2021-10-04

  • cs.CL updates on arXiv.org

    Annotation Inconsistency and Entity Bias in MultiWOZ. (arXiv:2105.14150v3 [cs.CL] UPDATED)
    (2 min) MultiWOZ is one of the most popular multi-domain task-oriented dialog datasets, containing 10K+ annotated dialogs covering eight domains. It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG), and end-to-end (E2E) dialog modeling. In this work, we identify an overlooked issue with dialog state annotation inconsistencies in the dataset, where a slot type is tagged inconsistently across similar dialogs leading to confusion for DST modeling. We propose an automated correction for this issue, which is present in a whopping 70% of the dialogs. Additionally, we notice that there is significant entity bias in the dataset (e.g., "cambridge" appears in 50% of the destination cities in the train domain). The entity bias can potentially lead to named entity memorization in generative models, which may go unnoticed as the test set suffers from a similar entity bias as well. We release a new test set with all entities replaced with unseen entities. Finally, we benchmark joint goal accuracy (JGA) of the state-of-the-art DST baselines on these modified versions of the data. Our experiments show that the annotation inconsistency corrections lead to 7-10% improvement in JGA. On the other hand, we observe a 29% drop in JGA when models are evaluated on the new test set with unseen entities.
    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition. (arXiv:2109.13226v2 [eess.AS] UPDATED)
    (2 min) We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
    Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition. (arXiv:2108.07789v2 [cs.CL] UPDATED)
    (2 min) Language models (LMs) pre-trained on massive amounts of text, in particular bidirectional encoder representations from Transformers (BERT), generative pre-training (GPT), and GPT-2, have become a key technology for many natural language processing tasks. In this paper, we present results using fine-tuned GPT, GPT-2, and their combination for automatic speech recognition (ASR). Unlike unidirectional LM GPT and GPT-2, BERT is bidirectional whose direct product of the output probabilities is no longer a valid language prior probability. A conversion method is proposed to compute the correct language prior probability based on bidirectional LM outputs in a mathematically exact way. Experimental results on the widely used AMI and Switchboard ASR tasks showed that the combination of the fine-tuned GPT and GPT-2 outperformed the combination of three neural LMs with different architectures trained from scratch on the in-domain text by up to a 12% relative word error rate reduction (WERR). Furthermore, on the AMI corpus, the proposed conversion for language prior probabilities enables BERT to obtain an extra 3% relative WERR, and the combination of BERT, GPT and GPT-2 results in further improvements.
    Unsupervised Self-Training for Sentiment Analysis of Code-Switched Data. (arXiv:2103.14797v2 [cs.CL] UPDATED)
    (2 min) Sentiment analysis is an important task in understanding social media content like customer reviews, Twitter and Facebook feeds etc. In multilingual communities around the world, a large amount of social media text is characterized by the presence of Code-Switching. Thus, it has become important to build models that can handle code-switched data. However, annotated code-switched data is scarce and there is a need for unsupervised models and algorithms. We propose a general framework called Unsupervised Self-Training and show its applications for the specific use case of sentiment analysis of code-switched data. We use the power of pre-trained BERT models for initialization and fine-tune them in an unsupervised manner, only using pseudo labels produced by zero-shot transfer. We test our algorithm on multiple code-switched languages and provide a detailed analysis of the learning dynamics of the algorithm with the aim of answering the question - `Does our unsupervised model understand the Code-Switched languages or does it just learn its representations?'. Our unsupervised models compete well with their supervised counterparts, with their performance reaching within 1-7\% (weighted F1 scores) when compared to supervised models trained for a two class problem.
    CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP. (arXiv:2104.08835v2 [cs.CL] UPDATED)
    (2 min) Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge obtained when learning prior tasks. In this paper, we explore whether and how such cross-task generalization ability can be acquired, and further applied to build better few-shot learners across diverse NLP tasks. We introduce CrossFit, a problem setup for studying cross-task generalization ability, which standardizes seen/unseen task partitions, data access during different learning stages, and the evaluation protocols. To instantiate different seen/unseen task partitions in CrossFit and facilitate in-depth analysis, we present the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format. Our analysis reveals that the few-shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks. We also observe that the selection of upstream learning tasks can significantly influence few-shot performance on unseen tasks, asking further analysis on task similarity and transferability.
    On the Influence of Masking Policies in Intermediate Pre-training. (arXiv:2104.08840v2 [cs.CL] UPDATED)
    (2 min) Current NLP models are predominantly trained through a two-stage "pre-train then fine-tune" pipeline. Prior work has shown that inserting an intermediate pre-training stage, using heuristic masking policies for masked language modeling (MLM), can significantly improve final performance. However, it is still unclear (1) in what cases such intermediate pre-training is helpful, (2) whether hand-crafted heuristic objectives are optimal for a given task, and (3) whether a masking policy designed for one task is generalizable beyond that task. In this paper, we perform a large-scale empirical study to investigate the effect of various masking policies in intermediate pre-training with nine selected tasks across three categories. Crucially, we introduce methods to automate the discovery of optimal masking policies via direct supervision or meta-learning. We conclude that the success of intermediate pre-training is dependent on appropriate pre-train corpus, selection of output format (i.e., masked spans or full sentence), and clear understanding of the role that MLM plays for the downstream task. In addition, we find our learned masking policies outperform the heuristic of masking named entities on TriviaQA, and policies learned from one task can positively transfer to other tasks in certain cases, inviting future research in this direction.
    Attention-based Clinical Note Summarization. (arXiv:2104.08942v2 [cs.CL] UPDATED)
    (2 min) The trend of deploying digital systems in numerous industries has induced a hike in recording digital information. The health sector has observed an extensive adoption of digital devices and systems that generate large volumes of personal medical records. Electronic health records contain valuable information for retrospective and prospective analysis that is often not entirely exploited because of the dense information storage. The crude purpose of condensing health records is to select the information that holds most characteristics of the original documents based on reported disease. These summaries may boost diagnosis and extend a doctor's time with the patient during a high workload situation like the COVID-19 pandemic. In this paper, we propose applying a multi-head attention-based mechanism to perform extractive summarization of meaningful phrases in clinical notes. This method finds major sentences for a summary by correlating tokens, segments, and positional embeddings. The model outputs attention scores that are statistically transformed to extract key phrases and can be used to projection on the heat-mapping tool for visual and human use.
    FRAKE: Fusional Real-time Automatic Keyword Extraction. (arXiv:2104.04830v2 [cs.CL] UPDATED)
    (2 min) Keyword extraction is the process of identifying the words or phrases that express the main concepts of text to the best of one's ability. Electronic infrastructure creates a considerable amount of text every day and at all times. This massive volume of documents makes it practically impossible for human resources to study and manage them. Nevertheless, the need for these documents to be accessed efficiently and effectively is evident in numerous purposes. A blog, news article, or technical note is considered a relatively long text since the reader aims to learn the subject based on keywords or topics. Our approach consists of a combination of two models: graph centrality features and textural features. The proposed method has been used to extract the best keyword among the candidate keywords with an optimal combination of graph centralities, such as degree, betweenness, eigenvector, closeness centrality and etc, and textural, such as Casing, Term position, Term frequency normalization, Term different sentence, Part Of Speech tagging. There have also been attempts to distinguish keywords from candidate phrases and consider them on separate keywords. For evaluating the proposed method, seven datasets were used: Semeval2010, SemEval2017, Inspec, fao30, Thesis100, pak2018, and Wikinews, with results reported as Precision, Recall, and F- measure. Our proposed method performed much better in terms of evaluation metrics in all reviewed datasets compared with available methods in literature. An approximate 16.9% increase was witnessed in F-score metric and this was much more for the Inspec in English datasets and WikiNews in forgone languages.
    Unpacking the Interdependent Systems of Discrimination: Ableist Bias in NLP Systems through an Intersectional Lens. (arXiv:2110.00521v1 [cs.CL])
    (2 min) Much of the world's population experiences some form of disability during their lifetime. Caution must be exercised while designing natural language processing (NLP) systems to prevent systems from inadvertently perpetuating ableist bias against people with disabilities, i.e., prejudice that favors those with typical abilities. We report on various analyses based on word predictions of a large-scale BERT language model. Statistically significant results demonstrate that people with disabilities can be disadvantaged. Findings also explore overlapping forms of discrimination related to interconnected gender and race identities.
    Emergence of Pragmatics from Referential Game between Theory of Mind Agents. (arXiv:2001.07752v2 [cs.AI] UPDATED)
    (2 min) Pragmatics studies how context can contribute to language meanings. In human communication, language is never interpreted out of context, and sentences can usually convey more information than their literal meanings. However, this mechanism is missing in most multi-agent systems, restricting the communication efficiency and the capability of human-agent interaction. In this paper, we propose an algorithm, using which agents can spontaneously learn the ability to "read between lines" without any explicit hand-designed rules. We integrate the theory of mind (ToM) in a cooperative multi-agent pedagogical situation and propose an adaptive reinforcement learning (RL) algorithm to develop a communication protocol. ToM is a profound cognitive science concept, claiming that people regularly reason about other's mental states, including beliefs, goals, and intentions, to obtain performance advantage in competition, cooperation or coalition. With this ability, agents consider language as not only messages but also rational acts reflecting others' hidden states. Our experiments demonstrate the advantage of pragmatic protocols over non-pragmatic protocols. We also show the teaching complexity following the pragmatic protocol empirically approximates to recursive teaching dimension (RTD).
    Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation. (arXiv:2104.05848v6 [cs.CL] UPDATED)
    (3 min) We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ~1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on ~1,000 lines (~3.5%) of low resource data. To translate named entities correctly, we build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Moreover, we also build a novel method of combining translations from different source languages into one. Using English as a hypothetical low resource language, we get a +23.9 BLEU increase over a multilingual baseline, and a +10.3 BLEU increase over our asymmetric baseline in the Bible dataset. We get a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset. We also have good results for a real severely low resource Mayan language, Eastern Pokomchi.
    Moving on from OntoNotes: Coreference Resolution Model Transfer. (arXiv:2104.08457v2 [cs.CL] UPDATED)
    (2 min) Academic neural models for coreference resolution (coref) are typically trained on a single dataset, OntoNotes, and model improvements are benchmarked on that same dataset. However, real-world applications of coref depend on the annotation guidelines and the domain of the target dataset, which often differ from those of OntoNotes. We aim to quantify transferability of coref models based on the number of annotated documents available in the target dataset. We examine eleven target datasets and find that continued training is consistently effective and especially beneficial when there are few target documents. We establish new benchmarks across several datasets, including state-of-the-art results on PreCo.
    A Web Scale Entity Extraction System. (arXiv:2110.00423v1 [cs.CL])
    (2 min) Understanding the semantic meaning of content on the web through the lens of entities and concepts has many practical advantages. However, when building large-scale entity extraction systems, practitioners are facing unique challenges involving finding the best ways to leverage the scale and variety of data available on internet platforms. We present learnings from our efforts in building an entity extraction system for multiple document types at large scale using multi-modal Transformers. We empirically demonstrate the effectiveness of multi-lingual, multi-task and cross-document type learning. We also discuss the label collection schemes that help to minimize the amount of noise in the collected data.
    TEACh: Task-driven Embodied Agents that Chat. (arXiv:2110.00534v1 [cs.CV])
    (2 min) Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.
    Learning to Ask for Data-Efficient Event Argument Extraction. (arXiv:2110.00479v1 [cs.CL])
    (2 min) Event argument extraction (EAE) is an important task for information extraction to discover specific argument roles. In this study, we cast EAE as a question-based cloze task and empirically analyze fixed discrete token template performance. As generating human-annotated question templates is often time-consuming and labor-intensive, we further propose a novel approach called "Learning to Ask," which can learn optimized question templates for EAE without human annotations. Experiments using the ACE-2005 dataset demonstrate that our method based on optimized questions achieves state-of-the-art performance in both the few-shot and supervised settings.
    Zero-shot Natural Language Video Localization. (arXiv:2110.00428v1 [cs.CL])
    (2 min) Understanding videos to localize moments with natural language often requires large expensive annotated video regions paired with language queries. To eliminate the annotation costs, we make a first attempt to train a natural language video localization model in zero-shot manner. Inspired by unsupervised image captioning setup, we merely require random text corpora, unlabeled video collections, and an off-the-shelf object detector to train a model. With the unpaired data, we propose to generate pseudo-supervision of candidate temporal regions and corresponding query sentences, and develop a simple NLVL model to train with the pseudo-supervision. Our empirical validations show that the proposed pseudo-supervised method outperforms several baseline approaches and a number of methods using stronger supervision on Charades-STA and ActivityNet-Captions.
    LEMON: Explainable Entity Matching. (arXiv:2110.00516v1 [cs.DB])
    (2 min) State-of-the-art entity matching (EM) methods are hard to interpret, and there is significant value in bringing explainable AI to EM. Unfortunately, most popular explainability methods do not work well out of the box for EM and need adaptation. In this paper, we identify three challenges of applying local post hoc feature attribution methods to entity matching: cross-record interaction effects, non-match explanations, and variation in sensitivity. We propose our novel model-agnostic and schema-flexible method LEMON that addresses all three challenges by (i) producing dual explanations to avoid cross-record interaction effects, (ii) introducing the novel concept of attribution potential to explain how two records could have matched, and (iii) automatically choosing explanation granularity to match the sensitivity of the matcher and record pair in question. Experiments on public datasets demonstrate that the proposed method is more faithful to the matcher and does a better job of helping users understand the decision boundary of the matcher than previous work. Furthermore, user studies show that the rate at which human subjects can construct counterfactual examples after seeing an explanation from our proposed method increases from 54% to 64% for matches and from 15% to 49% for non-matches compared to explanations from a standard adaptation of LIME.
    Improving Punctuation Restoration for Speech Transcripts via External Data. (arXiv:2110.00560v1 [cs.CL])
    (2 min) Automatic Speech Recognition (ASR) systems generally do not produce punctuated transcripts. To make transcripts more readable and follow the expected input format for downstream language models, it is necessary to add punctuation marks. In this paper, we tackle the punctuation restoration problem specifically for the noisy text (e.g., phone conversation scenarios). To leverage the available written text datasets, we introduce a data sampling technique based on an n-gram language model to sample more training data that are similar to our in-domain data. Moreover, we propose a two-stage fine-tuning approach that utilizes the sampled external data as well as our in-domain dataset for models based on BERT. Extensive experiments show that the proposed approach outperforms the baseline with an improvement of 1:12% F1 score.
    Machine Translation of Low-Resource Indo-European Languages. (arXiv:2108.03739v2 [cs.CL] UPDATED)
    (2 min) In this work, we investigate methods for the challenging task of translating between low-resource language pairs that exhibit some level of similarity. In particular, we consider the utility of transfer learning for translating between several Indo-European low-resource languages from the Germanic and Romance language families. In particular, we build two main classes of transfer-based systems to study how relatedness can benefit the translation performance. The primary system fine-tunes a model pre-trained on a related language pair and the contrastive system fine-tunes one pre-trained on an unrelated language pair. Our experiments show that although relatedness is not necessary for transfer learning to work, it does benefit model performance.
    External knowledge transfer deployment inside a simple double agent Viterbi algorithm. (arXiv:2110.00433v1 [cs.CL])
    (2 min) We consider in this paper deploying external knowledge transfer inside a simple double agent Viterbi algorithm which is an algorithm firstly introduced by the author in his preprint "Hidden Markov Based Mathematical Model dedicated to Extract Ingredients from Recipe Text". The key challenge of this work lies in discovering the reason why our old model does have bad performances when it is confronted with estimating ingredient state for unknown words and see if deploying external knowledge transfer directly on calculating state matrix could be the solution instead of deploying it only on back propagating step.
    Attention based Sequence to Sequence Learning for Machine Translation of Low Resourced Indic Languages -- A case of Sanskrit to Hindi. (arXiv:2110.00435v1 [cs.CL])
    (2 min) Deep Learning techniques are powerful in mimicking humans in a particular set of problems. They have achieved a remarkable performance in complex learning tasks. Deep learning inspired Neural Machine Translation (NMT) is a proficient technique that outperforms traditional machine translation. Performing machine-aided translation on Indic languages has always been a challenging task considering their rich and diverse grammar. The neural machine translation has shown quality results compared to the traditional machine translation approaches. The fully automatic machine translation becomes problematic when it comes to low-resourced languages, especially with Sanskrit. This paper presents attention mechanism based neural machine translation by selectively focusing on a particular part of language sentences during translation. The work shows the construction of Sanskrit to Hindi bilingual parallel corpus with nearly 10K samples and having 178,000 tokens. The neural translation model equipped with an attention mechanism has been trained on Sanskrit to Hindi parallel corpus. The approach has shown the significance of attention mechanisms to overcome long-term dependencies, primarily associated with low resources Indic languages. The paper shows the attention plots on testing data to demonstrate the alignment between source and translated words. For the evaluation of the translated sentences, manual score based human evaluation and automatic evaluation metric based techniques have been adopted. The attention mechanism based neural translation has achieved 88% accuracy in human evaluation and a BLEU score of 0.92 on Sanskrit to Hindi translation.
    Language Invariant Properties in Natural Language Processing. (arXiv:2109.13037v2 [cs.CL] UPDATED)
    (2 min) Meaning is context-dependent, but many properties of language (should) remain the same even if we transform the context. For example, sentiment, entailment, or speaker properties should be the same in a translation and original of a text. We introduce language invariant properties: i.e., properties that should not change when we transform text, and how they can be used to quantitatively evaluate the robustness of transformation algorithms. We use translation and paraphrasing as transformation examples, but our findings apply more broadly to any transformation. Our results indicate that many NLP transformations change properties like author characteristics, i.e., make them sound more male. We believe that studying these properties will allow NLP to address both social factors and pragmatic aspects of language. We also release an application suite that can be used to evaluate the invariance of transformation applications.
    Rumor Detection on Social Media with Hierarchical Adversarial Training. (arXiv:2110.00425v1 [cs.CL])
    (2 min) The proliferation of rumors on social media has a huge impact on society. However, natural language text is high-dimensional and sparse, and the same rumor may be expressed in hundreds of ways on social media. As such, the robustness and generalization of the current rumor detection model are put into question. We propose a new hierarchical model called HAT-RD, which is divided into two categories: post-level modules and event-level modules. HAT-RD adopts a novel hierarchical adversarial training method based on gradient ascent by adding adversarial perturbations to the embedding layers both of post-level modules and event-level modules to deceive the detector. At the same time, the detector uses stochastic gradient descent to minimize the adversarial risk to learn a more robust model. In this way, the post-level and event-level sample spaces are enhanced, and experiments indicate that the model drift into an area with a flat loss landscape that leads to better generalization. Experiments on two real-world datasets demonstrate that our model achieves better results than state-of-the-art methods.
    Paraphrases as Foreign Languages in Multilingual Neural Machine Translation. (arXiv:1808.08438v3 [cs.CL] UPDATED)
    (2 min) Paraphrases, the rewordings of the same semantic meaning, are useful for improving generalization and translation. However, prior works only explore paraphrases at the word or phrase level, not at the sentence or corpus level. Unlike previous works that only explore paraphrases at the word or phrase level, we use different translations of the whole training data that are consistent in structure as paraphrases at the corpus level. We train on parallel paraphrases in multiple languages from various sources. We treat paraphrases as foreign languages, tag source sentences with paraphrase labels, and train on parallel paraphrases in the style of multilingual Neural Machine Translation (NMT). Our multi-paraphrase NMT that trains only on two languages outperforms the multilingual baselines. Adding paraphrases improves the rare word translation and increases entropy and diversity in lexical choice. Adding the source paraphrases boosts performance better than adding the target ones. Combining both the source and the target paraphrases lifts performance further; combining paraphrases with multilingual data helps but has mixed performance. We achieve a BLEU score of 57.2 for French-to-English translation using 24 corpus-level paraphrases of the Bible, which outperforms the multilingual baselines and is +34.7 above the single-source single-target NMT baseline.
    Natural language understanding for logical games. (arXiv:2110.00558v1 [cs.CL])
    (2 min) We developed a system able to automatically solve logical puzzles in natural language. Our solution is composed by a parser and an inference module. The parser translates the text into first order logic (FOL), while the MACE4 model finder is used to compute the models of the given FOL theory. We also empower our software agent with the capability to provide Yes/No answers to natural language questions related to each puzzle. Moreover, in line with Explainalbe Artificial Intelligence (XAI), the agent can back its answer, providing a graphical representation of the proof. The advantage of using reasoning for Natural Language Understanding (NLU) instead of Machine learning is that the user can obtain an explanation of the reasoning chain. We illustrate how the system performs on various types of natural language puzzles, including 382 knights and knaves puzzles. These features together with the overall performance rate of 80.89\% makes the proposed solution an improvement upon similar solvers for natural language understanding in the puzzles domain.
    FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition. (arXiv:2105.03842v4 [cs.CL] UPDATED)
    (3 min) Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.
    Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers. (arXiv:2104.01604v2 [cs.CL] UPDATED)
    (2 min) This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made available as part of the SpeechBrain toolkit.
    Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images. (arXiv:2110.00519v1 [cs.CV])
    (2 min) While neural symbolic methods demonstrate impressive performance in visual question answering on synthetic images, their performance suffers on real images. We identify that the long-tail distribution of visual concepts and unequal importance of reasoning steps in real data are the two key obstacles that limit the models' real-world potentials. To address these challenges, we propose a new paradigm, Calibrating Concepts and Operations (CCO), which enables neural symbolic models to capture underlying data characteristics and to reason with hierarchical importance. Specifically, we introduce an executor with learnable concept embedding magnitudes for handling distribution imbalance, and an operation calibrator for highlighting important operations and suppressing redundant ones. Our experiments show CCO substantially boosts the performance of neural symbolic methods on real images. By evaluating models on the real world dataset GQA, CCO helps the neural symbolic method NSCL outperforms its vanilla counterpart by 9.1% (from 47.0% to 56.1%); this result also largely reduces the performance gap between symbolic and non-symbolic methods. Additionally, we create a perturbed test set for better understanding and analyzing model performance on real images. Code is available at https://github.com/Lizw14/CaliCO.git .
    Phonology Recognition in American Sign Language. (arXiv:2110.00453v1 [cs.CL])
    (2 min) Inspired by recent developments in natural language processing, we propose a novel approach to sign language processing based on phonological properties validated by American Sign Language users. By taking advantage of datasets composed of phonological data and people speaking sign language, we use a pretrained deep model based on mesh reconstruction to extract the 3D coordinates of the signers keypoints. Then, we train standard statistical and deep machine learning models in order to assign phonological classes to each temporal sequence of coordinates. Our paper introduces the idea of exploiting the phonological properties manually assigned by sign language users to classify videos of people performing signs by regressing a 3D mesh. We establish a new baseline for this problem based on the statistical distribution of 725 different signs. Our best-performing models achieve a micro-averaged F1-score of 58% for the major location class and 70% for the sign type using statistical and deep learning algorithms, compared to their corresponding baselines of 35% and 39%.
    Under the Microscope: Interpreting Readability Assessment Models for Filipino. (arXiv:2110.00157v1 [cs.CL])
    (2 min) Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited work is done on analyzing how linguistic variables affect model inference quantitatively. In this work, we dissect machine learning-based readability assessment models in Filipino by performing global and local model interpretation to understand the contributions of varying linguistic features and discuss its implications in the context of the Filipino language. Results show that using a model trained with top features from global interpretation obtained higher performance than the ones using features selected by Spearman correlation. Likewise, we also empirically observed local feature weight boundaries for discriminating reading difficulty at an extremely fine-grained level and their corresponding effects if values are perturbed.
    Evaluation of Non-Negative Matrix Factorization and n-stage Latent Dirichlet Allocation for Emotion Analysis in Turkish Tweets. (arXiv:2110.00418v1 [cs.CL])
    (2 min) With the development of technology, the use of social media has become quite common. Analyzing comments on social media in areas such as media and advertising plays an important role today. For this reason, new and traditional natural language processing methods are used to detect the emotion of these shares. In this paper, the Latent Dirichlet Allocation, namely LDA, and Non-Negative Matrix Factorization methods in topic modeling were used to determine which emotion the Turkish tweets posted via Twitter. In addition, the accuracy of a proposed n-level method based on LDA was analyzed. Dataset consists of 5 emotions, namely angry, fear, happy, sad and confused. NMF was the most successful method among all topic modeling methods in this study. Then, the F1-measure of Random Forest, Naive Bayes and Support Vector Machine methods was analyzed by obtaining a file suitable for Weka by using the word weights and class labels of the topics. Among the Weka results, the most successful method was n-stage LDA, and the most successful algorithm was Random Forest.
    BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification. (arXiv:2110.00171v1 [cs.CL])
    (2 min) Graph-based Aspect-based Sentiment Classification (ABSC) approaches have yielded state-of-the-art results, expecially when equipped with contextual word embedding from pre-training language models (PLMs). However, they ignore sequential features of the context and have not yet made the best of PLMs. In this paper, we propose a novel model, BERT4GCN, which integrates the grammatical sequential features from the PLM of BERT, and the syntactic knowledge from dependency graphs. BERT4GCN utilizes outputs from intermediate layers of BERT and positional information between words to augment GCN (Graph Convolutional Network) to better encode the dependency graphs for the downstream classification. Experimental results demonstrate that the proposed BERT4GCN outperforms all state-of-the-art baselines, justifying that augmenting GCN with the grammatical features from intermediate layers of BERT can significantly empower ABSC models.
    Span Labeling Approach for Vietnamese and Chinese Word Segmentation. (arXiv:2110.00156v1 [cs.CL])
    (2 min) In this paper, we propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SPAN SEG. We compare the span labeling approach with the conditional random field by using encoders with the same architecture. Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five Chinese benchmark datasets. Through our experimental results, the proposed approach SpanSeg achieves higher performance than the sequence tagging approach with the state-of-the-art F-score of 98.31% on the Vietnamese treebank benchmark, when they both apply the contextual pre-trained language model XLM-RoBERTa and the predicted word boundary information. Besides, we do fine-tuning experiments for the span labeling approach on BERT and ZEN pre-trained language model for Chinese with fewer parameters, faster inference time, and competitive or higher F-scores than the previous state-of-the-art approach, word segmentation with word-hood memory networks, on five Chinese benchmarks.
    Large-scale ASR Domain Adaptation by Self- and Semi-supervised Learning. (arXiv:2110.00165v1 [cs.LG])
    (2 min) Self- and Semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data.
    FiLMing Multimodal Sarcasm Detection with Attention. (arXiv:2110.00416v1 [cs.MM])
    (2 min) Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. It finds applications in many NLP tasks such as opinion mining, sentiment analysis, etc. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. So far, various approaches have been proposed that uses text and image modality and a fusion of both. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation by conditioning the input image through FiLMed ResNet blocks with the textual features using the GRU network to capture the multimodal information. The output from both the models and the CLS token from RoBERTa is concatenated and used for the final prediction. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal sarcasm detection dataset.
    MemBERT: Injecting Unstructured Knowledge into BERT. (arXiv:2110.00125v1 [cs.CL])
    (2 min) Transformers changed modern NLP in many ways. However, they can hardly exploit domain knowledge, and like other blackbox models, they lack interpretability. Unfortunately, structured knowledge injection, in the long run, risks to suffer from a knowledge acquisition bottleneck. We thus propose a memory enhancement of transformer models that makes use of unstructured domain knowledge expressed in plain natural language. An experimental evaluation conducted on two challenging NLP tasks demonstrates that our approach yields better performance and model interpretability than baseline transformer-based architectures.
    Building an Efficient and Effective Retrieval-based Dialogue System via Mutual Learning. (arXiv:2110.00159v1 [cs.CL])
    (2 min) Establishing retrieval-based dialogue systems that can select appropriate responses from the pre-built index has gained increasing attention from researchers. For this task, the adoption of pre-trained language models (such as BERT) has led to remarkable progress in a number of benchmarks. There exist two common approaches, including cross-encoders which perform full attention over the inputs, and bi-encoders that encode the context and response separately. The former gives considerable improvements in accuracy but is often inapplicable in practice for large-scale retrieval given the cost of the full attention required for each sample at test time. The latter is efficient for billions of indexes but suffers from sub-optimal performance. In this work, we propose to combine the best of both worlds to build a retrieval system. Specifically, we employ a fast bi-encoder to replace the traditional feature-based pre-retrieval model (such as BM25) and set the response re-ranking model as a more complicated architecture (such as cross-encoder). To further improve the effectiveness of our framework, we train the pre-retrieval model and the re-ranking model at the same time via mutual learning, which enables two models to learn from each other throughout the training process. We conduct experiments on two benchmarks and evaluation results demonstrate the efficiency and effectiveness of our proposed framework.
    A Survey of Knowledge Enhanced Pre-trained Models. (arXiv:2110.00269v1 [cs.CL])
    (2 min) Pre-trained models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which has achieved promising performance after fine-tuning. These models, however, suffer from poor robustness and lack of interpretability. Pre-trained models with knowledge injection, which we call knowledge enhanced pre-trained models (KEPTMs), possess deep understanding and logical reasoning and introduce interpretability to some extent. In this survey, we provide a comprehensive overview of KEPTMs for natural language processing. We first introduce the progress of pre-trained models and knowledge representation learning. Then we systematically categorize existing KEPTMs from three different perspectives. Finally, we outline some potential directions of KEPTMs for future research.
    Detecting Harmful Memes and Their Targets. (arXiv:2110.00413v1 [cs.CL])
    (2 min) Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without appropriate contexts, off-the-shelf multimodal models may not be adequate to understand their underlying semantics. In this work, we propose two novel problem formulations: detecting harmful memes and the social entities that these harmful memes target. To this end, we present HarMeme, the first benchmark dataset, containing 3,544 memes related to COVID-19. Each meme went through a rigorous two-stage annotation process. In the first stage, we labeled a meme as very harmful, partially harmful, or harmless; in the second stage, we further annotated the type of target(s) that each harmful meme points to: individual, organization, community, or society/general public/other. The evaluation results using ten unimodal and multimodal models highlight the importance of using multimodal signals for both tasks. We further discuss the limitations of these models and we argue that more research is needed to address these problems.
    UserIdentifier: Implicit User Representations for Simple and Effective Personalized Sentiment Analysis. (arXiv:2110.00135v1 [cs.LG])
    (2 min) Global models are trained to be as generalizable as possible, with user invariance considered desirable since the models are shared across multitudes of users. As such, these models are often unable to produce personalized responses for individual users, based on their data. Contrary to widely-used personalization techniques based on few-shot learning, we propose UserIdentifier, a novel scheme for training a single shared model for all users. Our approach produces personalized responses by adding fixed, non-trainable user identifiers to the input data. We empirically demonstrate that this proposed method outperforms the prefix-tuning based state-of-the-art approach by up to 13%, on a suite of sentiment analysis datasets. We also show that, unlike prior work, this method needs neither any additional model parameters nor any extra rounds of few-shot fine-tuning.
    Variance of Twitter Embeddings and Temporal Trends of COVID-19 cases. (arXiv:2110.00031v1 [cs.CL])
    (2 min) The severity of the coronavirus pandemic necessitates the need of effective administrative decisions. Over 4 lakh people in India succumbed to COVID-19, with over 3 crore confirmed cases, and still counting. The threat of a plausible third wave continues to haunt millions. In this ever changing dynamic of the virus, predictive modeling methods can serve as an integral tool. The pandemic has further triggered an unprecedented usage of social media. This paper aims to propose a method for harnessing social media, specifically Twitter, to predict the upcoming scenarios related to COVID-19 cases. In this study, we seek to understand how the surges in COVID-19 related tweets can indicate rise in the cases. This prospective analysis can be utilised to aid administrators about timely resource allocation to lessen the severity of the damage. Using word embeddings to capture the semantic meaning of tweets, we identify Significant Dimensions (SDs).Our methodology predicts the rise in cases with a lead time of 15 days and 30 days with R2 scores of 0.80 and 0.62 respectively. Finally, we explain the thematic utility of the SDs.
    #ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics. (arXiv:2110.00116v1 [cs.SI])
    (3 min) The United Nations identified gender equality as a Sustainable Development Goal in 2015, recognizing the underrepresentation of women in politics as a specific barrier to achieving gender equality. Political systems around the world experience gender inequality across all levels of elected government as fewer women run for office than men. This is due in part to online abuse, particularly on social media platforms like Twitter, where women seeking or in power tend to be targeted with more toxic maltreatment than their male counterparts. In this paper, we present reflections on ParityBOT - the first natural language processing-based intervention designed to affect online discourse for women in politics for the better, at scale. Deployed across elections in Canada, the United States and New Zealand, ParityBOT was used to analyse and classify more than 12 million tweets directed at women candidates and counter toxic tweets with supportive ones. From these elections we present three case studies highlighting the current limitations of, and future research and application opportunities for, using a natural language processing-based system to detect online toxicity, specifically with regards to contextually important microaggressions. We examine the rate of false negatives, where ParityBOT failed to pick up on insults directed at specific high profile women, which would be obvious to human users. We examine the unaddressed harms of microaggressions and the potential of yet unseen damage they cause for women in these communities, and for progress towards gender equality overall, in light of these technological blindspots. This work concludes with a discussion on the benefits of partnerships between nonprofit social groups and technology experts to develop responsible, socially impactful approaches to addressing online hate.
    Tree-Constrained Graph Neural Networks For Argument Mining. (arXiv:2110.00124v1 [cs.CL])
    (2 min) We propose a novel architecture for Graph Neural Networks that is inspired by the idea behind Tree Kernels of measuring similarity between trees by taking into account their common substructures, named fragments. By imposing a series of regularization constraints to the learning problem, we exploit a pooling mechanism that incorporates such notion of fragments within the node soft assignment function that produces the embeddings. We present an extensive experimental evaluation on a collection of sentence classification tasks conducted on several argument mining corpora, showing that the proposed approach performs well with respect to state-of-the-art techniques.
  • cs.CV updates on arXiv.org

    Learning to Track with Object Permanence. (arXiv:2103.14258v2 [cs.CV] UPDATED)
    (2 min) Tracking by detection, the dominant approach for online multi-object tracking, alternates between localization and association steps. As a result, it strongly depends on the quality of instantaneous observations, often failing when objects are not fully visible. In contrast, tracking in humans is underlined by the notion of object permanence: once an object is recognized, we are aware of its physical existence and can approximately localize it even under full occlusions. In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning. We build on top of the recent CenterTrack architecture, which takes pairs of frames as input, and extend it to videos of arbitrary length. To this end, we augment the model with a spatio-temporal, recurrent memory module, allowing it to reason about object locations and identities in the current frame using all the previous history. It is, however, not obvious how to train such an approach. We study this question on a new, large-scale, synthetic dataset for multi-object tracking, which provides ground truth annotations for invisible objects, and propose several approaches for supervising tracking behind occlusions. Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI and MOT17 datasets thanks to its robustness to occlusions.
    Few-shot Neural Human Performance Rendering from Sparse RGBD Videos. (arXiv:2107.06505v2 [cs.CV] UPDATED)
    (0 min) Recent neural rendering approaches for human activities achieve remarkable view synthesis results, but still rely on dense input views or dense training with all the capture frames, leading to deployment difficulty and inefficient training overload. However, existing advances will be ill-posed if the input is both spatially and temporally sparse. To fill this gap, in this paper we propose a few-shot neural human rendering approach (FNHR) from only sparse RGBD inputs, which exploits the temporal and spatial redundancy to generate photo-realistic free-view output of human activities. Our FNHR is trained only on the key-frames which expand the motion manifold in the input sequences. We introduce a two-branch neural blending to combine the neural point render and classical graphics texturing pipeline, which integrates reliable observations over sparse key-frames. Furthermore, we adopt a patch-based adversarial training process to make use of the local redundancy and avoids over-fitting to the key-frames, which generates fine-detailed rendering results. Extensive experiments demonstrate the effectiveness of our approach to generate high-quality free view-point results for challenging human performances under the sparse setting.
    Scalable, Axiomatic Explanations of Deep Alzheimer's Diagnosis from Heterogeneous Data. (arXiv:2107.05997v2 [cs.LG] UPDATED)
    (0 min) Deep Neural Networks (DNNs) have an enormous potential to learn from complex biomedical data. In particular, DNNs have been used to seamlessly fuse heterogeneous information from neuroanatomy, genetics, biomarkers, and neuropsychological tests for highly accurate Alzheimer's disease diagnosis. On the other hand, their black-box nature is still a barrier for the adoption of such a system in the clinic, where interpretability is absolutely essential. We propose Shapley Value Explanation of Heterogeneous Neural Networks (SVEHNN) for explaining the Alzheimer's diagnosis made by a DNN from the 3D point cloud of the neuroanatomy and tabular biomarkers. Our explanations are based on the Shapley value, which is the unique method that satisfies all fundamental axioms for local explanations previously established in the literature. Thus, SVEHNN has many desirable characteristics that previous work on interpretability for medical decision making is lacking. To avoid the exponential time complexity of the Shapley value, we propose to transform a given DNN into a Lightweight Probabilistic Deep Network without re-training, thus achieving a complexity only quadratic in the number of features. In our experiments on synthetic and real data, we show that we can closely approximate the exact Shapley value with a dramatically reduced runtime and can reveal the hidden knowledge the network has learned from the data.
    Building 3D Morphable Models from a Single Scan. (arXiv:2011.12440v2 [cs.CV] UPDATED)
    (0 min) We propose a method for constructing generative models of 3D objects from a single 3D mesh. Our method produces a 3D morphable model that represents shape and albedo in terms of Gaussian processes. We define the shape deformations in physical (3D) space and the albedo deformations as a combination of physical-space and color-space deformations. Whereas previous approaches have typically built 3D morphable models from multiple high-quality 3D scans through principal component analysis, we build 3D morphable models from a single scan or template. As we demonstrate in the face domain, these models can be used to infer 3D reconstructions from 2D data (inverse graphics) or 3D data (registration). Specifically, we show that our approach can be used to perform face recognition using only a single 3D scan (one scan total, not one per person), and further demonstrate how multiple scans can be incorporated to improve performance without requiring dense correspondence. Our approach enables the synthesis of 3D morphable models for 3D object categories where dense correspondence between multiple scans is unavailable. We demonstrate this by constructing additional 3D morphable models for fish and birds and use them to perform simple inverse rendering tasks. We share the code used to generate these models and to perform our inverse rendering and registration experiments.
    Fully Spiking Variational Autoencoder. (arXiv:2110.00375v1 [cs.NE])
    (0 min) Spiking neural networks (SNNs) can be run on neuromorphic devices with ultra-high speed and ultra-low energy consumption because of their binary and event-driven nature. Therefore, SNNs are expected to have various applications, including as generative models being running on edge devices to create high-quality images. In this study, we build a variational autoencoder (VAE) with SNN to enable image generation. VAE is known for its stability among generative models; recently, its quality advanced. In vanilla VAE, the latent space is represented as a normal distribution, and floating-point calculations are required in sampling. However, this is not possible in SNNs because all features must be binary time series data. Therefore, we constructed the latent space with an autoregressive SNN model, and randomly selected samples from its output to sample the latent variables. This allows the latent variables to follow the Bernoulli process and allows variational learning. Thus, we build the Fully Spiking Variational Autoencoder where all modules are constructed with SNN. To the best of our knowledge, we are the first to build a VAE only with SNN layers. We experimented with several datasets, and confirmed that it can generate images with the same or better quality compared to conventional ANNs. The code will be available soon.
    Segmentation of Anatomical Layers and Artifacts in Intravascular Polarization Sensitive Optical Coherence Tomography Using Attending Physician and Boundary Cardinality Losses. (arXiv:2105.05137v2 [eess.IV] UPDATED)
    (0 min) Intravascular ultrasound and optical coherence tomography are widely available for characterizing coronary stenoses and provide critical vessel parameters to optimize percutaneous intervention. Intravascular polarization-sensitive optical coherence tomography (PS-OCT) simultaneously provides high-resolution cross-sectional images of vascular structures while also revealing preponderant tissue components such as collagen and smooth muscle and thereby enhances plaque characterization. Automated interpretation of these features promises to facilitate the objective clinical investigation of the natural history and significance of coronary atheromas. Here, we propose a convolutional neural network model, optimized using a new multi-term loss function, to classify the lumen, intima, and media layers in addition to the guidewire and plaque shadows. We demonstrate that our multi-class classification model outperforms state-of-the-art methods in detecting the coronary anatomical layers. Furthermore, the proposed model segments two classes of common imaging artifacts and detects the anatomical layers within the thickened vessel wall regions that were excluded from analysis by other studies. The source code and the trained model are publicly available at https://github.com/mhaft/OCTseg
    Mask or Non-Mask? Robust Face Mask Detector via Triplet-Consistency Representation Learning. (arXiv:2110.00523v1 [cs.CV])
    (0 min) In the absence of vaccines or medicines to stop COVID-19, one of the effective methods to slow the spread of the coronavirus and reduce the overloading of healthcare is to wear a face mask. Nevertheless, to mandate the use of face masks or coverings in public areas, additional human resources are required, which is tedious and attention-intensive. To automate the monitoring process, one of the promising solutions is to leverage existing object detection models to detect the faces with or without masks. As such, security officers do not have to stare at the monitoring devices or crowds, and only have to deal with the alerts triggered by the detection of faces without masks. Existing object detection models usually focus on designing the CNN-based network architectures for extracting discriminative features. However, the size of training datasets of face mask detection is small, while the difference between faces with and without masks is subtle. Therefore, in this paper, we propose a face mask detection framework that uses the context attention module to enable the effective attention of the feed-forward convolution neural network by adapting their attention maps feature refinement. Moreover, we further propose an anchor-free detector with Triplet-Consistency Representation Learning by integrating the consistency loss and the triplet loss to deal with the small-scale training data and the similarity between masks and occlusions. Extensive experimental results show that our method outperforms the other state-of-the-art methods. The source code is released as a public download to improve public health at https://github.com/wei-1006/MaskFaceDetection.
    Towards Protecting Face Embeddings in Mobile Face Verification Scenarios. (arXiv:2110.00434v1 [cs.CV])
    (0 min) This paper proposes PolyProtect, a method for protecting the sensitive face embeddings that are used to represent people's faces in neural-network-based face verification systems. PolyProtect transforms a face embedding to a more secure template, using a mapping based on multivariate polynomials parameterised by user-specific coefficients and exponents. In this work, PolyProtect is evaluated on two open-source face verification systems in a mobile application context, under the toughest threat model that assumes a fully-informed attacker with complete knowledge of the system and all its parameters. Results indicate that PolyProtect can be tuned to achieve a satisfactory trade-off between the recognition accuracy of the PolyProtected face verification system and the irreversibility of the PolyProtected templates. Furthermore, PolyProtected templates are shown to be effectively unlinkable, especially if the user-specific parameters employed in the PolyProtect mapping are selected in a non-naive manner. The evaluation is conducted using practical methodologies with tangible results, to present realistic insight into the method's robustness as a face embedding protection scheme in practice. The code to fully reproduce this work is available at: https://gitlab.idiap.ch/bob/bob.paper.polyprotect_2021.
    Robust Event Detection based on Spatio-Temporal Latent Action Unit using Skeletal Information. (arXiv:2109.02376v2 [cs.CV] UPDATED)
    (0 min) This paper propose a novel dictionary learning approach to detect event action using skeletal information extracted from RGBD video. The event action is represented as several latent atoms and composed of latent spatial and temporal attributes. We perform the method at the example of fall event detection. The skeleton frames are clustered by an initial K-means method. Each skeleton frame is assigned with a varying weight parameter and fed into our Gradual Online Dictionary Learning (GODL) algorithm. During the training process, outlier frames will be gradually filtered by reducing the weight that is inversely proportional to a cost. In order to strictly distinguish the event action from similar actions and robustly acquire its action unit, we build a latent unit temporal structure for each sub-action. We evaluate the proposed method on parts of the NTURGB+D dataset, which includes 209 fall videos, 405 ground-lift videos, 420 sit-down videos, and 280 videos of 46 otheractions. We present the experimental validation of the achieved accuracy, recall and precision. Our approach achieves the bestperformance on precision and accuracy of human fall event detection, compared with other existing dictionary learning methods. With increasing noise ratio, our method remains the highest accuracy and the lowest variance.
    Dual-Stream Reciprocal Disentanglement Learning for Domain Adaptation Person Re-Identification. (arXiv:2106.13929v2 [cs.CV] UPDATED)
    (0 min) Since human-labeled samples are free for the target set, unsupervised person re-identification (Re-ID) has attracted much attention in recent years, by additionally exploiting the source set. However, due to the differences on camera styles, illumination and backgrounds, there exists a large gap between source domain and target domain, introducing a great challenge on cross-domain matching. To tackle this problem, in this paper we propose a novel method named Dual-stream Reciprocal Disentanglement Learning (DRDL), which is quite efficient in learning domain-invariant features. In DRDL, two encoders are first constructed for id-related and id-unrelated feature extractions, which are respectively measured by their associated classifiers. Furthermore, followed by an adversarial learning strategy, both streams reciprocally and positively effect each other, so that the id-related features and id-unrelated features are completely disentangled from a given image, allowing the encoder to be powerful enough to obtain the discriminative but domain-invariant features. In contrast to existing approaches, our proposed method is free from image generation, which not only reduces the computational complexity remarkably, but also removes redundant information from id-related features. Extensive experiments substantiate the superiority of our proposed method compared with the state-of-the-arts. The source code has been released in https://github.com/lhf12278/DRDL.
    Score-Based Generative Classifiers. (arXiv:2110.00473v1 [stat.ML])
    (0 min) The tremendous success of generative models in recent years raises the question whether they can also be used to perform classification. Generative models have been used as adversarially robust classifiers on simple datasets such as MNIST, but this robustness has not been observed on more complex datasets like CIFAR-10. Additionally, on natural image datasets, previous results have suggested a trade-off between the likelihood of the data and classification accuracy. In this work, we investigate score-based generative models as classifiers for natural images. We show that these models not only obtain competitive likelihood values but simultaneously achieve state-of-the-art classification accuracy for generative classifiers on CIFAR-10. Nevertheless, we find that these models are only slightly, if at all, more robust than discriminative baseline models on out-of-distribution tasks based on common image corruptions. Similarly and contrary to prior results, we find that score-based are prone to worst-case distribution shifts in the form of adversarial perturbations. Our work highlights that score-based generative models are closing the gap in classification accuracy compared to standard discriminative models. While they do not yet deliver on the promise of adversarial and out-of-domain robustness, they provide a different approach to classification that warrants further research.
    X2CT-FLOW: Maximum a posteriori reconstruction using a progressive flow-based deep generative model for ultra sparse-view computed tomography in ultra low-dose protocols. (arXiv:2104.04179v2 [eess.IV] UPDATED)
    (2 min) Ultra sparse-view computed tomography (CT) algorithms can reduce radiation exposure of patients, but those algorithms lack an explicit cycle consistency loss minimization and an explicit log-likelihood maximization in testing. Here, we propose X2CT-FLOW for the maximum a posteriori (MAP) reconstruction of a three-dimensional (3D) chest CT image from a single or a few two-dimensional (2D) projection images using a progressive flow-based deep generative model, especially for ultra low-dose protocols. The MAP reconstruction can simultaneously optimize the cycle consistency loss and the log-likelihood. The proposed algorithm is built upon a newly developed progressive flow-based deep generative model, which is featured with exact log-likelihood estimation, efficient sampling, and progressive learning. We applied X2CT-FLOW to reconstruction of 3D chest CT images from biplanar projection images without noise contamination (assuming a standard-dose protocol) and with strong noise contamination (assuming an ultra low-dose protocol). With the standard-dose protocol, our images reconstructed from 2D projected images and 3D ground-truth CT images showed good agreement in terms of structural similarity (SSIM, 0.7675 on average), peak signal-to-noise ratio (PSNR, 25.89 dB on average), mean absolute error (MAE, 0.02364 on average), and normalized root mean square error (NRMSE, 0.05731 on average). Moreover, with the ultra low-dose protocol, our images reconstructed from 2D projected images and the 3D ground-truth CT images also showed good agreement in terms of SSIM (0.7008 on average), PSNR (23.58 dB on average), MAE (0.02991 on average), and NRMSE (0.07349 on average).
    Layered Neural Rendering for Retiming People in Video. (arXiv:2009.07833v2 [cs.CV] UPDATED)
    (2 min) We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate -- e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running.
    Unsupervised Motion Representation Learning with Capsule Autoencoders. (arXiv:2110.00529v1 [cs.CV])
    (2 min) We propose the Motion Capsule Autoencoder (MCAE), which addresses a key challenge in the unsupervised learning of motion representations: transformation invariance. MCAE models motion in a two-level hierarchy. In the lower level, a spatio-temporal motion signal is divided into short, local, and semantic-agnostic snippets. In the higher level, the snippets are aggregated to form full-length semantic-aware segments. For both levels, we represent motion with a set of learned transformation invariant templates and the corresponding geometric transformations by using capsule autoencoders of a novel design. This leads to a robust and efficient encoding of viewpoint changes. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets. Notably, it achieves better results than baselines on Trajectory20 with considerably fewer parameters and state-of-the-art performance on the unsupervised skeleton-based action recognition task.
    From SLAM to Situational Awareness: Challenges and Survey. (arXiv:2110.00273v1 [cs.RO])
    (0 min) The knowledge that an intelligent and autonomous mobile robot has and is able to acquire of itself and the environment, namely the situation, limits its reasoning, decision-making, and execution skills to efficiently and safely perform complex missions. Situational awareness is a basic capability of humans that has been deeply studied in fields like Psychology, Military, Aerospace, Education, etc., but it has barely been considered in robotics, which has focused on ideas such as sensing, perception, sensor fusion, state estimation, localization and mapping, spatial AI, etc. In our research, we connected the broad multidisciplinary existing knowledge on situational awareness with its counterpart in mobile robotics. In this paper, we survey the state-of-the-art robotics algorithms, we analyze the situational awareness aspects that have been covered by them, and we discuss their missing points. We found out that the existing robotics algorithms are still missing manifold important aspects of situational awareness. As a consequence, we conclude that these missing features are limiting the performance of robotic situational awareness, and further research is needed to overcome this challenge. We see this as an opportunity, and provide our vision for future research on robotic situational awareness.
    Omnimatte: Associating Objects and Their Effects in Video. (arXiv:2105.06993v2 [cs.CV] UPDATED)
    (2 min) Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects -- shadows, reflections, generated smoke, etc -- are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject -- an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic -- it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.
    TEACh: Task-driven Embodied Agents that Chat. (arXiv:2110.00534v1 [cs.CV])
    (2 min) Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.
    On Deep Neural Network Calibration by Regularization and its Impact on Refinement. (arXiv:2106.09385v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks have been shown to be highly miscalibrated. often they tend to be overconfident in their predictions. It poses a significant challenge for safety-critical systems to utilise deep neural networks (DNNs), reliably. Many recently proposed approaches to mitigate this have demonstrated substantial progress in improving DNN calibration. However, they hardly touch upon refinement, which historically has been an essential aspect of calibration. Refinement indicates separability of a network's correct and incorrect predictions. This paper presents a theoretically and empirically supported exposition reviewing refinement of a calibrated model. Firstly, we show the breakdown of expected calibration error (ECE), into predicted confidence and refinement under the assumption of over-confident predictions. Secondly, linking with this result, we highlight that regularization based calibration only focuses on naively reducing a model's confidence. This logically has a severe downside to a model's refinement as correct and incorrect predictions become tightly coupled. Lastly, connecting refinement with ECE also provides support to existing refinement based approaches which improve calibration but do not explain the reasoning behind it. We support our claims through rigorous empirical evaluations of many state of the art calibration approaches on widely used datasets and neural networks. We find that many calibration approaches with the likes of label smoothing, mixup etc. lower the usefulness of a DNN by degrading its refinement. Even under natural data shift, this calibration-refinement trade-off holds for the majority of calibration methods.
    Generative Compositional Augmentations for Scene Graph Prediction. (arXiv:2007.05756v3 [cs.CV] UPDATED)
    (2 min) Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. Current scene graph generation models are trained on a tiny fraction of the distribution corresponding to the most frequent compositions, e.g. . However, test images might contain zero- and few-shot compositions of objects and relationships, e.g. . Despite each of the object categories and the predicate (e.g. 'on') being frequent in the training data, the models often fail to properly understand such unseen or rare compositions. To improve generalization, it is natural to attempt increasing the diversity of the training distribution. However, in the graph domain this is non-trivial. To that end, we propose a method to synthesize rare yet plausible scene graphs by perturbing real ones. We then propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs and learn from them in a joint fashion. When evaluated on the Visual Genome dataset, our approach yields marginal, but consistent improvements in zero- and few-shot metrics. We analyze the limitations of our approach indicating promising directions for future research.
    Iterative Human and Automated Identification of Wildlife Images. (arXiv:2105.02320v2 [cs.CV] UPDATED)
    (2 min) Camera trapping is increasingly used to monitor wildlife, but this technology typically requires extensive data annotation. Recently, deep learning has significantly advanced automatic wildlife recognition. However, current methods are hampered by a dependence on large static data sets when wildlife data is intrinsically dynamic and involves long-tailed distributions. These two drawbacks can be overcome through a hybrid combination of machine learning and humans in the loop. Our proposed iterative human and automated identification approach is capable of learning from wildlife imagery data with a long-tailed distribution. Additionally, it includes self-updating learning that facilitates capturing the community dynamics of rapidly changing natural systems. Extensive experiments show that our approach can achieve a ~90% accuracy employing only ~20% of the human annotations of existing approaches. Our synergistic collaboration of humans and machines transforms deep learning from a relatively inefficient post-annotation tool to a collaborative on-going annotation tool that vastly relieves the burden of human annotation and enables efficient and constant model updates.
    A Neurally-Inspired Hierarchical Prediction Network for Spatiotemporal Sequence Learning and Prediction. (arXiv:1901.09002v2 [cs.NE] UPDATED)
    (2 min) In this paper we developed a hierarchical network model, called Hierarchical Prediction Network (HPNet), to understand how spatiotemporal memories might be learned and encoded in the recurrent circuits in the visual cortical hierarchy for predicting future video frames. This neurally inspired model operates in the analysis-by-synthesis framework. It contains a feed-forward path that computes and encodes spatiotemporal features of successive complexity and a feedback path for the successive levels to project their interpretations to the level below. Within each level, the feed-forward path and the feedback path intersect in a recurrent gated circuit, instantiated in a LSTM module, to generate a prediction or explanation of the incoming signals. The network learns its internal model of the world by minimizing the errors of its prediction of the incoming signals at each level of the hierarchy. We found that hierarchical interaction in the network increases semantic clustering of global movement patterns in the population codes of the units along the hierarchy, even in the earliest module. This facilitates the learning of relationships among movement patterns, yielding state-of-the-art performance in long range video sequence predictions in the benchmark datasets. The network model automatically reproduces a variety of prediction suppression and familiarity suppression neurophysiological phenomena observed in the visual cortex, suggesting that hierarchical prediction might indeed be an important principle for representational learning in the visual cortex.
    Optic Disc Segmentation using Disk-Centered Patch Augmentation. (arXiv:2110.00512v1 [eess.IV])
    (2 min) The optic disc is a crucial diagnostic feature in the eye since changes to its physiognomy is correlated with the severity of various ocular and cardiovascular diseases. While identifying the bulk of the optic disc in a color fundus image is straightforward, accurately segmenting its boundary at the pixel level is very challenging. In this work, we propose disc-centered patch augmentation (DCPA) -- a simple, yet novel training scheme for deep neural networks -- to address this problem. DCPA achieves state-of-the-art results on full-size images even when using small neural networks, specifically a U-Net with only 7 million parameters as opposed to the original 31 million. In DCPA, we restrict the training data to patches that fully contain the optic nerve. In addition, we also train the network using dynamic cost functions to increase its robustness. We tested DCPA-trained networks on five retinal datasets: DRISTI, DRIONS-DB, DRIVE, AV-WIDE, and CHASE-DB. The first two had available optic disc ground truth, and we manually estimated the ground truth for the latter three. Our approach achieved state-of-the-art F1 and IOU results on four datasets (95 % F1, 91 % IOU on DRISTI; 92 % F1, 84 % IOU on DRIVE; 83 % F1, 71 % IOU on AV-WIDE; 83 % F1, 71 % IOU on CHASEDB) and competitive results on the fifth (95 % F1, 91 % IOU on DRIONS-DB), confirming its generality. Our open-source code and ground-truth annotations are available at: https://github.com/saeidmotevali/fundusdisk
    GAN-based Reactive Motion Synthesis with Class-aware Discriminators for Human-human Interaction. (arXiv:2110.00380v1 [cs.GR])
    (0 min) Creating realistic characters that can react to the users' or another character's movement can benefit computer graphics, games and virtual reality hugely. However, synthesizing such reactive motions in human-human interactions is a challenging task due to the many different ways two humans can interact. While there are a number of successful researches in adapting the generative adversarial network (GAN) in synthesizing single human actions, there are very few on modelling human-human interactions. In this paper, we propose a semi-supervised GAN system that synthesizes the reactive motion of a character given the active motion from another character. Our key insights are two-fold. First, to effectively encode the complicated spatial-temporal information of a human motion, we empower the generator with a part-based long short-term memory (LSTM) module, such that the temporal movement of different limbs can be effectively modelled. We further include an attention module such that the temporal significance of the interaction can be learned, which enhances the temporal alignment of the active-reactive motion pair. Second, as the reactive motion of different types of interactions can be significantly different, we introduce a discriminator that not only tells if the generated movement is realistic or not, but also tells the class label of the interaction. This allows the use of such labels in supervising the training of the generator. We experiment with the SBU and the HHOI datasets. The high quality of the synthetic motion demonstrates the effective design of our generator, and the discriminability of the synthesis also demonstrates the strength of our discriminator.
    Video Temporal Relationship Mining for Data-Efficient Person Re-identification. (arXiv:2110.00549v1 [cs.CV])
    (2 min) This paper is a technical report to our submission to the ICCV 2021 VIPriors Re-identification Challenge. In order to make full use of the visual inductive priors of the data, we treat the query and gallery images of the same identity as continuous frames in a video sequence. And we propose one novel post-processing strategy for video temporal relationship mining, which not only calculates the distance matrix between query and gallery images, but also the matrix between gallery images. The initial query image is used to retrieve the most similar image from the gallery, then the retrieved image is treated as a new query to retrieve its most similar image from the gallery. By iteratively searching for the closest image, we can achieve accurate image retrieval and finally obtain a robust retrieval sequence.
    Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video. (arXiv:2104.02735v2 [cs.CV] UPDATED)
    (2 min) An object's interior material properties, while invisible to the human eye, determine motion observed on its surface. We propose an approach that estimates heterogeneous material properties of an object directly from a monocular video of its surface vibrations. Specifically, we estimate Young's modulus and density throughout a 3D object with known geometry. Knowledge of how these values change across the object is useful for characterizing defects and simulating how the object will interact with different environments. Traditional non-destructive testing approaches, which generally estimate homogenized material properties or the presence of defects, are expensive and use specialized instruments. We propose an approach that leverages monocular video to (1) measure and object's sub-pixel motion and decompose this motion into image-space modes, and (2) directly infer spatially-varying Young's modulus and density values from the observed image-space modes. On both simulated and real videos, we demonstrate that our approach is able to image material properties simply by analyzing surface motion. In particular, our method allows us to identify unseen defects on a 2D drum head from real, high-speed video.
    MEDIC: A Multi-Task Learning Dataset for Disaster Image Classification. (arXiv:2108.12828v2 [cs.CV] UPDATED)
    (2 min) Recent research in disaster informatics demonstrates a practical and important use case of artificial intelligence to save human lives and sufferings during post-natural disasters based on social media contents (text and images). While notable progress has been made using texts, research on exploiting the images remains relatively under-explored. To advance the image-based approach, we propose MEDIC (available at: https://crisisnlp.qcri.org/medic/index.html), which is the largest social media image classification dataset for humanitarian response consisting of 71,198 images to address four different tasks in a multi-task learning setup. This is the first dataset of its kind: social media image, disaster response, and multi-task learning research. An important property of this dataset is its high potential to contribute research on multi-task learning, which recently receives much interest from the machine learning community and has shown remarkable results in terms of memory, inference speed, performance, and generalization capability. Therefore, the proposed dataset is an important resource for advancing image-based disaster management and multi-task machine learning research.
    Stochastic Modeling for Learnable Human Pose Triangulation. (arXiv:2110.00280v1 [cs.CV])
    (2 min) We propose a stochastic modeling framework for 3D human pose triangulation and evaluate its performance across different datasets and spatial camera arrangements. The common approach to 3D pose estimation is to first detect 2D keypoints in images and then apply the triangulation from multiple views. However, the majority of existing triangulation models are limited to a single dataset, i.e. camera arrangement and their number. Moreover, they require known camera parameters. The proposed stochastic pose triangulation model successfully generalizes to different camera arrangements and between two public datasets. In each step, we generate a set of 3D pose hypotheses obtained by triangulation from a random subset of views. The hypotheses are evaluated by a neural network and the expectation of the triangulation error is minimized. The key novelty is that the network learns to evaluate the poses without taking into account the spatial camera arrangement, thus improving generalization. Additionally, we demonstrate that the proposed stochastic framework can also be used for fundamental matrix estimation, showing promising results towards relative camera pose estimation from noisy keypoint correspondences.
    Cross Modal Focal Loss for RGBD Face Anti-Spoofing. (arXiv:2103.00948v1 [cs.CV] CROSS LISTED)
    (2 min) Automatic methods for detecting presentation attacks are essential to ensure the reliable use of facial recognition technology. Most of the methods available in the literature for presentation attack detection (PAD) fails in generalizing to unseen attacks. In recent years, multi-channel methods have been proposed to improve the robustness of PAD systems. Often, only a limited amount of data is available for additional channels, which limits the effectiveness of these methods. In this work, we present a new framework for PAD that uses RGB and depth channels together with a novel loss function. The new architecture uses complementary information from the two modalities while reducing the impact of overfitting. Essentially, a cross-modal focal loss function is proposed to modulate the loss contribution of each channel as a function of the confidence of individual channels. Extensive evaluations in two publicly available datasets demonstrate the effectiveness of the proposed approach.
    Content-Preserving Unpaired Translation from Simulated to Realistic Ultrasound Images. (arXiv:2103.05745v2 [eess.IV] UPDATED)
    (2 min) Interactive simulation of ultrasound imaging greatly facilitates sonography training. Although ray-tracing based methods have shown promising results, obtaining realistic images requires substantial modeling effort and manual parameter tuning. In addition, current techniques still result in a significant appearance gap between simulated images and real clinical scans. Herein we introduce a novel content-preserving image translation framework (ConPres) to bridge this appearance gap, while maintaining the simulated anatomical layout. We achieve this goal by leveraging both simulated images with semantic segmentations and unpaired in-vivo ultrasound scans. Our framework is based on recent contrastive unpaired translation techniques and we propose a regularization approach by learning an auxiliary segmentation-to-real image translation task, which encourages the disentanglement of content and style. In addition, we extend the generator to be class-conditional, which enables the incorporation of additional losses, in particular a cyclic consistency loss, to further improve the translation quality. Qualitative and quantitative comparisons against state-of-the-art unpaired translation methods demonstrate the superiority of our proposed framework.
    Scientific evidence extraction. (arXiv:2110.00061v1 [cs.LG])
    (2 min) Recently, interest has grown in applying machine learning to the problem of table structure inference and extraction from unstructured documents. However, progress in this area has been challenging both to make and to measure, due to several issues that arise in training and evaluating models from labeled data. This includes challenges as fundamental as the lack of a single definitive ground truth output for each input sample and the lack of an ideal metric for measuring partial correctness for this task. To address these we propose a new dataset, PubMed Tables One Million (PubTables-1M), and a new class of metric, grid table similarity (GriTS). PubTables-1M is nearly twice as large as the previous largest comparable dataset, can be used for models across multiple architectures and modalities, and addresses issues such as ambiguity and lack of consistency in the annotations. We apply DETR to table extraction for the first time and show that object detection models trained on PubTables-1M produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis. We describe the dataset in detail to enable others to build on our work and combine this data with other datasets for these and related tasks. It is our hope that PubTables-1M and the proposed metrics can further progress in this area by creating a benchmark suitable for training and evaluating a wide variety of models for table extraction. Data and code will be released at https://github.com/microsoft/table-transformer.
    Collapsible Linear Blocks for Super-Efficient Super Resolution. (arXiv:2103.09404v2 [eess.IV] UPDATED)
    (2 min) With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose SESR, a new class of Super-Efficient Super Resolution networks that significantly improve image quality and reduce computational complexity. Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality than state-of-the-art models while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. As a result, SESR can be used on constrained hardware to perform x2 (1080p to 4K) and x4 SISR (1080p to 8K). Towards this, we simulate hardware performance numbers for a commercial mobile Neural Processing Unit (NPU) for 1080p to 4K (x2) and 1080p to 8K (x4) SISR. Our results highlight the challenges faced by super resolution on AI accelerators and demonstrate that SESR is significantly faster than existing models. Overall, SESR establishes a new Pareto frontier on the quality (PSNR)-computation relationship for the super resolution task. The code for this work is available at https://github.com/ARM-software/sesr.
    Instance Segmentation Challenge Track Technical Report, VIPriors Workshop at ICCV 2021: Task-Specific Copy-Paste Data Augmentation Method for Instance Segmentation. (arXiv:2110.00470v1 [cs.CV])
    (2 min) Copy-Paste has proven to be a very effective data augmentation for instance segmentation which can improve the generalization of the model. We used a task-specific Copy-Paste data augmentation method to achieve good performance on the instance segmentation track of the 2nd VIPriors workshop challenge. We also applied additional data augmentation techniques including RandAugment and GridMask. Our segmentation model is the HTC detector on the CBSwin-B with CBFPN with some tweaks. This model was trained at the multi-scale mode by a random sampler on the 6x schedule and tested at the single-scale mode. By combining these techniques, we achieved 0.398 AP@0.50:0.95 with the validation set and 0.433 AP@0.50:0.95 with the test set. Finally, we reached 0.477 AP@0.50:0.95 with the test set by adding the validation set to the training data. Source code is available at https://github.com/jahongir7174/VIP2021.
    Summarize and Search: Learning Consensus-aware Dynamic Convolution for Co-Saliency Detection. (arXiv:2110.00338v1 [cs.CV])
    (2 min) Humans perform co-saliency detection by first summarizing the consensus knowledge in the whole group and then searching corresponding objects in each image. Previous methods usually lack robustness, scalability, or stability for the first process and simply fuse consensus features with image features for the second process. In this paper, we propose a novel consensus-aware dynamic convolution model to explicitly and effectively perform the "summarize and search" process. To summarize consensus image features, we first summarize robust features for every single image using an effective pooling method and then aggregate cross-image consensus cues via the self-attention mechanism. By doing this, our model meets the scalability and stability requirements. Next, we generate dynamic kernels from consensus features to encode the summarized consensus knowledge. Two kinds of kernels are generated in a supplementary way to summarize fine-grained image-specific consensus object cues and the coarse group-wise common knowledge, respectively. Then, we can effectively perform object searching by employing dynamic convolution at multiple scales. Besides, a novel and effective data synthesis method is also proposed to train our network. Experimental results on four benchmark datasets verify the effectiveness of our proposed method. Our code and saliency maps are available at \url{https://github.com/nnizhang/CADC}.
    Vision Xformers: Efficient Attention for Image Classification. (arXiv:2107.02239v4 [cs.CV] UPDATED)
    (2 min) Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X in {Performer, Linformer, Nystr\"omformer}), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive bias for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.
    Development of the algorithm for differentiating bone metastases and trauma of the ribs in bone scintigraphy and demonstration of visual evidence of the algorithm -- Using only anterior bone scan view of thorax. (arXiv:2110.00130v1 [eess.IV])
    (2 min) Background: Although there are many studies on the application of artificial intelligence (AI) models to medical imaging, there is no report of an AI model that determines the accumulation of ribs in bone metastases and trauma only using the anterior image of thorax of bone scintigraphy. In recent years, a method for visualizing diagnostic grounds called Gradient-weighted Class Activation Mapping (Grad-CAM) has been proposed in the area of diagnostic images using Deep Convolutional Neural Network (DCNN). As far as we have investigated, there are no reports of visualization of the diagnostic basis in bone scintigraphy. Our aim is to visualize the area of interest of DCNN, in addition to developing an algorithm to classify and diagnose whether RI accumulation on the ribs is bone metastasis or trauma using only anterior bone scan view of thorax. Material and Methods: For this retrospective study, we used 838 patients who underwent bone scintigraphy to search for bone metastases at our institution. A frontal chest image of bone scintigraphy was used to create the algorithm. We used 437 cases with bone metastases on the ribs and 401 cases with abnormal RI accumulation due to trauma. Result: AI model was able to detect bone metastasis lesion with a sensitivity of 90.00% and accuracy of 86.5%. And it was possible to visualize the part that the AI model focused on with Grad-CAM.
    Deep Image Synthesis from Intuitive User Input: A Review and Perspectives. (arXiv:2107.04240v2 [cs.CV] UPDATED)
    (2 min) In many applications of computer graphics, art and design, it is desirable for a user to provide intuitive non-image input, such as text, sketch, stroke, graph or layout, and have a computer system automatically generate photo-realistic images that adhere to the input content. While classic works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and flow-based methods have enabled more powerful and versatile image generation tasks. This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics. This motivates new perspectives on input representation and interactivity, cross pollination between major image generation paradigms, and evaluation and comparison of generation methods.
    Preconditioned Plug-and-Play ADMM with Locally Adjustable Denoiser for Image Restoration. (arXiv:2110.00493v1 [eess.IV])
    (2 min) Plug-and-Play optimization recently emerged as a powerful technique for solving inverse problems by plugging a denoiser into a classical optimization algorithm. The denoiser accounts for the regularization and therefore implicitly determines the prior knowledge on the data, hence replacing typical handcrafted priors. In this paper, we extend the concept of plug-and-play optimization to use denoisers that can be parameterized for non-constant noise variance. In that aim, we introduce a preconditioning of the ADMM algorithm, which mathematically justifies the use of such an adjustable denoiser. We additionally propose a procedure for training a convolutional neural network for high quality non-blind image denoising that also allows for pixel-wise control of the noise standard deviation. We show that our pixel-wise adjustable denoiser, along with a suitable preconditioning strategy, can further improve the plug-and-play ADMM approach for several applications, including image completion, interpolation, demosaicing and Poisson denoising.
    HUMBI: A Large Multiview Dataset of Human Body Expressions and Benchmark Challenge. (arXiv:2110.00119v1 [cs.CV])
    (2 min) This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of five primary body signals including gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and style. With the multiview image streams, we reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance. We demonstrate that HUMBI is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. Based on HUMBI, we formulate a new benchmark challenge of a pose-guided appearance rendering task that aims to substantially extend photorealism in modeling diverse human expressions in 3D, which is the key enabling factor of authentic social tele-presence. HUMBI is publicly available at this http URL
    Distance Metric Learned Collaborative Representation Classifier. (arXiv:1905.01168v3 [cs.CV] UPDATED)
    (2 min) Any generic deep machine learning algorithm is essentially a function fitting exercise, where the network tunes its weights and parameters to learn discriminatory features by minimizing some cost function. Though the network tries to learn the optimal feature space, it seldom tries to learn an optimal distance metric in the cost function, and hence misses out on an additional layer of abstraction. We present a simple effective way of achieving this by learning a generic Mahalanabis distance in a collaborative loss function in an end-to-end fashion with any standard convolutional network as the feature learner. The proposed method DML-CRC gives state-of-the-art performance on benchmark fine-grained classification datasets CUB Birds, Oxford Flowers and Oxford-IIIT Pets using the VGG-19 deep network. The method is network agnostic and can be used for any similar classification tasks.
    ResNet strikes back: An improved training procedure in timm. (arXiv:2110.00476v1 [cs.CV])
    (2 min) The influential Residual Networks designed by He et al. remain the gold-standard architecture in numerous scientific publications. They typically serve as the default architecture in studies, or as baselines when new architectures are proposed. Yet there has been significant progress on best practices for training neural networks since the inception of the ResNet architecture in 2015. Novel optimization & data-augmentation have increased the effectiveness of the training recipes. In this paper, we re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances. We share competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work. For instance, with our more demanding training setting, a vanilla ResNet-50 reaches 80.4% top-1 accuracy at resolution 224x224 on ImageNet-val without extra data or distillation. We also report the performance achieved with popular models with our training procedure.
    Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images. (arXiv:2110.00519v1 [cs.CV])
    (2 min) While neural symbolic methods demonstrate impressive performance in visual question answering on synthetic images, their performance suffers on real images. We identify that the long-tail distribution of visual concepts and unequal importance of reasoning steps in real data are the two key obstacles that limit the models' real-world potentials. To address these challenges, we propose a new paradigm, Calibrating Concepts and Operations (CCO), which enables neural symbolic models to capture underlying data characteristics and to reason with hierarchical importance. Specifically, we introduce an executor with learnable concept embedding magnitudes for handling distribution imbalance, and an operation calibrator for highlighting important operations and suppressing redundant ones. Our experiments show CCO substantially boosts the performance of neural symbolic methods on real images. By evaluating models on the real world dataset GQA, CCO helps the neural symbolic method NSCL outperforms its vanilla counterpart by 9.1% (from 47.0% to 56.1%); this result also largely reduces the performance gap between symbolic and non-symbolic methods. Additionally, we create a perturbed test set for better understanding and analyzing model performance on real images. Code is available at https://github.com/Lizw14/CaliCO.git .
    Self-supervised Secondary Landmark Detection via 3D Representation Learning. (arXiv:2110.00543v1 [cs.CV])
    (2 min) Recent technological developments have spurred great advances in the computerized tracking of joints and other landmarks in moving animals, including humans. Such tracking promises important advances in biology and biomedicine. Modern tracking models depend critically on labor-intensive annotated datasets of primary landmarks by non-expert humans. However, such annotation approaches can be costly and impractical for secondary landmarks, that is, ones that reflect fine-grained geometry of animals, and that are often specific to customized behavioral tasks. Due to visual and geometric ambiguity, nonexperts are often not qualified for secondary landmark annotation, which can require anatomical and zoological knowledge. These barriers significantly impede downstream behavioral studies because the learned tracking models exhibit limited generalizability. We hypothesize that there exists a shared representation between the primary and secondary landmarks because the range of motion of the secondary landmarks can be approximately spanned by that of the primary landmarks. We present a method to learn this spatial relationship of the primary and secondary landmarks in three dimensional space, which can, in turn, self-supervise the secondary landmark detector. This 3D representation learning is generic, and can therefore be applied to various multiview settings across diverse organisms, including macaques, flies, and humans.
    Zero-shot Natural Language Video Localization. (arXiv:2110.00428v1 [cs.CL])
    (2 min) Understanding videos to localize moments with natural language often requires large expensive annotated video regions paired with language queries. To eliminate the annotation costs, we make a first attempt to train a natural language video localization model in zero-shot manner. Inspired by unsupervised image captioning setup, we merely require random text corpora, unlabeled video collections, and an off-the-shelf object detector to train a model. With the unpaired data, we propose to generate pseudo-supervision of candidate temporal regions and corresponding query sentences, and develop a simple NLVL model to train with the pseudo-supervision. Our empirical validations show that the proposed pseudo-supervised method outperforms several baseline approaches and a number of methods using stronger supervision on Charades-STA and ActivityNet-Captions.
    Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?. (arXiv:2110.00528v1 [cs.CV])
    (2 min) Despite the success of a number of recent techniques for visual self-supervised deep learning, there remains limited investigation into the representations that are ultimately learned. By using recent advances in comparing neural representations, we explore in this direction by comparing a constrastive self-supervised algorithm (SimCLR) to supervision for simple image data in a common architecture. We find that the methods learn similar intermediate representations through dissimilar means, and that the representations diverge rapidly in the final few layers. We investigate this divergence, finding that it is caused by these layers strongly fitting to the distinct learning objectives. We also find that SimCLR's objective implicitly fits the supervised objective in intermediate layers, but that the reverse is not true. Our work particularly highlights the importance of the learned intermediate representations, and raises important questions for auxiliary task design.
    Robustly Removing Deep Sea Lighting Effects for Visual Mapping of Abyssal Plains. (arXiv:2110.00480v1 [cs.CV])
    (2 min) The majority of Earth's surface lies deep in the oceans, where no surface light reaches. Robots diving down to great depths must bring light sources that create moving illumination patterns in the darkness, such that the same 3D point appears with different color in each image. On top, scattering and attenuation of light in the water makes images appear foggy and typically blueish, the degradation depending on each pixel's distance to its observed seafloor patch, on the local composition of the water and the relative poses and cones of the light sources. Consequently, visual mapping, including image matching and surface albedo estimation, severely suffers from the effects that co-moving light sources produce, and larger mosaic maps from photos are often dominated by lighting effects that obscure the actual seafloor structure. In this contribution a practical approach to estimating and compensating these lighting effects on predominantly homogeneous, flat seafloor regions, as can be found in the Abyssal plains of our oceans, is presented. The method is essentially parameter-free and intended as a preprocessing step to facilitate visual mapping, but already produces convincing lighting artefact compensation up to a global white balance factor. It does not require to be trained beforehand on huge sets of annotated images, which are not available for the deep sea. Rather, we motivate our work by physical models of light propagation, perform robust statistics-based estimates of additive and multiplicative nuisances that avoid explicit parameters for light, camera, water or scene, discuss the breakdown point of the algorithms and show results on imagery captured by robots in several kilometer water depth.
    ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception. (arXiv:2110.00511v1 [cs.CV])
    (2 min) We present ASH, a modern and high-performance framework for parallel spatial hashing on GPU. Compared to existing GPU hash map implementations, ASH achieves higher performance, supports richer functionality, and requires fewer lines of code (LoC) when used for implementing spatially varying operations from volumetric geometry reconstruction to differentiable appearance reconstruction. Unlike existing GPU hash maps, the ASH framework provides a versatile tensor interface, hiding low-level details from the users. In addition, by decoupling the internal hashing data structures and key-value data in buffers, we offer direct access to spatially varying data via indices, enabling seamless integration to modern libraries such as PyTorch. To achieve this, we 1) detach stored key-value data from the low-level hash map implementation; 2) bridge the pointer-first low level data structures to index-first high-level tensor interfaces via an index heap; 3) adapt both generic and non-generic integer-only hash map implementations as backends to operate on multi-dimensional keys. We first profile our hash map against state-of-the-art hash maps on synthetic data to show the performance gain from this architecture. We then show that ASH can consistently achieve higher performance on various large-scale 3D perception tasks with fewer LoC by showcasing several applications, including 1) point cloud voxelization, 2) dense volumetric SLAM, 3) non-rigid point cloud registration and volumetric deformation, and 4) spatially varying geometry and appearance refinement. ASH and its example applications are open sourced in Open3D (this http URL).
    Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection. (arXiv:2110.00249v1 [cs.CV])
    (2 min) We study adapting trained object detectors to unseen domains manifesting significant variations of object appearance, viewpoints and backgrounds. Most current methods align domains by either using image or instance-level feature alignment in an adversarial fashion. This often suffers due to the presence of unwanted background and as such lacks class-specific alignment. A common remedy to promote class-level alignment is to use high confidence predictions on the unlabelled domain as pseudo labels. These high confidence predictions are often fallacious since the model is poorly calibrated under domain shift. In this paper, we propose to leverage model predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment. Specifically, we measure predictive uncertainty on class assignments and the bounding box predictions. Model predictions with low uncertainty are used to generate pseudo-labels for self-supervision, whereas the ones with higher uncertainty are used to generate tiles for an adversarial feature alignment stage. This synergy between tiling around the uncertain object regions and generating pseudo-labels from highly certain object regions allows us to capture both the image and instance level context during the model adaptation stage. We perform extensive experiments covering various domain shift scenarios. Our approach improves upon existing state-of-the-art methods with visible margins.
    Improving Object Permanence using Agent Actions and Reasoning. (arXiv:2110.00238v1 [cs.RO])
    (2 min) Object permanence in psychology means knowing that objects still exist even if they are no longer visible. It is a crucial concept for robots to operate autonomously in uncontrolled environments. Existing approaches learn object permanence from low-level perception, but perform poorly on more complex scenarios, like when objects are contained and carried by others. Knowledge about manipulation actions performed on an object prior to its disappearance allows us to reason about its location, e.g., that the object has been placed in a carrier. In this paper we argue that object permanence can be improved when the robot uses knowledge about executed actions and describe an approach to infer hidden object states from agent actions. We show that considering agent actions not only improves rule-based reasoning models but also purely neural approaches, showing its general applicability. Then, we conduct quantitative experiments on a snitch localization task using a dataset of 1,371 synthesized videos, where we compare the performance of different object permanence models with and without action annotations. We demonstrate that models with action annotations can significantly increase performance of both neural and rule-based approaches. Finally, we evaluate the usability of our approach in real-world applications by conducting qualitative experiments with two Universal Robots (UR5 and UR16e) in both lab and industrial settings. The robots complete benchmark tasks for a gearbox assembly and demonstrate the object permanence capabilities with real sensor data in an industrial environment.
    Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation. (arXiv:2110.00329v1 [cs.CV])
    (2 min) Knowledge distillation usually transfers the knowledge from a pre-trained cumbersome teacher network to a compact student network, which follows the classical teacher-teaching-student paradigm. Based on this paradigm, previous methods mostly focus on how to efficiently train a better student network for deployment. Different from the existing practices, in this paper, we propose a novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher (for deployment) is learned with the help of multiple hierarchical students by sharing the structural backbone. The diverse feedback from multiple students allows the teacher to improve itself through the shared feature representations. The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet. Notably, when trained together with our proposed method, ResNet-18 achieves 79.15% and 71.14% accuracy on CIFAR-100 and ImageNet, outperforming the baseline results by 4.74% and 1.43%, respectively. The code is available at: https://github.com/zhengli427/TESKD.
    Learning of Inter-Label Geometric Relationships Using Self-Supervised Learning: Application To Gleason Grade Segmentation. (arXiv:2110.00404v1 [eess.IV])
    (2 min) Segmentation of Prostate Cancer (PCa) tissues from Gleason graded histopathology images is vital for accurate diagnosis. Although deep learning (DL) based segmentation methods achieve state-of-the-art accuracy, they rely on large datasets with manual annotations. We propose a method to synthesize for PCa histopathology images by learning the geometrical relationship between different disease labels using self-supervised learning. We use a weakly supervised segmentation approach that uses Gleason score to segment the diseased regions and the resulting segmentation map is used to train a Shape Restoration Network (ShaRe-Net) to predict missing mask segments in a self-supervised manner. Using DenseUNet as the backbone generator architecture we incorporate latent variable sampling to inject diversity in the image generation process and thus improve robustness. Experiments on multiple histopathology datasets demonstrate the superiority of our method over competing image synthesis methods for segmentation tasks. Ablation studies show the benefits of integrating geometry and diversity in generating high-quality images, and our self-supervised approach with limited class-labeled data achieves similar performance as fully supervised learning.
    Survey and synthesis of state of the art in driver monitoring. (arXiv:2110.00472v1 [cs.CV])
    (2 min) Road-vehicle accidents are mostly due to human errors, and many such accidents could be avoided by continuously monitoring the driver. Driver monitoring (DM) is a topic of growing interest in the automotive industry, and it will remain relevant for all vehicles that are not fully autonomous, and thus for decades for the average vehicle owner. The present paper focuses on the first step of DM, which consists in characterizing the state of the driver. Since DM will be increasingly linked to driving automation (DA), this paper presents a clear view of the role of DM at each of the six SAE levels of DA. This paper surveys the state of the art of DM, and then synthesizes it, providing a unique, structured, polychotomous view of the many characterization techniques of DM. Informed by the survey, the paper characterizes the driver state along the five main dimensions--called here "(sub)states"--of drowsiness, mental workload, distraction, emotions, and under the influence. The polychotomous view of DM is presented through a pair of interlocked tables that relate these states to their indicators (e.g., the eye-blink rate) and the sensors that can access each of these indicators (e.g., a camera). The tables factor in not only the effects linked directly to the driver, but also those linked to the (driven) vehicle and the (driving) environment. They show, at a glance, to concerned researchers, equipment providers, and vehicle manufacturers (1) most of the options they have to implement various forms of advanced DM systems, and (2) fruitful areas for further research and innovation.
    Consistent Explanations by Contrastive Learning. (arXiv:2110.00527v1 [cs.CV])
    (2 min) Understanding and explaining the decisions of neural networks are critical to building trust, rather than relying on them as black box algorithms. Post-hoc evaluation techniques, such as Grad-CAM, enable humans to inspect the spatial regions responsible for a particular network decision. However, it is shown that such explanations are not always consistent with human priors, such as consistency across image transformations. Given an interpretation algorithm, e.g., Grad-CAM, we introduce a novel training method to train the model to produce more consistent explanations. Since obtaining the ground truth for a desired model interpretation is not a well-defined task, we adopt ideas from contrastive self-supervised learning and apply them to the interpretations of the model rather than its embeddings. Explicitly training the network to produce more reasonable interpretations and subsequently evaluating those interpretations will enhance our ability to trust the network. We show that our method, Contrastive Grad-CAM Consistency (CGC), results in Grad-CAM interpretation heatmaps that are consistent with human annotations while still achieving comparable classification accuracy. Moreover, since our method can be seen as a form of regularizer, on limited-data fine-grained classification settings, our method outperforms the baseline classification accuracy on Caltech-Birds, Stanford Cars, VGG Flowers, and FGVC-Aircraft datasets. In addition, because our method does not rely on annotations, it allows for the incorporation of unlabeled data into training, which enables better generalization of the model. Our code is publicly available.
    Self-Supervised Decomposition, Disentanglement and Prediction of Video Sequences while Interpreting Dynamics: A Koopman Perspective. (arXiv:2110.00547v1 [cs.CV])
    (2 min) Human interpretation of the world encompasses the use of symbols to categorize sensory inputs and compose them in a hierarchical manner. One of the long-term objectives of Computer Vision and Artificial Intelligence is to endow machines with the capacity of structuring and interpreting the world as we do. Towards this goal, recent methods have successfully been able to decompose and disentangle video sequences into their composing objects and dynamics, in a self-supervised fashion. However, there has been a scarce effort in giving interpretation to the dynamics of the scene. We propose a method to decompose a video into moving objects and their attributes, and model each object's dynamics with linear system identification tools, by means of a Koopman embedding. This allows interpretation, manipulation and extrapolation of the dynamics of the different objects by employing the Koopman operator K. We test our method in various synthetic datasets and successfully forecast challenging trajectories while interpreting them.
    DCT based Fusion of Variable Exposure Images for HDRI. (arXiv:2110.00312v1 [eess.IV])
    (2 min) Combining images with different exposure settings are of prime importance in the field of computational photography. Both transform domain approach and filtering based approaches are possible for fusing multiple exposure images, to obtain the well-exposed image. We propose a Discrete Cosine Transform (DCT-based) approach for fusing multiple exposure images. The input image stack is processed in the transform domain by an averaging operation and the inverse transform is performed on the averaged image obtained to generate the fusion of multiple exposure image. The experimental observation leads us to the conjecture that the obtained DCT coefficients are indicators of parameters to measure well-exposedness, contrast and saturation as specified in the traditional exposure fusion based approach and the averaging performed indicates equal weights assigned to the DCT coefficients in this non-parametric and non pyramidal approach to fuse the multiple exposure stack.
    A Graph-theoretic Algorithm for Small Bowel Path Tracking in CT Scans. (arXiv:2110.00466v1 [eess.IV])
    (2 min) We present a novel graph-theoretic method for small bowel path tracking. It is formulated as finding the minimum cost path between given start and end nodes on a graph that is constructed based on the bowel wall detection. We observed that a trivial solution with many short-cuts is easily made even with the wall detection, where the tracked path penetrates indistinct walls around the contact between different parts of the small bowel. Thus, we propose to include must-pass nodes in finding the path to better cover the entire course of the small bowel. The proposed method does not entail training with ground-truth paths while the previous methods do. We acquired ground-truth paths that are all connected from start to end of the small bowel for 10 abdominal CT scans, which enables the evaluation of the path tracking for the entire course of the small bowel. The proposed method showed clear improvements in terms of several metrics compared to the baseline method. The maximum length of the path that is tracked without an error per scan, by the proposed method, is above 800mm on average.
    Visual Cluster Separation Using High-Dimensional Sharpened Dimensionality Reduction. (arXiv:2110.00317v1 [cs.CV])
    (2 min) Applying dimensionality reduction (DR) to large, high-dimensional data sets can be challenging when distinguishing the underlying high-dimensional data clusters in a 2D projection for exploratory analysis. We address this problem by first sharpening the clusters in the original high-dimensional data prior to the DR step using Local Gradient Clustering (LGC). We then project the sharpened data from the high-dimensional space to 2D by a user-selected DR method. The sharpening step aids this method to preserve cluster separation in the resulting 2D projection. With our method, end-users can label each distinct cluster to further analyze an otherwise unlabeled data set. Our `High-Dimensional Sharpened DR' (HD-SDR) method, tested on both synthetic and real-world data sets, is favorable to DR methods with poor cluster separation and yields a better visual cluster separation than these DR methods with no sharpening. Our method achieves good quality (measured by quality metrics) and scales computationally well with large high-dimensional data. To illustrate its concrete applications, we further apply HD-SDR on a recent astronomical catalog.
    Generative Memory-Guided Semantic Reasoning Model for Image Inpainting. (arXiv:2110.00261v1 [cs.CV])
    (2 min) Most existing methods for image inpainting focus on learning the intra-image priors from the known regions of the current input image to infer the content of the corrupted regions in the same image. While such methods perform well on images with small corrupted regions, it is challenging for these methods to deal with images with large corrupted area due to two potential limitations: 1) such methods tend to overfit each single training pair of images relying solely on the intra-image prior knowledge learned from the limited known area; 2) the inter-image prior knowledge about the general distribution patterns of visual semantics, which can be transferred across images sharing similar semantics, is not exploited. In this paper, we propose the Generative Memory-Guided Semantic Reasoning Model (GM-SRM), which not only learns the intra-image priors from the known regions, but also distills the inter-image reasoning priors to infer the content of the corrupted regions. In particular, the proposed GM-SRM first pre-learns a generative memory from the whole training data to capture the semantic distribution patterns in a global view. Then the learned memory are leveraged to retrieve the matching inter-image priors for the current corrupted image to perform semantic reasoning during image inpainting. While the intra-image priors are used for guaranteeing the pixel-level content consistency, the inter-image priors are favorable for performing high-level semantic reasoning, which is particularly effective for inferring semantic content for large corrupted area. Extensive experiments on Paris Street View, CelebA-HQ, and Places2 benchmarks demonstrate that our GM-SRM outperforms the state-of-the-art methods for image inpainting in terms of both the visual quality and quantitative metrics.
    PhiNets: a scalable backbone for low-power AI at the edge. (arXiv:2110.00337v1 [cs.CV])
    (2 min) In the Internet of Things era, where we see many interconnected and heterogeneous mobile and fixed smart devices, distributing the intelligence from the cloud to the edge has become a necessity. Due to limited computational and communication capabilities, low memory and limited energy budget, bringing artificial intelligence algorithms to peripheral devices, such as the end-nodes of a sensor network, is a challenging task and requires the design of innovative methods. In this work, we present PhiNets, a new scalable backbone optimized for deep-learning-based image processing on resource-constrained platforms. PhiNets are based on inverted residual blocks specifically designed to decouple the computational cost, working memory, and parameter memory, thus exploiting all the available resources. With a YoloV2 detection head and Simple Online and Realtime Tracking, the proposed architecture has achieved the state-of-the-art results in (i) detection on the COCO and VOC2012 benchmarks, and (ii) tracking on the MOT15 benchmark. PhiNets reduce the parameter count of 87% to 93% with respect to previous state-of-the-art models (EfficientNetv1, MobileNetv2) and achieve better performance with lower computational cost. Moreover, we demonstrate our approach on a prototype node based on a STM32H743 microcontroller (MCU) with 2MB of internal Flash and 1MB of RAM and achieve power requirements in the order of 10 mW. The code for the PhiNets is publicly available on GitHub.
    Personalized Retrogress-Resilient Framework for Real-World Medical Federated Learning. (arXiv:2110.00394v1 [cs.CV])
    (2 min) Nowadays, deep learning methods with large-scale datasets can produce clinically useful models for computer-aided diagnosis. However, the privacy and ethical concerns are increasingly critical, which make it difficult to collect large quantities of data from multiple institutions. Federated Learning (FL) provides a promising decentralized solution to train model collaboratively by exchanging client models instead of private data. However, the server aggregation of existing FL methods is observed to degrade the model performance in real-world medical FL setting, which is termed as retrogress. To address this problem, we propose a personalized retrogress-resilient framework to produce a superior personalized model for each client. Specifically, we devise a Progressive Fourier Aggregation (PFA) at the server to achieve more stable and effective global knowledge gathering by integrating client models from low-frequency to high-frequency gradually. Moreover, with an introduced deputy model to receive the aggregated server model, we design a Deputy-Enhanced Transfer (DET) strategy at the client and conduct three steps of Recover-Exchange-Sublimate to ameliorate the personalized local model by transferring the global knowledge smoothly. Extensive experiments on real-world dermoscopic FL dataset prove that our personalized retrogress-resilient framework outperforms state-of-the-art FL methods, as well as the generalization on an out-of-distribution cohort. The code and dataset are available at https://github.com/CityU-AIM-Group/PRR-FL.
    MonoCInIS: Camera Independent Monocular 3D Object Detection using Instance Segmentation. (arXiv:2110.00464v1 [cs.CV])
    (2 min) Monocular 3D object detection has recently shown promising results, however there remain challenging problems. One of those is the lack of invariance to different camera intrinsic parameters, which can be observed across different 3D object datasets. Little effort has been made to exploit the combination of heterogeneous 3D object datasets. In contrast to general intuition, we show that more data does not automatically guarantee a better performance, but rather, methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data. In this paper we propose a category-level pose estimation method based on instance segmentation, using camera independent geometric reasoning to cope with the varying camera viewpoints and intrinsics of different datasets. Every pixel of an instance predicts the object dimensions, the 3D object reference points projected in 2D image space and, optionally, the local viewing angle. Camera intrinsics are only used outside of the learned network to lift the predicted 2D reference points to 3D. We surpass camera independent methods on the challenging KITTI3D benchmark and show the key benefits compared to camera dependent methods.
    Accelerating Inverse Rendering By Using a GPU and Reuse of Light Paths. (arXiv:2110.00085v1 [cs.CV])
    (2 min) Inverse rendering seeks to estimate scene characteristics from a set of data images. The dominant approach is based on differential rendering using Monte-Carlo. Algorithms as such usually rely on a forward model and use an iterative gradient method that requires sampling millions of light paths per iteration. This paper presents an efficient framework that speeds up existing inverse rendering algorithms. This is achieved by tailoring the iterative process of inverse rendering specifically to a GPU architecture. For this cause, we introduce two interleaved steps - Path Sorting and Path Recycling. Path Sorting allows the GPU to deal with light paths of the same size. Path Recycling allows the algorithm to use light paths from previous iterations to better utilize the information they encode. Together, these steps significantly speed up gradient optimization. In this paper, we give the theoretical background for Path Recycling. We demonstrate its efficiency for volumetric scattering tomography and reflectometry (surface reflections).
    Learning Multi-Site Harmonization of Magnetic Resonance Images Without Traveling Human Phantoms. (arXiv:2110.00041v1 [eess.IV])
    (2 min) Harmonization improves data consistency and is central to effective integration of diverse imaging data acquired across multiple sites. Recent deep learning techniques for harmonization are predominantly supervised in nature and hence require imaging data of the same human subjects to be acquired at multiple sites. Data collection as such requires the human subjects to travel across sites and is hence challenging, costly, and impractical, more so when sufficient sample size is needed for reliable network training. Here we show how harmonization can be achieved with a deep neural network that does not rely on traveling human phantom data. Our method disentangles site-specific appearance information and site-invariant anatomical information from images acquired at multiple sites and then employs the disentangled information to generate the image of each subject for any target site. We demonstrate with more than 6,000 multi-site T1- and T2-weighted images that our method is remarkably effective in generating images with realistic site-specific appearances without altering anatomical details. Our method allows retrospective harmonization of data in a wide range of existing modern large-scale imaging studies, conducted via different scanners and protocols, without additional data collection.
    Mining for strong gravitational lenses with self-supervised learning. (arXiv:2110.00023v1 [astro-ph.IM])
    (2 min) We employ self-supervised representation learning to distill information from 76 million galaxy images from the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys' Data Release 9. Targeting the identification of new strong gravitational lens candidates, we first create a rapid similarity search tool to discover new strong lenses given only a single labelled example. We then show how training a simple linear classifier on the self-supervised representations, requiring only a few minutes on a CPU, can automatically classify strong lenses with great efficiency. We present 1192 new strong lens candidates that we identified through a brief visual identification campaign, and release an interactive web-based similarity search tool and the top network predictions to facilitate crowd-sourcing rapid discovery of additional strong gravitational lenses and other rare objects: github.com/georgestein/ssl-legacysurvey
    Lightweight Transformer in Federated Setting for Human Activity Recognition. (arXiv:2110.00244v1 [cs.CV])
    (2 min) Human Activity Recognition (HAR) has been a challenging problem yet it needs to be solved. It will mainly be used for eldercare and healthcare as an assistive technology when ensemble with other technologies like Internet of Things(IoT). HAR can be achieved with the help of sensors, smartphones or images. Deep neural network techniques like artificial neural networks, convolutional neural networks and recurrent neural networks have been used in HAR, both in centralized and federated setting. However, these techniques have certain limitations. RNNs have limitation of parallelization, CNNS have the limitation of sequence length and they are computationally expensive. In this paper, to address the state of art challenges, we present a inertial sensors-based novel one patch transformer which gives the best of both RNNs and CNNs for Human activity recognition. We also design a testbed to collect real-time human activity data. The data collected is further used to train and test the proposed transformer. With the help of experiments, we show that the proposed transformer outperforms the state of art CNN and RNN based classifiers, both in federated and centralized setting. Moreover, the proposed transformer is computationally inexpensive as it uses very few parameter compared to the existing state of art CNN and RNN based classifier. Thus its more suitable for federated learning as it provides less communication and computational cost.
    Noise2Recon: A Semi-Supervised Framework for Joint MRI Reconstruction and Denoising. (arXiv:2110.00075v1 [eess.IV])
    (2 min) Deep learning (DL) has shown promise for faster, high quality accelerated MRI reconstruction. However, standard supervised DL methods depend on extensive amounts of fully-sampled ground-truth data and are sensitive to out-of-distribution (OOD) shifts, in particular for low signal-to-noise ratio (SNR) acquisitions. To alleviate this challenge, we propose a semi-supervised, consistency-based framework (termed Noise2Recon) for joint MR reconstruction and denoising. Our method enables the usage of a limited number of fully-sampled and a large number of undersampled-only scans. We compare our method to augmentation-based supervised techniques and fine-tuned denoisers. Results demonstrate that even with minimal ground-truth data, Noise2Recon (1) achieves high performance on in-distribution (low-noise) scans and (2) improves generalizability to OOD, noisy scans.
    DeepMCAT: Large-Scale Deep Clustering for Medical Image Categorization. (arXiv:2110.00109v1 [eess.IV])
    (2 min) In recent years, the research landscape of machine learning in medical imaging has changed drastically from supervised to semi-, weakly- or unsupervised methods. This is mainly due to the fact that ground-truth labels are time-consuming and expensive to obtain manually. Generating labels from patient metadata might be feasible but it suffers from user-originated errors which introduce biases. In this work, we propose an unsupervised approach for automatically clustering and categorizing large-scale medical image datasets, with a focus on cardiac MR images, and without using any labels. We investigated the end-to-end training using both class-balanced and imbalanced large-scale datasets. Our method was able to create clusters with high purity and achieved over 0.99 cluster purity on these datasets. The results demonstrate the potential of the proposed method for categorizing unstructured large medical databases, such as organizing clinical PACS systems in hospitals.
    Beyond Neighbourhood-Preserving Transformations for Quantization-Based Unsupervised Hashing. (arXiv:2110.00216v1 [cs.CV])
    (2 min) An effective unsupervised hashing algorithm leads to compact binary codes preserving the neighborhood structure of data as much as possible. One of the most established schemes for unsupervised hashing is to reduce the dimensionality of data and then find a rigid (neighbourhood-preserving) transformation that reduces the quantization error. Although employing rigid transformations is effective, we may not reduce quantization loss to the ultimate limits. As well, reducing dimensionality and quantization loss in two separate steps seems to be sub-optimal. Motivated by these shortcomings, we propose to employ both rigid and non-rigid transformations to reduce quantization error and dimensionality simultaneously. We relax the orthogonality constraint on the projection in a PCA-formulation and regularize this by a quantization term. We show that both the non-rigid projection matrix and rotation matrix contribute towards minimizing quantization loss but in different ways. A scalable nested coordinate descent approach is proposed to optimize this mixed-integer optimization problem. We evaluate the proposed method on five public benchmark datasets providing almost half a million images. Comparative results indicate that the proposed method mostly outperforms state-of-art linear methods and competes with end-to-end deep solutions.
    Geometry Attention Transformer with Position-aware LSTMs for Image Captioning. (arXiv:2110.00335v1 [cs.CV])
    (2 min) In recent years, transformer structures have been widely applied in image captioning with impressive performance. For good captioning results, the geometry and position relations of different visual objects are often thought of as crucial information. Aiming to further promote image captioning by transformers, this paper proposes an improved Geometry Attention Transformer (GAT) model. In order to further leverage geometric information, two novel geometry-aware architectures are designed respectively for the encoder and decoder in our GAT. Besides, this model includes the two work modules: 1) a geometry gate-controlled self-attention refiner, for explicitly incorporating relative spatial information into image region representations in encoding steps, and 2) a group of position-LSTMs, for precisely informing the decoder of relative word position in generating caption texts. The experiment comparisons on the datasets MS COCO and Flickr30K show that our GAT is efficient, and it could often outperform current state-of-the-art image captioning models.
    Deep Learning-based Action Detection in Untrimmed Videos: A Survey. (arXiv:2110.00111v1 [cs.CV])
    (2 min) Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper also reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.
    Seeing Glass: Joint Point Cloud and Depth Completion for Transparent Objects. (arXiv:2110.00087v1 [cs.CV])
    (2 min) The basis of many object manipulation algorithms is RGB-D input. Yet, commodity RGB-D sensors can only provide distorted depth maps for a wide range of transparent objects due light refraction and absorption. To tackle the perception challenges posed by transparent objects, we propose TranspareNet, a joint point cloud and depth completion method, with the ability to complete the depth of transparent objects in cluttered and complex scenes, even with partially filled fluid contents within the vessels. To address the shortcomings of existing transparent object data collection schemes in literature, we also propose an automated dataset creation workflow that consists of robot-controlled image collection and vision-based automatic annotation. Through this automated workflow, we created Toronto Transparent Objects Depth Dataset (TODD), which consists of nearly 15000 RGB-D images. Our experimental evaluation demonstrates that TranspareNet outperforms existing state-of-the-art depth completion methods on multiple datasets, including ClearGrasp, and that it also handles cluttered scenes when trained on TODD. Code and dataset will be released at https://www.pair.toronto.edu/TranspareNet/
    Learning to Predict Trustworthiness with Steep Slope Loss. (arXiv:2110.00054v1 [cs.CV])
    (2 min) Understanding the trustworthiness of a prediction yielded by a classifier is critical for the safe and effective use of AI models. Prior efforts have been proven to be reliable on small-scale datasets. In this work, we study the problem of predicting trustworthiness on real-world large-scale datasets, where the task is more challenging due to high-dimensional features, diverse visual concepts, and large-scale samples. In such a setting, we observe that the trustworthiness predictors trained with prior-art loss functions, i.e., the cross entropy loss, focal loss, and true class probability confidence loss, are prone to view both correct predictions and incorrect predictions to be trustworthy. The reasons are two-fold. Firstly, correct predictions are generally dominant over incorrect predictions. Secondly, due to the data complexity, it is challenging to differentiate the incorrect predictions from the correct ones on real-world large-scale datasets. To improve the generalizability of trustworthiness predictors, we propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other. The proposed loss is evaluated with two representative deep learning models, i.e., Vision Transformer and ResNet, as trustworthiness predictors. We conduct comprehensive experiments and analyses on ImageNet, which show that the proposed loss effectively improves the generalizability of trustworthiness predictors. The code and pre-trained trustworthiness predictors for reproducibility are available at https://github.com/luoyan407/predict_trustworthiness.
    Data-Efficient Instance Segmentation with a Single GPU. (arXiv:2110.00242v1 [cs.CV])
    (2 min) Not everyone is wealthy enough to have hundreds of GPUs or TPUs. Therefore, we've got to find a way out. In this paper, we introduce a data-efficient instance segmentation method we used in the 2021 VIPriors Instance Segmentation Challenge. Our solution is a modified version of Swin Transformer, based on the mmdetection which is a powerful toolbox. To solve the problem of lack of data, we utilize data augmentation including random flip and multiscale training to train our model. During inference, multiscale fusion is used to boost the performance. We only use a single GPU during the whole training and testing stages. In the end, our team named THU_IVG_2018 achieved the result of 0.366 for AP@0.50:0.95 on the test set, which is competitive with other top-ranking methods while only one GPU is used. Besides, our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants
    Sparse Quadratic Optimisation over the Stiefel Manifold with Application to Permutation Synchronisation. (arXiv:2110.00053v1 [math.OC])
    (2 min) We address the non-convex optimisation problem of finding a sparse matrix on the Stiefel manifold (matrices with mutually orthogonal columns of unit length) that maximises (or minimises) a quadratic objective function. Optimisation problems on the Stiefel manifold occur for example in spectral relaxations of various combinatorial problems, such as graph matching, clustering, or permutation synchronisation. Although sparsity is a desirable property in such settings, it is mostly neglected in spectral formulations since existing solvers, e.g. based on eigenvalue decomposition, are unable to account for sparsity while at the same time maintaining global optimality guarantees. We fill this gap and propose a simple yet effective sparsity-promoting modification of the Orthogonal Iteration algorithm for finding the dominant eigenspace of a matrix. By doing so, we can guarantee that our method finds a Stiefel matrix that is globally optimal with respect to the quadratic objective function, while in addition being sparse. As a motivating application we consider the task of permutation synchronisation, which can be understood as a constrained clustering problem that has particular relevance for matching multiple images or 3D shapes in computer vision, computer graphics, and beyond. We demonstrate that the proposed approach outperforms previous methods in this domain.
  • cs.IR updates on arXiv.org

    Attention-based Clinical Note Summarization. (arXiv:2104.08942v2 [cs.CL] UPDATED)
    (2 min) The trend of deploying digital systems in numerous industries has induced a hike in recording digital information. The health sector has observed an extensive adoption of digital devices and systems that generate large volumes of personal medical records. Electronic health records contain valuable information for retrospective and prospective analysis that is often not entirely exploited because of the dense information storage. The crude purpose of condensing health records is to select the information that holds most characteristics of the original documents based on reported disease. These summaries may boost diagnosis and extend a doctor's time with the patient during a high workload situation like the COVID-19 pandemic. In this paper, we propose applying a multi-head attention-based mechanism to perform extractive summarization of meaningful phrases in clinical notes. This method finds major sentences for a summary by correlating tokens, segments, and positional embeddings. The model outputs attention scores that are statistically transformed to extract key phrases and can be used to projection on the heat-mapping tool for visual and human use.
    Explainable Point-Based Document Visualizations. (arXiv:2110.00462v1 [cs.IR])
    (2 min) Two-dimensional data maps can visually reveal information about the relations between data instances. Popular techniques to construct data maps are t-SNE and UMAP. The resulting point-based visualizations, though, provide information only through their interpretation. We here consider a set of abstracts from the articles on longevity to argue for using keyword extraction methods to label clusters of documents in the map. Among the considered approaches, the best results were obtained by recently proposed YAKE!. Surprisingly, a classical TF-IDF term ranking outperformed graph and embedding-based techniques.
    SAM: A Self-adaptive Attention Module for Context-Aware Recommendation System. (arXiv:2110.00452v1 [cs.IR])
    (2 min) Recently, textual information has been proved to play a positive role in recommendation systems. However, most of the existing methods only focus on representation learning of textual information in ratings, while potential selection bias induced by the textual information is ignored. In this work, we propose a novel and general self-adaptive module, the Self-adaptive Attention Module (SAM), which adjusts the selection bias by capturing contextual information based on its representation. This module can be embedded into recommendation systems that contain learning components of contextual information. Experimental results on three real-world datasets demonstrate the effectiveness of our proposal, and the state-of-the-art models with SAM significantly outperform the original ones.
    Evaluation of Non-Negative Matrix Factorization and n-stage Latent Dirichlet Allocation for Emotion Analysis in Turkish Tweets. (arXiv:2110.00418v1 [cs.CL])
    (2 min) With the development of technology, the use of social media has become quite common. Analyzing comments on social media in areas such as media and advertising plays an important role today. For this reason, new and traditional natural language processing methods are used to detect the emotion of these shares. In this paper, the Latent Dirichlet Allocation, namely LDA, and Non-Negative Matrix Factorization methods in topic modeling were used to determine which emotion the Turkish tweets posted via Twitter. In addition, the accuracy of a proposed n-level method based on LDA was analyzed. Dataset consists of 5 emotions, namely angry, fear, happy, sad and confused. NMF was the most successful method among all topic modeling methods in this study. Then, the F1-measure of Random Forest, Naive Bayes and Support Vector Machine methods was analyzed by obtaining a file suitable for Weka by using the word weights and class labels of the topics. Among the Weka results, the most successful method was n-stage LDA, and the most successful algorithm was Random Forest.
    Learner to learner fuzzy profiles similarity using a hybrid interaction analysis grid. (arXiv:2110.00247v1 [cs.AI])
    (2 min) The analysis of remote discussions is not yet at the same level as the face-to-face ones. The present paper aspires twofold. On the one hand, it attempts to establish a suitable environment of interaction and collaboration among learners by using the speech acts via a semi structured synchronous communication tool. On the other, it aims to define behavioral profiles and interpersonal skills hybrid grid by matching the BALES' IPA and PLETY's analysis system. By applying the fuzzy logic, we formalize human reasoning and, thus, giving very appreciable flexibility to the reasoning that use it, which makes it possible to take into account imprecisions and uncertainties. In addition, the educational data mining techniques are used to optimize the mapping of behaviors to learner's profile, with similarity-based clustering, using Eros and PCA measures. In order to show the validity of our system, we performed an experiment on real-world data. The results show, among others: (1) the usefulness of fuzzy logic to properly translate the profile text descriptions into a mathematical format, (2) an irregularity in the behavior of the learners, (3) the correlation between the profiles, (4) the superiority of Eros method to the PCA factor in precision.
  • cs.LG updates on arXiv.org

    Exploiting Verified Neural Networks via Floating Point Numerical Error. (arXiv:2003.03021v4 [cs.LG] UPDATED)
    (2 min) Researchers have developed neural network verification algorithms motivated by the need to characterize the robustness of deep neural networks. The verifiers aspire to answer whether a neural network guarantees certain properties with respect to all inputs in a space. However, many verifiers inaccurately model floating point arithmetic but do not thoroughly discuss the consequences. We show that the negligence of floating point error leads to unsound verification that can be systematically exploited in practice. For a pretrained neural network, we present a method that efficiently searches inputs as witnesses for the incorrectness of robustness claims made by a complete verifier. We also present a method to construct neural network architectures and weights that induce wrong results of an incomplete verifier. Our results highlight that, to achieve practically reliable verification of neural networks, any verification system must accurately (or conservatively) model the effects of any floating point computations in the network inference or verification system.
    Tractable structured natural gradient descent using local parameterizations. (arXiv:2102.07405v8 [stat.ML] UPDATED)
    (2 min) Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.
    A survey on datasets for fairness-aware machine learning. (arXiv:2110.00530v1 [cs.LG])
    (2 min) As decision-making increasingly relies on machine learning and (big) data, the issue of fairness in data-driven AI systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which propose fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships among the different attributes, particularly w.r.t. protected attributes and class attributes, using a Bayesian network. For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relationships using exploratory analysis.
    Weight Vector Tuning and Asymptotic Analysis of Binary Linear Classifiers. (arXiv:2110.00567v1 [stat.ML])
    (2 min) Unlike its intercept, a linear classifier's weight vector cannot be tuned by a simple grid search. Hence, this paper proposes weight vector tuning of a generic binary linear classifier through the parameterization of a decomposition of the discriminant by a scalar which controls the trade-off between conflicting informative and noisy terms. By varying this parameter, the original weight vector is modified in a meaningful way. Applying this method to a number of linear classifiers under a variety of data dimensionality and sample size settings reveals that the classification performance loss due to non-optimal native hyperparameters can be compensated for by weight vector tuning. This yields computational savings as the proposed tuning method reduces to tuning a scalar compared to tuning the native hyperparameter, which may involve repeated weight vector generation along with its burden of optimization, dimensionality reduction, etc., depending on the classifier. It is also found that weight vector tuning significantly improves the performance of Linear Discriminant Analysis (LDA) under high estimation noise. Proceeding from this second finding, an asymptotic study of the misclassification probability of the parameterized LDA classifier in the growth regime where the data dimensionality and sample size are comparable is conducted. Using random matrix theory, the misclassification probability is shown to converge to a quantity that is a function of the true statistics of the data. Additionally, an estimator of the misclassification probability is derived. Finally, computationally efficient tuning of the parameter using this estimator is demonstrated on real data.
    On the Influence of Masking Policies in Intermediate Pre-training. (arXiv:2104.08840v2 [cs.CL] UPDATED)
    (2 min) Current NLP models are predominantly trained through a two-stage "pre-train then fine-tune" pipeline. Prior work has shown that inserting an intermediate pre-training stage, using heuristic masking policies for masked language modeling (MLM), can significantly improve final performance. However, it is still unclear (1) in what cases such intermediate pre-training is helpful, (2) whether hand-crafted heuristic objectives are optimal for a given task, and (3) whether a masking policy designed for one task is generalizable beyond that task. In this paper, we perform a large-scale empirical study to investigate the effect of various masking policies in intermediate pre-training with nine selected tasks across three categories. Crucially, we introduce methods to automate the discovery of optimal masking policies via direct supervision or meta-learning. We conclude that the success of intermediate pre-training is dependent on appropriate pre-train corpus, selection of output format (i.e., masked spans or full sentence), and clear understanding of the role that MLM plays for the downstream task. In addition, we find our learned masking policies outperform the heuristic of masking named entities on TriviaQA, and policies learned from one task can positively transfer to other tasks in certain cases, inviting future research in this direction.
    PMCE: efficient inference of expressive models of cancer evolution with high prognostic power. (arXiv:1408.6032v3 [stat.ML] UPDATED)
    (2 min) Motivation: Driver (epi)genomic alterations underlie the positive selection of cancer subpopulations, which promotes drug resistance and relapse. Even though substantial heterogeneity is witnessed in most cancer types, mutation accumulation patterns can be regularly found and can be exploited to reconstruct predictive models of cancer evolution. Yet, available methods cannot infer logical formulas connecting events to represent alternative evolutionary routes or convergent evolution. Results: We introduce PMCE, an expressive framework that leverages mutational profiles from cross-sectional sequencing data to infer probabilistic graphical models of cancer evolution including arbitrary logical formulas, and which outperforms the state-of-the-art in terms of accuracy and robustness to noise, on simulations. The application of PMCE to 7866 samples from the TCGA database allows us to identify a highly significant correlation between the predicted evolutionary paths and the overall survival in 7 tumor types, proving that our approach can effectively stratify cancer patients in reliable risk groups. Availability: PMCE is freely available at https://github.com/BIMIB-DISCo/PMCE, in addition to the code to replicate all the analyses presented in the manuscript. Contacts: daniele.ramazzotti@unimib.it, alex.graudenzi@ibfm.cnr.it.
    An Ensemble-based Multi-Criteria Decision Making Method for COVID-19 Cough Classification. (arXiv:2110.00508v1 [cs.LG])
    (2 min) The objectives of this research are analysing the performance of the state-of-the-art machine learning techniques for classifying COVID-19 from cough sound and identifying the model(s) that consistently perform well across different cough datasets. Different performance evaluation metrics (such as precision, sensitivity, specificity, AUC, accuracy, etc.) make it difficult to select the best performance model. To address this issue, in this paper, we propose an ensemble-based multi-criteria decision making (MCDM) method for selecting top performance machine learning technique(s) for COVID-19 cough classification. We use four cough datasets, namely Cambridge, Coswara, Virufy, and NoCoCoDa to verify the proposed method. At first, our proposed method uses the audio features of cough samples and then applies machine learning (ML) techniques to classify them as COVID-19 or non-COVID-19. Then, we consider a multi-criteria decision-making (MCDM) method that combines ensemble technologies (i.e., soft and hard) to select the best model. In MCDM, we use the technique for order preference by similarity to ideal solution (TOPSIS) for ranking purposes, while entropy is applied to calculate evaluation criteria weights. In addition, we apply the feature reduction process through recursive feature elimination with cross-validation under different estimators. The results of our empirical evaluations show that the proposed method outperforms the state-of-the-art models.
    Approximate Regions of Attraction in Learning with Decision-Dependent Distributions. (arXiv:2107.00055v2 [cs.LG] UPDATED)
    (2 min) As data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by situations where the initial conditions may have significant impact on the long-term behavior of the system. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization to performative risk minimizers.
    A fast randomized incremental gradient method for decentralized non-convex optimization. (arXiv:2011.03853v3 [math.OC] UPDATED)
    (2 min) We study decentralized non-convex finite-sum minimization problems described over a network of nodes, where each node possesses a local batch of data samples. In this context, we analyze a single-timescale randomized incremental gradient method, called GT-SAGA. GT-SAGA is computationally efficient as it evaluates one component gradient per node per iteration and achieves provably fast and robust performance by leveraging node-level variance reduction and network-level gradient tracking. For general smooth non-convex problems, we show the almost sure and mean-squared convergence of GT-SAGA to a first-order stationary point and further describe regimes of practical significance where it outperforms the existing approaches and achieves a network topology-independent iteration complexity respectively. When the global function satisfies the Polyak-Lojaciewisz condition, we show that GT-SAGA exhibits linear convergence to an optimal solution in expectation and describe regimes of practical interest where the performance is network topology-independent and improves upon the existing methods. Numerical experiments are included to highlight the main convergence aspects of GT-SAGA in non-convex settings.
    Emergence of Pragmatics from Referential Game between Theory of Mind Agents. (arXiv:2001.07752v2 [cs.AI] UPDATED)
    (2 min) Pragmatics studies how context can contribute to language meanings. In human communication, language is never interpreted out of context, and sentences can usually convey more information than their literal meanings. However, this mechanism is missing in most multi-agent systems, restricting the communication efficiency and the capability of human-agent interaction. In this paper, we propose an algorithm, using which agents can spontaneously learn the ability to "read between lines" without any explicit hand-designed rules. We integrate the theory of mind (ToM) in a cooperative multi-agent pedagogical situation and propose an adaptive reinforcement learning (RL) algorithm to develop a communication protocol. ToM is a profound cognitive science concept, claiming that people regularly reason about other's mental states, including beliefs, goals, and intentions, to obtain performance advantage in competition, cooperation or coalition. With this ability, agents consider language as not only messages but also rational acts reflecting others' hidden states. Our experiments demonstrate the advantage of pragmatic protocols over non-pragmatic protocols. We also show the teaching complexity following the pragmatic protocol empirically approximates to recursive teaching dimension (RTD).
    Best Principal Submatrix Selection for the Maximum Entropy Sampling Problem: Scalable Algorithms and Performance Guarantees. (arXiv:2001.08537v2 [stat.ML] UPDATED)
    (2 min) This paper studies a classic maximum entropy sampling problem (MESP), which aims to select the most informative principal submatrix of a prespecified size from a covariance matrix. MESP has been widely applied to many areas, including healthcare, power system, manufacturing and data science. By investigating its Lagrangian dual and primal characterization, we derive a novel convex integer program for MESP and show that its continuous relaxation yields a near-optimal solution. The results motivate us to study an efficient sampling algorithm and develop its approximation bound for MESP, which improves the best-known bound in literature. We then provide an efficient deterministic implementation of the sampling algorithm with the same approximation bound. By developing new mathematical tools for the singular matrices and analyzing the Lagrangian dual of the proposed convex integer program, we investigate the widely-used local search algorithm and prove its first-known approximation bound for MESP. The proof techniques further inspire us with an efficient implementation of the local search algorithm. Our numerical experiments demonstrate that these approximation algorithms can efficiently solve medium-sized and large-scale instances to near-optimality. Our proposed algorithms are coded and released as open-source software. Finally, we extend the analyses to the A-Optimal MESP (A-MESP), where the objective is to minimize the trace of the inverse of the selected principal submatrix.
    Inverse Reinforcement Learning: A Control Lyapunov Approach. (arXiv:2104.04483v2 [cs.LG] UPDATED)
    (2 min) Inferring the intent of an intelligent agent from demonstrations and subsequently predicting its behavior, is a critical task in many collaborative settings. A common approach to solve this problem is the framework of inverse reinforcement learning (IRL), where the observed agent, e.g., a human demonstrator, is assumed to behave according to an intrinsic cost function that reflects its intent and informs its control actions. In this work, we reformulate the IRL inference problem to learning control Lyapunov functions (CLF) from demonstrations by exploiting the inverse optimality property, which states that every CLF is also a meaningful value function. Moreover, the derived CLF formulation directly guarantees stability of inferred control policies. We show the flexibility of our proposed method by learning from goal-directed movement demonstrations in a continuous environment.
    Cross Modal Focal Loss for RGBD Face Anti-Spoofing. (arXiv:2103.00948v1 [cs.CV] CROSS LISTED)
    (2 min) Automatic methods for detecting presentation attacks are essential to ensure the reliable use of facial recognition technology. Most of the methods available in the literature for presentation attack detection (PAD) fails in generalizing to unseen attacks. In recent years, multi-channel methods have been proposed to improve the robustness of PAD systems. Often, only a limited amount of data is available for additional channels, which limits the effectiveness of these methods. In this work, we present a new framework for PAD that uses RGB and depth channels together with a novel loss function. The new architecture uses complementary information from the two modalities while reducing the impact of overfitting. Essentially, a cross-modal focal loss function is proposed to modulate the loss contribution of each channel as a function of the confidence of individual channels. Extensive evaluations in two publicly available datasets demonstrate the effectiveness of the proposed approach.
    A Cram\'er Distance perspective on Non-crossing Quantile Regression in Distributional Reinforcement Learning. (arXiv:2110.00535v1 [stat.ML])
    (2 min) Distributional reinforcement learning (DRL) extends the value-based approach by using a deep convolutional network to approximate the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile-based methods like QR-DQN project arbitrary distributions onto a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance, however, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Recently, monotonicity constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under monotonicity constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a novel non-crossing neural architecture that allows a good training performance using a novel algorithm to compute the Cram\'er distance, yielding significant improvements over QR-DQN in a number of games of the standard Atari 2600 benchmark.
    Fed-LAMB: Layerwise and Dimensionwise Locally Adaptive Optimization Algorithm. (arXiv:2110.00532v1 [cs.LG])
    (2 min) In the emerging paradigm of federated learning (FL), large amount of clients, such as mobile devices, are used to train possibly high-dimensional models on their respective data. Due to the low bandwidth of mobile devices, decentralized optimization methods need to shift the computation burden from those clients to the computation server while preserving privacy and reasonable communication cost. In this paper, we focus on the training of deep, as in multilayered, neural networks, under the FL settings. We present Fed-LAMB, a novel federated learning method based on a layerwise and dimensionwise updates of the local models, alleviating the nonconvexity and the multilayered nature of the optimization task at hand. We provide a thorough finite-time convergence analysis for Fed-LAMB characterizing how fast its gradient decreases. We provide experimental results under iid and non-iid settings to corroborate not only our theory, but also exhibit the faster convergence of our method, compared to the state-of-the-art.
    Stochastic Contrastive Learning. (arXiv:2110.00552v1 [cs.LG])
    (2 min) While state-of-the-art contrastive Self-Supervised Learning (SSL) models produce results competitive with their supervised counterparts, they lack the ability to infer latent variables. In contrast, prescribed latent variable (LV) models enable attributing uncertainty, inducing task specific compression, and in general allow for more interpretable representations. In this work, we introduce LV approximations to large scale contrastive SSL models. We demonstrate that this addition improves downstream performance (resulting in 96.42% and 77.49% test top-1 fine-tuned performance on CIFAR10 and ImageNet respectively with a ResNet50) as well as producing highly compressed representations (588x reduction) that are useful for interpretability, classification and regression downstream tasks.
    Evaluating the fairness of fine-tuning strategies in self-supervised learning. (arXiv:2110.00538v1 [cs.LG])
    (2 min) In this work we examine how fine-tuning impacts the fairness of contrastive Self-Supervised Learning (SSL) models. Our findings indicate that Batch Normalization (BN) statistics play a crucial role, and that updating only the BN statistics of a pre-trained SSL backbone improves its downstream fairness (36% worst subgroup, 25% mean subgroup gap). This procedure is competitive with supervised learning, while taking 4.4x less time to train and requiring only 0.35% as many parameters to be updated. Finally, inspired by recent work in supervised learning, we find that updating BN statistics and training residual skip connections (12.3% of the parameters) achieves parity with a fully fine-tuned model, while taking 1.33x less time to train.
    Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. (arXiv:2102.01645v4 [cs.NE] UPDATED)
    (2 min) In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption) corresponding to a given caption (or image). CLIP-GLaSS is based on the CLIP neural network, which, given an image and a descriptive caption, provides similar embeddings. Differently, CLIP-GLaSS takes a caption (or an image) as an input, and generates the image (or the caption) whose CLIP embedding is the most similar to the input one. This optimal image (or caption) is produced via a generative network, after an exploration by a genetic algorithm. Promising results are shown, based on the experimentation of the image Generators BigGAN and StyleGAN2, and of the text Generator GPT2
    Personalized Rehabilitation Robotics based on Online Learning Control. (arXiv:2110.00481v1 [cs.LG])
    (2 min) The use of rehabilitation robotics in clinical applications gains increasing importance, due to therapeutic benefits and the ability to alleviate labor-intensive works. However, their practical utility is dependent on the deployment of appropriate control algorithms, which adapt the level of task-assistance according to each individual patient's need. Generally, the required personalization is achieved through manual tuning by clinicians, which is cumbersome and error-prone. In this work we propose a novel online learning control architecture, which is able to personalize the control force at run time to each individual user. To this end, we deploy Gaussian process-based online learning with previously unseen prediction and update rates. Finally, we evaluate our method in an experimental user study, where the learning controller is shown to provide personalized control, while also obtaining safe interaction forces.
    Optimization Networks for Integrated Machine Learning. (arXiv:2110.00415v1 [cs.LG])
    (2 min) Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks.
    Applying Differential Privacy to Tensor Completion. (arXiv:2110.00539v1 [cs.LG])
    (2 min) Tensor completion aims at filling the missing or unobserved entries based on partially observed tensors. However, utilization of the observed tensors often raises serious privacy concerns in many practical scenarios. To address this issue, we propose a solid and unified framework that contains several approaches for applying differential privacy to the two most widely used tensor decomposition methods: i) CANDECOMP/PARAFAC~(CP) and ii) Tucker decompositions. For each approach, we establish a rigorous privacy guarantee and meanwhile evaluate the privacy-accuracy trade-off. Experiments on synthetic and real-world datasets demonstrate that our proposal achieves high accuracy for tensor completion while ensuring strong privacy protections.
    A generalization of the randomized singular value decomposition. (arXiv:2105.13052v2 [math.NA] UPDATED)
    (2 min) The randomized singular value decomposition (SVD) is a popular and effective algorithm for computing a near-best rank $k$ approximation of a matrix $A$ using matrix-vector products with standard Gaussian vectors. Here, we generalize the theory of randomized SVD to multivariate Gaussian vectors, allowing one to incorporate prior knowledge of $A$ into the algorithm. This enables us to explore the continuous analogue of the randomized SVD for Hilbert--Schmidt (HS) operators using operator-function products with functions drawn from a Gaussian process (GP). We then construct a new covariance kernel for GPs, based on weighted Jacobi polynomials, which allows us to rapidly sample the GP and control the smoothness of the randomly generated functions. Numerical examples on matrices and HS operators demonstrate the applicability of the algorithm.
    Guiding Evolutionary Strategies by Differentiable Robot Simulators. (arXiv:2110.00438v1 [cs.RO])
    (2 min) In recent years, Evolutionary Strategies were actively explored in robotic tasks for policy search as they provide a simpler alternative to reinforcement learning algorithms. However, this class of algorithms is often claimed to be extremely sample-inefficient. On the other hand, there is a growing interest in Differentiable Robot Simulators (DRS) as they potentially can find successful policies with only a handful of trajectories. But the resulting gradient is not always useful for the first-order optimization. In this work, we demonstrate how DRS gradient can be used in conjunction with Evolutionary Strategies. Preliminary results suggest that this combination can reduce sample complexity of Evolutionary Strategies by 3x-5x times in both simulation and the real world.
    Evaluation of Non-Negative Matrix Factorization and n-stage Latent Dirichlet Allocation for Emotion Analysis in Turkish Tweets. (arXiv:2110.00418v1 [cs.CL])
    (2 min) With the development of technology, the use of social media has become quite common. Analyzing comments on social media in areas such as media and advertising plays an important role today. For this reason, new and traditional natural language processing methods are used to detect the emotion of these shares. In this paper, the Latent Dirichlet Allocation, namely LDA, and Non-Negative Matrix Factorization methods in topic modeling were used to determine which emotion the Turkish tweets posted via Twitter. In addition, the accuracy of a proposed n-level method based on LDA was analyzed. Dataset consists of 5 emotions, namely angry, fear, happy, sad and confused. NMF was the most successful method among all topic modeling methods in this study. Then, the F1-measure of Random Forest, Naive Bayes and Support Vector Machine methods was analyzed by obtaining a file suitable for Weka by using the word weights and class labels of the topics. Among the Weka results, the most successful method was n-stage LDA, and the most successful algorithm was Random Forest.
    Probabilistic Robust Autoencoders for Anomaly Detection. (arXiv:2110.00494v1 [cs.LG])
    (2 min) Empirical observations often consist of anomalies (or outliers) that contaminate the data. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose a new type of autoencoder (AE) which we term Probabilistic Robust autoencoder (PRAE). PRAE is designed to simultaneously remove outliers and identify a low-dimensional representation for the inlier samples. We first describe Robust AE (RAE) as a model that aims to split the data to inlier samples from which a low dimensional representation is learned via an AE, and anomalous (outlier) samples that are excluded as they do not fit the low dimensional representation. Robust AE minimizes the reconstruction of the AE while attempting to incorporate as many observations as possible. This could be realized by subtracting from the reconstruction term an $\ell_0$ norm counting the number of selected observations. Since the $\ell_0$ norm is not differentiable, we propose two probabilistic relaxations for the RAE approach and demonstrate that they can effectively identify anomalies. We prove that the solution to PRAE is equivalent to the solution of RAE and demonstrate using extensive simulations that PRAE is at par with state-of-the-art methods for anomaly detection.
    Arbitrary Marginal Neural Ratio Estimation for Simulation-based Inference. (arXiv:2110.00449v1 [cs.LG])
    (2 min) In many areas of science, complex phenomena are modeled by stochastic parametric simulators, often featuring high-dimensional parameter spaces and intractable likelihoods. In this context, performing Bayesian inference can be challenging. In this work, we present a novel method that enables amortized inference over arbitrary subsets of the parameters, without resorting to numerical integration, which makes interpretation of the posterior more convenient. Our method is efficient and can be implemented with arbitrary neural network architectures. We demonstrate the applicability of the method on parameter inference of binary black hole systems from gravitational waves observations.
    SMATE: Semi-Supervised Spatio-Temporal Representation Learning on Multivariate Time Series. (arXiv:2110.00578v1 [cs.LG])
    (2 min) Learning from Multivariate Time Series (MTS) has attracted widespread attention in recent years. In particular, label shortage is a real challenge for the classification task on MTS, considering its complex dimensional and sequential data structure. Unlike self-training and positive unlabeled learning that rely on distance-based classifiers, in this paper, we propose SMATE, a novel semi-supervised model for learning the interpretable Spatio-Temporal representation from weakly labeled MTS. We validate empirically the learned representation on 22 public datasets from the UEA MTS archive. We compare it with 13 state-of-the-art baseline methods for fully supervised tasks and four baselines for semi-supervised tasks. The results show the reliability and efficiency of our proposed method.
    SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference. (arXiv:2110.00478v1 [cs.AR])
    (2 min) Edge computing devices inherently face tight resource constraints, which is especially apparent when deploying Deep Neural Networks (DNN) with high memory and compute demands. FPGAs are commonly available in edge devices. Since these reconfigurable circuits can achieve higher throughput and lower power consumption than general purpose processors, they are especially well-suited for DNN acceleration. However, existing solutions for designing FPGA-based DNN accelerators for edge devices come with high development overheads, given the cost of repeated FPGA synthesis passes, reimplementation in a Hardware Description Language (HDL) of the simulated design, and accelerator system integration. In this paper we propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized DNN inference accelerators on edge devices with FPGAs. SECDA combines cost-effective SystemC simulation with hardware execution, streamlining design space exploration and the development process via reduced design evaluation time. As a case study, we use SECDA to efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a platform that includes an edge FPGA. We quickly and iteratively explore the system's hardware/software stack, while identifying and mitigating performance bottlenecks. We evaluate the two accelerator designs with four common DNN models, achieving an average performance speedup across models of up to 3.5$\times$ with a 2.9$\times$ reduction in energy consumption over CPU-only inference. Our code is available at https://github.com/gicLAB/SECDA
    New Evolutionary Computation Models and their Applications to Machine Learning. (arXiv:2110.00468v1 [cs.NE])
    (3 min) Automatic Programming is one of the most important areas of computer science research today. Hardware speed and capability have increased exponentially, but the software is years behind. The demand for software has also increased significantly, but it is still written in old fashion: by using humans. There are multiple problems when the work is done by humans: cost, time, quality. It is costly to pay humans, it is hard to keep them satisfied for a long time, it takes a lot of time to teach and train them and the quality of their output is in most cases low (in software, mostly due to bugs). The real advances in human civilization appeared during the industrial revolutions. Before the first revolution, most people worked in agriculture. Today, very few percent of people work in this field. A similar revolution must appear in the computer programming field. Otherwise, we will have so many people working in this field as we had in the past working in agriculture. How do people know how to write computer programs? Very simple: by learning. Can we do the same for software? Can we put the software to learn how to write software? It seems that is possible (to some degree) and the term is called Machine Learning. It was first coined in 1959 by the first person who made a computer perform a serious learning task, namely, Arthur Samuel. However, things are not so easy as in humans (well, truth to be said - for some humans it is impossible to learn how to write software). So far we do not have software that can learn perfectly to write software. We have some particular cases where some programs do better than humans, but the examples are sporadic at best. Learning from experience is difficult for computer programs. Instead of trying to simulate how humans teach humans how to write computer programs, we can simulate nature.
    MEDIC: A Multi-Task Learning Dataset for Disaster Image Classification. (arXiv:2108.12828v2 [cs.CV] UPDATED)
    (2 min) Recent research in disaster informatics demonstrates a practical and important use case of artificial intelligence to save human lives and sufferings during post-natural disasters based on social media contents (text and images). While notable progress has been made using texts, research on exploiting the images remains relatively under-explored. To advance the image-based approach, we propose MEDIC (available at: https://crisisnlp.qcri.org/medic/index.html), which is the largest social media image classification dataset for humanitarian response consisting of 71,198 images to address four different tasks in a multi-task learning setup. This is the first dataset of its kind: social media image, disaster response, and multi-task learning research. An important property of this dataset is its high potential to contribute research on multi-task learning, which recently receives much interest from the machine learning community and has shown remarkable results in terms of memory, inference speed, performance, and generalization capability. Therefore, the proposed dataset is an important resource for advancing image-based disaster management and multi-task machine learning research.
    X2CT-FLOW: Maximum a posteriori reconstruction using a progressive flow-based deep generative model for ultra sparse-view computed tomography in ultra low-dose protocols. (arXiv:2104.04179v2 [eess.IV] UPDATED)
    (0 min) Ultra sparse-view computed tomography (CT) algorithms can reduce radiation exposure of patients, but those algorithms lack an explicit cycle consistency loss minimization and an explicit log-likelihood maximization in testing. Here, we propose X2CT-FLOW for the maximum a posteriori (MAP) reconstruction of a three-dimensional (3D) chest CT image from a single or a few two-dimensional (2D) projection images using a progressive flow-based deep generative model, especially for ultra low-dose protocols. The MAP reconstruction can simultaneously optimize the cycle consistency loss and the log-likelihood. The proposed algorithm is built upon a newly developed progressive flow-based deep generative model, which is featured with exact log-likelihood estimation, efficient sampling, and progressive learning. We applied X2CT-FLOW to reconstruction of 3D chest CT images from biplanar projection images without noise contamination (assuming a standard-dose protocol) and with strong noise contamination (assuming an ultra low-dose protocol). With the standard-dose protocol, our images reconstructed from 2D projected images and 3D ground-truth CT images showed good agreement in terms of structural similarity (SSIM, 0.7675 on average), peak signal-to-noise ratio (PSNR, 25.89 dB on average), mean absolute error (MAE, 0.02364 on average), and normalized root mean square error (NRMSE, 0.05731 on average). Moreover, with the ultra low-dose protocol, our images reconstructed from 2D projected images and the 3D ground-truth CT images also showed good agreement in terms of SSIM (0.7008 on average), PSNR (23.58 dB on average), MAE (0.02991 on average), and NRMSE (0.07349 on average).
    Deep Image Synthesis from Intuitive User Input: A Review and Perspectives. (arXiv:2107.04240v2 [cs.CV] UPDATED)
    (2 min) In many applications of computer graphics, art and design, it is desirable for a user to provide intuitive non-image input, such as text, sketch, stroke, graph or layout, and have a computer system automatically generate photo-realistic images that adhere to the input content. While classic works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and flow-based methods have enabled more powerful and versatile image generation tasks. This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics. This motivates new perspectives on input representation and interactivity, cross pollination between major image generation paradigms, and evaluation and comparison of generation methods.
    Learning Reward Functions from Scale Feedback. (arXiv:2110.00284v1 [cs.RO])
    (2 min) Today's robots are increasingly interacting with people and need to efficiently learn inexperienced user's preferences. A common framework is to iteratively query the user about which of two presented robot trajectories they prefer. While this minimizes the users effort, a strict choice does not yield any information on how much one trajectory is preferred. We propose scale feedback, where the user utilizes a slider to give more nuanced information. We introduce a probabilistic model on how users would provide feedback and derive a learning framework for the robot. We demonstrate the performance benefit of slider feedback in simulations, and validate our approach in two user studies suggesting that scale feedback enables more effective learning in practice.
    Sharp Restricted Isometry Property Bounds for Low-rank Matrix Recovery Problems with Corrupted Measurements. (arXiv:2105.08232v2 [math.OC] UPDATED)
    (2 min) In this paper, we study a general low-rank matrix recovery problem with linear measurements corrupted by some noise. The objective is to understand under what conditions on the restricted isometry property (RIP) of the problem local search methods can find the ground truth with a small error. By analyzing the landscape of the non-convex problem, we first propose a global guarantee on the maximum distance between an arbitrary local minimizer and the ground truth under the assumption that the RIP constant is smaller than $1/2$. We show that this distance shrinks to zero as the intensity of the noise reduces. Our new guarantee is sharp in terms of the RIP constant and is much stronger than the existing results. We then present a local guarantee for problems with an arbitrary RIP constant, which states that any local minimizer is either considerably close to the ground truth or far away from it. Next, we prove the strict saddle property, which guarantees the global convergence of the perturbed gradient descent method in polynomial time. The developed results demonstrate how the noise intensity and the RIP constant of the problem affect the landscape of the problem.
    PFPN: Continuous Control of Physically Simulated Characters using Particle Filtering Policy Network. (arXiv:2003.06959v4 [cs.LG] UPDATED)
    (2 min) Data-driven methods for physics-based character control using reinforcement learning have been successfully applied to generate high-quality motions. However, existing approaches typically rely on Gaussian distributions to represent the action policy, which can prematurely commit to suboptimal actions when solving high-dimensional continuous control problems for highly-articulated characters. In this paper, to improve the learning performance of physics-based character controllers, we propose a framework that considers a particle-based action policy as a substitute for Gaussian policies. We exploit particle filtering to dynamically explore and discretize the action space, and track the posterior policy represented as a mixture distribution. The resulting policy can replace the unimodal Gaussian policy which has been the staple for character control problems, without changing the underlying model architecture of the reinforcement learning algorithm used to perform policy optimization. We demonstrate the applicability of our approach on various motion capture imitation tasks. Baselines using our particle-based policies achieve better imitation performance and speed of convergence as compared to corresponding implementations using Gaussians, and are more robust to external perturbations during character control. Related code is available at: https://motion-lab.github.io/PFPN.
    Neural Networks for Encoding Dynamic Security-Constrained Optimal Power Flow. (arXiv:2003.07939v4 [eess.SY] UPDATED)
    (0 min) This paper introduces a framework to capture previously intractable optimization constraints and transform them to a mixed-integer linear program, through the use of neural networks. We encode the feasible space of optimization problems characterized by both tractable and intractable constraints, e.g. differential equations, to a neural network. Leveraging an exact mixed-integer reformulation of neural networks, we solve mixed-integer linear programs that accurately approximate solutions to the originally intractable non-linear optimization problem. We apply our methods to the AC optimal power flow problem (AC-OPF), where directly including dynamic security constraints renders the AC-OPF intractable. Our proposed approach has the potential to be significantly more scalable than traditional approaches. We demonstrate our approach for power system operation considering N-1 security and small-signal stability, showing how it can efficiently obtain cost-optimal solutions which at the same time satisfy both static and dynamic security constraints.
    Unsupervised Belief Representation Learning in Polarized Networks with Information-Theoretic Variational Graph Auto-Encoders. (arXiv:2110.00210v1 [cs.SI])
    (2 min) This paper develops a novel unsupervised algorithm for belief representation learning in polarized networks that (i) uncovers the latent dimensions of the underlying belief space and (ii) jointly embeds users and content items (that they interact with) into that space in a manner that facilitates a number of downstream tasks, such as stance detection, stance prediction, and ideology mapping. Inspired by total correlation in information theory, we propose a novel Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) that learns to project both users and content items (e.g., posts that represent user views) into an appropriate disentangled latent space. In order to better disentangle orthogonal latent variables in that space, we develop total correlation regularization, PI control module, and adopt rectified Gaussian Distribution for the latent space. The latent representation of users and content can then be used to quantify their ideological leaning and detect/predict their stances on issues. We evaluate the performance of the proposed InfoVGAE on three real-world datasets, of which two are collected from Twitter and one from U.S. Congress voting records. The evaluation results show that our model outperforms state-of-the-art unsupervised models and produce comparable result with supervised models. We also discuss stance prediction and user ranking within ideological groups.
    Effect of barren plateaus on gradient-free optimization. (arXiv:2011.12245v2 [quant-ph] UPDATED)
    (2 min) Barren plateau landscapes correspond to gradients that vanish exponentially in the number of qubits. Such landscapes have been demonstrated for variational quantum algorithms and quantum neural networks with either deep circuits or global cost functions. For obvious reasons, it is expected that gradient-based optimizers will be significantly affected by barren plateaus. However, whether or not gradient-free optimizers are impacted is a topic of debate, with some arguing that gradient-free approaches are unaffected by barren plateaus. Here we show that, indeed, gradient-free optimizers do not solve the barren plateau problem. Our main result proves that cost function differences, which are the basis for making decisions in a gradient-free optimization, are exponentially suppressed in a barren plateau. Hence, without exponential precision, gradient-free optimizers will not make progress in the optimization. We numerically confirm this by training in a barren plateau with several gradient-free optimizers (Nelder-Mead, Powell, and COBYLA algorithms), and show that the numbers of shots required in the optimization grows exponentially with the number of qubits.
    Cellular traffic offloading via Opportunistic Networking with Reinforcement Learning. (arXiv:2110.00397v1 [cs.NI])
    (2 min) The widespread diffusion of mobile phones is triggering an exponential growth of mobile data traffic that is likely to cause, in the near future, considerable traffic overload issues even in last-generation cellular networks. Offloading part of the traffic to other networks is considered a very promising approach and, in particular, in this paper, we consider offloading through opportunistic networks of users' devices. However, the performance of this solution strongly depends on the pattern of encounters between mobile nodes, which should therefore be taken into account when designing offloading control algorithms. In this paper, we propose an adaptive offloading solution based on the Reinforcement Learning framework and we evaluate and compare the performance of two well-known learning algorithms: Actor-Critic and Q-Learning. More precisely, in our solution the controller of the dissemination process, once trained, is able to select a proper number of content replicas to be injected into the opportunistic network to guarantee the timely delivery of contents to all interested users. We show that our system based on Reinforcement Learning is able to automatically learn a very efficient strategy to reduce the traffic on the cellular network, without relying on any additional context information about the opportunistic network. Our solution achieves a higher level of offloading with respect to other state-of-the-art approaches, in a range of different mobility settings. Moreover, we show that a more refined learning solution, based on the Actor-Critic algorithm, is significantly more efficient than a simpler solution based on Q-learning.
    Smooth Normalizing Flows. (arXiv:2110.00351v1 [stat.ML])
    (2 min) Normalizing flows are a promising tool for modeling probability distributions in physical systems. While state-of-the-art flows accurately approximate distributions and energies, applications in physics additionally require smooth energies to compute forces and higher-order derivatives. Furthermore, such densities are often defined on non-trivial topologies. A recent example are Boltzmann Generators for generating 3D-structures of peptides and small proteins. These generative models leverage the space of internal coordinates (dihedrals, angles, and bonds), which is a product of hypertori and compact intervals. In this work, we introduce a class of smooth mixture transformations working on both compact intervals and hypertori. Mixture transformations employ root-finding methods to invert them in practice, which has so far prevented bi-directional flow training. To this end, we show that parameter gradients and forces of such inverses can be computed from forward evaluations via the inverse function theorem. We demonstrate two advantages of such smooth flows: they allow training by force matching to simulation data and can be used as potentials in molecular dynamics simulations.
    Error-free approximation of explicit linear MPC through lattice piecewise affine expression. (arXiv:2110.00201v1 [eess.SY])
    (2 min) In this paper, the disjunctive and conjunctive lattice piecewise affine (PWA) approximations of explicit linear model predictive control (MPC) are proposed. The training data is generated uniformly in the domain of interest, consisting of the state samples and corresponding affine control laws, based on which the lattice PWA approximations are constructed. Resampling of data is also proposed to guarantee that the lattice PWA approximations are identical to the explicit MPC control law in unique order (UO) regions containing the sample points as interior points. Besides, under mild assumptions, the equivalence of the 2 lattice PWA approximations guarantees the approximations are error-free in the domain of interest. The algorithms for deriving statistical error-free approximation to the explicit linear MPC is proposed and the complexity of the whole procedure is analyzed, which is polynomial with respect to the number of samples. The performance of the proposed approximation strategy is tested through 2 simulation examples, and the result shows that with a moderate number of sample points, we can construct lattice PWA approximations that are equivalent to optimal control law of the explicit linear MPC.
    Lagrangian Inference for Ranking Problems. (arXiv:2110.00151v1 [stat.ML])
    (2 min) We propose a novel combinatorial inference framework to conduct general uncertainty quantification in ranking problems. We consider the widely adopted Bradley-Terry-Luce (BTL) model, where each item is assigned a positive preference score that determines the Bernoulli distributions of pairwise comparisons' outcomes. Our proposed method aims to infer general ranking properties of the BTL model. The general ranking properties include the "local" properties such as if an item is preferred over another and the "global" properties such as if an item is among the top $K$-ranked items. We further generalize our inferential framework to multiple testing problems where we control the false discovery rate (FDR), and apply the method to infer the top-$K$ ranked items. We also derive the information-theoretic lower bound to justify the minimax optimality of the proposed method. We conduct extensive numerical studies using both synthetic and real datasets to back up our theory.
    CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP. (arXiv:2104.08835v2 [cs.CL] UPDATED)
    (2 min) Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge obtained when learning prior tasks. In this paper, we explore whether and how such cross-task generalization ability can be acquired, and further applied to build better few-shot learners across diverse NLP tasks. We introduce CrossFit, a problem setup for studying cross-task generalization ability, which standardizes seen/unseen task partitions, data access during different learning stages, and the evaluation protocols. To instantiate different seen/unseen task partitions in CrossFit and facilitate in-depth analysis, we present the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format. Our analysis reveals that the few-shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks. We also observe that the selection of upstream learning tasks can significantly influence few-shot performance on unseen tasks, asking further analysis on task similarity and transferability.
    Tree in Tree: from Decision Trees to Decision Graphs. (arXiv:2110.00392v1 [cs.LG])
    (2 min) Decision trees have been widely used as classifiers in many machine learning applications thanks to their lightweight and interpretable decision process. This paper introduces Tree in Tree decision graph (TnT), a framework that extends the conventional decision tree to a more generic and powerful directed acyclic graph. TnT constructs decision graphs by recursively growing decision trees inside the internal or leaf nodes instead of greedy training. The time complexity of TnT is linear to the number of nodes in the graph, and it can construct decision graphs on large datasets. Compared to decision trees, we show that TnT achieves better classification performance with reduced model size, both as a stand-alone classifier and as a base estimator in bagging/AdaBoost ensembles. Our proposed model is a novel, more efficient, and accurate alternative to the widely-used decision trees.
    Offline Reinforcement Learning with Reverse Model-based Imagination. (arXiv:2110.00188v1 [cs.LG])
    (2 min) In offline reinforcement learning (offline RL), one of the main challenges is to deal with the distributional shift between the learning policy and the given dataset. To address this problem, recent offline RL methods attempt to introduce conservatism bias to encourage learning on high-confidence areas. Model-free approaches directly encode such bias into policy or value function learning using conservative regularizations or special network structures, but their constrained policy search limits the generalization beyond the offline dataset. Model-based approaches learn forward dynamics models with conservatism quantifications and then generate imaginary trajectories to extend the offline datasets. However, due to limited samples in offline dataset, conservatism quantifications often suffer from overgeneralization in out-of-support regions. The unreliable conservative measures will mislead forward model-based imaginations to undesired areas, leading to overaggressive behaviors. To encourage more conservatism, we propose a novel model-based offline RL framework, called Reverse Offline Model-based Imagination (ROMI). We learn a reverse dynamics model in conjunction with a novel reverse policy, which can generate rollouts leading to the target goal states within the offline dataset. These reverse imaginations provide informed data augmentation for the model-free policy learning and enable conservative generalization beyond the offline dataset. ROMI can effectively combine with off-the-shelf model-free algorithms to enable model-based generalization with proper conservatism. Empirical results show that our method can generate more conservative behaviors and achieve state-of-the-art performance on offline RL benchmark tasks.
    Learning Representations for Sub-Symbolic Reasoning. (arXiv:2106.00393v2 [cs.AI] UPDATED)
    (0 min) Neuro-symbolic methods integrate neural architectures, knowledge representation and reasoning. However, they have been struggling at both dealing with the intrinsic uncertainty of the observations and scaling to real world applications. This paper presents Relational Reasoning Networks (R2N), a novel end-to-end model that performs relational reasoning in the latent space of a deep learner architecture, where the representations of constants, ground atoms and their manipulations are learned in an integrated fashion. Unlike flat architectures like Knowledge Graph Embedders, which can only represent relations between entities, R2Ns define an additional computational structure, accounting for higher-level relations among the ground atoms. The considered relations can be explicitly known, like the ones defined by logic formulas, or defined as unconstrained correlations among groups of ground atoms. R2Ns can be applied to purely symbolic tasks or as a neuro-symbolic platform to integrate learning and reasoning in heterogeneous problems with both symbolic and feature-based represented entities. The proposed model bridges the gap between previous neuro-symbolic methods that have been either limited in terms of scalability or expressivity. The proposed methodology is shown to achieve state-of-the-art results in different experimental settings.
    Predicting Flat-Fading Channels via Meta-Learned Closed-Form Linear Filters and Equilibrium Propagation. (arXiv:2110.00414v1 [cs.IT])
    (0 min) Predicting fading channels is a classical problem with a vast array of applications, including as an enabler of artificial intelligence (AI)-based proactive resource allocation for cellular networks. Under the assumption that the fading channel follows a stationary complex Gaussian process, as for Rayleigh and Rician fading models, the optimal predictor is linear, and it can be directly computed from the Doppler spectrum via standard linear minimum mean squared error (LMMSE) estimation. However, in practice, the Doppler spectrum is unknown, and the predictor has only access to a limited time series of estimated channels. This paper proposes to leverage meta-learning in order to mitigate the requirements in terms of training data for channel fading prediction. Specifically, it first develops an offline low-complexity solution based on linear filtering via a meta-trained quadratic regularization. Then, an online method is proposed based on gradient descent and equilibrium propagation (EP). Numerical results demonstrate the advantages of the proposed approach, showing its capacity to approach the genie-aided LMMSE solution with a small number of training data points.
    Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads. (arXiv:2110.00459v1 [cs.DC])
    (0 min) We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we examine scheduling at the microarchitectural level. We find that the lack of fine-grained preemption mechanisms, robust task prioritization options, and contention-aware thread block placement policies limits the effectiveness of NVIDIA's concurrency mechanisms. In summary, the sequential nature of deep learning workloads and their fluctuating resource requirements and kernel runtimes make executing such workloads while maintaining consistently high utilization and low, predictable turnaround times difficult on current NVIDIA hardware.
    Unsupervised Self-Training for Sentiment Analysis of Code-Switched Data. (arXiv:2103.14797v2 [cs.CL] UPDATED)
    (0 min) Sentiment analysis is an important task in understanding social media content like customer reviews, Twitter and Facebook feeds etc. In multilingual communities around the world, a large amount of social media text is characterized by the presence of Code-Switching. Thus, it has become important to build models that can handle code-switched data. However, annotated code-switched data is scarce and there is a need for unsupervised models and algorithms. We propose a general framework called Unsupervised Self-Training and show its applications for the specific use case of sentiment analysis of code-switched data. We use the power of pre-trained BERT models for initialization and fine-tune them in an unsupervised manner, only using pseudo labels produced by zero-shot transfer. We test our algorithm on multiple code-switched languages and provide a detailed analysis of the learning dynamics of the algorithm with the aim of answering the question - `Does our unsupervised model understand the Code-Switched languages or does it just learn its representations?'. Our unsupervised models compete well with their supervised counterparts, with their performance reaching within 1-7\% (weighted F1 scores) when compared to supervised models trained for a two class problem.
    Scalable, Axiomatic Explanations of Deep Alzheimer's Diagnosis from Heterogeneous Data. (arXiv:2107.05997v2 [cs.LG] UPDATED)
    (2 min) Deep Neural Networks (DNNs) have an enormous potential to learn from complex biomedical data. In particular, DNNs have been used to seamlessly fuse heterogeneous information from neuroanatomy, genetics, biomarkers, and neuropsychological tests for highly accurate Alzheimer's disease diagnosis. On the other hand, their black-box nature is still a barrier for the adoption of such a system in the clinic, where interpretability is absolutely essential. We propose Shapley Value Explanation of Heterogeneous Neural Networks (SVEHNN) for explaining the Alzheimer's diagnosis made by a DNN from the 3D point cloud of the neuroanatomy and tabular biomarkers. Our explanations are based on the Shapley value, which is the unique method that satisfies all fundamental axioms for local explanations previously established in the literature. Thus, SVEHNN has many desirable characteristics that previous work on interpretability for medical decision making is lacking. To avoid the exponential time complexity of the Shapley value, we propose to transform a given DNN into a Lightweight Probabilistic Deep Network without re-training, thus achieving a complexity only quadratic in the number of features. In our experiments on synthetic and real data, we show that we can closely approximate the exact Shapley value with a dramatically reduced runtime and can reveal the hidden knowledge the network has learned from the data.
    Towards Gradient-based Bilevel Optimization with Non-convex Followers and Beyond. (arXiv:2110.00455v1 [cs.LG])
    (2 min) In recent years, Bi-Level Optimization (BLO) techniques have received extensive attentions from both learning and vision communities. A variety of BLO models in complex and practical tasks are of non-convex follower structure in nature (a.k.a., without Lower-Level Convexity, LLC for short). However, this challenging class of BLOs is lack of developments on both efficient solution strategies and solid theoretical guarantees. In this work, we propose a new algorithmic framework, named Initialization Auxiliary and Pessimistic Trajectory Truncated Gradient Method (IAPTT-GM), to partially address the above issues. In particular, by introducing an auxiliary as initialization to guide the optimization dynamics and designing a pessimistic trajectory truncation operation, we construct a reliable approximate version of the original BLO in the absence of LLC hypothesis. Our theoretical investigations establish the convergence of solutions returned by IAPTT-GM towards those of the original BLO without LLC. As an additional bonus, we also theoretically justify the quality of our IAPTT-GM embedded with Nesterov's accelerated dynamics under LLC. The experimental results confirm both the convergence of our algorithm without LLC, and the theoretical findings under LLC.
    Beyond Low-Pass Filters: Adaptive Feature Propagation on Graphs. (arXiv:2103.14187v5 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) have been extensively studied for prediction tasks on graphs. As pointed out by recent studies, most GNNs assume local homophily, i.e., strong similarities in local neighborhoods. This assumption however limits the generalizability power of GNNs. To address this limitation, we propose a flexible GNN model, which is capable of handling any graphs without being restricted by their underlying homophily. At its core, this model adopts a node attention mechanism based on multiple learnable spectral filters; therefore, the aggregation scheme is learned adaptively for each graph in the spectral domain. We evaluated the proposed model on node classification tasks over eight benchmark datasets. The proposed model is shown to generalize well to both homophilic and heterophilic graphs. Further, it outperforms all state-of-the-art baselines on heterophilic graphs and performs comparably with them on homophilic graphs.
    Score-Based Generative Classifiers. (arXiv:2110.00473v1 [stat.ML])
    (2 min) The tremendous success of generative models in recent years raises the question whether they can also be used to perform classification. Generative models have been used as adversarially robust classifiers on simple datasets such as MNIST, but this robustness has not been observed on more complex datasets like CIFAR-10. Additionally, on natural image datasets, previous results have suggested a trade-off between the likelihood of the data and classification accuracy. In this work, we investigate score-based generative models as classifiers for natural images. We show that these models not only obtain competitive likelihood values but simultaneously achieve state-of-the-art classification accuracy for generative classifiers on CIFAR-10. Nevertheless, we find that these models are only slightly, if at all, more robust than discriminative baseline models on out-of-distribution tasks based on common image corruptions. Similarly and contrary to prior results, we find that score-based are prone to worst-case distribution shifts in the form of adversarial perturbations. Our work highlights that score-based generative models are closing the gap in classification accuracy compared to standard discriminative models. While they do not yet deliver on the promise of adversarial and out-of-domain robustness, they provide a different approach to classification that warrants further research.
    Content-Preserving Unpaired Translation from Simulated to Realistic Ultrasound Images. (arXiv:2103.05745v2 [eess.IV] UPDATED)
    (2 min) Interactive simulation of ultrasound imaging greatly facilitates sonography training. Although ray-tracing based methods have shown promising results, obtaining realistic images requires substantial modeling effort and manual parameter tuning. In addition, current techniques still result in a significant appearance gap between simulated images and real clinical scans. Herein we introduce a novel content-preserving image translation framework (ConPres) to bridge this appearance gap, while maintaining the simulated anatomical layout. We achieve this goal by leveraging both simulated images with semantic segmentations and unpaired in-vivo ultrasound scans. Our framework is based on recent contrastive unpaired translation techniques and we propose a regularization approach by learning an auxiliary segmentation-to-real image translation task, which encourages the disentanglement of content and style. In addition, we extend the generator to be class-conditional, which enables the incorporation of additional losses, in particular a cyclic consistency loss, to further improve the translation quality. Qualitative and quantitative comparisons against state-of-the-art unpaired translation methods demonstrate the superiority of our proposed framework.
    Revisiting the Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning. (arXiv:2102.03479v15 [cs.LG] UPDATED)
    (3 min) QMIX, a popular MARL algorithm based on the monotonicity constraint, has been used as a baseline for the benchmark environments, such as Starcraft Multi-Agent Challenge (SMAC), Predator-Prey (PP). Recent variants of QMIX target relaxing the monotonicity constraint of QMIX to improve the expressive power of QMIX, allowing for performance improvement in SMAC. However, we find that such performance improvements of the variants are significantly affected by various implementation tricks. In this paper, we revisit the monotonicity constraint of QMIX, (1) we design a novel model RMC to further investigate the monotonicity constraint; the results show that monotonicity constraint can improve sample efficiency in some purely cooperative tasks; (2) we then re-evaluate the performance of QMIX and these variants by a grid hyperparameter search for the tricks; the results show QMIX achieves the best performance among them, achieving SOTA performance on SMAC and PP; (3) we analyze the monotonic mixing network from a theoretical perspective and show that it can represent any tasks that can be interpreted as purely cooperative. These analyses demonstrate that relaxing the monotonicity constraint of the mixing network will not always improve the performance of QMIX, which breaks our previous impressions of the monotonicity constraints. We open-source the code at \url{https://github.com/hijkzzz/pymarl2}.
    When in Doubt: Neural Non-Parametric Uncertainty Quantification for Epidemic Forecasting. (arXiv:2106.03904v2 [cs.LG] UPDATED)
    (2 min) Accurate and trustworthy epidemic forecasting is an important problem that has impact on public health planning and disease mitigation. Most existing epidemic forecasting models disregard uncertainty quantification, resulting in mis-calibrated predictions. Recent works in deep neural models for uncertainty-aware time-series forecasting also have several limitations; e.g. it is difficult to specify meaningful priors in Bayesian NNs, while methods like deep ensembling are computationally expensive in practice. In this paper, we fill this important gap. We model the forecasting task as a probabilistic generative process and propose a functional neural process model called EPIFNP, which directly models the probability density of the forecast value. EPIFNP leverages a dynamic stochastic correlation graph to model the correlations between sequences in a non-parametric way, and designs different stochastic latent variables to capture functional uncertainty from different perspectives. Our extensive experiments in a real-time flu forecasting setting show that EPIFNP significantly outperforms previous state-of-the-art models in both accuracy and calibration metrics, up to 2.5x in accuracy and 2.4x in calibration. Additionally, due to properties of its generative process,EPIFNP learns the relations between the current season and similar patterns of historical seasons,enabling interpretable forecasts. Beyond epidemic forecasting, the EPIFNP can be of independent interest for advancing principled uncertainty quantification in deep sequential models for predictive analytics
    Model-free inference of unseen attractors: Reconstructing phase space features from a single noisy trajectory using reservoir computing. (arXiv:2108.04074v2 [cs.LG] UPDATED)
    (2 min) Reservoir computers are powerful tools for chaotic time series prediction. They can be trained to approximate phase space flows and can thus both predict future values to a high accuracy, as well as reconstruct the general properties of a chaotic attractor without requiring a model. In this work, we show that the ability to learn the dynamics of a complex system can be extended to systems with co-existing attractors, here a 4-dimensional extension of the well-known Lorenz chaotic system. We demonstrate that a reservoir computer can infer entirely unexplored parts of the phase space: a properly trained reservoir computer can predict the existence of attractors that were never approached during training and therefore are labelled as unseen. We provide examples where attractor inference is achieved after training solely on a single noisy trajectory.
    FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition. (arXiv:2105.03842v4 [cs.CL] UPDATED)
    (3 min) Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.
    Large-scale ASR Domain Adaptation by Self- and Semi-supervised Learning. (arXiv:2110.00165v1 [cs.LG])
    (0 min) Self- and Semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data.
    The Complexity of Learning Approval-Based Multiwinner Voting Rules. (arXiv:2110.00254v1 [cs.GT])
    (2 min) We study the PAC learnability of multiwinner voting, focusing on the class of approval-based committee scoring (ABCS) rules. These are voting rules applied on profiles with approval ballots, where each voter approves some of the candidates. ABCS rules adapt positional scoring rules in single-winner voting by assuming that each committee of $k$ candidates collects from each voter a score, that depends on the size of the voter's ballot and on the size of its intersection with the committee. Then, committees of maximum score are the winning ones. Our goal is to learn a target rule (i.e., to learn the corresponding scoring function) using information about the winning committees of a small number of sampled profiles. Despite the existence of exponentially many outcomes compared to single-winner elections, we show that the sample complexity is still low: a polynomial number of samples carries enough information for learning the target committee with high confidence and accuracy. Unfortunately, even simple tasks that need to be solved for learning from these samples are intractable. We prove that deciding whether there exists some ABCS rule that makes a given committee winning in a given profile is a computationally hard problem. Our results extend to the class of sequential Thiele rules, which have received attention due to their simplicity.
    Solving the scalarization issues of Advantage-based Reinforcement Learning Algorithms. (arXiv:2004.04120v4 [cs.LG] UPDATED)
    (0 min) In this research, some of the issues that arise from the scalarization of the multi-objective optimization problem in the Advantage Actor Critic (A2C) reinforcement learning algorithm are investigated. The paper shows how a naive scalarization can lead to gradients overlapping. Furthermore, the possibility that the entropy regularization term can be a source of uncontrolled noise is discussed. With respect to the above issues, a technique to avoid gradient overlapping is proposed, while keeping the same loss formulation. Moreover, a method to avoid the uncontrolled noise, by sampling the actions from distributions with a desired minimum entropy, is investigated. Pilot experiments have been carried out to show how the proposed method speeds up the training. The proposed approach can be applied to any Advantage-based Reinforcement Learning algorithm.
    Generative Compositional Augmentations for Scene Graph Prediction. (arXiv:2007.05756v3 [cs.CV] UPDATED)
    (2 min) Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. Current scene graph generation models are trained on a tiny fraction of the distribution corresponding to the most frequent compositions, e.g. . However, test images might contain zero- and few-shot compositions of objects and relationships, e.g. . Despite each of the object categories and the predicate (e.g. 'on') being frequent in the training data, the models often fail to properly understand such unseen or rare compositions. To improve generalization, it is natural to attempt increasing the diversity of the training distribution. However, in the graph domain this is non-trivial. To that end, we propose a method to synthesize rare yet plausible scene graphs by perturbing real ones. We then propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs and learn from them in a joint fashion. When evaluated on the Visual Genome dataset, our approach yields marginal, but consistent improvements in zero- and few-shot metrics. We analyze the limitations of our approach indicating promising directions for future research.
    SAM: A Self-adaptive Attention Module for Context-Aware Recommendation System. (arXiv:2110.00452v1 [cs.IR])
    (2 min) Recently, textual information has been proved to play a positive role in recommendation systems. However, most of the existing methods only focus on representation learning of textual information in ratings, while potential selection bias induced by the textual information is ignored. In this work, we propose a novel and general self-adaptive module, the Self-adaptive Attention Module (SAM), which adjusts the selection bias by capturing contextual information based on its representation. This module can be embedded into recommendation systems that contain learning components of contextual information. Experimental results on three real-world datasets demonstrate the effectiveness of our proposal, and the state-of-the-art models with SAM significantly outperform the original ones.
    Regularized linear autoencoders recover the principal components, eventually. (arXiv:2007.06731v2 [cs.LG] UPDATED)
    (2 min) Our understanding of learning input-output relationships with neural nets has improved rapidly in recent years, but little is known about the convergence of the underlying representations, even in the simple case of linear autoencoders (LAEs). We show that when trained with proper regularization, LAEs can directly learn the optimal representation -- ordered, axis-aligned principal components. We analyze two such regularization schemes: non-uniform $\ell_2$ regularization and a deterministic variant of nested dropout [Rippel et al, ICML' 2014]. Though both regularization schemes converge to the optimal representation, we show that this convergence is slow due to ill-conditioning that worsens with increasing latent dimension. We show that the inefficiency of learning the optimal representation is not inevitable -- we present a simple modification to the gradient descent update that greatly speeds up convergence empirically.
    Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data. (arXiv:2004.07740v2 [stat.ML] UPDATED)
    (2 min) Recent advances in generating synthetic data that allow to add principled ways of protecting privacy -- such as Differential Privacy -- are a crucial step in sharing statistical information in a privacy preserving way. But while the focus has been on privacy guarantees, the resulting private synthetic data is only useful if it still carries statistical information from the original data. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. What is it that data analysts want? Acknowledging that data quality is a subjective concept, we develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective. Data quality can be measured along two dimensions. First, quality of synthetic data can be evaluated against training data or against an underlying population. Second, the quality of synthetic data depends on general similarity of distributions or specific tasks such as inference or prediction. It is clear that accommodating all goals at once is a formidable challenge. We invite the academic community to jointly advance the privacy-quality frontier.
    Learning to Elect. (arXiv:2108.02768v3 [cs.LG] UPDATED)
    (2 min) Voting systems have a wide range of applications including recommender systems, web search, product design and elections. Limited by the lack of general-purpose analytical tools, it is difficult to hand-engineer desirable voting rules for each use case. For this reason, it is appealing to automatically discover voting rules geared towards each scenario. In this paper, we show that set-input neural network architectures such as Set Transformers, fully-connected graph networks and DeepSets are both theoretically and empirically well-suited for learning voting rules. In particular, we show that these network models can not only mimic a number of existing voting rules to compelling accuracy -- both position-based (such as Plurality and Borda) and comparison-based (such as Kemeny, Copeland and Maximin) -- but also discover near-optimal voting rules that maximize different social welfare functions. Furthermore, the learned voting rules generalize well to different voter utility distributions and election sizes unseen during training.
    Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing. (arXiv:2103.01009v2 [cs.LG] UPDATED)
    (2 min) Recent research has shown that graph neural networks (GNNs) can learn policies for locomotion control that are as effective as a typical multi-layer perceptron (MLP), with superior transfer and multi-task performance (Wang et al., 2018; Huang et al., 2020). Results have so far been limited to training on small agents, with the performance of GNNs deteriorating rapidly as the number of sensors and actuators grows. A key motivation for the use of GNNs in the supervised learning setting is their applicability to large graphs, but this benefit has not yet been realised for locomotion control. We identify the weakness with a common GNN architecture that causes this poor scaling: overfitting in the MLPs within the network that encode, decode, and propagate messages. To combat this, we introduce Snowflake, a GNN training method for high-dimensional continuous control that freezes parameters in parts of the network that suffer from overfitting. Snowflake significantly boosts the performance of GNNs for locomotion control on large agents, now matching the performance of MLPs, and with superior transfer properties.
    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition. (arXiv:2109.13226v2 [eess.AS] UPDATED)
    (2 min) We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
    Vision Xformers: Efficient Attention for Image Classification. (arXiv:2107.02239v4 [cs.CV] UPDATED)
    (2 min) Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X in {Performer, Linformer, Nystr\"omformer}), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive bias for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.
    Human-Inspired Multi-Agent Navigation using Knowledge Distillation. (arXiv:2103.10000v4 [cs.RO] UPDATED)
    (2 min) Despite significant advancements in the field of multi-agent navigation, agents still lack the sophistication and intelligence that humans exhibit in multi-agent settings. In this paper, we propose a framework for learning a human-like general collision avoidance policy for agent-agent interactions in fully decentralized, multi-agent environments. Our approach uses knowledge distillation with reinforcement learning to shape the reward function based on expert policies extracted from human trajectory demonstrations through behavior cloning. We show that agents trained with our approach can take human-like trajectories in collision avoidance and goal-directed steering tasks not provided by the demonstrations, outperforming the experts as well as learning-based agents trained without knowledge distillation.
    Open-set Classification of Common Waveforms Using A Deep Feed-forward Network and Binary Isolation Forest Models. (arXiv:2110.00252v1 [cs.LG])
    (2 min) In this paper, we examine the use of a deep multi-layer perceptron architecture to classify received signals as one of seven common waveforms, single carrier (SC), single-carrier frequency division multiple access (SC-FDMA), orthogonal frequency division multiplexing (OFDM), linear frequency modulation (LFM), amplitude modulation (AM), frequency modulation (FM), and phase-coded pulse modulation used in communication and radar networks. Synchronization of the signals is not needed as we assume there is an unknown and uncompensated time and frequency offset. The classifier is open-set meaning it assumes unknown waveforms may appear. Isolation forest (IF) models acting as binary classifiers are used for each known signal class to perform detection of possible unknown signals. This is accomplished using the 32-length feature vector from a dense layer as input to the IF models. The classifier and IF models work together to monitor the spectrum and identify waveforms along with detecting unknown waveforms. Results showed the classifier had 100% classification rate above 0 dB with an accuracy of 83.2% and 94.7% at -10 dB and -5 dB, respectively, with signal impairments present. Results for the IF models showed an overall accuracy of 98% when detecting known and unknown signals with signal impairments present. IF models were able to reject all unknown signals while signals similar to known signals were able to pass through 2% of the time due to the contamination rate used during training. Overall, the entire system can classify correctly in an open-set mode with 98% accuracy at SNR greater than 0 dB.
    Mean-Field Controls with Q-learning for Cooperative MARL: Convergence and Complexity Analysis. (arXiv:2002.04131v6 [cs.LG] UPDATED)
    (2 min) Multi-agent reinforcement learning (MARL), despite its popularity and empirical success, suffers from the curse of dimensionality. This paper builds the mathematical framework to approximate cooperative MARL by a mean-field control (MFC) approach, and shows that the approximation error is of $\mathcal{O}(\frac{1}{\sqrt{N}})$. By establishing an appropriate form of the dynamic programming principle for both the value function and the Q function, it proposes a model-free kernel-based Q-learning algorithm (MFC-K-Q), which is shown to have a linear convergence rate for the MFC problem, the first of its kind in the MARL literature. It further establishes that the convergence rate and the sample complexity of MFC-K-Q are independent of the number of agents $N$, which provides an $\mathcal{O}(\frac{1}{\sqrt{N}})$ approximation to the MARL problem with $N$ agents in the learning environment. Empirical studies for the network traffic congestion problem demonstrate that MFC-K-Q outperforms existing MARL algorithms when $N$ is large, for instance when $N>50$.
    Attention-based Clinical Note Summarization. (arXiv:2104.08942v2 [cs.CL] UPDATED)
    (0 min) The trend of deploying digital systems in numerous industries has induced a hike in recording digital information. The health sector has observed an extensive adoption of digital devices and systems that generate large volumes of personal medical records. Electronic health records contain valuable information for retrospective and prospective analysis that is often not entirely exploited because of the dense information storage. The crude purpose of condensing health records is to select the information that holds most characteristics of the original documents based on reported disease. These summaries may boost diagnosis and extend a doctor's time with the patient during a high workload situation like the COVID-19 pandemic. In this paper, we propose applying a multi-head attention-based mechanism to perform extractive summarization of meaningful phrases in clinical notes. This method finds major sentences for a summary by correlating tokens, segments, and positional embeddings. The model outputs attention scores that are statistically transformed to extract key phrases and can be used to projection on the heat-mapping tool for visual and human use.
    Exploring the social influence of Kaggle virtual community on the M5 competition. (arXiv:2103.00501v3 [cs.SI] UPDATED)
    (2 min) One of the most significant differences of M5 over previous forecasting competitions is that it was held on Kaggle, an online platform of data scientists and machine learning practitioners. Kaggle provides a gathering place, or virtual community, for web users who are interested in the M5 competition. Users can share code, models, features, loss functions, etc. through online notebooks and discussion forums. This paper aims to study the social influence of virtual community on user behaviors in the M5 competition. We first research the content of the M5 virtual community by topic modeling and trend analysis. Further, we perform social media analysis to identify the potential relationship network of the virtual community. We study the roles and characteristics of some key participants that promote the diffusion of information within the M5 virtual community. Overall, this study provides in-depth insights into the mechanism of the virtual community's influence on the participants and has potential implications for future online competitions.
    Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective. (arXiv:2106.00092v2 [cs.LG] UPDATED)
    (2 min) Accelerated gradient-based methods are being extensively used for solving non-convex machine learning problems, especially when the data points are abundant or the available data is distributed across several agents. Two of the prominent accelerated gradient algorithms are AdaGrad and Adam. AdaGrad is the simplest accelerated gradient method, which is particularly effective for sparse data. Adam has been shown to perform favorably in deep learning problems compared to other methods. In this paper, we propose a new fast optimizer, Generalized AdaGrad (G-AdaGrad), for accelerating the solution of potentially non-convex machine learning problems. Specifically, we adopt a state-space perspective for analyzing the convergence of gradient acceleration algorithms, namely G-AdaGrad and Adam, in machine learning. Our proposed state-space models are governed by ordinary differential equations. We present simple convergence proofs of these two algorithms in the deterministic settings with minimal assumptions. Our analysis also provides intuition behind improving upon AdaGrad's convergence rate. We provide empirical results on MNIST dataset to reinforce our claims on the convergence and performance of G-AdaGrad and Adam.
    Predicting COVID-19 Patient Shielding: A Comprehensive Study. (arXiv:2110.00183v1 [cs.LG])
    (0 min) There are many ways machine learning and big data analytics are used in the fight against the COVID-19 pandemic, including predictions, risk management, diagnostics, and prevention. This study focuses on predicting COVID-19 patient shielding -- identifying and protecting patients who are clinically extremely vulnerable from coronavirus. This study focuses on techniques used for the multi-label classification of medical text. Using the information published by the United Kingdom NHS and the World Health Organisation, we present a novel approach to predicting COVID-19 patient shielding as a multi-label classification problem. We use publicly available, de-identified ICU medical text data for our experiments. The labels are derived from the published COVID-19 patient shielding data. We present an extensive comparison across 12 multi-label classifiers from the simple binary relevance to neural networks and the most recent transformers. To the best of our knowledge this is the first comprehensive study, where such a range of multi-label classifiers for medical text are considered. We highlight the benefits of various approaches, and argue that, for the task at hand, both predictive accuracy and processing time are essential.
    Rapid Assessments of Light-Duty Gasoline Vehicle Emissions Using On-Road Remote Sensing and Machine Learning. (arXiv:2110.00260v1 [cs.LG])
    (0 min) In-time and accurate assessments of on-road vehicle emissions play a central role in urban air quality and health policymaking. However, official insight is hampered by the Inspection/Maintenance (I/M) procedure conducted in the laboratory annually. It not only has a large gap to real-world situations (e.g., meteorological conditions) but also is incapable of regular supervision. Here we build a unique dataset including 103831 light-duty gasoline vehicles, in which on-road remote sensing (ORRS) measurements are linked to the I/M records based on the vehicle identification numbers and license plates. On this basis, we develop an ensemble model framework that integrates three machining learning algorithms, including neural network (NN), extreme gradient boosting (XGBoost), and random forest (RF). We demonstrate that this ensemble model could rapidly assess the vehicle-specific emissions (i.e., CO, HC, and NO). In particular, the model performs quite well for the passing vehicles under normal conditions (i.e., lower VSP (< 18 kw/t), temperature (6 ~ 32 {\deg}C), relative humidity (< 80%), and wind speed (< 5m/s)). Together with the current emission standard, we identify a large number of the dirty (2.33%) or clean (74.92%) vehicles in the real world. Our results show that the ORRS measurements, assisted by the machine-learning-based ensemble model developed here, can realize day-to-day supervision of on-road vehicle-specific emissions. This approach framework provides a valuable opportunity to reform the I/M procedures globally and mitigate urban air pollution deeply.
    AEGD: Adaptive Gradient Descent with Energy. (arXiv:2010.05109v2 [math.OC] UPDATED)
    (0 min) We propose AEGD, a new algorithm for first-order gradient-based optimization of non-convex objective functions, based on a dynamically updated energy variable. The method is shown to be unconditionally energy stable, irrespective of the step size. We prove energy-dependent convergence rates of AEGD for both non-convex and convex objectives, which for a suitably small step size recovers desired convergence rates for the batch gradient descent. We also provide an energy-dependent bound on the stationary convergence of AEGD in the stochastic non-convex setting. The method is straightforward to implement and requires little tuning of hyper-parameters. Experimental results demonstrate that AEGD works well for a large variety of optimization problems: it is robust with respect to initial data, capable of making rapid initial progress. The stochastic AEGD shows comparable and often better generalization performance than SGD with momentum for deep neural networks.
    Online Primal-Dual Algorithms with Predictions for Packing Problems. (arXiv:2110.00391v1 [cs.DS])
    (0 min) The domain of online algorithms with predictions has been extensively studied for different applications such as scheduling, caching (paging), clustering, ski rental, etc. Recently, Bamas et al., aiming for an unified method, have provided a primal-dual framework for linear covering problems. They extended the online primal-dual method by incorporating predictions in order to achieve a performance beyond the worst-case case analysis. In this paper, we consider this research line and present a framework to design algorithms with predictions for non-linear packing problems. We illustrate the applicability of our framework in submodular maximization and in particular ad-auction maximization in which the optimal bound is given and supporting experiments are provided.
    Hybrid Model for Anomaly Detection on Call Detail Records by Time Series Forecasting. (arXiv:2006.04101v2 [cs.LG] UPDATED)
    (0 min) Mobile network operators store an enormous amount of information like log files that describe various events and users' activities. Analysis of these logs might be used in many critical applications such as detecting cyber-attacks, finding behavioral patterns of users, security incident response, network forensics, etc. In a cellular network Call Detail Records (CDR) is one type of such logs containing metadata of calls and usually includes valuable information about contact such as the phone numbers of originating and receiving subscribers, call duration, the area of activity, type of call (SMS or voice call) and a timestamp. With anomaly detection, it is possible to determine abnormal reduction or increment of network traffic in an area or for a particular person. This paper's primary goal is to study subscribers' behavior in a cellular network, mainly predicting the number of calls in a region and detecting anomalies in the network traffic. In this paper, a new hybrid method is proposed based on various anomaly detection methods such as GARCH, K-means, and Neural Network to determine the anomalous data. Moreover, we have discussed the possible causes of such anomalies.
    A survey on active noise control techniques -- Part I: Linear systems. (arXiv:2110.00531v1 [eess.SP])
    (0 min) Active noise control (ANC) is an effective way for reducing the noise level in electroacoustic or electromechanical systems. Since its first introduction in 1936, this approach has been greatly developed. This paper focuses on discussing the development of ANC techniques over the past decade. Linear ANC algorithms, including the celebrated filtered-x least-mean-square (FxLMS)-based algorithms and distributed ANC algorithms, are investigated and evaluated. Nonlinear ANC (NLANC) techniques, such as functional link artificial neural network (FLANN)-based algorithms, are pursued in Part II. Furthermore, some novel methods and applications of ANC emerging in the past decade are summarized. Finally, future research challenges regarding the ANC technique are discussed.
    Update in Unit Gradient. (arXiv:2110.00199v1 [cs.LG])
    (0 min) In Machine Learning, optimization mostly has been done by using a gradient descent method to find the minimum value of the loss. However, especially in deep learning, finding a global minimum from a nonconvex loss function across a high dimensional space is an extraordinarily difficult task. Recently, a generalization learning algorithm, Sharpness-Aware Minimization (SAM), has made a great success in image classification task. Despite the great performance in creating convex space, proper direction leading by SAM is still remained unclear. We, thereby, propose a creating a Unit Vector space in SAM, which not only consisted of the mathematical instinct in linear algebra but also kept the advantages of adaptive gradient algorithm. Moreover, applying SAM in unit gradient brings models competitive performances in image classification datasets, such as CIFAR - {10, 100}. The experiment showed that it performed even better and more robust than SAM.
    Collapsible Linear Blocks for Super-Efficient Super Resolution. (arXiv:2103.09404v2 [eess.IV] UPDATED)
    (2 min) With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose SESR, a new class of Super-Efficient Super Resolution networks that significantly improve image quality and reduce computational complexity. Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality than state-of-the-art models while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. As a result, SESR can be used on constrained hardware to perform x2 (1080p to 4K) and x4 SISR (1080p to 8K). Towards this, we simulate hardware performance numbers for a commercial mobile Neural Processing Unit (NPU) for 1080p to 4K (x2) and 1080p to 8K (x4) SISR. Our results highlight the challenges faced by super resolution on AI accelerators and demonstrate that SESR is significantly faster than existing models. Overall, SESR establishes a new Pareto frontier on the quality (PSNR)-computation relationship for the super resolution task. The code for this work is available at https://github.com/ARM-software/sesr.
    Reconstruction for Powerful Graph Representations. (arXiv:2110.00577v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have limited expressive power, failing to represent many graph classes correctly. While more expressive graph representation learning (GRL) alternatives can distinguish some of these classes, they are significantly harder to implement, may not scale well, and have not been shown to outperform well-tuned GNNs in real-world tasks. Thus, devising simple, scalable, and expressive GRL architectures that also achieve real-world improvements remains an open challenge. In this work, we show the extent to which graph reconstruction -- reconstructing a graph from its subgraphs -- can mitigate the theoretical and practical problems currently faced by GRL architectures. First, we leverage graph reconstruction to build two new classes of expressive graph representations. Secondly, we show how graph reconstruction boosts the expressive power of any GNN architecture while being a (provably) powerful inductive bias for invariances to vertex removals. Empirically, we show how reconstruction can boost GNN's expressive power -- while maintaining its invariance to permutations of the vertices -- by solving seven graph property tasks not solvable by the original GNN. Further, we demonstrate how it boosts state-of-the-art GNN's performance across nine real-world benchmark datasets.
    Learning Multi-Site Harmonization of Magnetic Resonance Images Without Traveling Human Phantoms. (arXiv:2110.00041v1 [eess.IV])
    (2 min) Harmonization improves data consistency and is central to effective integration of diverse imaging data acquired across multiple sites. Recent deep learning techniques for harmonization are predominantly supervised in nature and hence require imaging data of the same human subjects to be acquired at multiple sites. Data collection as such requires the human subjects to travel across sites and is hence challenging, costly, and impractical, more so when sufficient sample size is needed for reliable network training. Here we show how harmonization can be achieved with a deep neural network that does not rely on traveling human phantom data. Our method disentangles site-specific appearance information and site-invariant anatomical information from images acquired at multiple sites and then employs the disentangled information to generate the image of each subject for any target site. We demonstrate with more than 6,000 multi-site T1- and T2-weighted images that our method is remarkably effective in generating images with realistic site-specific appearances without altering anatomical details. Our method allows retrospective harmonization of data in a wide range of existing modern large-scale imaging studies, conducted via different scanners and protocols, without additional data collection.
    Neural Dependency Coding inspired Multimodal Fusion. (arXiv:2110.00385v1 [cs.NE])
    (2 min) Information integration from different modalities is an active area of research. Human beings and, in general, biological neural systems are quite adept at using a multitude of signals from different sensory perceptive fields to interact with the environment and each other. Recent work in deep fusion models via neural networks has led to substantial improvements over unimodal approaches in areas like speech recognition, emotion recognition and analysis, captioning and image description. However, such research has mostly focused on architectural changes allowing for fusion of different modalities while keeping the model complexity manageable. Inspired by recent neuroscience ideas about multisensory integration and processing, we investigate the effect of synergy maximizing loss functions. Experiments on multimodal sentiment analysis tasks: CMU-MOSI and CMU-MOSEI with different models show that our approach provides a consistent performance boost.
    Physics and Equality Constrained Artificial Neural Networks: Application to Partial Differential Equations. (arXiv:2109.14860v1 [physics.comp-ph] CROSS LISTED)
    (2 min) Physics-informed neural networks (PINNs) have been proposed to learn the solution of partial differential equations (PDE). In PINNs, the residual form of the PDE of interest and its boundary conditions are lumped into a composite objective function as an unconstrained optimization problem, which is then used to train a deep feed-forward neural network. Here, we show that this specific way of formulating the objective function is the source of severe limitations in the PINN approach when applied to different kinds of PDEs. To address these limitations, we propose a versatile framework that can tackle both inverse and forward problems. The framework is adept at multi-fidelity data fusion and can seamlessly constrain the governing physics equations with proper initial and boundary conditions. The backbone of the proposed framework is a nonlinear, equality-constrained optimization problem formulation aimed at minimizing a loss functional, where an augmented Lagrangian method (ALM) is used to formally convert a constrained-optimization problem into an unconstrained-optimization problem. We implement the ALM within a stochastic, gradient-descent type training algorithm in a way that scrupulously focuses on meeting the constraints without sacrificing other loss terms. Additionally, as a modification of the original residual layers, we propose lean residual layers in our neural network architecture to address the so-called vanishing-gradient problem. We demonstrate the efficacy and versatility of our physics- and equality-constrained deep-learning framework by applying it to learn the solutions of various multi-dimensional PDEs, including a nonlinear inverse problem from the hydrology field with multi-fidelity data fusion. The results produced with our proposed model match exact solutions very closely for all the cases considered.
    Fully Spiking Variational Autoencoder. (arXiv:2110.00375v1 [cs.NE])
    (2 min) Spiking neural networks (SNNs) can be run on neuromorphic devices with ultra-high speed and ultra-low energy consumption because of their binary and event-driven nature. Therefore, SNNs are expected to have various applications, including as generative models being running on edge devices to create high-quality images. In this study, we build a variational autoencoder (VAE) with SNN to enable image generation. VAE is known for its stability among generative models; recently, its quality advanced. In vanilla VAE, the latent space is represented as a normal distribution, and floating-point calculations are required in sampling. However, this is not possible in SNNs because all features must be binary time series data. Therefore, we constructed the latent space with an autoregressive SNN model, and randomly selected samples from its output to sample the latent variables. This allows the latent variables to follow the Bernoulli process and allows variational learning. Thus, we build the Fully Spiking Variational Autoencoder where all modules are constructed with SNN. To the best of our knowledge, we are the first to build a VAE only with SNN layers. We experimented with several datasets, and confirmed that it can generate images with the same or better quality compared to conventional ANNs. The code will be available soon.
    Detecting Harmful Memes and Their Targets. (arXiv:2110.00413v1 [cs.CL])
    (2 min) Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without appropriate contexts, off-the-shelf multimodal models may not be adequate to understand their underlying semantics. In this work, we propose two novel problem formulations: detecting harmful memes and the social entities that these harmful memes target. To this end, we present HarMeme, the first benchmark dataset, containing 3,544 memes related to COVID-19. Each meme went through a rigorous two-stage annotation process. In the first stage, we labeled a meme as very harmful, partially harmful, or harmless; in the second stage, we further annotated the type of target(s) that each harmful meme points to: individual, organization, community, or society/general public/other. The evaluation results using ten unimodal and multimodal models highlight the importance of using multimodal signals for both tasks. We further discuss the limitations of these models and we argue that more research is needed to address these problems.
    Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation. (arXiv:2110.00329v1 [cs.CV])
    (0 min) Knowledge distillation usually transfers the knowledge from a pre-trained cumbersome teacher network to a compact student network, which follows the classical teacher-teaching-student paradigm. Based on this paradigm, previous methods mostly focus on how to efficiently train a better student network for deployment. Different from the existing practices, in this paper, we propose a novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher (for deployment) is learned with the help of multiple hierarchical students by sharing the structural backbone. The diverse feedback from multiple students allows the teacher to improve itself through the shared feature representations. The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet. Notably, when trained together with our proposed method, ResNet-18 achieves 79.15% and 71.14% accuracy on CIFAR-100 and ImageNet, outperforming the baseline results by 4.74% and 1.43%, respectively. The code is available at: https://github.com/zhengli427/TESKD.
    Conditional Deep Gaussian Processes: empirical Bayes hyperdata learning. (arXiv:2110.00568v1 [cs.LG])
    (0 min) It is desirable to combine the expressive power of deep learning with Gaussian Process (GP) in one expressive Bayesian learning model. Deep kernel learning proposed in [1] showed success in adopting a deep network for feature extraction followed by a GP used as function model. Recently, [2] suggested that the deterministic nature of feature extractor may lead to overfitting while the replacement with a Bayesian network seemed to cure it. Here, we propose the conditional Deep Gaussian Process (DGP) in which the intermediate GPs in hierarchical composition are supported by the hyperdata and the exposed GP remains zero mean. Motivated by the inducing points in sparse GP, the hyperdata also play the role of function supports, but are hyperparameters rather than random variables. We use the moment matching method [3] to approximate the marginal prior for conditional DGP with a GP carrying an effective kernel. Thus, as in empirical Bayes, the hyperdata are learned by optimizing the approximate marginal likelihood which implicitly depends on the hyperdata via the kernel. We shall show the equivalence with the deep kernel learning in the limit of dense hyperdata in latent space. However, the conditional DGP and the corresponding approximate inference enjoy the benefit of being more Bayesian than deep kernel learning. Preliminary extrapolation results demonstrate expressive power of the proposed model compared with GP kernel composition, DGP variational inference, and deep kernel learning. We also address the non-Gaussian aspect of our model as well as way of upgrading to a full Bayes inference.
    Optic Disc Segmentation using Disk-Centered Patch Augmentation. (arXiv:2110.00512v1 [eess.IV])
    (0 min) The optic disc is a crucial diagnostic feature in the eye since changes to its physiognomy is correlated with the severity of various ocular and cardiovascular diseases. While identifying the bulk of the optic disc in a color fundus image is straightforward, accurately segmenting its boundary at the pixel level is very challenging. In this work, we propose disc-centered patch augmentation (DCPA) -- a simple, yet novel training scheme for deep neural networks -- to address this problem. DCPA achieves state-of-the-art results on full-size images even when using small neural networks, specifically a U-Net with only 7 million parameters as opposed to the original 31 million. In DCPA, we restrict the training data to patches that fully contain the optic nerve. In addition, we also train the network using dynamic cost functions to increase its robustness. We tested DCPA-trained networks on five retinal datasets: DRISTI, DRIONS-DB, DRIVE, AV-WIDE, and CHASE-DB. The first two had available optic disc ground truth, and we manually estimated the ground truth for the latter three. Our approach achieved state-of-the-art F1 and IOU results on four datasets (95 % F1, 91 % IOU on DRISTI; 92 % F1, 84 % IOU on DRIVE; 83 % F1, 71 % IOU on AV-WIDE; 83 % F1, 71 % IOU on CHASEDB) and competitive results on the fifth (95 % F1, 91 % IOU on DRIONS-DB), confirming its generality. Our open-source code and ground-truth annotations are available at: https://github.com/saeidmotevali/fundusdisk
    Divergence-Regularized Multi-Agent Actor-Critic. (arXiv:2110.00304v1 [cs.LG])
    (0 min) Entropy regularization is a popular method in reinforcement learning (RL). Although it has many advantages, it alters the RL objective and makes the converged policy deviate from the optimal policy of the original Markov Decision Process. Though divergence regularization has been proposed to settle this problem, it cannot be trivially applied to cooperative multi-agent reinforcement learning (MARL). In this paper, we investigate divergence regularization in cooperative MARL and propose a novel off-policy cooperative MARL framework, divergence-regularized multi-agent actor-critic (DMAC). Mathematically, we derive the update rule of DMAC which is naturally off-policy, guarantees a monotonic policy improvement and is not biased by the regularization. DMAC is a flexible framework and can be combined with many existing MARL algorithms. We evaluate DMAC in a didactic stochastic game and StarCraft Multi-Agent Challenge and empirically show that DMAC substantially improves the performance of existing MARL algorithms.
    Comparing Sequential Forecasters. (arXiv:2110.00115v1 [stat.ME])
    (0 min) We consider two or more forecasters each making a sequence of predictions over time and tackle the problem of how to compare them -- either online or post-hoc. In fields ranging from meteorology to sports, forecasters make predictions on different events or quantities over time, and this work describes how to compare them in a statistically rigorous manner. Specifically, we design a nonasymptotic sequential inference procedure for estimating the time-varying difference in forecast quality when using a relatively large class of scoring rules (bounded scores with a linear equivalent). The resulting confidence intervals can be continuously monitored and yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting recent variance-adaptive confidence sequences (CS) to our setting. In the spirit of Shafer and Vovk's game-theoretic probability, the coverage guarantees for our CSs are also distribution-free, in the sense that they make no distributional assumptions whatsoever on the forecasts or outcomes. Additionally, in contrast to a recent preprint by Henzi and Ziegel, we show how to sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time, by designing different e-processes that quantify the evidence at any stopping time. We examine the validity of our methods over their fixed-time and asymptotic counterparts in synthetic experiments and demonstrate their effectiveness in real-data settings, including comparing probability forecasts on Major League Baseball (MLB) games and comparing statistical postprocessing methods for ensemble weather forecasts.
    Batched Thompson Sampling. (arXiv:2110.00202v1 [cs.LG])
    (0 min) We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be bounded by $O(\log(T))$ independent of the problem instance over a time horizon $T$. We also show that in expectation the number of batches used by our policy can be bounded by an instance dependent bound of order $O(\log\log(T))$. These results indicate that Thompson sampling maintains the same performance in this batched setting as in the case when instantaneous feedback is available after each action, while requiring minimal feedback. These results also indicate that Thompson sampling performs competitively with recently proposed algorithms tailored for the batched setting. These algorithms optimize the batch structure for a given time horizon $T$ and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. We show that Thompson sampling combined with an adaptive batching strategy can achieve a similar performance without knowing the time horizon $T$ of the problem and without having to carefully optimize the batch structure to achieve a target regret bound (i.e. problem dependent vs minimax regret) for a given $T$.
    Gaussian Variational State Estimation for Nonlinear State-Space Models. (arXiv:2002.02620v4 [stat.ML] UPDATED)
    (0 min) In this paper, the problem of state estimation, in the context of both filtering and smoothing, for nonlinear state-space models is considered. Due to the nonlinear nature of the models, the state estimation problem is generally intractable as it involves integrals of general nonlinear functions and the filtered and smoothed state distributions lack closed-form solutions. As such, it is common to approximate the state estimation problem. In this paper, we develop an assumed Gaussian solution based on variational inference, which offers the key advantage of a flexible, but principled, mechanism for approximating the required distributions. Our main contribution lies in a new formulation of the state estimation problem as an optimisation problem, which can then be solved using standard optimisation routines that employ exact first- and second-order derivatives. The resulting state estimation approach involves a minimal number of assumptions and applies directly to nonlinear systems with both Gaussian and non-Gaussian probabilistic models. The performance of our approach is demonstrated on several examples; a challenging scalar system, a model of a simple robotic system, and a target tracking problem using a von Mises-Fisher distribution and outperforms alternative assumed Gaussian approaches to state estimation.
    DualNet: Continual Learning, Fast and Slow. (arXiv:2110.00175v1 [cs.LG])
    (0 min) According to Complementary Learning Systems (CLS) theory~\citep{mcclelland1995there} in neuroscience, humans do effective \emph{continual learning} through two complementary systems: a fast learning system centered on the hippocampus for rapid learning of the specifics and individual experiences, and a slow learning system located in the neocortex for the gradual acquisition of structured knowledge about the environment. Motivated by this theory, we propose a novel continual learning framework named "DualNet", which comprises a fast learning system for supervised learning of pattern-separated representation from specific tasks and a slow learning system for unsupervised representation learning of task-agnostic general representation via a Self-Supervised Learning (SSL) technique. The two fast and slow learning systems are complementary and work seamlessly in a holistic continual learning framework. Our extensive experiments on two challenging continual learning benchmarks of CORE50 and miniImageNet show that DualNet outperforms state-of-the-art continual learning methods by a large margin. We further conduct ablation studies of different SSL objectives to validate DualNet's efficacy, robustness, and scalability. Code will be made available upon acceptance.
    Multi Expression Programming -- an in-depth description. (arXiv:2110.00367v1 [cs.NE])
    (0 min) Multi Expression Programming (MEP) is a Genetic Programming variant that uses a linear representation of chromosomes. MEP individuals are strings of genes encoding complex computer programs. When MEP individuals encode expressions, their representation is similar to the way in which compilers translate $C$ or $Pascal$ expressions into machine code. A unique MEP feature is the ability to store multiple solutions of a problem in a single chromosome. Usually, the best solution is chosen for fitness assignment. When solving symbolic regression or classification problems (or any other problems for which the training set is known before the problem is solved) MEP has the same complexity as other techniques storing a single solution in a chromosome (such as GP, CGP, GEP or GE). Evaluation of the expressions encoded into an MEP individual can be performed by a single parsing of the chromosome. Offspring obtained by crossover and mutation is always syntactically correct MEP individuals (computer programs). Thus, no extra processing for repairing newly obtained individuals is needed.
    Inverse airfoil design method for generating varieties of smooth airfoils using conditional WGAN-gp. (arXiv:2110.00212v1 [cs.LG])
    (0 min) Machine learning models are recently utilized for airfoil shape generation methods. It is desired to obtain airfoil shapes that satisfies required lift coefficient. Generative adversarial networks (GAN) output reasonable airfoil shapes. However, shapes obtained from ordinal GAN models are not smooth, and they need smoothing before flow analysis. Therefore, the models need to be coupled with Bezier curves or other smoothing methods to obtain smooth shapes. Generating shapes without any smoothing methods is challenging. In this study, we employed conditional Wasserstein GAN with gradient penalty (CWGAN-GP) to generate airfoil shapes, and the obtained shapes are as smooth as those obtained using smoothing methods. With the proposed method, no additional smoothing method is needed to generate airfoils. Moreover, the proposed model outputs shapes that satisfy the lift coefficient requirements.
    LEMON: Explainable Entity Matching. (arXiv:2110.00516v1 [cs.DB])
    (0 min) State-of-the-art entity matching (EM) methods are hard to interpret, and there is significant value in bringing explainable AI to EM. Unfortunately, most popular explainability methods do not work well out of the box for EM and need adaptation. In this paper, we identify three challenges of applying local post hoc feature attribution methods to entity matching: cross-record interaction effects, non-match explanations, and variation in sensitivity. We propose our novel model-agnostic and schema-flexible method LEMON that addresses all three challenges by (i) producing dual explanations to avoid cross-record interaction effects, (ii) introducing the novel concept of attribution potential to explain how two records could have matched, and (iii) automatically choosing explanation granularity to match the sensitivity of the matcher and record pair in question. Experiments on public datasets demonstrate that the proposed method is more faithful to the matcher and does a better job of helping users understand the decision boundary of the matcher than previous work. Furthermore, user studies show that the rate at which human subjects can construct counterfactual examples after seeing an explanation from our proposed method increases from 54% to 64% for matches and from 15% to 49% for non-matches compared to explanations from a standard adaptation of LIME.
    On the Importance of Gradients for Detecting Distributional Shifts in the Wild. (arXiv:2110.00218v1 [cs.LG])
    (0 min) Detecting out-of-distribution (OOD) data has become a critical component in ensuring the safe deployment of machine learning models in the real world. Existing OOD detection approaches primarily rely on the output or feature space for deriving OOD scores, while largely overlooking information from the gradient space. In this paper, we present GradNorm, a simple and effective approach for detecting OOD inputs by utilizing information extracted from the gradient space. GradNorm directly employs the vector norm of gradients, backpropagated from the KL divergence between the softmax output and a uniform probability distribution. Our key idea is that the magnitude of gradients is higher for in-distribution (ID) data than that for OOD data, making it informative for OOD detection. GradNorm demonstrates superior performance, reducing the average FPR95 by up to 10.89% compared to the previous best method.
    DNN-Opt: An RL Inspired Optimization for Analog Circuit Sizing using Deep Neural Networks. (arXiv:2110.00211v1 [cs.LG])
    (0 min) Analog circuit sizing takes a significant amount of manual effort in a typical design cycle. With rapidly developing technology and tight schedules, bringing automated solutions for sizing has attracted great attention. This paper presents DNN-Opt, a Reinforcement Learning (RL) inspired Deep Neural Network (DNN) based black-box optimization framework for analog circuit sizing. The key contributions of this paper are a novel sample-efficient two-stage deep learning optimization framework leveraging RL actor-critic algorithms, and a recipe to extend it on large industrial circuits using critical device identification. Our method shows 5--30x sample efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits with better performance metrics. To the best of our knowledge, this is the first application of DNN-based circuit sizing on industrial scale circuits.
    Sim and Real: Better Together. (arXiv:2110.00445v1 [stat.ML])
    (0 min) Simulation is used extensively in autonomous systems, particularly in robotic manipulation. By far, the most common approach is to train a controller in simulation, and then use it as an initial starting point for the real system. We demonstrate how to learn simultaneously from both simulation and interaction with the real environment. We propose an algorithm for balancing the large number of samples from the high throughput but less accurate simulation and the low-throughput, high-fidelity and costly samples from the real environment. We achieve that by maintaining a replay buffer for each environment the agent interacts with. We analyze such multi-environment interaction theoretically, and provide convergence properties, through a novel theoretical replay buffer analysis. We demonstrate the efficacy of our method on a sim-to-real environment.
    Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed Bandits. (arXiv:2110.00158v1 [cs.LG])
    (0 min) We study the asymptotic performance of the Thompson sampling algorithm in the batched multi-armed bandit setting where the time horizon $T$ is divided into batches, and the agent is not able to observe the rewards of her actions until the end of each batch. We show that in this batched setting, Thompson sampling achieves the same asymptotic performance as in the case where instantaneous feedback is available after each action, provided that the batch sizes increase subexponentially. This result implies that Thompson sampling can maintain its performance even if it receives delayed feedback in $\omega(\log T)$ batches. We further propose an adaptive batching scheme that reduces the number of batches to $\Theta(\log T)$ while maintaining the same performance. Although the batched multi-armed bandit setting has been considered in several recent works, previous results rely on tailored algorithms for the batched setting, which optimize the batch structure and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. We show that Thompson sampling, on the other hand, is able to achieve a similar asymptotic performance in the batched setting without any modifications.
    Powerpropagation: A sparsity inducing weight reparameterisation. (arXiv:2110.00296v1 [stat.ML])
    (0 min) The training of sparse neural networks is becoming an increasingly important tool for reducing the computational footprint of models at training and evaluation, as well enabling the effective scaling up of models. Whereas much work over the years has been dedicated to specialised pruning techniques, little attention has been paid to the inherent effect of gradient based training on model sparsity. In this work, we introduce Powerpropagation, a new weight-parameterisation for neural networks that leads to inherently sparse models. Exploiting the behaviour of gradient descent, our method gives rise to weight updates exhibiting a "rich get richer" dynamic, leaving low-magnitude parameters largely unaffected by learning. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Powerpropagation is general, intuitive, cheap and straight-forward to implement and can readily be combined with various other techniques. To highlight its versatility, we explore it in two very different settings: Firstly, following a recent line of work, we investigate its effect on sparse training for resource-constrained settings. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark. Secondly, we advocate the use of sparsity in overcoming catastrophic forgetting, where compressed representations allow accommodating a large number of tasks at fixed model capacity. In all cases our reparameterisation considerably increases the efficacy of the off-the-shelf methods.
    Iterative Teacher-Aware Learning. (arXiv:2110.00137v1 [cs.LG])
    (0 min) In human pedagogy, teachers and students can interact adaptively to maximize communication efficiency. The teacher adjusts her teaching method for different students, and the student, after getting familiar with the teacher's instruction mechanism, can infer the teacher's intention to learn faster. Recently, the benefits of integrating this cooperative pedagogy into machine concept learning in discrete spaces have been proved by multiple works. However, how cooperative pedagogy can facilitate machine parameter learning hasn't been thoroughly studied. In this paper, we propose a gradient optimization based teacher-aware learner who can incorporate teacher's cooperative intention into the likelihood function and learn provably faster compared with the naive learning algorithms used in previous machine teaching works. We give theoretical proof that the iterative teacher-aware learning (ITAL) process leads to local and global improvements. We then validate our algorithms with extensive experiments on various tasks including regression, classification, and inverse reinforcement learning using synthetic and real data. We also show the advantage of modeling teacher-awareness when agents are learning from human teachers.
    Unsupervised Motion Representation Learning with Capsule Autoencoders. (arXiv:2110.00529v1 [cs.CV])
    (0 min) We propose the Motion Capsule Autoencoder (MCAE), which addresses a key challenge in the unsupervised learning of motion representations: transformation invariance. MCAE models motion in a two-level hierarchy. In the lower level, a spatio-temporal motion signal is divided into short, local, and semantic-agnostic snippets. In the higher level, the snippets are aggregated to form full-length semantic-aware segments. For both levels, we represent motion with a set of learned transformation invariant templates and the corresponding geometric transformations by using capsule autoencoders of a novel design. This leads to a robust and efficient encoding of viewpoint changes. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets. Notably, it achieves better results than baselines on Trajectory20 with considerably fewer parameters and state-of-the-art performance on the unsupervised skeleton-based action recognition task.
    Scientific evidence extraction. (arXiv:2110.00061v1 [cs.LG])
    (0 min) Recently, interest has grown in applying machine learning to the problem of table structure inference and extraction from unstructured documents. However, progress in this area has been challenging both to make and to measure, due to several issues that arise in training and evaluating models from labeled data. This includes challenges as fundamental as the lack of a single definitive ground truth output for each input sample and the lack of an ideal metric for measuring partial correctness for this task. To address these we propose a new dataset, PubMed Tables One Million (PubTables-1M), and a new class of metric, grid table similarity (GriTS). PubTables-1M is nearly twice as large as the previous largest comparable dataset, can be used for models across multiple architectures and modalities, and addresses issues such as ambiguity and lack of consistency in the annotations. We apply DETR to table extraction for the first time and show that object detection models trained on PubTables-1M produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis. We describe the dataset in detail to enable others to build on our work and combine this data with other datasets for these and related tasks. It is our hope that PubTables-1M and the proposed metrics can further progress in this area by creating a benchmark suitable for training and evaluating a wide variety of models for table extraction. Data and code will be released at https://github.com/microsoft/table-transformer.
    Improving Load Forecast in Energy Markets During COVID-19. (arXiv:2110.00181v1 [eess.SP])
    (0 min) The abrupt outbreak of the COVID-19 pandemic was the most significant event in 2020, which had profound and lasting impacts across the world. Studies on energy markets observed a decline in energy demand and changes in energy consumption behaviors during COVID-19. However, as an essential part of system operation, how the load forecasting performs amid COVID-19 is not well understood. This paper aims to bridge the research gap by systematically evaluating models and features that can be used to improve the load forecasting performance amid COVID-19. Using real-world data from the New York Independent System Operator, our analysis employs three deep learning models and adopts both novel COVID-related features as well as classical weather-related features. We also propose simulating the stay-at-home situation with pre-stay-at-home weekend data and demonstrate its effectiveness in improving load forecasting accuracy during COVID-19.
    Spiking Hyperdimensional Network: Neuromorphic Models Integrated with Memory-Inspired Framework. (arXiv:2110.00214v1 [cs.NE])
    (0 min) Recently, brain-inspired computing models have shown great potential to outperform today's deep learning solutions in terms of robustness and energy efficiency. Particularly, Spiking Neural Networks (SNNs) and HyperDimensional Computing (HDC) have shown promising results in enabling efficient and robust cognitive learning. Despite the success, these two brain-inspired models have different strengths. While SNN mimics the physical properties of the human brain, HDC models the brain on a more abstract and functional level. Their design philosophies demonstrate complementary patterns that motivate their combination. With the help of the classical psychological model on memory, we propose SpikeHD, the first framework that fundamentally combines Spiking neural network and hyperdimensional computing. SpikeHD generates a scalable and strong cognitive learning system that better mimics brain functionality. SpikeHD exploits spiking neural networks to extract low-level features by preserving the spatial and temporal correlation of raw event-based spike data. Then, it utilizes HDC to operate over SNN output by mapping the signal into high-dimensional space, learning the abstract information, and classifying the data. Our extensive evaluation on a set of benchmark classification problems shows that SpikeHD provides the following benefit compared to SNN architecture: (1) significantly enhance learning capability by exploiting two-stage information processing, (2) enables substantial robustness to noise and failure, and (3) reduces the network size and required parameters to learn complex information.
    Learn to Communicate with Neural Calibration: Scalability and Generalization. (arXiv:2110.00272v1 [eess.SP])
    (0 min) The conventional design of wireless communication systems typically relies on established mathematical models that capture the characteristics of different communication modules. Unfortunately, such design cannot be easily and directly applied to future wireless networks, which will be characterized by large-scale ultra-dense networks whose design complexity scales exponentially with the network size. Furthermore, such networks will vary dynamically in a significant way, which makes it intractable to develop comprehensive analytical models. Recently, deep learning-based approaches have emerged as potential alternatives for designing complex and dynamic wireless systems. However, existing learning-based methods have limited capabilities to scale with the problem size and to generalize with varying network settings. In this paper, we propose a scalable and generalizable neural calibration framework for future wireless system design, where a neural network is adopted to calibrate the input of conventional model-based algorithms. Specifically, the backbone of a traditional time-efficient algorithm is integrated with deep neural networks to achieve a high computational efficiency, while enjoying enhanced performance. The permutation equivariance property, carried out by the topological structure of wireless systems, is furthermore utilized to develop a generalizable neural network architecture. The proposed neural calibration framework is applied to solve challenging resource management problems in massive multiple-input multiple-output (MIMO) systems. Simulation results will show that the proposed neural calibration approach enjoys significantly improved scalability and generalization compared with the existing learning-based methods.
    Two ways towards combining Sequential Neural Network and Statistical Methods to Improve the Prediction of Time Series. (arXiv:2110.00082v1 [cs.LG])
    (0 min) Statistic modeling and data-driven learning are the two vital fields that attract many attentions. Statistic models intend to capture and interpret the relationships among variables, while data-based learning attempt to extract information directly from the data without pre-processing through complex models. Given the extensive studies in both fields, a subtle issue is how to properly integrate data based methods with existing knowledge or models. In this paper, based on the time series data, we propose two different directions to integrate the two, a decomposition-based method and a method exploiting the statistic extraction of data features. The first one decomposes the data into linear stable, nonlinear stable and unstable parts, where suitable statistical models are used for the linear stable and nonlinear stable parts while the appropriate machine learning tools are used for the unstable parts. The second one applies statistic models to extract statistics features of data and feed them as additional inputs into the machine learning platform for training. The most critical and challenging thing is how to determine and extract the valuable information from mathematical or statistical models to boost the performance of machine learning algorithms. We evaluate the proposal using time series data with varying degrees of stability. Performance results show that both methods can outperform existing schemes that use models and learning separately, and the improvements can be over 60%. Both our proposed methods are promising in bridging the gap between model-based and data-driven schemes and integrating the two to provide an overall higher learning performance.
    Empirical Quantitative Analysis of COVID-19 Forecasting Models. (arXiv:2110.00174v1 [cs.LG])
    (0 min) COVID-19 has been a public health emergency of international concern since early 2020. Reliable forecasting is critical to diminish the impact of this disease. To date, a large number of different forecasting models have been proposed, mainly including statistical models, compartmental models, and deep learning models. However, due to various uncertain factors across different regions such as economics and government policy, no forecasting model appears to be the best for all scenarios. In this paper, we perform quantitative analysis of COVID-19 forecasting of confirmed cases and deaths across different regions in the United States with different forecasting horizons, and evaluate the relative impacts of the following three dimensions on the predictive performance (improvement and variation) through different evaluation metrics: model selection, hyperparameter tuning, and the length of time series required for training. We find that if a dimension brings about higher performance gains, if not well-tuned, it may also lead to harsher performance penalties. Furthermore, model selection is the dominant factor in determining the predictive performance. It is responsible for both the largest improvement and the largest variation in performance in all prediction tasks across different regions. While practitioners may perform more complicated time series analysis in practice, they should be able to achieve reasonable results if they have adequate insight into key decisions like model selection.
    On the Trustworthiness of Tree Ensemble Explainability Methods. (arXiv:2110.00086v1 [cs.LG])
    (0 min) The recent increase in the deployment of machine learning models in critical domains such as healthcare, criminal justice, and finance has highlighted the need for trustworthy methods that can explain these models to stakeholders. Feature importance methods (e.g. gain and SHAP) are among the most popular explainability methods used to address this need. For any explainability technique to be trustworthy and meaningful, it has to provide an explanation that is accurate and stable. Although the stability of local feature importance methods (explaining individual predictions) has been studied before, there is yet a knowledge gap about the stability of global features importance methods (explanations for the whole model). Additionally, there is no study that evaluates and compares the accuracy of global feature importance methods with respect to feature ordering. In this paper, we evaluate the accuracy and stability of global feature importance methods through comprehensive experiments done on simulations as well as four real-world datasets. We focus on tree-based ensemble methods as they are used widely in industry and measure the accuracy and stability of explanations under two scenarios: 1) when inputs are perturbed 2) when models are perturbed. Our findings provide a comparison of these methods under a variety of settings and shed light on the limitations of global feature importance methods by indicating their lack of accuracy with and without noisy inputs, as well as their lack of stability with respect to: 1) increase in input dimension or noise in the data; 2) perturbations in models initialized by different random seeds or hyperparameter settings.
    Leveraging power grid topology in machine learning assisted optimal power flow. (arXiv:2110.00306v1 [cs.LG])
    (0 min) Machine learning assisted optimal power flow (OPF) aims to reduce the computational complexity of these non-linear and non-convex constrained optimisation problems by consigning expensive (online) optimisation to offline training. The majority of work in this area typically employs fully-connected neural networks (FCNN). However, recently convolutional (CNN) and graph (GNN) neural networks have been also investigated, in effort to exploit topological information within the power grid. Although promising results have been obtained, there lacks a systematic comparison between these architectures throughout literature. Accordingly, we assess the performance of a variety of FCNN, CNN and GNN models for two fundamental approaches to machine learning assisted OPF: regression (predicting optimal generator set-points) and classification (predicting the active set of constraints). For several synthetic grids with interconnected utilities, we show that locality properties between feature and target variables are scarce, hence find limited merit of harnessing topological information in NN models for this set of problems.
    Predicting Consumer Purchasing Decision in The Online Food Delivery Industry. (arXiv:2110.00502v1 [stat.ML])
    (0 min) This transformation of food delivery businesses to online platforms has gained high attention in recent years. This due to the availability of customizing ordering experiences, easy payment methods, fast delivery, and others. The competition between online food delivery providers has intensified to attain a wider range of customers. Hence, they should have a better understanding of their customers' needs and predict their purchasing decisions. Machine learning has a significant impact on companies' bottom line. They are used to construct models and strategies in industries that rely on big data and need a system to evaluate it fast and effectively. Predictive modeling is a type of machine learning that uses various regression algorithms, analytics, and statistics to estimate the probability of an occurrence. The incorporation of predictive models helps online food delivery providers to understand their customers. In this study, a dataset collected from 388 consumers in Bangalore, India was provided to predict their purchasing decisions. Four prediction models are considered: CART and C4.5 decision trees, random forest, and rule-based classifiers, and their accuracies in providing the correct class label are evaluated. The findings show that all models perform similarly, but the C4.5 outperforms them all with an accuracy of 91.67%.
    SpliceOut: A Simple and Efficient Audio Augmentation Method. (arXiv:2110.00046v1 [cs.SD])
    (0 min) Time masking has become a de facto augmentation technique for speech and audio tasks, including automatic speech recognition (ASR) and audio classification, most notably as a part of SpecAugment. In this work, we propose SpliceOut, a simple modification to time masking which makes it computationally more efficient. SpliceOut performs comparably to (and sometimes outperforms) SpecAugment on a wide variety of speech and audio tasks, including ASR for seven different languages using varying amounts of training data, as well as on speech translation, sound and music classification, thus establishing itself as a broadly applicable audio augmentation method. SpliceOut also provides additional gains when used in conjunction with other augmentation techniques. Apart from the fully-supervised setting, we also demonstrate that SpliceOut can complement unsupervised representation learning with performance gains in the semi-supervised and self-supervised settings.
    Machine learning models for prediction of droplet collision outcomes. (arXiv:2110.00167v1 [cs.LG])
    (0 min) Predicting the outcome of liquid droplet collisions is an extensively studied phenomenon but the current physics based models for predicting the outcomes are poor (accuracy $\approx 43\%$). The key weakness of these models is their limited complexity. They only account for 3 features while there are many more relevant features that go unaccounted for. This limitation of traditional models can be easily overcome through machine learning modeling of the problem. In an ML setting this problem directly translates to a classification problem with 4 classes. Here we compile a large labelled dataset and tune different ML classifiers over this dataset. We evaluate the accuracy and robustness of the classifiers. ML classifiers, with accuracies over 90\%, significantly outperform the physics based models. Another key question we try to answer in this paper is whether existing knowledge of the physics based models can be exploited to boost the accuracy of the ML classifiers. We find that while this knowledge improves the accuracy marginally for small datasets, it does not improve accuracy with if larger datasets are used for training the models.
    Belief propagation for permutations, rankings, and partial orders. (arXiv:2110.00513v1 [cs.AI])
    (0 min) Many datasets give partial information about an ordering or ranking by indicating which team won a game, which item a user prefers, or who infected whom. We define a continuous spin system whose Gibbs distribution is the posterior distribution on permutations, given a probabilistic model of these interactions. Using the cavity method we derive a belief propagation algorithm that computes the marginal distribution of each node's position. In addition, the Bethe free energy lets us approximate the number of linear extensions of a partial order and perform model selection.
    Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?. (arXiv:2110.00528v1 [cs.CV])
    (0 min) Despite the success of a number of recent techniques for visual self-supervised deep learning, there remains limited investigation into the representations that are ultimately learned. By using recent advances in comparing neural representations, we explore in this direction by comparing a constrastive self-supervised algorithm (SimCLR) to supervision for simple image data in a common architecture. We find that the methods learn similar intermediate representations through dissimilar means, and that the representations diverge rapidly in the final few layers. We investigate this divergence, finding that it is caused by these layers strongly fitting to the distinct learning objectives. We also find that SimCLR's objective implicitly fits the supervised objective in intermediate layers, but that the reverse is not true. Our work particularly highlights the importance of the learned intermediate representations, and raises important questions for auxiliary task design.
    FiLMing Multimodal Sarcasm Detection with Attention. (arXiv:2110.00416v1 [cs.MM])
    (0 min) Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. It finds applications in many NLP tasks such as opinion mining, sentiment analysis, etc. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. So far, various approaches have been proposed that uses text and image modality and a fusion of both. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation by conditioning the input image through FiLMed ResNet blocks with the textual features using the GRU network to capture the multimodal information. The output from both the models and the CLS token from RoBERTa is concatenated and used for the final prediction. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal sarcasm detection dataset.
    Topologically-Informed Atlas Learning. (arXiv:2110.00429v1 [cs.RO])
    (0 min) We present a new technique that enables manifold learning to accurately embed data manifolds that contain holes, without discarding any topological information. Manifold learning aims to embed high dimensional data into a lower dimensional Euclidean space by learning a coordinate chart, but it requires that the entire manifold can be embedded in a single chart. This is impossible for manifolds with holes. In such cases, it is necessary to learn an atlas: a collection of charts that collectively cover the entire manifold. We begin with many small charts, and combine them in a bottom-up approach, where charts are only combined if doing so will not introduce problematic topological features. When it is no longer possible to combine any charts, each chart is individually embedded with standard manifold learning techniques, completing the construction of the atlas. We show the efficacy of our method by constructing atlases for challenging synthetic manifolds; learning human motion embeddings from motion capture data; and learning kinematic models of articulated objects.
    TyXe: Pyro-based Bayesian neural nets for Pytorch. (arXiv:2110.00276v1 [stat.ML])
    (0 min) We introduce TyXe, a Bayesian neural network library built on top of Pytorch and Pyro. Our leading design principle is to cleanly separate architecture, prior, inference and likelihood specification, allowing for a flexible workflow where users can quickly iterate over combinations of these components. In contrast to existing packages TyXe does not implement any layer classes, and instead relies on architectures defined in generic Pytorch code. TyXe then provides modular choices for canonical priors, variational guides, inference techniques, and layer selections for a Bayesian treatment of the specified architecture. Sampling tricks for variance reduction, such as local reparameterization or flipout, are implemented as effect handlers, which can be applied independently of other specifications. We showcase the ease of use of TyXe to explore Bayesian versions of popular models from various libraries: toy regression with a pure Pytorch neural network; large-scale image classification with torchvision ResNets; graph neural networks based on DGL; and Neural Radiance Fields built on top of Pytorch3D. Finally, we provide convenient abstractions for variational continual learning. In all cases the change from a deterministic to a Bayesian neural network comes with minimal modifications to existing code, offering a broad range of researchers and practitioners alike practical access to uncertainty estimation techniques. The library is available at https://github.com/TyXe-BDL/TyXe.
    PhiNets: a scalable backbone for low-power AI at the edge. (arXiv:2110.00337v1 [cs.CV])
    (0 min) In the Internet of Things era, where we see many interconnected and heterogeneous mobile and fixed smart devices, distributing the intelligence from the cloud to the edge has become a necessity. Due to limited computational and communication capabilities, low memory and limited energy budget, bringing artificial intelligence algorithms to peripheral devices, such as the end-nodes of a sensor network, is a challenging task and requires the design of innovative methods. In this work, we present PhiNets, a new scalable backbone optimized for deep-learning-based image processing on resource-constrained platforms. PhiNets are based on inverted residual blocks specifically designed to decouple the computational cost, working memory, and parameter memory, thus exploiting all the available resources. With a YoloV2 detection head and Simple Online and Realtime Tracking, the proposed architecture has achieved the state-of-the-art results in (i) detection on the COCO and VOC2012 benchmarks, and (ii) tracking on the MOT15 benchmark. PhiNets reduce the parameter count of 87% to 93% with respect to previous state-of-the-art models (EfficientNetv1, MobileNetv2) and achieve better performance with lower computational cost. Moreover, we demonstrate our approach on a prototype node based on a STM32H743 microcontroller (MCU) with 2MB of internal Flash and 1MB of RAM and achieve power requirements in the order of 10 mW. The code for the PhiNets is publicly available on GitHub.
    Discovering Boundary Values of Feature-based Machine Learning Classifiers through Exploratory Datamorphic Testing. (arXiv:2110.00330v1 [cs.LG])
    (0 min) Testing has been widely recognised as difficult for AI applications. This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology. In these strategies, testing aims at exploring the data space of a classification or clustering application to discover the boundaries between classes that the machine learning application defines. This enables the tester to understand precisely the behaviour and function of the software under test. In the paper, three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy. The correctness of these algorithms are formally proved. Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
    Tree-Constrained Graph Neural Networks For Argument Mining. (arXiv:2110.00124v1 [cs.CL])
    (0 min) We propose a novel architecture for Graph Neural Networks that is inspired by the idea behind Tree Kernels of measuring similarity between trees by taking into account their common substructures, named fragments. By imposing a series of regularization constraints to the learning problem, we exploit a pooling mechanism that incorporates such notion of fragments within the node soft assignment function that produces the embeddings. We present an extensive experimental evaluation on a collection of sentence classification tasks conducted on several argument mining corpora, showing that the proposed approach performs well with respect to state-of-the-art techniques.
    On Deep Neural Network Calibration by Regularization and its Impact on Refinement. (arXiv:2106.09385v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks have been shown to be highly miscalibrated. often they tend to be overconfident in their predictions. It poses a significant challenge for safety-critical systems to utilise deep neural networks (DNNs), reliably. Many recently proposed approaches to mitigate this have demonstrated substantial progress in improving DNN calibration. However, they hardly touch upon refinement, which historically has been an essential aspect of calibration. Refinement indicates separability of a network's correct and incorrect predictions. This paper presents a theoretically and empirically supported exposition reviewing refinement of a calibrated model. Firstly, we show the breakdown of expected calibration error (ECE), into predicted confidence and refinement under the assumption of over-confident predictions. Secondly, linking with this result, we highlight that regularization based calibration only focuses on naively reducing a model's confidence. This logically has a severe downside to a model's refinement as correct and incorrect predictions become tightly coupled. Lastly, connecting refinement with ECE also provides support to existing refinement based approaches which improve calibration but do not explain the reasoning behind it. We support our claims through rigorous empirical evaluations of many state of the art calibration approaches on widely used datasets and neural networks. We find that many calibration approaches with the likes of label smoothing, mixup etc. lower the usefulness of a DNN by degrading its refinement. Even under natural data shift, this calibration-refinement trade-off holds for the majority of calibration methods.
    Determining Standard Occupational Classification Codes from Job Descriptions in Immigration Petitions. (arXiv:2110.00078v1 [cs.LG])
    (0 min) Accurate specification of standard occupational classification (SOC) code is critical to the success of many U.S. work visa applications. Determination of correct SOC code relies on careful study of job requirements and comparison to definitions given by the U.S. Bureau of Labor Statistics, which is often a tedious activity. In this paper, we apply methods from natural language processing (NLP) to computationally determine SOC code based on job description. We implement and empirically evaluate a broad variety of predictive models with respect to quality of prediction and training time, and identify models best suited for this task.
    DeepMCAT: Large-Scale Deep Clustering for Medical Image Categorization. (arXiv:2110.00109v1 [eess.IV])
    (2 min) In recent years, the research landscape of machine learning in medical imaging has changed drastically from supervised to semi-, weakly- or unsupervised methods. This is mainly due to the fact that ground-truth labels are time-consuming and expensive to obtain manually. Generating labels from patient metadata might be feasible but it suffers from user-originated errors which introduce biases. In this work, we propose an unsupervised approach for automatically clustering and categorizing large-scale medical image datasets, with a focus on cardiac MR images, and without using any labels. We investigated the end-to-end training using both class-balanced and imbalanced large-scale datasets. Our method was able to create clusters with high purity and achieved over 0.99 cluster purity on these datasets. The results demonstrate the potential of the proposed method for categorizing unstructured large medical databases, such as organizing clinical PACS systems in hospitals.
    Seeing Glass: Joint Point Cloud and Depth Completion for Transparent Objects. (arXiv:2110.00087v1 [cs.CV])
    (0 min) The basis of many object manipulation algorithms is RGB-D input. Yet, commodity RGB-D sensors can only provide distorted depth maps for a wide range of transparent objects due light refraction and absorption. To tackle the perception challenges posed by transparent objects, we propose TranspareNet, a joint point cloud and depth completion method, with the ability to complete the depth of transparent objects in cluttered and complex scenes, even with partially filled fluid contents within the vessels. To address the shortcomings of existing transparent object data collection schemes in literature, we also propose an automated dataset creation workflow that consists of robot-controlled image collection and vision-based automatic annotation. Through this automated workflow, we created Toronto Transparent Objects Depth Dataset (TODD), which consists of nearly 15000 RGB-D images. Our experimental evaluation demonstrates that TranspareNet outperforms existing state-of-the-art depth completion methods on multiple datasets, including ClearGrasp, and that it also handles cluttered scenes when trained on TODD. Code and dataset will be released at https://www.pair.toronto.edu/TranspareNet/
    Q-Net: A Quantitative Susceptibility Mapping-based Deep Neural Network for Differential Diagnosis of Brain Iron Deposition in Hemochromatosis. (arXiv:2110.00203v1 [cs.LG])
    (2 min) Brain iron deposition, in particular deep gray matter nuclei, increases with advancing age. Hereditary Hemochromatosis (HH) is the most common inherited disorder of systemic iron excess in Europeans and recent studies claimed high brain iron accumulation in patient with Hemochromatosis. In this study, we focus on Artificial Intelligence (AI)-based differential diagnosis of brain iron deposition in HH via Quantitative Susceptibility Mapping (QSM), which is an established Magnetic Resonance Imaging (MRI) technique to study the distribution of iron in the brain. Our main objective is investigating potentials of AI-driven frameworks to accurately and efficiently differentiate individuals with Hemochromatosis from those of the healthy control group. More specifically, we developed the Q-Net framework, which is a data-driven model that processes information on iron deposition in the brain obtained from multi-echo gradient echo imaging data and anatomical information on T1-Weighted images of the brain. We illustrate that the Q-Net framework can assist in differentiating between someone with HH and Healthy control (HC) of the same age, something that is not possible by just visualizing images. The study is performed based on a unique dataset that was collected from 52 subjects with HH and 47 HC. The Q-Net provides a differential diagnosis accuracy of 83.16% and 80.37% in the scan-level and image-level classification, respectively.
    Strengthening Probabilistic Graphical Models: The Purge-and-merge Algorithm. (arXiv:2110.00091v1 [cs.LG])
    (2 min) Probabilistic graphical models (PGMs) are powerful tools for solving systems of complex relationships over a variety of probability distributions. Tree-structured PGMs always result in efficient and exact solutions, while inference on graph (or loopy) structured PGMs is not guaranteed to discover the optimal solutions. It is in principle possible to convert loopy PGMs to an equivalent tree structure, but for most interesting problems this is impractical due to exponential blow-up. To address this, we developed the purge-and-merge algorithm. The idea behind this algorithm is to iteratively nudge a malleable graph structure towards a tree structure by selectively merging factors. The merging process is designed to avoid exponential blow-up by making use of sparse structures from which redundancy is purged as the algorithm progresses. This approach is evaluated on a number of constraint-satisfaction puzzles such as Sudoku, Fill-a-pix, and Kakuro. On these tasks, our system outperformed other PGM-based approaches reported in the literature. Although these tasks were limited to the binary logic of CSP, we believe it holds promise for extension to general PGM inference.
    Personalized Retrogress-Resilient Framework for Real-World Medical Federated Learning. (arXiv:2110.00394v1 [cs.CV])
    (0 min) Nowadays, deep learning methods with large-scale datasets can produce clinically useful models for computer-aided diagnosis. However, the privacy and ethical concerns are increasingly critical, which make it difficult to collect large quantities of data from multiple institutions. Federated Learning (FL) provides a promising decentralized solution to train model collaboratively by exchanging client models instead of private data. However, the server aggregation of existing FL methods is observed to degrade the model performance in real-world medical FL setting, which is termed as retrogress. To address this problem, we propose a personalized retrogress-resilient framework to produce a superior personalized model for each client. Specifically, we devise a Progressive Fourier Aggregation (PFA) at the server to achieve more stable and effective global knowledge gathering by integrating client models from low-frequency to high-frequency gradually. Moreover, with an introduced deputy model to receive the aggregated server model, we design a Deputy-Enhanced Transfer (DET) strategy at the client and conduct three steps of Recover-Exchange-Sublimate to ameliorate the personalized local model by transferring the global knowledge smoothly. Extensive experiments on real-world dermoscopic FL dataset prove that our personalized retrogress-resilient framework outperforms state-of-the-art FL methods, as well as the generalization on an out-of-distribution cohort. The code and dataset are available at https://github.com/CityU-AIM-Group/PRR-FL.
    Sparse Quadratic Optimisation over the Stiefel Manifold with Application to Permutation Synchronisation. (arXiv:2110.00053v1 [math.OC])
    (2 min) We address the non-convex optimisation problem of finding a sparse matrix on the Stiefel manifold (matrices with mutually orthogonal columns of unit length) that maximises (or minimises) a quadratic objective function. Optimisation problems on the Stiefel manifold occur for example in spectral relaxations of various combinatorial problems, such as graph matching, clustering, or permutation synchronisation. Although sparsity is a desirable property in such settings, it is mostly neglected in spectral formulations since existing solvers, e.g. based on eigenvalue decomposition, are unable to account for sparsity while at the same time maintaining global optimality guarantees. We fill this gap and propose a simple yet effective sparsity-promoting modification of the Orthogonal Iteration algorithm for finding the dominant eigenspace of a matrix. By doing so, we can guarantee that our method finds a Stiefel matrix that is globally optimal with respect to the quadratic objective function, while in addition being sparse. As a motivating application we consider the task of permutation synchronisation, which can be understood as a constrained clustering problem that has particular relevance for matching multiple images or 3D shapes in computer vision, computer graphics, and beyond. We demonstrate that the proposed approach outperforms previous methods in this domain.
    Incremental Layer-wise Self-Supervised Learning for Efficient Speech Domain Adaptation On Device. (arXiv:2110.00155v1 [cs.SD])
    (2 min) Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm obtains a Word Error Rate (WER) on the target domain $24.2\%$ better than supervised baseline and costs $89.7\%$ less training memory than the end-to-end self-supervised learning algorithm.
    UserIdentifier: Implicit User Representations for Simple and Effective Personalized Sentiment Analysis. (arXiv:2110.00135v1 [cs.LG])
    (2 min) Global models are trained to be as generalizable as possible, with user invariance considered desirable since the models are shared across multitudes of users. As such, these models are often unable to produce personalized responses for individual users, based on their data. Contrary to widely-used personalization techniques based on few-shot learning, we propose UserIdentifier, a novel scheme for training a single shared model for all users. Our approach produces personalized responses by adding fixed, non-trainable user identifiers to the input data. We empirically demonstrate that this proposed method outperforms the prefix-tuning based state-of-the-art approach by up to 13%, on a suite of sentiment analysis datasets. We also show that, unlike prior work, this method needs neither any additional model parameters nor any extra rounds of few-shot fine-tuning.
    #ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics. (arXiv:2110.00116v1 [cs.SI])
    (3 min) The United Nations identified gender equality as a Sustainable Development Goal in 2015, recognizing the underrepresentation of women in politics as a specific barrier to achieving gender equality. Political systems around the world experience gender inequality across all levels of elected government as fewer women run for office than men. This is due in part to online abuse, particularly on social media platforms like Twitter, where women seeking or in power tend to be targeted with more toxic maltreatment than their male counterparts. In this paper, we present reflections on ParityBOT - the first natural language processing-based intervention designed to affect online discourse for women in politics for the better, at scale. Deployed across elections in Canada, the United States and New Zealand, ParityBOT was used to analyse and classify more than 12 million tweets directed at women candidates and counter toxic tweets with supportive ones. From these elections we present three case studies highlighting the current limitations of, and future research and application opportunities for, using a natural language processing-based system to detect online toxicity, specifically with regards to contextually important microaggressions. We examine the rate of false negatives, where ParityBOT failed to pick up on insults directed at specific high profile women, which would be obvious to human users. We examine the unaddressed harms of microaggressions and the potential of yet unseen damage they cause for women in these communities, and for progress towards gender equality overall, in light of these technological blindspots. This work concludes with a discussion on the benefits of partnerships between nonprofit social groups and technology experts to develop responsible, socially impactful approaches to addressing online hate.
    PIETS: Parallelised Irregularity Encoders for Forecasting with Heterogeneous Time-Series. (arXiv:2110.00071v1 [cs.LG])
    (2 min) Heterogeneity and irregularity of multi-source data sets present a significant challenge to time-series analysis. In the literature, the fusion of multi-source time-series has been achieved either by using ensemble learning models which ignore temporal patterns and correlation within features or by defining a fixed-size window to select specific parts of the data sets. On the other hand, many studies have shown major improvement to handle the irregularity of time-series, yet none of these studies has been applied to multi-source data. In this work, we design a novel architecture, PIETS, to model heterogeneous time-series. PIETS has the following characteristics: (1) irregularity encoders for multi-source samples that can leverage all available information and accelerate the convergence of the model; (2) parallelised neural networks to enable flexibility and avoid information overwhelming; and (3) attention mechanism that highlights different information and gives high importance to the most related data. Through extensive experiments on real-world data sets related to COVID-19, we show that the proposed architecture is able to effectively model heterogeneous temporal data and outperforms other state-of-the-art approaches in the prediction task.
    Under the Microscope: Interpreting Readability Assessment Models for Filipino. (arXiv:2110.00157v1 [cs.CL])
    (2 min) Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited work is done on analyzing how linguistic variables affect model inference quantitatively. In this work, we dissect machine learning-based readability assessment models in Filipino by performing global and local model interpretation to understand the contributions of varying linguistic features and discuss its implications in the context of the Filipino language. Results show that using a model trained with top features from global interpretation obtained higher performance than the ones using features selected by Spearman correlation. Likewise, we also empirically observed local feature weight boundaries for discriminating reading difficulty at an extremely fine-grained level and their corresponding effects if values are perturbed.

2021-10-01

  • cs.CL updates on arXiv.org

    Multi-granular Legal Topic Classification on Greek Legislation. (arXiv:2109.15298v1 [cs.CL])
    (0 min) In this work, we study the task of classifying legal texts written in the Greek language. We introduce and make publicly available a novel dataset based on Greek legislation, consisting of more than 47 thousand official, categorized Greek legislation resources. We experiment with this dataset and evaluate a battery of advanced methods and classifiers, ranging from traditional machine learning and RNN-based methods to state-of-the-art Transformer-based methods. We show that recurrent architectures with domain-specific word embeddings offer improved overall performance while being competitive even to transformer-based models. Finally, we show that cutting-edge multilingual and monolingual transformer-based models brawl on the top of the classifiers' ranking, making us question the necessity of training monolingual transfer learning models as a rule of thumb. To the best of our knowledge, this is the first time the task of Greek legal text classification is considered in an open research project, while also Greek is a language with very limited NLP resources in general.
    Learning to Rank Question Answer Pairs with Bilateral Contrastive Data Augmentation. (arXiv:2106.11096v2 [cs.CL] UPDATED)
    (0 min) In this work, we propose a novel and easy-to-apply data augmentation strategy, namely Bilateral Generation (BiG), with a contrastive training objective for improving the performance of ranking question answer pairs with existing labeled data. In specific, we synthesize pseudo-positive QA pairs in contrast to the original negative QA pairs with two pre-trained generation models, one for question generation, the other for answer generation, which are fine-tuned on the limited positive QA pairs from the original dataset. With the augmented dataset, we design a contrastive training objective for learning to rank question answer pairs. Experimental results on three benchmark datasets show that our method significantly improves the performance of ranking models by making full use of existing labeled data and can be easily applied to different ranking models.
    Semi-Supervised Text Classification via Self-Pretraining. (arXiv:2109.15300v1 [cs.CL])
    (0 min) We present a neural semi-supervised learning model termed Self-Pretraining. Our model is inspired by the classic self-training algorithm. However, as opposed to self-training, Self-Pretraining is threshold-free, it can potentially update its belief about previously labeled documents, and can cope with the semantic drift problem. Self-Pretraining is iterative and consists of two classifiers. In each iteration, one classifier draws a random set of unlabeled documents and labels them. This set is used to initialize the second classifier, to be further trained by the set of labeled documents. The algorithm proceeds to the next iteration and the classifiers' roles are reversed. To improve the flow of information across the iterations and also to cope with the semantic drift problem, Self-Pretraining employs an iterative distillation process, transfers hypotheses across the iterations, utilizes a two-stage training model, uses an efficient learning rate schedule, and employs a pseudo-label transformation heuristic. We have evaluated our model in three publicly available social media datasets. Our experiments show that Self-Pretraining outperforms the existing state-of-the-art semi-supervised classifiers across multiple settings. Our code is available at https://github.com/p-karisani/self_pretraining.
    Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. (arXiv:2104.08758v2 [cs.CL] UPDATED)
    (0 min) Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.
    GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. (arXiv:2105.02605v2 [cs.CL] UPDATED)
    (0 min) The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based on the individual textual features and the neighbourhood information. Recent breakthroughs on pretrained language models and graph neural networks push forward the development of corresponding techniques. The existing works mainly rely on the cascaded model architecture: the textual features of nodes are independently encoded by language models at first; the textual embeddings are aggregated by graph neural networks afterwards. However, the above architecture is limited due to the independent modeling of textual features. In this work, we propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models. With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow, making each node's semantic accurately comprehended from the global perspective. In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph. Extensive evaluations are conducted on three large-scale benchmark datasets, where GraphFormers outperform the SOTA baselines with comparable running efficiency.
    Automatic Rule Generation for Time Expression Normalization. (arXiv:2108.13658v2 [cs.CL] UPDATED)
    (0 min) The understanding of time expressions includes two sub-tasks: recognition and normalization. In recent years, significant progress has been made in the recognition of time expressions while research on normalization has lagged behind. Existing SOTA normalization methods highly rely on rules or grammars designed by experts, which limits their performance on emerging corpora, such as social media texts. In this paper, we model time expression normalization as a sequence of operations to construct the normalized temporal value, and we present a novel method called ARTime, which can automatically generate normalization rules from training data without expert interventions. Specifically, ARTime automatically captures possible operation sequences from annotated data and generates normalization rules on time expressions with common surface forms. The experimental results show that ARTime can significantly surpass SOTA methods on the Tweets benchmark, and achieves competitive results with existing expert-engineered rule methods on the TempEval-3 benchmark.
    MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction. (arXiv:2109.15290v1 [cs.CL])
    (0 min) An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible.
    Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts. (arXiv:2105.01542v6 [cs.CL] UPDATED)
    (0 min) Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles. Then, we evaluate several baseline approaches for conversational machine comprehension on the UIT-ViCoQA corpus. The best model obtains an F1 score of 45.27%, which is 30.91 points behind human performance (76.18%), indicating that there is ample room for improvement. Our dataset is available at our website: this http URL for research purposes.
    Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. (arXiv:2109.15121v1 [cs.CL])
    (0 min) The paper presents a feature-rich approach to the automatic recognition and categorization of named entities (persons, organizations, locations, and miscellaneous) in news text for Bulgarian. We combine well-established features used for other languages with language-specific lexical, syntactic and morphological information. In particular, we make use of the rich tagset annotation of the BulTreeBank (680 morpho-syntactic tags), from which we derive suitable task-specific tagsets (local and nonlocal). We further add domain-specific gazetteers and additional unlabeled data, achieving F1=89.4%, which is comparable to the state-of-the-art results for English.
    Compositional generalization in semantic parsing with pretrained transformers. (arXiv:2109.15101v1 [cs.CL])
    (0 min) Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.
    A formal model for ledger management systems based on contracts and temporal logic. (arXiv:2109.15212v1 [cs.CR])
    (0 min) A key component of blockchain technology is the ledger, viz., a database that, unlike standard databases, keeps in memory the complete history of past transactions as in a notarial archive for the benefit of any future test. In second-generation blockchains such as Ethereum the ledger is coupled with smart contracts, which enable the automation of transactions associated with agreements between the parties of a financial or commercial nature. The coupling of smart contracts and ledgers provides the technological background for very innovative application areas, such as Decentralized Autonomous Organizations (DAOs), Initial Coin Offerings (ICOs) and Decentralized Finance (DeFi), which propelled blockchains beyond cryptocurrencies that were the only focus of first generation blockchains such as the Bitcoin. However, the currently used implementation of smart contracts as arbitrary programming constructs has made them susceptible to dangerous bugs that can be exploited maliciously and has moved their semantics away from that of legal contracts. We propose here to recompose the split and recover the reliability of databases by formalizing a notion of contract modelled as a finite-state automaton with well-defined computational characteristics derived from an encoding in terms of allocations of resources to actors, as an alternative to the approach based on programming. To complete the work, we use temporal logic as the basis for an abstract query language that is effectively suited to the historical nature of the information kept in the ledger.
    Improved statistical machine translation using monolingual paraphrases. (arXiv:2109.15119v1 [cs.CL])
    (0 min) We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems "for free" -- by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and vice-versa -- preposition-containing noun phrases are turned into noun compounds. The evaluation shows an improvement equivalent to 33%-50% of that of doubling the amount of training data.
    Measuring Sentence-Level and Aspect-Level (Un)certainty in Science Communications. (arXiv:2109.14776v1 [cs.CL])
    (0 min) Certainty and uncertainty are fundamental to science communication. Hedges have widely been used as proxies for uncertainty. However, certainty is a complex construct, with authors expressing not only the degree but the type and aspects of uncertainty in order to give the reader a certain impression of what is known. Here, we introduce a new study of certainty that models both the level and the aspects of certainty in scientific findings. Using a new dataset of 2167 annotated scientific findings, we demonstrate that hedges alone account for only a partial explanation of certainty. We show that both the overall certainty and individual aspects can be predicted with pre-trained language models, providing a more complete picture of the author's intended communication. Downstream analyses on 431K scientific findings from news and scientific abstracts demonstrate that modeling sentence-level and aspect-level certainty is meaningful for areas like science communication. Both the model and datasets used in this paper are released at https://blablablab.si.umich.edu/projects/certainty/.
    Multilingual AMR Parsing with Noisy Knowledge Distillation. (arXiv:2109.15196v1 [cs.CL])
    (0 min) We study multilingual AMR parsing from the perspective of knowledge distillation, where the aim is to learn and improve a multilingual AMR parser by using an existing English parser as its teacher. We constrain our exploration in a strict multilingual setting: there is but one model to parse all different languages including English. We identify that noisy input and precise output are the key to successful distillation. Together with extensive pre-training, we obtain an AMR parser whose performances surpass all previously published results on four different foreign languages, including German, Spanish, Italian, and Chinese, by large margins (up to 18.8 \textsc{Smatch} points on Chinese and on average 11.3 \textsc{Smatch} points). Our parser also achieves comparable performance on English to the latest state-of-the-art English-only parser.
    Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism. (arXiv:2109.15153v1 [cs.CL])
    (0 min) In the past decade, sarcasm detection has been intensively conducted in a textual scenario. With the popularization of video communication, the analysis in multi-modal scenarios has received much attention in recent years. Therefore, multi-modal sarcasm detection, which aims at detecting sarcasm in video conversations, becomes increasingly hot in both the natural language processing community and the multi-modal analysis community. In this paper, considering that sarcasm is often conveyed through incongruity between modalities (e.g., text expressing a compliment while acoustic tone indicating a grumble), we construct a Contras-tive-Attention-based Sarcasm Detection (ConAttSD) model, which uses an inter-modality contrastive attention mechanism to extract several contrastive features for an utterance. A contrastive feature represents the incongruity of information between two modalities. Our experiments on MUStARD, a benchmark multi-modal sarcasm dataset, demonstrate the effectiveness of the proposed ConAttSD model.
    SUper Team at SemEval-2016 Task 3: Building a feature-rich system for community question answering. (arXiv:2109.15120v1 [cs.CL])
    (0 min) We present the system we built for participating in SemEval-2016 Task 3 on Community Question Answering. We achieved the best results on subtask C, and strong results on subtasks A and B, by combining a rich set of various types of features: semantic, lexical, metadata, and user-related. The most important group turned out to be the metadata for the question and for the comment, semantic vectors trained on QatarLiving data and similarities between the question and the comment for subtasks A and C, and between the original and the related question for Subtask B.
    Does My Representation Capture X? Probe-Ably. (arXiv:2104.05807v2 [cs.LG] UPDATED)
    (0 min) Probing (or diagnostic classification) has become a popular strategy for investigating whether a given set of intermediate features is present in the representations of neural models. Probing studies may have misleading results, but various recent works have suggested more reliable methodologies that compensate for the possible pitfalls of probing. However, these best practices are numerous and fast-evolving. To simplify the process of running a set of probing experiments in line with suggested methodologies, we introduce Probe-Ably: an extendable probing framework which supports and automates the application of probing methods to the user's inputs.
    Integrating Graph Contextualized Knowledge into Pre-trained Language Models. (arXiv:1912.00147v3 [cs.CL] UPDATED)
    (0 min) Complex node interactions are common in knowledge graphs, and these interactions also contain rich knowledge information. However, traditional methods usually treat a triple as a training unit during the knowledge representation learning (KRL) procedure, neglecting contextualized information of the nodes in knowledge graphs (KGs). We generalize the modeling object to a very general form, which theoretically supports any subgraph extracted from the knowledge graph, and these subgraphs are fed into a novel transformer-based model to learn the knowledge embeddings. To broaden usage scenarios of knowledge, pre-trained language models are utilized to build a model that incorporates the learned knowledge representations. Experimental results demonstrate that our model achieves the state-of-the-art performance on several medical NLP tasks, and improvement above TransE indicates that our KRL method captures the graph contextualized information effectively.
    Meta Sequence Learning for Generating Adequate Question-Answer Pairs. (arXiv:2010.01620v2 [cs.CL] UPDATED)
    (0 min) Creating multiple-choice questions to assess reading comprehension of a given article involves generating question-answer pairs (QAPs) on the main points of the document. We present a learning scheme to generate adequate QAPs via meta-sequence representations of sentences. A meta sequence is a sequence of vectors comprising semantic and syntactic tags. In particular, we devise a scheme called MetaQA to learn meta sequences from training data to form pairs of a meta sequence for a declarative sentence (MD) and a corresponding interrogative sentence (MIs). On a given declarative sentence, a trained MetaQA model converts it to a meta sequence, finds a matched MD, and uses the corresponding MIs and the input sentence to generate QAPs. We implement MetaQA for the English language using semantic-role labeling, part-of-speech tagging, and named-entity recognition, and show that trained on a small dataset, MetaQA generates efficiently over the official SAT practice reading tests a large number of syntactically and semantically correct QAPs with over 97\% accuracy.
    Prose2Poem: The blessing of Transformer-based Language Models in translating Prose to Persian Poetry. (arXiv:2109.14934v1 [cs.CL])
    (0 min) Persian Poetry has consistently expressed its philosophy, wisdom, speech, and rationale on the basis of its couplets, making it an enigmatic language on its own to both native and non-native speakers. Nevertheless, the notice able gap between Persian prose and poem has left the two pieces of literature medium-less. Having curated a parallel corpus of prose and their equivalent poems, we introduce a novel Neural Machine Translation (NMT) approach to translate prose to ancient Persian poetry using transformer-based Language Models in an extremely low-resource setting. More specifically, we trained a Transformer model from scratch to obtain initial translations and pretrained different variations of BERT to obtain final translations. To address the challenge of using masked language modelling under poeticness criteria, we heuristically joined the two models and generated valid poems in terms of automatic and human assessments. Final results demonstrate the eligibility and creativity of our novel heuristically aided approach among Literature professionals and non-professionals in generating novel Persian poems.
    BERT got a Date: Introducing Transformers to Temporal Tagging. (arXiv:2109.14927v1 [cs.CL])
    (0 min) Temporal expressions in text play a significant role in language understanding and correctly identifying them is fundamental to various retrieval and natural language processing systems. Previous works have slowly shifted from rule-based to neural architectures, capable of tagging expressions with higher accuracy. However, neural models can not yet distinguish between different expression types at the same level as their rule-based counterparts. n this work, we aim to identify the most suitable transformer architecture for joint temporal tagging and type classification, as well as, investigating the effect of semi-supervised training on the performance of these systems. After studying variants of token classification and encoder-decoder architectures, we ultimately present a transformer encoder-decoder model using RoBERTa language model as our best performing system. By supplementing training resources with weakly labeled data from rule-based systems, our model surpasses previous works in temporal tagging and type classification, especially on rare classes. Additionally, we make the code and pre-trained experiment publicly available
    SlovakBERT: Slovak Masked Language Model. (arXiv:2109.15254v1 [cs.CL])
    (0 min) We introduce a new Slovak masked language model called SlovakBERT in this paper. It is the first Slovak-only transformers-based model trained on a sizeable corpus. We evaluate the model on several NLP tasks and achieve state-of-the-art results. We publish the masked language model, as well as the subsequently fine-tuned models for part-of-speech tagging, sentiment analysis and semantic textual similarity.
    A surprisal--duration trade-off across and within the world's languages. (arXiv:2109.15000v1 [cs.CL])
    (0 min) While there exist scores of natural languages, each with its unique features and idiosyncrasies, they all share a unifying theme: enabling human communication. We may thus reasonably predict that human cognition shapes how these languages evolve and are used. Assuming that the capacity to process information is roughly constant across human populations, we expect a surprisal--duration trade-off to arise both across and within languages. We analyse this trade-off using a corpus of 600 languages and, after controlling for several potential confounds, we find strong supporting evidence in both settings. Specifically, we find that, on average, phones are produced faster in languages where they are less surprising, and vice versa. Further, we confirm that more surprising phones are longer, on average, in 319 languages out of the 600. We thus conclude that there is strong evidence of a surprisal--duration trade-off in operation, both across and within the world's languages.
    Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments. (arXiv:2109.15207v1 [cs.CV])
    (0 min) In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
    CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact Verification Models. (arXiv:2109.15107v1 [cs.CL])
    (0 min) Fact verification datasets are typically constructed using crowdsourcing techniques due to the lack of text sources with veracity labels. However, the crowdsourcing process often produces undesired biases in data that cause models to learn spurious patterns. In this paper, we propose CrossAug, a contrastive data augmentation method for debiasing fact verification models. Specifically, we employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples. The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns and learn more robust representations. Experimental results show that our method outperforms the previous state-of-the-art debiasing technique by 3.6% on the debiased extension of the FEVER dataset, with a total performance boost of 10.13% from the baseline. Furthermore, we evaluate our approach in data-scarce settings, where models can be more susceptible to biases due to the lack of training data. Experimental results demonstrate that our approach is also effective at debiasing in these low-resource conditions, exceeding the baseline performance on the Symmetric dataset with just 1% of the original data.
    A Review of Text Style Transfer using Deep Learning. (arXiv:2109.15144v1 [cs.CL])
    (0 min) Style is an integral component of a sentence indicated by the choice of words a person makes. Different people have different ways of expressing themselves, however, they adjust their speaking and writing style to a social context, an audience, an interlocutor or the formality of an occasion. Text style transfer is defined as a task of adapting and/or changing the stylistic manner in which a sentence is written, while preserving the meaning of the original sentence. A systematic review of text style transfer methodologies using deep learning is presented in this paper. We point out the technological advances in deep neural networks that have been the driving force behind current successes in the fields of natural language understanding and generation. The review is structured around two key stages in the text style transfer process, namely, representation learning and sentence generation in a new style. The discussion highlights the commonalities and differences between proposed solutions as well as challenges and opportunities that are expected to direct and foster further research in the field.
    Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks. (arXiv:2109.15256v1 [cs.CL])
    (0 min) Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. However, existing neural models have been shown to lack this basic ability in learning symbolic structures. Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics, as additional training supervision. These automatically-generated sequences are more representative of the underlying compositional symbolic structures of the input data. During inference, the model jointly predicts the next action and the next tokens in the auxiliary sequences at each step. Experiments on the SCAN dataset show that our method encourages the Transformer to understand compositional structures of the command, improving its accuracy on multiple challenging splits from <= 10% to 100%. With only 418 (5%) training instances, our approach still achieves 97.8% accuracy on the MCD1 split. Therefore, we argue that compositionality can be induced in Transformers given minimal but proper guidance. We also show that a better result is achieved using less contextualized vectors as the attention's query, providing insights into architecture choices in achieving systematic compositionality. Finally, we show positive generalization results on the groundedSCAN task (Ruis et al., 2020). Our code is publicly available at: https://github.com/jiangycTarheel/compositional-auxseq
    Overview of the CLEF-2019 CheckThat!: Automatic Identification and Verification of Claims. (arXiv:2109.15118v1 [cs.CL])
    (0 min) We present an overview of the second edition of the CheckThat! Lab at CLEF 2019. The lab featured two tasks in two different languages: English and Arabic. Task 1 (English) challenged the participating systems to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 (Arabic) asked to (A) rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) identify useful passages from these pages, and (D) use the useful pages to predict the claim's factuality. CheckThat! provided a full evaluation framework, consisting of data in English (derived from fact-checking sources) and Arabic (gathered and annotated from scratch) and evaluation based on mean average precision (MAP) and normalized discounted cumulative gain (nDCG) for ranking, and F1 for classification. A total of 47 teams registered to participate in this lab, and fourteen of them actually submitted runs (compared to nine last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. As for Task 2, learning-to-rank was used by the highest scoring runs for subtask A, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important tasks of check-worthiness estimation and automatic claim verification.
    Focused Contrastive Training for Test-based Constituency Analysis. (arXiv:2109.15159v1 [cs.CL])
    (0 min) We propose a scheme for self-training of grammaticality models for constituency analysis based on linguistic tests. A pre-trained language model is fine-tuned by contrastive estimation of grammatical sentences from a corpus, and ungrammatical sentences that were perturbed by a syntactic test, a transformation that is motivated by constituency theory. We show that consistent gains can be achieved if only certain positive instances are chosen for training, depending on whether they could be the result of a test transformation. This way, the positives, and negatives exhibit similar characteristics, which makes the objective more challenging for the language model, and also allows for additional markup that indicates the position of the test application within the sentence.
    Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media. (arXiv:2104.06999v2 [cs.CL] UPDATED)
    (0 min) Online social media platforms increasingly rely on Natural Language Processing (NLP) techniques to detect abusive content at scale in order to mitigate the harms it causes to their users. However, these techniques suffer from various sampling and association biases present in training data, often resulting in sub-par performance on content relevant to marginalized groups, potentially furthering disproportionate harms towards them. Studies on such biases so far have focused on only a handful of axes of disparities and subgroups that have annotations/lexicons available. Consequently, biases concerning non-Western contexts are largely ignored in the literature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geocultural contexts. Through a case study on a publicly available toxicity detection model, we demonstrate that our method identifies salient groups of cross-geographic errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. We also conduct analysis of a model trained on a dataset with ground truth labels to better understand these biases, and present preliminary mitigation experiments.
    Deep Neural Compression Via Concurrent Pruning and Self-Distillation. (arXiv:2109.15014v1 [cs.LG])
    (0 min) Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel \emph{self-distillation} based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed {\em cross-correlation objective for self-distilled pruning} implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.
    Syntactic Persistence in Language Models: Priming as a Window into Abstract Language Representations. (arXiv:2109.14989v1 [cs.CL])
    (2 min) We investigate the extent to which modern, neural language models are susceptible to syntactic priming, the phenomenon where the syntactic structure of a sentence makes the same structure more probable in a follow-up sentence. We explore how priming can be used to study the nature of the syntactic knowledge acquired by these models. We introduce a novel metric and release Prime-LM, a large corpus where we control for various linguistic factors which interact with priming strength. We find that recent large Transformer models indeed show evidence of syntactic priming, but also that the syntactic generalisations learned by these models are to some extent modulated by semantic information. We report surprisingly strong priming effects when priming with multiple sentences, each with different words and meaning but with identical syntactic structure. We conclude that the syntactic priming paradigm is a highly useful, additional tool for gaining insights into the capacities of language models.
    Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach. (arXiv:2109.14638v1 [cs.CL])
    (2 min) Existing work on making privacy policies accessible has explored new presentation forms such as color-coding based on the risk factors or summarization to assist users with conscious agreement. To facilitate a more personalized interaction with the policies, in this work, we propose an automated privacy policy question answering assistant that extracts a summary in response to the input user query. This is a challenging task because users articulate their privacy-related questions in a very different language than the legal language of the policy, making it difficult for the system to understand their inquiry. Moreover, existing annotated data in this domain are limited. We address these problems by paraphrasing to bring the style and language of the user's question closer to the language of privacy policies. Our content scoring module uses the existing in-domain data to find relevant information in the policy and incorporates it in a summary. Our pipeline is able to find an answer for 89% of the user queries in the privacyQA dataset.
    Classifying Tweet Sentiment Using the Hidden State and Attention Matrix of a Fine-tuned BERTweet Model. (arXiv:2109.14692v1 [cs.CL])
    (2 min) This paper introduces a study on tweet sentiment classification. Our task is to classify a tweet as either positive or negative. We approach the problem in two steps, namely embedding and classifying. Our baseline methods include several combinations of traditional embedding methods and classification algorithms. Furthermore, we explore the current state-of-the-art tweet analysis model, BERTweet, and propose a novel approach in which features are engineered from the hidden states and attention matrices of the model, inspired by empirical study of the tweets. Using a multi-layer perceptron trained with a high dropout rate for classification, our proposed approach achieves a validation accuracy of 0.9111.
    Key Point Analysis via Contrastive Learning and Extractive Argument Summarization. (arXiv:2109.15086v1 [cs.CL])
    (2 min) Key point analysis is the task of extracting a set of concise and high-level statements from a given collection of arguments, representing the gist of these arguments. This paper presents our proposed approach to the Key Point Analysis shared task, collocated with the 8th Workshop on Argument Mining. The approach integrates two complementary components. One component employs contrastive learning via a siamese neural network for matching arguments to key points; the other is a graph-based extractive summarization model for generating key points. In both automatic and manual evaluation, our approach was ranked best among all submissions to the shared task.
    Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by Machine Translation Systems. (arXiv:2109.14895v1 [cs.CL])
    (2 min) In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words. The reason is that an incorrect translation of such words would miss the fundamental aspect of the source text, i.e. the author's sentiment. In the online world, MT systems are extensively used to translate User-Generated Content (UGC) such as reviews, tweets, and social media posts, where the main message is often the author's positive or negative attitude towards the topic of the text. It is important in such scenarios to accurately measure how far an MT system can be a reliable real-life utility in transferring the correct affect message. This paper tackles an under-recognised problem in the field of machine translation evaluation which is judging to what extent automatic metrics concur with the gold standard of human evaluation for a correct translation of sentiment. We evaluate the efficacy of conventional quality metrics in spotting a mistranslation of sentiment, especially when it is the sole error in the MT output. We propose a numerical `sentiment-closeness' measure appropriate for assessing the accuracy of a translated affect message in UGC text by an MT system. We will show that incorporating this sentiment-aware measure can significantly enhance the correlation of some available quality metrics with the human judgement of an accurate translation of sentiment.
    COVID-19 Fake News Detection Using Bidirectiona lEncoder Representations from Transformers Based Models. (arXiv:2109.14816v1 [cs.CL])
    (2 min) Nowadays, the development of social media allows people to access the latest news easily. During the COVID-19 pandemic, it is important for people to access the news so that they can take corresponding protective measures. However, the fake news is flooding and is a serious issue especially under the global pandemic. The misleading fake news can cause significant loss in terms of the individuals and the society. COVID-19 fake news detection has become a novel and important task in the NLP field. However, fake news always contain the correct portion and the incorrect portion. This fact increases the difficulty of the classification task. In this paper, we fine tune the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model as our base model. We add BiLSTM layers and CNN layers on the top of the finetuned BERT model with frozen parameters or not frozen parameters methods respectively. The model performance evaluation results showcase that our best model (BERT finetuned model with frozen parameters plus BiLSTM layers) achieves state-of-the-art results towards COVID-19 fake news detection task. We also explore keywords evaluation methods using our best model and evaluate the model performance after removing keywords.
    Process discovery on deviant traces and other stranger things. (arXiv:2109.14883v1 [cs.LO])
    (2 min) As the need to understand and formalise business processes into a model has grown over the last years, the process discovery research field has gained more and more importance, developing two different classes of approaches to model representation: procedural and declarative. Orthogonally to this classification, the vast majority of works envisage the discovery task as a one-class supervised learning process guided by the traces that are recorded into an input log. In this work instead, we focus on declarative processes and embrace the less-popular view of process discovery as a binary supervised learning task, where the input log reports both examples of the normal system execution, and traces representing "stranger" behaviours according to the domain semantics. We therefore deepen how the valuable information brought by both these two sets can be extracted and formalised into a model that is "optimal" according to user-defined goals. Our approach, namely NegDis, is evaluated w.r.t. other relevant works in this field, and shows promising results as regards both the performance and the quality of the obtained solution.
    Towards Efficient Post-training Quantization of Pre-trained Language Models. (arXiv:2109.15082v1 [cs.CL])
    (2 min) Network quantization has gained increasing attention with the rapid growth of large pre-trained language models~(PLMs). However, most existing quantization methods for PLMs follow quantization-aware training~(QAT) that requires end-to-end training with full access to the entire dataset. Therefore, they suffer from slow training, large memory overhead, and data security issues. In this paper, we study post-training quantization~(PTQ) of PLMs, and propose module-wise quantization error minimization~(MREM), an efficient solution to mitigate these issues. By partitioning the PLM into multiple modules, we minimize the reconstruction error incurred by quantization for each module. In addition, we design a new model parallel training strategy such that each module can be trained locally on separate computing devices without waiting for preceding modules, which brings nearly the theoretical training speed-up (e.g., $4\times$ on $4$ GPUs). Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
    Phonetic Word Embeddings. (arXiv:2109.14796v1 [cs.CL])
    (2 min) This work presents a novel methodology for calculating the phonetic similarity between words taking motivation from the human perception of sounds. This metric is employed to learn a continuous vector embedding space that groups similar sounding words together and can be used for various downstream computational phonology tasks. The efficacy of the method is presented for two different languages (English, Hindi) and performance gains over previous reported works are discussed on established tests for predicting phonetic similarity. To address limited benchmarking mechanisms in this field, we also introduce a heterographic pun dataset based evaluation methodology to compare the effectiveness of acoustic similarity algorithms. Further, a visualization of the embedding space is presented with a discussion on the various possible use-cases of this novel algorithm. An open-source implementation is also shared to aid reproducibility and enable adoption in related tasks.
    BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief. (arXiv:2109.14723v1 [cs.CL])
    (2 min) Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually "believes" about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs -- a BeliefBank -- that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component -- a weighted MaxSAT solver -- revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining.
    Tipping the Scales: A Corpus-Based Reconstruction of Adjective Scales in the McGill Pain Questionnaire. (arXiv:2109.14788v1 [cs.CL])
    (2 min) Modern medical diagnosis relies on precise pain assessment tools in translating clinical information from patient to physician. The McGill Pain Questionnaire (MPQ) is a clinical pain assessment technique that utilizes 78 adjectives of different intensities in 20 different categories to quantity a patient's pain. The questionnaire's efficacy depends on a predictable pattern of adjective use by patients experiencing pain. In this study, I recreate the MPQ's adjective intensity orderings using data gathered from patient forums and modern NLP techniques. I extract adjective intensity relationships by searching for key linguistic contexts, and then combine the relationship information to form robust adjective scales. Of 17 adjective relationships predicted by this research, 10 show agreement with the MPQ, which is statistically significant at the .1 alpha level. The results suggest predictable patterns of adjective use by people experiencing pain, but call into question the MPQ's categories for grouping adjectives.
    DICoE@FinSim-3: Financial Hypernym Detection using Augmented Terms and Distance-based Features. (arXiv:2109.14906v1 [cs.CL])
    (2 min) We present the submission of team DICoE for FinSim-3, the 3rd Shared Task on Learning Semantic Similarities for the Financial Domain. The task provides a set of terms in the financial domain and requires to classify them into the most relevant hypernym from a financial ontology. After augmenting the terms with their Investopedia definitions, our system employs a Logistic Regression classifier over financial word embeddings and a mix of hand-crafted and distance-based features. Also, for the first time in this task, we employ different replacement methods for out-of-vocabulary terms, leading to improved performance. Finally, we have also experimented with word representations generated from various financial corpora. Our best-performing submission ranked 4th on the task's leaderboard.
    Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System. (arXiv:2109.14739v1 [cs.CL])
    (2 min) Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In addition, we introduce a new dialogue multi-task pre-training strategy that allows the model to learn the primary TOD task completion skills from heterogeneous dialog corpora. We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Experimental results show that PPTOD achieves new state of the art on all evaluated tasks in both high-resource and low-resource scenarios. Furthermore, comparisons against previous SOTA methods show that the responses generated by PPTOD are more factually correct and semantically coherent as judged by human annotators.
    Adaptive Approach For Sparse Representations Using The Locally Competitive Algorithm For Audio. (arXiv:2109.14705v1 [cs.SD])
    (2 min) Gammachirp filterbank has been used to approximate the cochlea in sparse coding algorithms. An oriented grid search optimization was applied to adapt the gammachirp's parameters and improve the Matching Pursuit (MP) algorithm's sparsity along with the reconstruction quality. However, this combination of a greedy algorithm with a grid search at each iteration is computationally demanding and not suitable for real-time applications. This paper presents an adaptive approach to optimize the gammachirp's parameters but in the context of the Locally Competitive Algorithm (LCA) that requires much fewer computations than MP. The proposed method consists of taking advantage of the LCA's neural architecture to automatically adapt the gammachirp's filterbank using the backpropagation algorithm. Results demonstrate an improvement in the LCA's performance with our approach in terms of sparsity, reconstruction quality, and convergence time. This approach can yield a significant advantage over existing approaches for real-time applications.
  • cs.CV updates on arXiv.org

    Egocentric Hand-object Interaction Detection and Application. (arXiv:2109.14734v1 [cs.CV])
    (0 min) In this paper, we present a method to detect the hand-object interaction from an egocentric perspective. In contrast to massive data-driven discriminator based method like \cite{Shan20}, we propose a novel workflow that utilises the cues of hand and object. Specifically, we train networks predicting hand pose, hand mask and in-hand object mask to jointly predict the hand-object interaction status. We compare our method with the most recent work from Shan et al. \cite{Shan20} on selected images from EPIC-KITCHENS \cite{damen2018scaling} dataset and achieve $89\%$ accuracy on HOI (hand-object interaction) detection which is comparative to Shan's ($92\%$). However, for real-time performance, with the same machine, our method can run over $\textbf{30}$ FPS which is much efficient than Shan's ($\textbf{1}\sim\textbf{2}$ FPS). Furthermore, with our approach, we are able to segment script-less activities from where we extract the frames with the HOI status detection. We achieve $\textbf{68.2\%}$ and $\textbf{82.8\%}$ F1 score on GTEA \cite{fathi2011learning} and the UTGrasp \cite{cai2015scalable} dataset respectively which are all comparative to the SOTA methods.
    Data Augmentation and CNN Classification For Automatic COVID-19 Diagnosis From CT-Scan Images On Small Dataset. (arXiv:2108.07148v2 [eess.IV] UPDATED)
    (0 min) We present an automatic COVID1-19 diagnosis framework from lung CT images. The focus is on signal processing and classification on small datasets with efforts putting into exploring data preparation and augmentation to improve the generalization capability of the 2D CNN classification models. We propose a unique and effective data augmentation method using multiple Hounsfield Unit (HU) normalization windows. In addition, the original slice image is cropped to exclude background, and a filter is applied to filter out closed-lung images. For the classification network, we choose to use 2D Densenet and Xception with the feature pyramid network (FPN). To further improve the classification accuracy, an ensemble of multiple CNN models and HU windows is used. On the training/validation dataset, we achieve a patient classification accuracy of 93.39%.
    Representation Learning for Event-based Visuomotor Policies. (arXiv:2103.00806v2 [cs.CV] UPDATED)
    (0 min) Event-based cameras are dynamic vision sensors that provide asynchronous measurements of changes in per-pixel brightness at a microsecond level. This makes them significantly faster than conventional frame-based cameras, and an appealing choice for high-speed navigation. While an interesting sensor modality, this asynchronously streamed event data poses a challenge for machine learning techniques that are more suited for frame-based data. In this paper, we present an event variational autoencoder and show that it is feasible to learn compact representations directly from asynchronous spatiotemporal event data. Furthermore, we show that such pretrained representations can be used for event-based reinforcement learning instead of end-to-end reward driven perception. We validate this framework of learning event-based visuomotor policies by applying it to an obstacle avoidance scenario in simulation. Compared to techniques that treat event data as images, we show that representations learnt from event streams result in faster policy training, adapt to different control capacities, and demonstrate a higher degree of robustness.
    3D Pose Transfer with Correspondence Learning and Mesh Refinement. (arXiv:2109.15025v1 [cs.CV])
    (0 min) 3D pose transfer is one of the most challenging 3D generation tasks. It aims to transfer the pose of a source mesh to a target mesh and keep the identity (e.g., body shape) of the target mesh. Some previous works require key point annotations to build reliable correspondence between the source and target meshes, while other methods do not consider any shape correspondence between sources and targets, which leads to limited generation quality. In this work, we propose a correspondence-refinement network to help the 3D pose transfer for both human and animal meshes. The correspondence between source and target meshes is first established by solving an optimal transport problem. Then, we warp the source mesh according to the dense correspondence and obtain a coarse warped mesh. The warped mesh will be better refined with our proposed \textit{Elastic Instance Normalization}, which is a conditional normalization layer and can help to generate high-quality meshes. Extensive experimental results show that the proposed architecture can effectively transfer the poses from source to target meshes and produce better results with satisfied visual performance than state-of-the-art methods.
    AffectGAN: Affect-Based Generative Art Driven by Semantics. (arXiv:2109.14845v1 [cs.CV])
    (0 min) This paper introduces a novel method for generating artistic images that express particular affective states. Leveraging state-of-the-art deep learning methods for visual generation (through generative adversarial networks), semantic models from OpenAI, and the annotated dataset of the visual art encyclopedia WikiArt, our AffectGAN model is able to generate images based on specific or broad semantic prompts and intended affective outcomes. A small dataset of 32 images generated by AffectGAN is annotated by 50 participants in terms of the particular emotion they elicit, as well as their quality and novelty. Results show that for most instances the intended emotion used as a prompt for image generation matches the participants' responses. This small-scale study brings forth a new vision towards blending affective computing with computational creativity, enabling generative systems with intentionality in terms of the emotions they wish their output to elicit.
    iShape: A First Step Towards Irregular Shape Instance Segmentation. (arXiv:2109.15068v1 [cs.CV])
    (0 min) In this paper, we introduce a brand new dataset to promote the study of instance segmentation for objects with irregular shapes. Our key observation is that though irregularly shaped objects widely exist in daily life and industrial scenarios, they received little attention in the instance segmentation field due to the lack of corresponding datasets. To fill this gap, we propose iShape, an irregular shape dataset for instance segmentation. iShape contains six sub-datasets with one real and five synthetics, each represents a scene of a typical irregular shape. Unlike most existing instance segmentation datasets of regular objects, iShape has many characteristics that challenge existing instance segmentation algorithms, such as large overlaps between bounding boxes of instances, extreme aspect ratios, and large numbers of connected components per instance. We benchmark popular instance segmentation methods on iShape and find their performance drop dramatically. Hence, we propose an affinity-based instance segmentation algorithm, called ASIS, as a stronger baseline. ASIS explicitly combines perception and reasoning to solve Arbitrary Shape Instance Segmentation including irregular objects. Experimental results show that ASIS outperforms the state-of-the-art on iShape. Dataset and code are available at https://ishape.github.io
    Sensor-Guided Optical Flow. (arXiv:2109.15321v1 [cs.CV])
    (0 min) This paper proposes a framework to guide an optical flow network with external cues to achieve superior accuracy either on known or unseen domains. Given the availability of sparse yet accurate optical flow hints from an external source, these are injected to modulate the correlation scores computed by a state-of-the-art optical flow network and guide it towards more accurate predictions. Although no real sensor can provide sparse flow hints, we show how these can be obtained by combining depth measurements from active sensors with geometry and hand-crafted optical flow algorithms, leading to accurate enough hints for our purpose. Experimental results with a state-of-the-art flow network on standard benchmarks support the effectiveness of our framework, both in simulated and real conditions.
    ScanMix: Learning from Severe Label Noise via Semantic Clustering and Semi-Supervised Learning. (arXiv:2103.11395v2 [cs.CV] UPDATED)
    (0 min) In this paper, we address the problem of training deep neural networks in the presence of severe label noise. Our proposed training algorithm ScanMix, combines semantic clustering with semi-supervised learning (SSL) to improve the feature representations and enable an accurate identification of noisy samples, even in severe label noise scenarios. To be specific, ScanMix is designed based on the expectation maximisation (EM) framework, where the E-step estimates the value of a latent variable to cluster the training images based on their appearance representations and classification results, and the M-step optimises the SSL classification and learns effective feature representations via semantic clustering. In our evaluations, we show state-of-the-art results on standard benchmarks for symmetric, asymmetric and semantic label noise on CIFAR-10 and CIFAR-100, as well as large scale real label noise on WebVision. Most notably, for the benchmarks contaminated with large noise rates (80% and above), our results are up to 27% better than the related work. The code is available at https://github.com/ragavsachdeva/ScanMix.
    Deep Homography Estimation in Dynamic Surgical Scenes for Laparoscopic Camera Motion Extraction. (arXiv:2109.15098v1 [eess.IV])
    (0 min) Current laparoscopic camera motion automation relies on rule-based approaches or only focuses on surgical tools. Imitation Learning (IL) methods could alleviate these shortcomings, but have so far been applied to oversimplified setups. Instead of extracting actions from oversimplified setups, in this work we introduce a method that allows to extract a laparoscope holder's actions from videos of laparoscopic interventions. We synthetically add camera motion to a newly acquired dataset of camera motion free da Vinci surgery image sequences through the introduction of a novel homography generation algorithm. The synthetic camera motion serves as a supervisory signal for camera motion estimation that is invariant to object and tool motion. We perform an extensive evaluation of state-of-the-art (SOTA) Deep Neural Networks (DNNs) across multiple compute regimes, finding our method transfers from our camera motion free da Vinci surgery dataset to videos of laparoscopic interventions, outperforming classical homography estimation approaches in both, precision by 41%, and runtime on a CPU by 43%.
    The Object at Hand: Automated Editing for Mixed Reality Video Guidance from Hand-Object Interactions. (arXiv:2109.14744v1 [cs.CV])
    (0 min) In this paper, we concern with the problem of how to automatically extract the steps that compose real-life hand activities. This is a key competence towards processing, monitoring and providing video guidance in Mixed Reality systems. We use egocentric vision to observe hand-object interactions in real-world tasks and automatically decompose a video into its constituent steps. Our approach combines hand-object interaction (HOI) detection, object similarity measurement and a finite state machine (FSM) representation to automatically edit videos into steps. We use a combination of Convolutional Neural Networks (CNNs) and the FSM to discover, edit cuts and merge segments while observing real hand activities. We evaluate quantitatively and qualitatively our algorithm on two datasets: the GTEA\cite{li2015delving}, and a new dataset we introduce for Chinese Tea making. Results show our method is able to segment hand-object interaction videos into key step segments with high levels of precision.
    Revisiting Point Cloud Simplification: A Learnable Feature Preserving Approach. (arXiv:2109.14982v1 [cs.CV])
    (0 min) The recent advances in 3D sensing technology have made possible the capture of point clouds in significantly high resolution. However, increased detail usually comes at the expense of high storage, as well as computational costs in terms of processing and visualization operations. Mesh and Point Cloud simplification methods aim to reduce the complexity of 3D models while retaining visual quality and relevant salient features. Traditional simplification techniques usually rely on solving a time-consuming optimization problem, hence they are impractical for large-scale datasets. In an attempt to alleviate this computational burden, we propose a fast point cloud simplification method by learning to sample salient points. The proposed method relies on a graph neural network architecture trained to select an arbitrary, user-defined, number of points from the input space and to re-arrange their positions so as to minimize the visual perception error. The approach is extensively evaluated on various datasets using several perceptual metrics. Importantly, our method is able to generalize to out-of-distribution shapes, hence demonstrating zero-shot capabilities.
    FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation. (arXiv:2102.12867v2 [cs.CV] UPDATED)
    (0 min) Recent methods for long-tailed instance segmentation still struggle on rare object classes with few training data. We propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the data scarcity issue by augmenting the feature space especially for rare classes. Both the Feature Augmentation (FA) and feature sampling components are adaptive to the actual training status -- FA is informed by the feature mean and variance of observed real samples from past iterations, and we sample the generated virtual features in a loss-adapted manner to avoid over-fitting. FASA does not require any elaborate loss design, and removes the need for inter-class transfer learning that often involves large cost and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. FASA is also applicable to other tasks like long-tailed classification with state-of-the-art performance.
    Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning. (arXiv:2012.01345v3 [cs.CV] UPDATED)
    (0 min) Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap.
    IntentVizor: Towards Generic Query Guided Interactive Video Summarization Using Slow-Fast Graph Convolutional Networks. (arXiv:2109.14834v1 [cs.CV])
    (0 min) The target of automatic Video summarization is to create a short skim of the original long video while preserving the major content/events. There is a growing interest in the integration of user's queries into video summarization, or query-driven video summarization. This video summarization method predicts a concise synopsis of the original video based on the user query, which is commonly represented by the input text. However, two inherent problems exist in this query-driven way. First, the query text might not be enough to describe the exact and diverse needs of the user. Second, the user cannot edit once the summaries are produced, limiting this summarization technique's practical value. We assume the needs of the user should be subtle and need to be adjusted interactively. To solve these two problems, we propose a novel IntentVizor framework, which is an interactive video summarization framework guided by genric multi-modality queries. The input query that describes the user's needs is not limited to text but also the video snippets. We further conclude these multi-modality finer-grained queries as user `intent', which is a newly proposed concept in this paper. This intent is interpretable, interactable, and better quantifies/describes the user's needs. To be more specific, We use a set of intents to represent the inputs of users to design our new interactive visual analytic interface. Users can interactively control and adjust these mixed-initiative intents to obtain a more satisfying summary of this newly proposed interface. Also, as algorithms help users achieve their summarization goal via video understanding, we propose two novel intent/scoring networks based on the slow-fast feature for our algorithm part. We conduct our experiments on two benchmark datasets. The comparison with the state-of-the-art methods verifies the effectiveness of the proposed framework.
    Moving Object Detection for Event-based vision using Graph Spectral Clustering. (arXiv:2109.14979v1 [cs.CV])
    (0 min) Moving object detection has been a central topic of discussion in computer vision for its wide range of applications like in self-driving cars, video surveillance, security, and enforcement. Neuromorphic Vision Sensors (NVS) are bio-inspired sensors that mimic the working of the human eye. Unlike conventional frame-based cameras, these sensors capture a stream of asynchronous 'events' that pose multiple advantages over the former, like high dynamic range, low latency, low power consumption, and reduced motion blur. However, these advantages come at a high cost, as the event camera data typically contains more noise and has low resolution. Moreover, as event-based cameras can only capture the relative changes in brightness of a scene, event data do not contain usual visual information (like texture and color) as available in video data from normal cameras. So, moving object detection in event-based cameras becomes an extremely challenging task. In this paper, we present an unsupervised Graph Spectral Clustering technique for Moving Object Detection in Event-based data (GSCEventMOD). We additionally show how the optimum number of moving objects can be automatically determined. Experimental comparisons on publicly available datasets show that the proposed GSCEventMOD algorithm outperforms a number of state-of-the-art techniques by a maximum margin of 30%.
    Convolutional Neural Network Compression through Generalized Kronecker Product Decomposition. (arXiv:2109.14710v1 [cs.CV])
    (0 min) Modern Convolutional Neural Network (CNN) architectures, despite their superiority in solving various problems, are generally too large to be deployed on resource constrained edge devices. In this paper, we reduce memory usage and floating-point operations required by convolutional layers in CNNs. We compress these layers by generalizing the Kronecker Product Decomposition to apply to multidimensional tensors, leading to the Generalized Kronecker Product Decomposition(GKPD). Our approach yields a plug-and-play module that can be used as a drop-in replacement for any convolutional layer. Experimental results for image classification on CIFAR-10 and ImageNet datasets using ResNet, MobileNetv2 and SeNet architectures substantiate the effectiveness of our proposed approach. We find that GKPD outperforms state-of-the-art decomposition methods including Tensor-Train and Tensor-Ring as well as other relevant compression methods such as pruning and knowledge distillation.
    Forming a sparse representation for visual place recognition using a neurorobotic approach. (arXiv:2109.14916v1 [cs.CV])
    (0 min) This paper introduces a novel unsupervised neural network model for visual information encoding which aims to address the problem of large-scale visual localization. Inspired by the structure of the visual cortex, the model (namely HSD) alternates layers of topologic sparse coding and pooling to build a more compact code of visual information. Intended for visual place recognition (VPR) systems that use local descriptors, the impact of its integration in a bio-inpired model for self-localization (LPMP) is evaluated. Our experimental results on the KITTI dataset show that HSD improves the runtime speed of LPMP by a factor of at least 2 and its localization accuracy by 10%. A comparison with CoHog, a state-of-the-art VPR approach, showed that our method achieves slightly better results.
    LayoutTransformer: Layout Generation and Completion with Self-attention. (arXiv:2006.14615v2 [cs.CV] UPDATED)
    (0 min) We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents, and 3D objects. Most complex scenes, natural or human-designed, can be expressed as a meaningful arrangement of simpler compositional graphical primitives. Generating a new layout or extending an existing layout requires understanding the relationships between these primitives. To do this, we propose LayoutTransformer, a novel framework that leverages self-attention to learn contextual relationships between layout elements and generate novel layouts in a given domain. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout. Furthermore, our analyses show that the model is able to automatically capture the semantic properties of the primitives. We propose simple improvements in both representation of layout primitives, as well as training methods to demonstrate competitive performance in very diverse data domains such as object bounding boxes in natural images(COCO bounding box), documents (PubLayNet), mobile applications (RICO dataset) as well as 3D shapes (Part-Net). Code and other materials will be made available at https://kampta.github.io/layout.
    You Cannot Easily Catch Me: A Low-Detectable Adversarial Patch for Object Detectors. (arXiv:2109.15177v1 [cs.CV])
    (0 min) Blind spots or outright deceit can bedevil and deceive machine learning models. Unidentified objects such as digital "stickers," also known as adversarial patches, can fool facial recognition systems, surveillance systems and self-driving cars. Fortunately, most existing adversarial patches can be outwitted, disabled and rejected by a simple classification network called an adversarial patch detector, which distinguishes adversarial patches from original images. An object detector classifies and predicts the types of objects within an image, such as by distinguishing a motorcyclist from the motorcycle, while also localizing each object's placement within the image by "drawing" so-called bounding boxes around each object, once again separating the motorcyclist from the motorcycle. To train detectors even better, however, we need to keep subjecting them to confusing or deceitful adversarial patches as we probe for the models' blind spots. For such probes, we came up with a novel approach, a Low-Detectable Adversarial Patch, which attacks an object detector with small and texture-consistent adversarial patches, making these adversaries less likely to be recognized. Concretely, we use several geometric primitives to model the shapes and positions of the patches. To enhance our attack performance, we also assign different weights to the bounding boxes in terms of loss function. Our experiments on the common detection dataset COCO as well as the driving-video dataset D2-City show that LDAP is an effective attack method, and can resist the adversarial patch detector.
    Motion-aware Self-supervised Video Representation Learning via Foreground-background Merging. (arXiv:2109.15130v1 [cs.CV])
    (0 min) In light of the success of contrastive learning in the image domain, current self-supervised video representation learning methods usually employ contrastive loss to facilitate video representation learning. When naively pulling two augmented views of a video closer, the model however tends to learn the common static background as a shortcut but fails to capture the motion information, a phenomenon dubbed as background bias. This bias makes the model suffer from weak generalization ability, leading to worse performance on downstream tasks such as action recognition. To alleviate such bias, we propose Foreground-background Merging (FAME) to deliberately compose the foreground region of the selected video onto the background of others. Specifically, without any off-the-shelf detector, we extract the foreground and background regions via the frame difference and color statistics, and shuffle the background regions among the videos. By leveraging the semantic consistency between the original clips and the fused ones, the model focuses more on the foreground motion pattern and is thus more robust to the background context. Extensive experiments demonstrate that FAME can significantly boost the performance in different downstream tasks with various backbones. When integrated with MoCo, FAME reaches 84.8% and 53.5% accuracy on UCF101 and HMDB51, respectively, achieving the state-of-the-art performance.
    SPATE-GAN: Improved Generative Modeling of Dynamic Spatio-Temporal Patterns with an Autoregressive Embedding Loss. (arXiv:2109.15044v1 [cs.LG])
    (0 min) From ecology to atmospheric sciences, many academic disciplines deal with data characterized by intricate spatio-temporal complexities, the modeling of which often requires specialized approaches. Generative models of these data are of particular interest, as they enable a range of impactful downstream applications like simulation or creating synthetic training data. Recent work has highlighted the potential of generative adversarial nets (GANs) for generating spatio-temporal data. A new GAN algorithm COT-GAN, inspired by the theory of causal optimal transport (COT), was proposed in an attempt to better tackle this challenge. However, the task of learning more complex spatio-temporal patterns requires additional knowledge of their specific data structures. In this study, we propose a novel loss objective combined with COT-GAN based on an autoregressive embedding to reinforce the learning of spatio-temporal dynamics. We devise SPATE (spatio-temporal association), a new metric measuring spatio-temporal autocorrelation by using the deviance of observations from their expected values. We compute SPATE for real and synthetic data samples and use it to compute an embedding loss that considers space-time interactions, nudging the GAN to learn outputs that are faithful to the observed dynamics. We test this new objective on a diverse set of complex spatio-temporal patterns: turbulent flows, log-Gaussian Cox processes and global weather data. We show that our novel embedding loss improves performance without any changes to the architecture of the COT-GAN backbone, highlighting our model's increased capacity for capturing autoregressive structures. We also contextualize our work with respect to recent advances in physics-informed deep learning and interdisciplinary work connecting neural networks with geographic and geophysical sciences.
    Learning Deformable Image Registration from Optimization: Perspective, Modules, Bilevel Training and Beyond. (arXiv:2004.14557v4 [cs.CV] UPDATED)
    (0 min) Conventional deformable registration methods aim at solving an optimization model carefully designed on image pairs and their computational costs are exceptionally high. In contrast, recent deep learning based approaches can provide fast deformation estimation. These heuristic network architectures are fully data-driven and thus lack explicit geometric constraints, e.g., topology-preserving, which are indispensable to generate plausible deformations. We design a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation in order to integrate advantages and avoid limitations of these two categories of approaches. Specifically, we introduce a generic optimization model to formulate diffeomorphic registration and develop a series of learnable architectures to obtain propagative updating in the coarse-to-fine feature space. Moreover, we propose a novel bilevel self-tuned training strategy, allowing efficient search of task-specific hyper-parameters. This training strategy increases the flexibility to various types of data while reduces computational and human burdens. We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data. Extensive results demonstrate the state-of-the-art performance of the proposed method with diffeomorphic guarantee and extreme efficiency. We also apply our framework to challenging multi-modal image registration, and investigate how our registration to support the down-streaming tasks for medical image analysis including multi-modal fusion and image segmentation.
    SOLQ: Segmenting Objects by Learning Queries. (arXiv:2106.02351v3 [cs.CV] UPDATED)
    (0 min) In this paper, we propose an end-to-end framework for instance segmentation. Based on the recently introduced DETR [1], our method, termed SOLQ, segments objects by learning unified queries. In SOLQ, each query represents one object and has multiple representations: class, location and mask. The object queries learned perform classification, box regression and mask encoding simultaneously in an unified vector form. During training phase, the mask vectors encoded are supervised by the compression coding of raw spatial masks. In inference time, mask vectors produced can be directly transformed to spatial masks by the inverse process of compression coding. Experimental results show that SOLQ can achieve state-of-the-art performance, surpassing most of existing approaches. Moreover, the joint learning of unified query representation can greatly improve the detection performance of DETR. We hope our SOLQ can serve as a strong baseline for the Transformer-based instance segmentation. Code is available at https://github.com/megvii-research/SOLQ.
    Automated airway segmentation by learning graphical structure. (arXiv:2109.14792v1 [eess.IV])
    (0 min) In this research project, we put forward an advanced method for airway segmentation based on the existent convolutional neural network (CNN) and graph neural network (GNN). The method is originated from the vessel segmentation, but we ameliorate it and enable the novel model to perform better for datasets from computed tomography (CT) scans. Current methods for airway segmentation are considering the regular grid only. No matter what the detailed model is, including the 3-dimensional CNN or 2-dimensional CNN in three directions, the overall graph structures are not taken into consideration. In our model, with the neighbourhoods of airway taken into account, the graph structure is incorporated and the segmentation of airways are improved compared with the traditional CNN methods. We perform experiments on the chest CT scans, where the ground truth segmentation labels are produced manually. The proposed model shows that compared with the CNN-only method, the combination of CNN and GNN has a better performance in that the bronchi in the chest CT scans can be detected in most cases. In addition, the model we propose has a wide extension since the architecture is also utilitarian in fulfilling similar aims in other datasets. Hence, the state-of-the-art model is of great significance and highly applicable in our daily lives. Keywords: Airway segmentation, Convolutional neural network, Graph neural network
    Domain-Robust Mitotic Figure Detection with Style Transfer. (arXiv:2109.01124v2 [cs.CV] UPDATED)
    (0 min) We propose a new training scheme for domain generalization in mitotic figure detection. Mitotic figures show different characteristics for each scanner. We consider each scanner as a 'domain' and the image distribution specified for each domain as 'style'. The goal is to train our network to be robust on scanner types by using various 'style' images. To expand the style variance, we transfer a style of the training image into arbitrary styles, by defining a module based on StarGAN. Our model with the proposed training scheme shows positive performance on MIDOG Preliminary Test-Set containing scanners never seen before.
    CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation. (arXiv:2109.15170v1 [cs.CV])
    (0 min) Some cognitive research has discovered that humans accomplish event segmentation as a side effect of event anticipation. Inspired by this discovery, we propose a simple yet effective end-to-end self-supervised learning framework for event segmentation/boundary detection. Unlike the mainstream clustering-based methods, our framework exploits a transformer-based feature reconstruction scheme to detect event boundary by reconstruction errors. This is consistent with the fact that humans spot new events by leveraging the deviation between their prediction and what is actually perceived. Thanks to their heterogeneity in semantics, the frames at boundaries are difficult to be reconstructed (generally with large reconstruction errors), which is favorable for event boundary detection. Additionally, since the reconstruction occurs on the semantic feature level instead of pixel level, we develop a temporal contrastive feature embedding module to learn the semantic visual representation for frame feature reconstruction. This procedure is like humans building up experiences with "long-term memory". The goal of our work is to segment generic events rather than localize some specific ones. We focus on achieving accurate event boundaries. As a result, we adopt F1 score (Precision/Recall) as our primary evaluation metric for a fair comparison with previous approaches. Meanwhile, we also calculate the conventional frame-based MoF and IoU metric. We thoroughly benchmark our work on four publicly available datasets and demonstrate much better results.
    HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning. (arXiv:2109.15163v1 [cs.CV])
    (0 min) Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at \url{https://github.com/shiming-chen/HSVA} .
    End-to-End Image Compression with Probabilistic Decoding. (arXiv:2109.14837v1 [eess.IV])
    (0 min) Lossy image compression is a many-to-one process, thus one bitstream corresponds to multiple possible original images, especially at low bit rates. However, this nature was seldom considered in previous studies on image compression, which usually chose one possible image as reconstruction, e.g. the one with the maximal a posteriori probability. We propose a learned image compression framework to natively support probabilistic decoding. The compressed bitstream is decoded into a series of parameters that instantiate a pre-chosen distribution; then the distribution is used by the decoder to sample and reconstruct images. The decoder may adopt different sampling strategies and produce diverse reconstructions, among which some have higher signal fidelity and some others have better visual quality. The proposed framework is dependent on a revertible neural network-based transform to convert pixels into coefficients that obey the pre-chosen distribution as much as possible. Our code and models will be made publicly available.
    Suppression of Independent and Correlated Noise with Similarity-based Unsupervised Deep Learning. (arXiv:2011.03384v5 [cs.LG] UPDATED)
    (0 min) Denoising is one of the most important data processing tasks and is generally a prerequisite for downstream image analysis in many fields. Despite their superior denoising performance, supervised deep denoising methods require paired noise-clean or noise-noise samples often unavailable in practice. On the other hand, unsupervised deep denoising methods such as Noise2Void and its variants predict masked pixels from their neighboring pixels in single noisy images. However, these unsupervised algorithms only work under the independent noise assumption while real noise distributions are usually correlated with complex structural patterns. Here we propose the first-of-its-kind feature similarity-based unsupervised denoising approach that works in a nonlocal and nonlinear fashion to suppress not only independent but also correlated noise. Our approach is referred to as Noise2Sim since different noisy sub-images with similar signals are extracted to form as many as possible training pairs so that the parameters of a deep denoising network can be optimized in a self-learning fashion. Theoretically, the theorem is established that Noise2Sim is equivalent to the supervised learning methods under mild conditions. Experimentally, Noise2Sim achieves excellent results on natural, microscopic, low-dose CT and photon-counting micro-CT images, removing image noise independent or not and being superior to the competitive denoising methods. Potentially, Noise2Sim would open a new direction of research and lead to the development of adaptive denoising tools in diverse applications.
    Sim-to-real for high-resolution optical tactile sensing: From images to 3D contact force distributions. (arXiv:2012.11295v3 [cs.RO] UPDATED)
    (0 min) The images captured by vision-based tactile sensors carry information about high-resolution tactile fields, such as the distribution of the contact forces applied to their soft sensing surface. However, extracting the information encoded in the images is challenging and often addressed with learning-based approaches, which generally require a large amount of training data. This article proposes a strategy to generate tactile images in simulation for a vision-based tactile sensor based on an internal camera that tracks the motion of spherical particles within a soft material. The deformation of the material is simulated in a finite element environment under a diverse set of contact conditions, and spherical particles are projected to a simulated image. Features extracted from the images are mapped to the 3D contact force distribution, with the ground truth also obtained via finite-element simulations, with an artificial neural network that is therefore entirely trained on synthetic data avoiding the need for real-world data collection. The resulting model exhibits high accuracy when evaluated on real-world tactile images, is transferable across multiple tactile sensors without further training, and is suitable for efficient real-time inference.
    A Technical Report for ICCV 2021 VIPriors Re-identification Challenge. (arXiv:2109.15164v1 [cs.CV])
    (0 min) Person re-identification has always been a hot and challenging task. This paper introduces our solution for the re-identification track in VIPriors Challenge 2021. In this challenge, the difficulty is how to train the model from scratch without any pretrained weight. In our method, we show use state-of-the-art data processing strategies, model designs, and post-processing ensemble methods, it is possible to overcome the difficulty of data shortage and obtain competitive results. (1) Both image augmentation strategy and novel pre-processing method for occluded images can help the model learn more discriminative features. (2) Several strong backbones and multiple loss functions are used to learn more representative features. (3) Post-processing techniques including re-ranking, automatic query expansion, ensemble learning, etc., significantly improve the final performance. The final score of our team (ALONG) is 96.5154% mAP, ranking first in the leaderboard.
    Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. (arXiv:2103.14006v2 [eess.IV] UPDATED)
    (0 min) It is widely acknowledged that single image super-resolution (SISR) methods would not perform well if the assumed degradation model deviates from those in real images. Although several degradation models take additional factors into consideration, such as blur, they are still not effective enough to cover the diverse degradations of real images. To address this issue, this paper proposes to design a more complex but practical degradation model that consists of randomly shuffled blur, downsampling and noise degradations. Specifically, the blur is approximated by two convolutions with isotropic and anisotropic Gaussian kernels; the downsampling is randomly chosen from nearest, bilinear and bicubic interpolations; the noise is synthesized by adding Gaussian noise with different noise levels, adopting JPEG compression with different quality factors, and generating processed camera sensor noise via reverse-forward camera image signal processing (ISP) pipeline model and RAW image noise model. To verify the effectiveness of the new degradation model, we have trained a deep blind ESRGAN super-resolver and then applied it to super-resolve both synthetic and real images with diverse degradations. The experimental results demonstrate that the new degradation model can help to significantly improve the practicability of deep super-resolvers, thus providing a powerful alternative solution for real SISR applications.
    Improvising the Learning of Neural Networks on Hyperspherical Manifold. (arXiv:2109.14746v1 [cs.CV])
    (0 min) The impact of convolution neural networks (CNNs) in the supervised settings provided tremendous increment in performance. The representations learned from CNN's operated on hyperspherical manifold led to insightful outcomes in face recognition, face identification and other supervised tasks. A broad range of activation functions is developed with hypersphere intuition which performs superior to softmax in euclidean space. The main motive of this research is to provide insights. First, the stereographic projection is implied to transform data from Euclidean space ($\mathbb{R}^{n}$) to hyperspherical manifold ($\mathbb{S}^{n}$) to analyze the performance of angular margin losses. Secondly, proving both theoretically and practically that decision boundaries constructed on hypersphere using stereographic projection obliges the learning of neural networks. Experiments have proved that applying stereographic projection on existing state-of-the-art angular margin objective functions led to improve performance for standard image classification data sets (CIFAR-10,100). The code is publicly available at: https://github.com/barulalithb/stereo-angular-margin.
    Augmented reality navigation system for visual prosthesis. (arXiv:2109.14957v1 [cs.CV])
    (0 min) The visual functions of visual prostheses such as field of view, resolution and dynamic range, seriously restrict the person's ability to navigate in unknown environments. Implanted patients still require constant assistance for navigating from one location to another. Hence, there is a need for a system that is able to assist them safely during their journey. In this work, we propose an augmented reality navigation system for visual prosthesis that incorporates a software of reactive navigation and path planning which guides the subject through convenient, obstacle-free route. It consists on four steps: locating the subject on a map, planning the subject trajectory, showing it to the subject and re-planning without obstacles. We have also designed a simulated prosthetic vision environment which allows us to systematically study navigation performance. Twelve subjects participated in the experiment. Subjects were guided by the augmented reality navigation system and their instruction was to navigate through different environments until they reached two goals, cross the door and find an object (bin), as fast and accurately as possible. Results show how our augmented navigation system help navigation performance by reducing the time and distance to reach the goals, even significantly reducing the number of obstacles collisions, compared to other baseline methods.
    Unlocking Pixels for Reinforcement Learning via Implicit Attention. (arXiv:2102.04353v4 [cs.LG] UPDATED)
    (0 min) There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and the potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is an attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods cannot be applied beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these techniques can be successfully adopted for the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. We show this on a range of tasks from the Distracting Control Suite to vision-based quadruped robots locomotion. We provide rigorous theoretical analysis of the proposed algorithm.
    Deep Contextual Video Compression. (arXiv:2109.15047v1 [eess.IV])
    (0 min) Most of the existing neural video compression methods adopt the predictive coding framework, which first generates the predicted frame and then encodes its residue with the current frame. However, as for compression ratio, predictive coding is only a sub-optimal solution as it uses simple subtraction operation to remove the redundancy across frames. In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. In particular, we try to answer the following questions: how to define, use, and learn condition under a deep video compression framework. To tap the potential of conditional coding, we propose using feature domain context as condition. This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. Our framework is also extensible, in which the condition can be flexibly designed. Experiments show that our method can significantly outperform the previous state-of-the-art (SOTA) deep video compression methods. When compared with x265 using veryslow preset, we can achieve 26.0% bitrate saving for 1080P standard test videos.
    Unsupervised Landmark Detection Based Spatiotemporal Motion Estimation for 4D Dynamic Medical Images. (arXiv:2109.14805v1 [eess.IV])
    (0 min) Motion estimation is a fundamental step in dynamic medical image processing for the assessment of target organ anatomy and function. However, existing image-based motion estimation methods, which optimize the motion field by evaluating the local image similarity, are prone to produce implausible estimation, especially in the presence of large motion. In this study, we provide a novel motion estimation framework of Dense-Sparse-Dense (DSD), which comprises two stages. In the first stage, we process the raw dense image to extract sparse landmarks to represent the target organ anatomical topology and discard the redundant information that is unnecessary for motion estimation. For this purpose, we introduce an unsupervised 3D landmark detection network to extract spatially sparse but representative landmarks for the target organ motion estimation. In the second stage, we derive the sparse motion displacement from the extracted sparse landmarks of two images of different time points. Then, we present a motion reconstruction network to construct the motion field by projecting the sparse landmarks displacement back into the dense image domain. Furthermore, we employ the estimated motion field from our two-stage DSD framework as initialization and boost the motion estimation quality in light-weight yet effective iterative optimization. We evaluate our method on two dynamic medical imaging tasks to model cardiac motion and lung respiratory motion, respectively. Our method has produced superior motion estimation accuracy compared to existing comparative methods. Besides, the extensive experimental results demonstrate that our solution can extract well representative anatomical landmarks without any requirement of manual annotation. Our code is publicly available online.
    Semantic Dense Reconstruction with Consistent Scene Segments. (arXiv:2109.14821v1 [cs.CV])
    (0 min) In this paper, a method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks. First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone that propagates objects' labels with high probabilities from full scans to corresponding ones of partial views. Then a dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence. Benefiting from 2D consistent semantic segments and the 3D model, a novel semantic projection block (SP-Block) is proposed to extract deep feature volumes from 2D segments of different views. Moreover, the semantic volumes are fused into deep volumes from a point cloud encoder to make the final semantic segmentation. Extensive experimental evaluations on public datasets show that our system achieves accurate 3D dense reconstruction and state-of-the-art semantic prediction performances simultaneously.
    Self-Supervised Out-of-Distribution Detection and Localization with Natural Synthetic Anomalies (NSA). (arXiv:2109.15222v1 [cs.CV])
    (0 min) We introduce a new self-supervised task, NSA, for training an end-to-end model for anomaly detection and localization using only normal data. NSA uses Poisson image editing to seamlessly blend scaled patches of various sizes from separate images. This creates a wide range of synthetic anomalies which are more similar to natural sub-image irregularities than previous data-augmentation strategies for self-supervised anomaly detection. We evaluate the proposed method using natural and medical images. Our experiments with the MVTec AD dataset show that a model trained to localize NSA anomalies generalizes well to detecting real-world a priori unknown types of manufacturing defects. Our method achieves an overall detection AUROC of 97.2 outperforming all previous methods that learn from scratch without pre-training datasets.
    Segmentation of Roads in Satellite Images using specially modified U-Net CNNs. (arXiv:2109.14671v1 [cs.CV])
    (0 min) The image classification problem has been deeply investigated by the research community, with computer vision algorithms and with the help of Neural Networks. The aim of this paper is to build an image classifier for satellite images of urban scenes that identifies the portions of the images in which a road is located, separating these portions from the rest. Unlike conventional computer vision algorithms, convolutional neural networks (CNNs) provide accurate and reliable results on this task. Our novel approach uses a sliding window to extract patches out of the whole image, data augmentation for generating more training/testing data and lastly a series of specially modified U-Net CNNs. This proposed technique outperforms all other baselines tested in terms of mean F-score metric.
    Learning to synthesise the ageing brain without longitudinal data. (arXiv:1912.02620v6 [eess.IV] UPDATED)
    (0 min) How will my face look when I get older? Or, for a more challenging question: How will my brain look when I get older? To answer this question one must devise (and learn from data) a multivariate auto-regressive function which given an image and a desired target age generates an output image. While collecting data for faces may be easier, collecting longitudinal brain data is not trivial. We propose a deep learning-based method that learns to simulate subject-specific brain ageing trajectories without relying on longitudinal data. Our method synthesises images conditioned on two factors: age (a continuous variable), and status of Alzheimer's Disease (AD, an ordinal variable). With an adversarial formulation we learn the joint distribution of brain appearance, age and AD status, and define reconstruction losses to address the challenging problem of preserving subject identity. We compare with several benchmarks using two widely used datasets. We evaluate the quality and realism of synthesised images using ground-truth longitudinal data and a pre-trained age predictor. We show that, despite the use of cross-sectional data, our model learns patterns of gray matter atrophy in the middle temporal gyrus in patients with AD. To demonstrate generalisation ability, we train on one dataset and evaluate predictions on the other. In conclusion, our model shows an ability to separate age, disease influence and anatomy using only 2D cross-sectional data that should be useful in large studies into neurodegenerative disease, that aim to combine several data sources. To facilitate such future studies by the community at large our code is made available at https://github.com/xiat0616/BrainAgeing.
    Twins: Revisiting the Design of Spatial Attention in Vision Transformers. (arXiv:2104.13840v4 [cs.CV] UPDATED)
    (0 min) Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at https://github.com/Meituan-AutoML/Twins .
    Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation. (arXiv:2109.15317v1 [cs.CV])
    (0 min) We present MetaUVFS as the first Unsupervised Meta-learning algorithm for Video Few-Shot action recognition. MetaUVFS leverages over 550K unlabeled videos to train a two-stream 2D and 3D CNN architecture via contrastive learning to capture the appearance-specific spatial and action-specific spatio-temporal video features respectively. MetaUVFS comprises a novel Action-Appearance Aligned Meta-adaptation (A3M) module that learns to focus on the action-oriented video features in relation to the appearance features via explicit few-shot episodic meta-learning over unsupervised hard-mined episodes. Our action-appearance alignment and explicit few-shot learner conditions the unsupervised training to mimic the downstream few-shot task, enabling MetaUVFS to significantly outperform all unsupervised methods on few-shot benchmarks. Moreover, unlike previous few-shot action recognition methods that are supervised, MetaUVFS needs neither base-class labels nor a supervised pretrained backbone. Thus, we need to train MetaUVFS just once to perform competitively or sometimes even outperform state-of-the-art supervised methods on popular HMDB51, UCF101, and Kinetics100 few-shot datasets.
    Video Instance Segmentation with a Propose-Reduce Paradigm. (arXiv:2103.13746v2 [cs.CV] UPDATED)
    (0 min) Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos. Prior methods usually obtain segmentation for a frame or clip first, and merge the incomplete results by tracking or matching. These methods may cause error accumulation in the merging step. Contrarily, we propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step. We further build a sequence propagation head on the existing image-level instance segmentation network for long-term propagation. To ensure robustness and high recall of our proposed framework, multiple sequences are proposed where redundant sequences of the same instance are reduced. We achieve state-of-the-art performance on two representative benchmark datasets -- we obtain 47.6% in terms of AP on YouTube-VIS validation set and 70.4% for J&F on DAVIS-UVOS validation set. Code is available at https://github.com/dvlab-research/ProposeReduce.
    CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations. (arXiv:2109.14910v1 [cs.CV])
    (0 min) Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples. Recently, the principle has also been used to learn cross-modal embeddings for video and text, yet without exploiting its full potential. In particular, previous losses do not take the intra-modality similarities into account, which leads to inefficient embeddings, as the same content is mapped to multiple points in the embedding space. With CrossCLR, we present a contrastive loss that fixes this issue. Moreover, we define sets of highly related samples in terms of their input embeddings and exclude them from the negative samples to avoid issues with false negatives. We show that these principles consistently improve the quality of the learned embeddings. The joint embeddings learned with CrossCLR extend the state of the art in video-text retrieval on Youcook2 and LSMDC datasets and in video captioning on Youcook2 dataset by a large margin. We also demonstrate the generality of the concept by learning improved joint embeddings for other pairs of modalities.
    Real-Time Tactile Grasp Force Sensing Using Fingernail Imaging via Deep Neural Networks. (arXiv:2109.15231v1 [cs.CV])
    (0 min) This paper has introduced a novel approach for the real-time estimation of 3D tactile forces exerted by human fingertips via vision only. The introduced approach is entirely monocular vision-based and does not require any physical force sensor. Therefore, it is scalable, non-intrusive, and easily fused with other perception systems such as body pose estimation, making it ideal for HRI applications where force sensing is necessary. The introduced approach consists of three main modules: finger tracking for detection and tracking of each individual finger, image alignment for preserving the spatial information in the images, and the force model for estimating the 3D forces from coloration patterns in the images. The model has been implemented experimentally, and the results have shown a maximum RMS error of 8.4% (for the entire range of force levels) along all three directions. The estimation accuracy is comparable to the offline models in the literature, such as EigneNail, while, this model is capable of performing force estimation at 30 frames per second.
    Transferability Estimation for Semantic Segmentation Task. (arXiv:2109.15242v1 [cs.CV])
    (0 min) Transferability estimation is a fundamental problem in transfer learning to predict how good the performance is when transferring a source model (or source task) to a target task. With the guidance of transferability score, we can efficiently select the highly transferable source models without performing the real transfer in practice. Recent analytical transferability metrics are mainly designed for image classification, and currently there is no specific investigation for the transferability estimation of semantic segmentation task, which is an essential problem in autonomous driving, medical image analysis, etc. Consequently, we further extend the recent analytical transferability metric OTCE (Optimal Transport based Conditional Entropy) score to the semantic segmentation task. The challenge in applying the OTCE score is the high dimensional segmentation output, which is difficult to find the optimal coupling between so many pixels under an acceptable computation cost. Thus we propose to randomly sample N pixels for computing OTCE score and take the expectation over K repetitions as the final transferability score. Experimental evaluation on Cityscapes, BDD100K and GTA5 datasets demonstrates that the OTCE score highly correlates with the transfer performance.
    DAAS: Differentiable Architecture and Augmentation Policy Search. (arXiv:2109.15273v1 [cs.LG])
    (0 min) Neural architecture search (NAS) has been an active direction of automatic machine learning (Auto-ML), aiming to explore efficient network structures. The searched architecture is evaluated by training on datasets with fixed data augmentation policies. However, recent works on auto-augmentation show that the suited augmentation policies can vary over different structures. Therefore, this work considers the possible coupling between neural architectures and data augmentation and proposes an effective algorithm jointly searching for them. Specifically, 1) for the NAS task, we adopt a single-path based differentiable method with Gumbel-softmax reparameterization strategy due to its memory efficiency; 2) for the auto-augmentation task, we introduce a novel search method based on policy gradient algorithm, which can significantly reduce the computation complexity. Our approach achieves 97.91% accuracy on CIFAR-10 and 76.6% Top-1 accuracy on ImageNet dataset, showing the outstanding performance of our search algorithm.
    Workflow Augmentation of Video Data for Event Recognition with Time-Sensitive Neural Networks. (arXiv:2109.15063v1 [eess.IV])
    (0 min) Supervised training of neural networks requires large, diverse and well annotated data sets. In the medical field, this is often difficult to achieve due to constraints in time, expert knowledge and prevalence of an event. Artificial data augmentation can help to prevent overfitting and improve the detection of rare events as well as overall performance. However, most augmentation techniques use purely spatial transformations, which are not sufficient for video data with temporal correlations. In this paper, we present a novel methodology for workflow augmentation and demonstrate its benefit for event recognition in cataract surgery. The proposed approach increases the frequency of event alternation by creating artificial videos. The original video is split into event segments and a workflow graph is extracted from the original annotations. Finally, the segments are assembled into new videos based on the workflow graph. Compared to the original videos, the frequency of event alternation in the augmented cataract surgery videos increased by 26%. Further, a 3% higher classification accuracy and a 7.8% higher precision was achieved compared to a state-of-the-art approach. Our approach is particularly helpful to increase the occurrence of rare but important events and can be applied to a large variety of use cases.
    Adversarial Semantic Contour for Object Detection. (arXiv:2109.15009v1 [cs.CV])
    (0 min) Modern object detectors are vulnerable to adversarial examples, which brings potential risks to numerous applications, e.g., self-driving car. Among attacks regularized by $\ell_p$ norm, $\ell_0$-attack aims to modify as few pixels as possible. Nevertheless, the problem is nontrivial since it generally requires to optimize the shape along with the texture simultaneously, which is an NP-hard problem. To address this issue, we propose a novel method of Adversarial Semantic Contour (ASC) guided by object contour as prior. With this prior, we reduce the searching space to accelerate the $\ell_0$ optimization, and also introduce more semantic information which should affect the detectors more. Based on the contour, we optimize the selection of modified pixels via sampling and their colors with gradient descent alternately. Extensive experiments demonstrate that our proposed ASC outperforms the most commonly manually designed patterns (e.g., square patches and grids) on task of disappearing. By modifying no more than 5\% and 3.5\% of the object area respectively, our proposed ASC can successfully mislead the mainstream object detectors including the SSD512, Yolov4, Mask RCNN, Faster RCNN, etc.
    A Deep Learning Localization Method for Measuring Abdominal Muscle Dimensions in Ultrasound Images. (arXiv:2109.14919v1 [eess.IV])
    (0 min) Health professionals extensively use Two- Dimensional (2D) Ultrasound (US) videos and images to visualize and measure internal organs for various purposes including evaluation of muscle architectural changes. US images can be used to measure abdominal muscles dimensions for the diagnosis and creation of customized treatment plans for patients with Low Back Pain (LBP), however, they are difficult to interpret. Due to high variability, skilled professionals with specialized training are required to take measurements to avoid low intra-observer reliability. This variability stems from the challenging nature of accurately finding the correct spatial location of measurement endpoints in abdominal US images. In this paper, we use a Deep Learning (DL) approach to automate the measurement of the abdominal muscle thickness in 2D US images. By treating the problem as a localization task, we develop a modified Fully Convolutional Network (FCN) architecture to generate blobs of coordinate locations of measurement endpoints, similar to what a human operator does. We demonstrate that using the TrA400 US image dataset, our network achieves a Mean Absolute Error (MAE) of 0.3125 on the test set, which almost matches the performance of skilled ultrasound technicians. Our approach can facilitate next steps for automating the process of measurements in 2D US images, while reducing inter-observer as well as intra-observer variability for more effective clinical outcomes.
    Chest X-Rays Image Classification from beta-Variational Autoencoders Latent Features. (arXiv:2109.14760v1 [eess.IV])
    (0 min) Chest X-Ray (CXR) is one of the most common diagnostic techniques used in everyday clinical practice all around the world. We hereby present a work which intends to investigate and analyse the use of Deep Learning (DL) techniques to extract information from such images and allow to classify them, trying to keep our methodology as general as possible and possibly also usable in a real world scenario without much effort, in the future. To move in this direction, we trained several beta-Variational Autoencoder (beta-VAE) models on the CheXpert dataset, one of the largest publicly available collection of labeled CXR images; from these models, latent features have been extracted and used to train other Machine Learning models, able to classify the original images from the features extracted by the beta-VAE. Lastly, tree-based models have been combined together in ensemblings to improve the results without the necessity of further training or models engineering. Expecting some drop in pure performance with the respect to state of the art classification specific models, we obtained encouraging results, which show the viability of our approach and the usability of the high level features extracted by the autoencoders for classification tasks.
    Bend-Net: Bending Loss Regularized Multitask Learning Network for Nuclei Segmentation in Histopathology Images. (arXiv:2109.15283v1 [eess.IV])
    (0 min) Separating overlapped nuclei is a major challenge in histopathology image analysis. Recently published approaches have achieved promising overall performance on nuclei segmentation; however, their performance on separating overlapped nuclei is quite limited. To address the issue, we propose a novel multitask learning network with a bending loss regularizer to separate overlapped nuclei accurately. The newly proposed multitask learning architecture enhances the generalization by learning shared representation from three tasks: instance segmentation, nuclei distance map prediction, and overlapped nuclei distance map prediction. The proposed bending loss defines high penalties to concave contour points with large curvatures, and applies small penalties to convex contour points with small curvatures. Minimizing the bending loss avoids generating contours that encompass multiple nuclei. In addition, two new quantitative metrics, Aggregated Jaccard Index of overlapped nuclei (AJIO) and Accuracy of overlapped nuclei (ACCO), are designed for the evaluation of overlapped nuclei segmentation. We validate the proposed approach on the CoNSeP and MoNuSegv1 datasets using seven quantitative metrics: Aggregate Jaccard Index, Dice, Segmentation Quality, Recognition Quality, Panoptic Quality, AJIO, and ACCO. Extensive experiments demonstrate that the proposed Bend-Net outperforms eight state-of-the-art approaches.
    Unsupervised Domain Adaptation for LiDAR Panoptic Segmentation. (arXiv:2109.15286v1 [cs.CV])
    (0 min) Scene understanding is a pivotal task for autonomous vehicles to safely navigate in the environment. Recent advances in deep learning enable accurate semantic reconstruction of the surroundings from LiDAR data. However, these models encounter a large domain gap while deploying them on vehicles equipped with different LiDAR setups which drastically decreases their performance. Fine-tuning the model for every new setup is infeasible due to the expensive and cumbersome process of recording and manually labeling new data. Unsupervised Domain Adaptation (UDA) techniques are thus essential to fill this domain gap and retain the performance of models on new sensor setups without the need for additional data labeling. In this paper, we propose AdaptLPS, a novel UDA approach for LiDAR panoptic segmentation that leverages task-specific knowledge and accounts for variation in the number of scan lines, mounting position, intensity distribution, and environmental conditions. We tackle the UDA task by employing two complementary domain adaptation strategies, data-based and model-based. While data-based adaptations reduce the domain gap by processing the raw LiDAR scans to resemble the scans in the target domain, model-based techniques guide the network in extracting features that are representative for both domains. Extensive evaluations on three pairs of real-world autonomous driving datasets demonstrate that AdaptLPS outperforms existing UDA approaches by up to 6.41 pp in terms of the PQ score.
    Gradient-based Hyperparameter Optimization Over Long Horizons. (arXiv:2007.07869v2 [cs.LG] UPDATED)
    (0 min) Gradient-based hyperparameter optimization has earned a widespread popularity in the context of few-shot meta-learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online, but this introduces greediness which comes with a significant performance drop. We propose forward-mode differentiation with sharing (FDS), a simple and efficient algorithm which tackles memory scaling issues with forward-mode differentiation, and gradient degradation issues by sharing hyperparameters that are contiguous in time. We provide theoretical guarantees about the noise reduction properties of our algorithm, and demonstrate its efficiency empirically by differentiating through $\sim 10^4$ gradient steps of unrolled optimization. We consider large hyperparameter search ranges on CIFAR-10 where we significantly outperform greedy gradient-based alternatives, while achieving $\times 20$ speedups compared to the state-of-the-art black-box methods. Code is available at: \url{https://github.com/polo5/FDS}
    FathomNet: A global underwater image training set for enabling artificial intelligence in the ocean. (arXiv:2109.14646v1 [cs.CV])
    (0 min) Ocean-going platforms are integrating high-resolution camera feeds for observation and navigation, producing a deluge of visual data. The volume and rate of this data collection can rapidly outpace researchers' abilities to process and analyze them. Recent advances in machine learning enable fast, sophisticated analysis of visual data, but have had limited success in the oceanographic world due to lack of dataset standardization, sparse annotation tools, and insufficient formatting and aggregation of existing, expertly curated imagery for use by data scientists. To address this need, we have built FathomNet, a public platform that makes use of existing (and future), expertly curated data. Initial efforts have leveraged MBARI's Video Annotation and Reference System and annotated deep sea video database, which has more than 7M annotations, 1M framegrabs, and 5k terms in the knowledgebase, with additional contributions by National Geographic Society (NGS) and NOAA's Office of Ocean Exploration and Research. FathomNet has over 100k localizations of 1k midwater and benthic classes, and contains iconic and non-iconic views of marine animals, underwater equipment, debris, etc. We will demonstrate how machine learning models trained on FathomNet data can be applied across different institutional video data, (e.g., NGS' Deep Sea Camera System and NOAA's ROV Deep Discoverer), and enable automated acquisition and tracking of midwater animals using MBARI's ROV MiniROV. As FathomNet continues to develop and incorporate more image data from other oceanographic community members, this effort will enable scientists, explorers, policymakers, storytellers, and the public to understand and care for our ocean.
    MetaHistoSeg: A Python Framework for Meta Learning in Histopathology Image Segmentation. (arXiv:2109.14754v1 [eess.IV])
    (0 min) Few-shot learning is a standard practice in most deep learning based histopathology image segmentation, given the relatively low number of digitized slides that are generally available. While many models have been developed for domain specific histopathology image segmentation, cross-domain generalization remains a key challenge for properly validating models. Here, tooling and datasets to benchmark model performance across histopathological domains are lacking. To address this limitation, we introduce MetaHistoSeg - a Python framework that implements unique scenarios in both meta learning and instance based transfer learning. Designed for easy extension to customized datasets and task sampling schemes, the framework empowers researchers with the ability of rapid model design and experimentation. We also curate a histopathology meta dataset - a benchmark dataset for training and validating models on out-of-distribution performance across a range of cancer types. In experiments we showcase the usage of MetaHistoSeg with the meta dataset and find that both meta-learning and instance based transfer learning deliver comparable results on average, but in some cases tasks can greatly benefit from one over the other.
    Automatic Estimation of Ulcerative Colitis Severity from Endoscopy Videos using Ordinal Multi-Instance Learning. (arXiv:2109.14685v1 [eess.IV])
    (0 min) Ulcerative colitis (UC) is a chronic inflammatory bowel disease characterized by relapsing inflammation of the large intestine. The severity of UC is often represented by the Mayo Endoscopic Subscore (MES) which quantifies mucosal disease activity from endoscopy videos. In clinical trials, an endoscopy video is assigned an MES based upon the most severe disease activity observed in the video. For this reason, severe inflammation spread throughout the colon will receive the same MES as an otherwise healthy colon with severe inflammation restricted to a small, localized segment. Therefore, the extent of disease activity throughout the large intestine, and overall response to treatment, may not be completely captured by the MES. In this work, we aim to automatically estimate UC severity for each frame in an endoscopy video to provide a higher resolution assessment of disease activity throughout the colon. Because annotating severity at the frame-level is expensive, labor-intensive, and highly subjective, we propose a novel weakly supervised, ordinal classification method to estimate frame severity from video MES labels alone. Using clinical trial data, we first achieved 0.92 and 0.90 AUC for predicting mucosal healing and remission of UC, respectively. Then, for severity estimation, we demonstrate that our models achieve substantial Cohen's Kappa agreement with ground truth MES labels, comparable to the inter-rater agreement of expert clinicians. These findings indicate that our framework could serve as a foundation for novel clinical endpoints, based on a more localized scoring system, to better evaluate UC drug efficacy in clinical trials.
    PP-LCNet: A Lightweight CPU Convolutional Neural Network. (arXiv:2109.15099v1 [cs.CV])
    (0 min) We propose a lightweight CPU network based on the MKLDNN acceleration strategy, named PP-LCNet, which improves the performance of lightweight models on multiple tasks. This paper lists technologies which can improve network accuracy while the latency is almost constant. With these improvements, the accuracy of PP-LCNet can greatly surpass the previous network structure with the same inference time for classification. As shown in Figure 1, it outperforms the most state-of-the-art models. And for downstream tasks of computer vision, it also performs very well, such as object detection, semantic segmentation, etc. All our experiments are implemented based on PaddlePaddle. Code and pretrained models are available at PaddleClas.
    Identity-Disentangled Neural Deformation Model for Dynamic Meshes. (arXiv:2109.15299v1 [cs.CV])
    (0 min) Neural shape models can represent complex 3D shapes with a compact latent space. When applied to dynamically deforming shapes such as the human hands, however, they would need to preserve temporal coherence of the deformation as well as the intrinsic identity of the subject. These properties are difficult to regularize with manually designed loss functions. In this paper, we learn a neural deformation model that disentangles the identity-induced shape variations from pose-dependent deformations using implicit neural functions. We perform template-free unsupervised learning on 3D scans without explicit mesh correspondence or semantic correspondences of shapes across subjects. We can then apply the learned model to reconstruct partial dynamic 4D scans of novel subjects performing unseen actions. We propose two methods to integrate global pose alignment with our neural deformation model. Experiments demonstrate the efficacy of our method in the disentanglement of identities and pose. Our method also outperforms traditional skeleton-driven models in reconstructing surface details such as palm prints or tendons without limitations from a fixed template.
    Spark in the Dark: Evaluating Encoder-Decoder Pairs for COVID-19 CT's Semantic Segmentation. (arXiv:2109.14818v1 [eess.IV])
    (0 min) With the COVID-19 global pandemic, computerassisted diagnoses of medical images have gained a lot of attention, and robust methods of Semantic Segmentation of Computed Tomography (CT) turned highly desirable. Semantic Segmentation of CT is one of many research fields of automatic detection of Covid-19 and was widely explored since the Covid19 outbreak. In the robotic field, Semantic Segmentation of organs and CTs are widely used in robots developed for surgery tasks. As new methods and new datasets are proposed quickly, it becomes apparent the necessity of providing an extensive evaluation of those methods. To provide a standardized comparison of different architectures across multiple recently proposed datasets, we propose in this paper an extensive benchmark of multiple encoders and decoders with a total of 120 architectures evaluated in five datasets, with each dataset being validated through a five-fold cross-validation strategy, totaling 3.000 experiments. To the best of our knowledge, this is the largest evaluation in number of encoders, decoders, and datasets proposed in the field of Covid-19 CT segmentation.
    HLIC: Harmonizing Optimization Metrics in Learned Image Compression by Reinforcement Learning. (arXiv:2109.14863v1 [cs.CV])
    (0 min) Learned image compression is making good progress in recent years. Peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) are the two most popular evaluation metrics. As different metrics only reflect certain aspects of human perception, works in this field normally optimize two models using PSNR and MS-SSIM as loss function separately, which is suboptimal and makes it difficult to select the model with best visual quality or overall performance. Towards solving this problem, we propose to Harmonize optimization metrics in Learned Image Compression (HLIC) using online loss function adaptation by reinforcement learning. By doing so, we are able to leverage the advantages of both PSNR and MS-SSIM, achieving better visual quality and higher VMAF score. To our knowledge, our work is the first to explore automatic loss function adaptation for harmonizing optimization metrics in low level vision tasks like learned image compression.
    Comparative Validation of Machine Learning Algorithms for Surgical Workflow and Skill Analysis with the HeiChole Benchmark. (arXiv:2109.14956v1 [eess.IV])
    (0 min) PURPOSE: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported for phase recognition on an open data single-center dataset. In this work we investigated the generalizability of phase recognition algorithms in a multi-center setting including more difficult recognition tasks such as surgical action and surgical skill. METHODS: To achieve this goal, a dataset with 33 laparoscopic cholecystectomy videos from three surgical centers with a total operation time of 22 hours was created. Labels included annotation of seven surgical phases with 250 phase transitions, 5514 occurences of four surgical actions, 6980 occurences of 21 surgical instruments from seven instrument categories and 495 skill classifications in five skill dimensions. The dataset was used in the 2019 Endoscopic Vision challenge, sub-challenge for surgical workflow and skill analysis. Here, 12 teams submitted their machine learning algorithms for recognition of phase, action, instrument and/or skill assessment. RESULTS: F1-scores were achieved for phase recognition between 23.9% and 67.7% (n=9 teams), for instrument presence detection between 38.5% and 63.8% (n=8 teams), but for action recognition only between 21.8% and 23.3% (n=5 teams). The average absolute error for skill assessment was 0.78 (n=1 team). CONCLUSION: Surgical workflow and skill analysis are promising technologies to support the surgical team, but are not solved yet, as shown by our comparison of algorithms. This novel benchmark can be used for comparable evaluation and validation of future work.
    Sketch2Mesh: Reconstructing and Editing 3D Shapes from Sketches. (arXiv:2104.00482v2 [cs.CV] UPDATED)
    (0 min) Reconstructing 3D shape from 2D sketches has long been an open problem because the sketches only provide very sparse and ambiguous information. In this paper, we use an encoder/decoder architecture for the sketch to mesh translation. When integrated into a user interface that provides camera parameters for the sketches, this enables us to leverage its latent parametrization to represent and refine a 3D mesh so that its projections match the external contours outlined in the sketch. We will show that this approach is easy to deploy, robust to style changes, and effective. Furthermore, it can be used for shape refinement given only single pen strokes. We compare our approach to state-of-the-art methods on sketches -- both hand-drawn and synthesized -- and demonstrate that we outperform them.
    Understanding Egocentric Hand-Object Interactions from Hand Pose Estimation. (arXiv:2109.14657v1 [cs.CV])
    (0 min) In this paper, we address the problem of estimating the hand pose from the egocentric view when the hand is interacting with objects. Specifically, we propose a method to label a dataset Ego-Siam which contains the egocentric images pair-wisely. We also use the collected pairwise data to train our encoder-decoder style network which has been proven efficient in. This could bring extra training efficiency and testing accuracy. Our network is lightweight and can be performed with over 30 FPS with an outdated GPU. We demonstrate that our method outperforms Mueller et al. which is the state of the art work dealing with egocentric hand-object interaction problems on the GANerated dataset. To show the ability to preserve the semantic information of our method, we also report the performance of grasp type classification on GUN-71 dataset and outperforms the benchmark by only using the predicted 3-d hand pose.
    ASOD60K: An Audio-Induced Salient Object Detection Dataset for Panoramic Videos. (arXiv:2107.11629v2 [cs.CV] UPDATED)
    (0 min) Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-/object-level saliency detection tasks, we focus on audio-induced salient object detection (SOD), where the salient objects are labeled with the guidance of audio-induced eye movements. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modelling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos. The dataset and benchmark will be made publicly available at https://github.com/PanoAsh/ASOD60K.
    A Prior Knowledge Based Tumor and Tumoral Subregion Segmentation Tool for Pediatric Brain Tumors. (arXiv:2109.14775v1 [eess.IV])
    (0 min) In the past few years, deep learning (DL) models have drawn great attention and shown superior performance on brain tumor and subregion segmentation tasks. However, the success is limited to segmentation of adult gliomas, where sufficient data have been collected, manually labeled, and published for training DL models. It is still challenging to segment pediatric tumors, because the appearances are different from adult gliomas. Hence, directly applying a pretained DL model on pediatric data usually generates unacceptable results. Because pediatric data is very limited, both labeled and unlabeled, we present a brain tumor segmentation model that is based on knowledge rather than learning from data. We also provide segmentation of more subregions for super heterogeneous tumor like atypical teratoid rhabdoid tumor (ATRT). Our proposed approach showed superior performance on both whole tumor and subregion segmentation tasks to DL based models on our pediatric data when training data is not available for transfer learning.
    Robust Multi-Domain Mitosis Detection. (arXiv:2109.15092v1 [eess.IV])
    (0 min) Domain variability is a common bottle neck in developing generalisable algorithms for various medical applications. Motivated by the observation that the domain variability of the medical images is to some extent compact, we propose to learn a target representative feature space through unpaired image to image translation (CycleGAN). We comprehensively evaluate the performanceand usefulness by utilising the transformation to mitosis detection with candidate proposal and classification. This work presents a simple yet effective multi-step mitotic figure detection algorithm developed as a baseline for the MIDOG challenge. On the preliminary test set, the algorithm scoresan F1 score of 0.52.
    Fake It Till You Make It: Face analysis in the wild using synthetic data alone. (arXiv:2109.15102v1 [cs.CV])
    (0 min) We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. Researchers have tried to bridge this gap with data mixing, domain adaptation, and domain-adversarial training, but we show that it is possible to synthesize data with minimal domain gap, so that models trained on synthetic data generalize to real in-the-wild datasets. We describe how to combine a procedurally-generated parametric 3D face model with a comprehensive library of hand-crafted assets to render training images with unprecedented realism and diversity. We train machine learning systems for face-related tasks such as landmark localization and face parsing, showing that synthetic data can both match real data in accuracy as well as open up new approaches where manual labelling would be impossible.
    Vision-Aided Beam Tracking: Explore the Proper Use of Camera Images with Deep Learning. (arXiv:2109.14686v1 [cs.LG])
    (0 min) We investigate the problem of wireless beam tracking on mmWave bands with the assistance of camera images. In particular, based on the user's beam indices used and camera images taken in the trajectory, we predict the optimal beam indices in the next few time spots. To resolve this problem, we first reformulate the "ViWi" dataset in [1] to get rid of the image repetition problem. Then we develop a deep learning approach and investigate various model components to achieve the best performance. Finally, we explore whether, when, and how to use the image for better beam prediction. To answer this question, we split the dataset into three clusters -- (LOS, light NLOS, serious NLOS)-like -- based on the standard deviation of the beam sequence. With experiments we demonstrate that using the image indeed helps beam tracking especially when the user is in serious NLOS, and the solution relies on carefully-designed dataset for training a model. Generally speaking, including NLOS-like data for training a model does not benefit beam tracking of the user in LOS, but including light NLOS-like data for training a model benefits beam tracking of the user in serious NLOS.
    Federated Self-Supervised Contrastive Learning via Ensemble Similarity Distillation. (arXiv:2109.14611v1 [cs.LG])
    (0 min) This paper investigates the feasibility of learning good representation space with unlabeled client data in the federated scenario. Existing works trivially inherit the supervised federated learning methods, which does not apply to the model heterogeneity and has the potential risk of privacy exposure. To tackle the problems above, we first identify that self-supervised contrastive local training is more robust against the non-i.i.d.-ness than the traditional supervised learning paradigm. Then we propose a novel federated self-supervised contrastive learning framework FLESD that supports architecture-agnostic local training and communication-efficient global aggregation. At each round of communication, the server first gathers a fraction of the clients' inferred similarity matrices on a public dataset. Then FLESD ensembles the similarity matrices and trains the global model via similarity distillation. We verify the effectiveness of our proposed framework by a series of empirical experiments and show that FLESD has three main advantages over the existing methods: it handles the model heterogeneity, is less prone to privacy leak, and is more communication-efficient. We will release the code of this paper in the future.
    GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation. (arXiv:2109.14813v1 [cs.CV])
    (0 min) To achieve an accurate assessment of root canal therapy, a fundamental step is to perform tooth root segmentation on oral X-ray images, in that the position of tooth root boundary is significant anatomy information in root canal therapy evaluation. However, the fuzzy boundary makes the tooth root segmentation very challenging. In this paper, we propose a novel end-to-end U-Net like Group Transformer Network (GT U-Net) for the tooth root segmentation. The proposed network retains the essential structure of U-Net but each of the encoders and decoders is replaced by a group Transformer, which significantly reduces the computational cost of traditional Transformer architectures by using the grouping structure and the bottleneck structure. In addition, the proposed GT U-Net is composed of a hybrid structure of convolution and Transformer, which makes it independent of pre-training weights. For optimization, we also propose a shape-sensitive Fourier Descriptor (FD) loss function to make use of shape prior knowledge. Experimental results show that our proposed network achieves the state-of-the-art performance on our collected tooth root segmentation dataset and the public retina dataset DRIVE. Code has been released at https://github.com/Kent0n-Li/GT-U-Net.
    LR-to-HR Face Hallucination with an Adversarial Progressive Attribute-Induced Network. (arXiv:2109.14690v1 [cs.CV])
    (0 min) Face super-resolution is a challenging and highly ill-posed problem since a low-resolution (LR) face image may correspond to multiple high-resolution (HR) ones during the hallucination process and cause a dramatic identity change for the final super-resolved results. Thus, to address this problem, we propose an end-to-end progressive learning framework incorporating facial attributes and enforcing additional supervision from multi-scale discriminators. By incorporating facial attributes into the learning process and progressively resolving the facial image, the mapping between LR and HR images is constrained more, and this significantly helps to reduce the ambiguity and uncertainty in one-to-many mapping. In addition, we conduct thorough evaluations on the CelebA dataset following the settings of previous works (i.e. super-resolving by a factor of 8x from tiny 16x16 face images.), and the results demonstrate that the proposed approach can yield satisfactory face hallucination images outperforming other state-of-the-art approaches.
    Robust Segmentation Models using an Uncertainty Slice Sampling Based Annotation Workflow. (arXiv:2109.14879v1 [cs.CV])
    (0 min) Semantic segmentation neural networks require pixel-level annotations in large quantities to achieve a good performance. In the medical domain, such annotations are expensive, because they are time-consuming and require expert knowledge. Active learning optimizes the annotation effort by devising strategies to select cases for labeling that are most informative to the model. In this work, we propose an uncertainty slice sampling (USS) strategy for semantic segmentation of 3D medical volumes that selects 2D image slices for annotation and compare it with various other strategies. We demonstrate the efficiency of USS on a CT liver segmentation task using multi-site data. After five iterations, the training data resulting from USS consisted of 2410 slices (4% of all slices in the data pool) compared to 8121 (13%), 8641 (14%), and 3730 (6%) for uncertainty volume (UVS), random volume (RVS), and random slice (RSS) sampling, respectively. Despite being trained on the smallest amount of data, the model based on the USS strategy evaluated on 234 test volumes significantly outperformed models trained according to other strategies and achieved a mean Dice index of 0.964, a relative volume error of 4.2%, a mean surface distance of 1.35 mm, and a Hausdorff distance of 23.4 mm. This was only slightly inferior to 0.967, 3.8%, 1.18 mm, and 22.9 mm achieved by a model trained on all available data, but the robustness analysis using the 5th percentile of Dice and the 95th percentile of the remaining metrics demonstrated that USS resulted not only in the most robust model compared to other sampling schemes, but also outperformed the model trained on all data according to Dice (0.946 vs. 0.945) and mean surface distance (1.92 mm vs. 2.03 mm).
    T\"oRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis. (arXiv:2109.15271v1 [cs.CV])
    (0 min) Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with low reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors that are now available on modern smartphones.
    Riedones3D: a celtic coin dataset for registration and fine-grained clustering. (arXiv:2109.15033v1 [cs.CV])
    (0 min) Clustering coins with respect to their die is an important component of numismatic research and crucial for understanding the economic history of tribes (especially when literary production does not exist, in celtic culture). It is a very hard task that requires a lot of times and expertise. To cluster thousands of coins, automatic methods are becoming necessary. Nevertheless, public datasets for coin die clustering evaluation are too rare, though they are very important for the development of new methods. Therefore, we propose a new 3D dataset of 2 070 scans of coins. With this dataset, we propose two benchmarks, one for point cloud registration, essential for coin die recognition, and a benchmark of coin die clustering. We show how we automatically cluster coins to help experts, and perform a preliminary evaluation for these two tasks. The code of the baseline and the dataset will be publicly available at https://www.npm3d.fr/coins-riedones3d and https://www.chronocarto.eu/spip.php?article84&lang=fr
    Targeted Gradient Descent: A Novel Method for Convolutional Neural Networks Fine-tuning and Online-learning. (arXiv:2109.14729v1 [cs.CV])
    (0 min) A convolutional neural network (ConvNet) is usually trained and then tested using images drawn from the same distribution. To generalize a ConvNet to various tasks often requires a complete training dataset that consists of images drawn from different tasks. In most scenarios, it is nearly impossible to collect every possible representative dataset as a priori. The new data may only become available after the ConvNet is deployed in clinical practice. ConvNet, however, may generate artifacts on out-of-distribution testing samples. In this study, we present Targeted Gradient Descent (TGD), a novel fine-tuning method that can extend a pre-trained network to a new task without revisiting data from the previous task while preserving the knowledge acquired from previous training. To a further extent, the proposed method also enables online learning of patient-specific data. The method is built on the idea of reusing a pre-trained ConvNet's redundant kernels to learn new knowledge. We compare the performance of TGD to several commonly used training approaches on the task of Positron emission tomography (PET) image denoising. Results from clinical images show that TGD generated results on par with training-from-scratch while significantly reducing data preparation and network training time. More importantly, it enables online learning on the testing study to enhance the network's generalization capability in real-world applications.
    Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments. (arXiv:2109.15207v1 [cs.CV])
    (0 min) In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
    Uncertainty-aware Mean Teacher for Source-free Unsupervised Domain Adaptive 3D Object Detection. (arXiv:2109.14651v1 [cs.CV])
    (0 min) Pseudo-label based self training approaches are a popular method for source-free unsupervised domain adaptation. However, their efficacy depends on the quality of the labels generated by the source trained model. These labels may be incorrect with high confidence, rendering thresholding methods ineffective. In order to avoid reinforcing errors caused by label noise, we propose an uncertainty-aware mean teacher framework which implicitly filters incorrect pseudo-labels during training. Leveraging model uncertainty allows the mean teacher network to perform implicit filtering by down-weighing losses corresponding uncertain pseudo-labels. Effectively, we perform automatic soft-sampling of pseudo-labeled data while aligning predictions from the student and teacher networks. We demonstrate our method on several domain adaptation scenarios, from cross-dataset to cross-weather conditions, and achieve state-of-the-art performance in these cases, on the KITTI lidar target dataset.
    USIS: Unsupervised Semantic Image Synthesis. (arXiv:2109.14715v1 [cs.CV])
    (0 min) Semantic Image Synthesis (SIS) is a subclass of image-to-image translation where a photorealistic image is synthesized from a segmentation mask. SIS has mostly been addressed as a supervised problem. However, state-of-the-art methods depend on a huge amount of labeled data and cannot be applied in an unpaired setting. On the other hand, generic unpaired image-to-image translation frameworks underperform in comparison, because they color-code semantic layouts and feed them to traditional convolutional networks, which then learn correspondences in appearance instead of semantic content. In this initial work, we propose a new Unsupervised paradigm for Semantic Image Synthesis (USIS) as a first step towards closing the performance gap between paired and unpaired settings. Notably, the framework deploys a SPADE generator that learns to output images with visually separable semantic classes using a self-supervised segmentation loss. Furthermore, in order to match the color and texture distribution of real images without losing high-frequency information, we propose to use whole image wavelet-based discrimination. We test our methodology on 3 challenging datasets and demonstrate its ability to generate multimodal photorealistic images with an improved quality in the unpaired setting.
    Unlocking the potential of deep learning for marine ecology: overview, applications, and outlook. (arXiv:2109.14737v1 [cs.LG])
    (2 min) The deep learning revolution is touching all scientific disciplines and corners of our lives as a means of harnessing the power of big data. Marine ecology is no exception. These new methods provide analysis of data from sensors, cameras, and acoustic recorders, even in real time, in ways that are reproducible and rapid. Off-the-shelf algorithms can find, count, and classify species from digital images or video and detect cryptic patterns in noisy data. Using these opportunities requires collaboration across ecological and data science disciplines, which can be challenging to initiate. To facilitate these collaborations and promote the use of deep learning towards ecosystem-based management of the sea, this paper aims to bridge the gap between marine ecologists and computer scientists. We provide insight into popular deep learning approaches for ecological data analysis in plain language, focusing on the techniques of supervised learning with deep neural networks, and illustrate challenges and opportunities through established and emerging applications of deep learning to marine ecology. We use established and future-looking case studies on plankton, fishes, marine mammals, pollution, and nutrient cycling that involve object detection, classification, tracking, and segmentation of visualized data. We conclude with a broad outlook of the field's opportunities and challenges, including potential technological advances and issues with managing complex data sets.
  • cs.IR updates on arXiv.org

    SUper Team at SemEval-2016 Task 3: Building a feature-rich system for community question answering. (arXiv:2109.15120v1 [cs.CL])
    (2 min) We present the system we built for participating in SemEval-2016 Task 3 on Community Question Answering. We achieved the best results on subtask C, and strong results on subtasks A and B, by combining a rich set of various types of features: semantic, lexical, metadata, and user-related. The most important group turned out to be the metadata for the question and for the comment, semantic vectors trained on QatarLiving data and similarities between the question and the comment for subtasks A and C, and between the original and the related question for Subtask B.
    Towards Understanding Trends Manipulation in Pakistan Twitter. (arXiv:2109.14872v1 [cs.IR])
    (2 min) The rapid adoption of online social media platforms has transformed the way of communication and interaction. On these platforms, discussions in the form of trending topics provide a glimpse of events happening around the world in real-time. Also, these trends are used for political campaigns, public awareness, and brand promotions. Consequently, these trends are sensitive to manipulation by malicious users who aim to mislead the mass audience. In this article, we identify and study the characteristics of users involved in the manipulation of Twitter trends in Pakistan. We propose 'Manipify', a framework for automatic detection and analysis of malicious users for Twitter trends. Our framework consists of three distinct modules: i) user classifier, ii) hashtag classifier, and ii) trend analyzer. The user classifier introduces a novel approach to automatically detect manipulators using tweet content and user behaviour features. Also, the module classifies human and bot users. Next, the hashtag classifier categorizes trending hashtags into six categories assisting in examining manipulators behaviour across different categories. Finally, the trend analyzer module examines users, hashtags, and tweets for hashtag reach, linguistic features and user behaviour. Our user classifier module achieves 0.91 accuracy in classifying the manipulators. We further test Manipify on the dataset comprising of 665 trending hashtags with 5.4 million tweets and 1.9 million users. The analysis of trends reveals that the trending panel is mostly dominated by political hashtags. In addition, our results show a higher contribution of human accounts in trend manipulation as compared to bots. Furthermore, we present two case studies of hashtag-wars and anti-state propaganda to implicate the real-world application of our research.
    Overview of the CLEF-2019 CheckThat!: Automatic Identification and Verification of Claims. (arXiv:2109.15118v1 [cs.CL])
    (3 min) We present an overview of the second edition of the CheckThat! Lab at CLEF 2019. The lab featured two tasks in two different languages: English and Arabic. Task 1 (English) challenged the participating systems to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 (Arabic) asked to (A) rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) identify useful passages from these pages, and (D) use the useful pages to predict the claim's factuality. CheckThat! provided a full evaluation framework, consisting of data in English (derived from fact-checking sources) and Arabic (gathered and annotated from scratch) and evaluation based on mean average precision (MAP) and normalized discounted cumulative gain (nDCG) for ranking, and F1 for classification. A total of 47 teams registered to participate in this lab, and fourteen of them actually submitted runs (compared to nine last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. As for Task 2, learning-to-rank was used by the highest scoring runs for subtask A, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important tasks of check-worthiness estimation and automatic claim verification.
    Accelerating COVID-19 research with graph mining and transformer-based learning. (arXiv:2102.07631v2 [cs.IR] UPDATED)
    (2 min) In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.
    Born Again Neural Rankers. (arXiv:2109.15285v1 [cs.IR])
    (2 min) We introduce Born Again neural Rankers (BAR) in the Learning to Rank (LTR) setting, where student rankers, trained in the Knowledge Distillation (KD) framework, are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, BAR adapts the idea of Born Again Networks (BAN) to ranking problems and significantly improves ranking performance of students over the teacher rankers without increasing model capacity. The key differences between BAR and common distillation techniques for classification are: (1) an appropriate teacher score transformation function, and (2) a novel listwise distillation framework. Both techniques are specifically designed for ranking problems and are rarely studied in the knowledge distillation literature. Using the state-of-the-art neural ranking structure, BAR is able to push the limits of neural rankers above a recent rigorous benchmark study and significantly outperforms traditionally strong gradient boosted decision tree based models on 7 out of 9 key metrics, the first time in the literature. In addition to the strong empirical results, we give theoretical explanations on why listwise distillation is effective for neural rankers.
    Mac Users Do It Differently: the Role of Operating System and Individual Differences in File Management. (arXiv:2109.15272v1 [cs.HC])
    (2 min) Despite much discussion in HCI research about how individual differences likely determine computer users' personal information management (PIM) practices, the extent of the influence of several important factors remains unclear, including users' personalities, spatial abilities, and the different software used to manage their collections. We therefore analyse data from prior CHI work to explore (1) associations of people's file collections with personality and spatial ability, and (2) differences between collections managed with different operating systems and file managers. We find no notable associations between users' attributes and their collections, and minimal predictive power, but do find considerable and surprising differences across operating systems. We discuss these findings and how they can inform future research.
    GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. (arXiv:2105.02605v2 [cs.CL] UPDATED)
    (2 min) The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based on the individual textual features and the neighbourhood information. Recent breakthroughs on pretrained language models and graph neural networks push forward the development of corresponding techniques. The existing works mainly rely on the cascaded model architecture: the textual features of nodes are independently encoded by language models at first; the textual embeddings are aggregated by graph neural networks afterwards. However, the above architecture is limited due to the independent modeling of textual features. In this work, we propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models. With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow, making each node's semantic accurately comprehended from the global perspective. In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph. Extensive evaluations are conducted on three large-scale benchmark datasets, where GraphFormers outperform the SOTA baselines with comparable running efficiency.
    Learning to Rank Question Answer Pairs with Bilateral Contrastive Data Augmentation. (arXiv:2106.11096v2 [cs.CL] UPDATED)
    (2 min) In this work, we propose a novel and easy-to-apply data augmentation strategy, namely Bilateral Generation (BiG), with a contrastive training objective for improving the performance of ranking question answer pairs with existing labeled data. In specific, we synthesize pseudo-positive QA pairs in contrast to the original negative QA pairs with two pre-trained generation models, one for question generation, the other for answer generation, which are fine-tuned on the limited positive QA pairs from the original dataset. With the augmented dataset, we design a contrastive training objective for learning to rank question answer pairs. Experimental results on three benchmark datasets show that our method significantly improves the performance of ranking models by making full use of existing labeled data and can be easily applied to different ranking models.
    Library of Congress Subject Heading (LCSH) Browsing and Natural Language Searching. (arXiv:2109.15276v1 [cs.IR])
    (2 min) Controlled topical vocabularies (CVs) are built into information systems to aid browsing and retrieval of items that may be unfamiliar, but it is unclear how this feature should be integrated with standard keyword searching. Few systems or scholarly prototypes have attempted this, and none have used the most widely used CV, the Library of Congress Subject Headings (LCSH), which organizes monograph collections in academic libraries throughout the world. This paper describes a working prototype of a Web application that concurrently allows topic exploration using an outline tree view of the LCSH hierarchy and natural language keyword searching of a real-world Science and Engineering bibliographic collection. Pilot testing shows the system is functional, and work to fit the complex LCSH structure into a usable hierarchy is ongoing. This study contributes to knowledge of the practical design decisions required when developing linked interactions between topical hierarchy browsing and natural language searching, which promise to facilitate information discovery and exploration.
    Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach. (arXiv:2109.14638v1 [cs.CL])
    (2 min) Existing work on making privacy policies accessible has explored new presentation forms such as color-coding based on the risk factors or summarization to assist users with conscious agreement. To facilitate a more personalized interaction with the policies, in this work, we propose an automated privacy policy question answering assistant that extracts a summary in response to the input user query. This is a challenging task because users articulate their privacy-related questions in a very different language than the legal language of the policy, making it difficult for the system to understand their inquiry. Moreover, existing annotated data in this domain are limited. We address these problems by paraphrasing to bring the style and language of the user's question closer to the language of privacy policies. Our content scoring module uses the existing in-domain data to find relevant information in the policy and incorporates it in a summary. Our pipeline is able to find an answer for 89% of the user queries in the privacyQA dataset.
    SCIMAT: Science and Mathematics Dataset. (arXiv:2109.15005v1 [math.NA])
    (2 min) In this work, we announce a comprehensive well curated and opensource dataset with millions of samples for pre-college and college level problems in mathematicsand science. A preliminary set of results using transformer architecture with character to character encoding is shown. The dataset identifies some challenging problem and invites research on better architecture search
    Assessing Algorithmic Biases for Musical Version Identification. (arXiv:2109.15188v1 [cs.SD])
    (2 min) Version identification (VI) systems now offer accurate and scalable solutions for detecting different renditions of a musical composition, allowing the use of these systems in industrial applications and throughout the wider music ecosystem. Such use can have an important impact on various stakeholders regarding recognition and financial benefits, including how royalties are circulated for digital rights management. In this work, we take a step toward acknowledging this impact and consider VI systems as socio-technical systems rather than isolated technologies. We propose a framework for quantifying performance disparities across 5 systems and 6 relevant side attributes: gender, popularity, country, language, year, and prevalence. We also consider 3 main stakeholders for this particular information retrieval use case: the performing artists of query tracks, those of reference (original) tracks, and the composers. By categorizing the recordings in our dataset using such attributes and stakeholders, we analyze whether the considered VI systems show any implicit biases. We find signs of disparities in identification performance for most of the groups we include in our analyses. Moreover, we also find that learning- and rule-based systems behave differently for some attributes, which suggests an additional dimension to consider along with accuracy and scalability when evaluating VI systems. Lastly, we share our dataset with attribute annotations to encourage VI researchers to take these aspects into account while building new systems.
    Boost-RS: Boosted Embeddings for Recommender Systems and its Application to Enzyme-Substrate Interaction Prediction. (arXiv:2109.14766v1 [q-bio.QM])
    (2 min) Despite experimental and curation efforts, the extent of enzyme promiscuity on substrates continues to be largely unexplored and under documented. Recommender systems (RS), which are currently unexplored for the enzyme-substrate interaction prediction problem, can be utilized to provide enzyme recommendations for substrates, and vice versa. The performance of Collaborative-Filtering (CF) recommender systems however hinges on the quality of embedding vectors of users and items (enzymes and substrates in our case). Importantly, enhancing CF embeddings with heterogeneous auxiliary data, specially relational data (e.g., hierarchical, pairwise, or groupings), remains a challenge. We propose an innovative general RS framework, termed Boost-RS, that enhances RS performance by "boosting" embedding vectors through auxiliary data. Specifically, Boost-RS is trained and dynamically tuned on multiple relevant auxiliary learning tasks Boost-RS utilizes contrastive learning tasks to exploit relational data. To show the efficacy of Boost-RS for the enzyme-substrate prediction interaction problem, we apply the Boost-RS framework to several baseline CF models. We show that each of our auxiliary tasks boosts learning of the embedding vectors, and that contrastive learning using Boost-RS outperforms attribute concatenation and multi-label learning. We also show that Boost-RS outperforms similarity-based models. Ablation studies and visualization of learned representations highlight the importance of using contrastive learning on some of the auxiliary data in boosting the embedding vectors.
    Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning. (arXiv:2012.01345v3 [cs.CV] UPDATED)
    (2 min) Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap.
    USER: A Unified Information Search and Recommendation Model based on Integrated Behavior Sequence. (arXiv:2109.15012v1 [cs.IR])
    (2 min) Search and recommendation are the two most common approaches used by people to obtain information. They share the same goal -- satisfying the user's information need at the right time. There are already a lot of Internet platforms and Apps providing both search and recommendation services, showing us the demand and opportunity to simultaneously handle both tasks. However, most platforms consider these two tasks independently -- they tend to train separate search model and recommendation model, without exploiting the relatedness and dependency between them. In this paper, we argue that jointly modeling these two tasks will benefit both of them and finally improve overall user satisfaction. We investigate the interactions between these two tasks in the specific information content service domain. We propose first integrating the user's behaviors in search and recommendation into a heterogeneous behavior sequence, then utilizing a joint model for handling both tasks based on the unified sequence. More specifically, we design the Unified Information Search and Recommendation model (USER), which mines user interests from the integrated sequence and accomplish the two tasks in a unified way.
  • cs.LG updates on arXiv.org

    Process discovery on deviant traces and other stranger things. (arXiv:2109.14883v1 [cs.LG])
    (2 min) As the need to understand and formalise business processes into a model has grown over the last years, the process discovery research field has gained more and more importance, developing two different classes of approaches to model representation: procedural and declarative. Orthogonally to this classification, the vast majority of works envisage the discovery task as a one-class supervised learning process guided by the traces that are recorded into an input log. In this work instead, we focus on declarative processes and embrace the less-popular view of process discovery as a binary supervised learning task, where the input log reports both examples of the normal system execution, and traces representing "stranger" behaviours according to the domain semantics. We therefore deepen how the valuable information brought by both these two sets can be extracted and formalised into a model that is "optimal" according to user-defined goals. Our approach, namely NegDis, is evaluated w.r.t. other relevant works in this field, and shows promising results as regards both the performance and the quality of the obtained solution.
    Dynamic Graph-Based Anomaly Detection in the Electrical Grid. (arXiv:2012.15006v3 [cs.LG] UPDATED)
    (0 min) Given sensor readings over time from a power grid, how can we accurately detect when an anomaly occurs? A key part of achieving this goal is to use the network of power grid sensors to quickly detect, in real-time, when any unusual events, whether natural faults or malicious, occur on the power grid. Existing bad-data detectors in the industry lack the sophistication to robustly detect broad types of anomalies, especially those due to emerging cyber-attacks, since they operate on a single measurement snapshot of the grid at a time. New ML methods are more widely applicable, but generally do not consider the impact of topology change on sensor measurements and thus cannot accommodate regular topology adjustments in historical data. Hence, we propose DYNWATCH, a domain knowledge based and topology-aware algorithm for anomaly detection using sensors placed on a dynamic grid. Our approach is accurate, outperforming existing approaches by 20% or more (F-measure) in experiments; and fast, running in less than 1.7ms on average per time tick per sensor on a 60K+ branch case using a laptop computer, and scaling linearly in the size of the graph.
    Accelerating COVID-19 research with graph mining and transformer-based learning. (arXiv:2102.07631v2 [cs.IR] UPDATED)
    (0 min) In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.
    DiCE4EL: Interpreting Process Predictions using a Milestone-Aware Counterfactual Approach. (arXiv:2107.08697v2 [cs.LG] UPDATED)
    (0 min) Predictive process analytics often apply machine learning to predict the future states of a running business~process. However, the internal mechanisms of many existing predictive algorithms are opaque and a human decision-maker is unable to understand \emph{why} a certain activity was predicted. Recently, counterfactuals have been proposed in the literature to derive human-understandable explanations from predictive models. Current counterfactual approaches consist of finding the minimum feature change that can make a certain prediction flip its outcome. Although many algorithms have been proposed, their application to multi-dimensional sequence data like event logs has not been explored in the literature. In this paper, we explore the use of a recent, popular model-agnostic counterfactual algorithm, DiCE, in the context of predictive process analytics. The analysis reveals that DiCE is unable to derive explanations for process predictions, due to (1) process domain knowledge not being taken into account, (2) long traces of process execution that often tend to be less understandable, and (3) difficulties in optimising the counterfactual search with categorical variables. We design an extension of DiCE, namely DiCE4EL (DiCE for Event Logs), that can generate counterfactual explanations for process prediction, and propose an approach that supports deriving milestone-aware counterfactual explanations at key intermediate stages along process execution to promote interpretability. We apply our approach to a publicly available real-life event log and the analysis results demonstrate the effectiveness of the proposed approach.
    Federated Dropout -- A Simple Approach for Enabling Federated Learning on Resource Constrained Devices. (arXiv:2109.15258v1 [cs.LG])
    (0 min) Federated learning (FL) is a popular framework for training an AI model using distributed mobile data in a wireless network. It features data parallelism by distributing the learning task to multiple edge devices while attempting to preserve their local-data privacy. One main challenge confronting practical FL is that resource constrained devices struggle with the computation intensive task of updating of a deep-neural network model. To tackle the challenge, in this paper, a federated dropout (FedDrop) scheme is proposed building on the classic dropout scheme for random model pruning. Specifically, in each iteration of the FL algorithm, several subnets are independently generated from the global model at the server using dropout but with heterogeneous dropout rates (i.e., parameter-pruning probabilities), each of which is adapted to the state of an assigned channel. The subsets are downloaded to associated devices for updating. Thereby, FdeDrop reduces both the communication overhead and devices' computation loads compared with the conventional FL while outperforming the latter in the case of overfitting and also the FL scheme with uniform dropout (i.e., identical subsets).
    Width-Based Planning and Active Learning for Atari. (arXiv:2109.15310v1 [cs.AI])
    (0 min) Width-based planning has shown promising results on Atari 2600 games using pixel input, while using substantially fewer environment interactions than reinforcement learning. Recent width-based approaches have computed feature vectors for each screen using a hand designed feature set or a variational autoencoder (VAE) trained on game screens, and prune screens that do not have novel features during the search. In this paper, we explore consideration of uncertainty in features generated by a VAE during width-based planning. Our primary contribution is the introduction of active learning to maximize the utility of screens observed during planning. Experimental results demonstrate that use of active learning strategies increases gameplay scores compared to alternative width-based approaches with equal numbers of environment interactions.
    Convolution-Free Waveform Transformers for Multi-Lead ECG Classification. (arXiv:2109.15129v1 [eess.SP])
    (0 min) We present our entry to the 2021 PhysioNet/CinC challenge - a waveform transformer model to detect cardiac abnormalities from ECG recordings. We compare the performance of the waveform transformer model on different ECG-lead subsets using approximately 88,000 ECG recordings from six datasets. In the official rankings, team prna ranked between 9 and 15 on 12, 6, 4, 3 and 2-lead sets respectively. Our waveform transformer model achieved an average challenge metric of 0.47 on the held-out test set across all ECG-lead subsets. Our combined performance across all leads placed us at rank 11 out of 39 officially ranking teams.
    Twins: Revisiting the Design of Spatial Attention in Vision Transformers. (arXiv:2104.13840v4 [cs.CV] UPDATED)
    (0 min) Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at https://github.com/Meituan-AutoML/Twins .
    Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. (arXiv:1910.12799v2 [stat.ML] UPDATED)
    (0 min) Deep learning has exhibited superior performance for various tasks, especially for high-dimensional datasets, such as images. To understand this property, we investigate the approximation and estimation ability of deep learning on anisotropic Besov spaces. The anisotropic Besov space is characterized by direction-dependent smoothness and includes several function classes that have been investigated thus far. We demonstrate that the approximation error and estimation error of deep learning only depend on the average value of the smoothness parameters in all directions. Consequently, the curse of dimensionality can be avoided if the smoothness of the target function is highly anisotropic. Unlike existing studies, our analysis does not require a low-dimensional structure of the input data. We also investigate the minimax optimality of deep learning and compare its performance with that of the kernel method (more generally, linear estimators). The results show that deep learning has better dependence on the input dimensionality if the target function possesses anisotropic smoothness, and it achieves an adaptive rate for functions with spatially inhomogeneous smoothness.
    Multi-index Antithetic Stochastic Gradient Algorithm. (arXiv:2006.06102v2 [stat.ML] UPDATED)
    (0 min) Stochastic Gradient Algorithms (SGAs) are ubiquitous in computational statistics, machine learning and optimisation. Recent years have brought an influx of interest in SGAs, and the non-asymptotic analysis of their bias is by now well-developed. However, relatively little is known about the optimal choice of the random approximation (e.g mini-batching) of the gradient in SGAs as this relies on the analysis of the variance and is problem specific. While there have been numerous attempts to reduce the variance of SGAs, these typically exploit a particular structure of the sampled distribution by requiring a priori knowledge of its density's mode. It is thus unclear how to adapt such algorithms to non-log-concave settings. In this paper, we construct a Multi-index Antithetic Stochastic Gradient Algorithm (MASGA) whose implementation is independent of the structure of the target measure and which achieves performance on par with Monte Carlo estimators that have access to unbiased samples from the distribution of interest. In other words, MASGA is an optimal estimator from the mean square error-computational cost perspective within the class of Monte Carlo estimators. We prove this fact rigorously for log-concave settings and verify it numerically for some examples where the log-concavity assumption is not satisfied.
    Which Design Decisions in AI-enabled Mobile Applications Contribute to Greener AI?. (arXiv:2109.15284v1 [cs.LG])
    (0 min) Background: The construction, evolution and usage of complex artificial intelligence (AI) models demand expensive computational resources. While currently available high-performance computing environments support well this complexity, the deployment of AI models in mobile devices, which is an increasing trend, is challenging. Mobile applications consist of environments with low computational resources and hence imply limitations in the design decisions during the AI-enabled software engineering lifecycle that balance the trade-off between the accuracy and the complexity of the mobile applications. Objective: Our objective is to systematically assess the trade-off between accuracy and complexity when deploying complex AI models (e.g. neural networks) to mobile devices, which have an implicit resource limitation. We aim to cover (i) the impact of the design decisions on the achievement of high-accuracy and low resource-consumption implementations; and (ii) the validation of profiling tools for systematically promoting greener AI. Method: This confirmatory registered report consists of a plan to conduct an empirical study to quantify the implications of the design decisions on AI-enabled applications performance and to report experiences of the end-to-end AI-enabled software engineering lifecycle. Concretely, we will implement both image-based and language-based neural networks in mobile applications to solve multiple image classification and text classification problems on different benchmark datasets. Overall, we plan to model the accuracy and complexity of AI-enabled applications in operation with respect to their design decisions and will provide tools for allowing practitioners to gain consciousness of the quantitative relationship between the design decisions and the green characteristics of study.
    Transfer Learning Based Multi-Objective Evolutionary Algorithm for Community Detection of Dynamic Complex Networks. (arXiv:2109.15136v1 [cs.SI])
    (0 min) Dynamic community detection is the hotspot and basic problem of complex network and artificial intelligence research in recent years. It is necessary to maximize the accuracy of clustering as the network structure changes, but also to minimize the two consecutive clustering differences between the two results. There is a trade-off relationship between these two objectives. In this paper, we propose a Feature Transfer Based Multi-Objective Optimization Genetic Algorithm (TMOGA) based on transfer learning and traditional multi-objective evolutionary algorithm framework. The main idea is to extract stable features from past community structures, retain valuable feature information, and integrate this feature information into current optimization processes to improve the evolutionary algorithms. Additionally, a new theoretical framework is proposed in this paper to analyze community detection problem based on information theory. Then, we exploit this framework to prove the rationality of TMOGA. Finally, the experimental results show that our algorithm can achieve better clustering effects compared with the state-of-the-art dynamic network community detection algorithms in diverse test problems.
    Fine-tuning wav2vec2 for speaker recognition. (arXiv:2109.15053v1 [cs.SD])
    (0 min) This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.
    Unlocking Pixels for Reinforcement Learning via Implicit Attention. (arXiv:2102.04353v4 [cs.LG] UPDATED)
    (0 min) There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and the potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is an attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods cannot be applied beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these techniques can be successfully adopted for the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. We show this on a range of tasks from the Distracting Control Suite to vision-based quadruped robots locomotion. We provide rigorous theoretical analysis of the proposed algorithm.
    A Survey of Embodied AI: From Simulators to Research Tasks. (arXiv:2103.04918v5 [cs.AI] UPDATED)
    (0 min) There has been an emerging paradigm shift from the era of "internet AI" to "embodied AI", where AI algorithms and agents no longer learn from datasets of images, videos or text curated primarily from the internet. Instead, they learn through interactions with their environments from an egocentric perception similar to humans. Consequently, there has been substantial growth in the demand for embodied AI simulators to support various embodied AI research tasks. This growing interest in embodied AI is beneficial to the greater pursuit of Artificial General Intelligence (AGI), but there has not been a contemporary and comprehensive survey of this field. This paper aims to provide an encyclopedic survey for the field of embodied AI, from its simulators to its research. By evaluating nine current embodied AI simulators with our proposed seven features, this paper aims to understand the simulators in their provision for use in embodied AI research and their limitations. Lastly, this paper surveys the three main research tasks in embodied AI -- visual exploration, visual navigation and embodied question answering (QA), covering the state-of-the-art approaches, evaluation metrics and datasets. Finally, with the new insights revealed through surveying the field, the paper will provide suggestions for simulator-for-task selections and recommendations for the future directions of the field.
    Variational Marginal Particle Filters. (arXiv:2109.15134v1 [stat.ML])
    (0 min) Variational inference for state space models (SSMs) is known to be hard in general. Recent works focus on deriving variational objectives for SSMs from unbiased sequential Monte Carlo estimators. We reveal that the marginal particle filter is obtained from sequential Monte Carlo by applying Rao-Blackwellization operations, which sacrifices the trajectory information for reduced variance and differentiability. We propose the variational marginal particle filter (VMPF), which is a differentiable and reparameterizable variational filtering objective for SSMs based on an unbiased estimator. We find that VMPF with biased gradients gives tighter bounds than previous objectives, and the unbiased reparameterization gradients are sometimes beneficial.
    Suppression of Independent and Correlated Noise with Similarity-based Unsupervised Deep Learning. (arXiv:2011.03384v5 [cs.LG] UPDATED)
    (0 min) Denoising is one of the most important data processing tasks and is generally a prerequisite for downstream image analysis in many fields. Despite their superior denoising performance, supervised deep denoising methods require paired noise-clean or noise-noise samples often unavailable in practice. On the other hand, unsupervised deep denoising methods such as Noise2Void and its variants predict masked pixels from their neighboring pixels in single noisy images. However, these unsupervised algorithms only work under the independent noise assumption while real noise distributions are usually correlated with complex structural patterns. Here we propose the first-of-its-kind feature similarity-based unsupervised denoising approach that works in a nonlocal and nonlinear fashion to suppress not only independent but also correlated noise. Our approach is referred to as Noise2Sim since different noisy sub-images with similar signals are extracted to form as many as possible training pairs so that the parameters of a deep denoising network can be optimized in a self-learning fashion. Theoretically, the theorem is established that Noise2Sim is equivalent to the supervised learning methods under mild conditions. Experimentally, Noise2Sim achieves excellent results on natural, microscopic, low-dose CT and photon-counting micro-CT images, removing image noise independent or not and being superior to the competitive denoising methods. Potentially, Noise2Sim would open a new direction of research and lead to the development of adaptive denoising tools in diverse applications.
    Mitigation of Adversarial Policy Imitation via Constrained Randomization of Policy (CRoP). (arXiv:2109.14678v1 [cs.LG])
    (0 min) Deep reinforcement learning (DRL) policies are vulnerable to unauthorized replication attacks, where an adversary exploits imitation learning to reproduce target policies from observed behavior. In this paper, we propose Constrained Randomization of Policy (CRoP) as a mitigation technique against such attacks. CRoP induces the execution of sub-optimal actions at random under performance loss constraints. We present a parametric analysis of CRoP, address the optimality of CRoP, and establish theoretical bounds on the adversarial budget and the expectation of loss. Furthermore, we report the experimental evaluation of CRoP in Atari environments under adversarial imitation, which demonstrate the efficacy and feasibility of our proposed method against policy replication attacks.
    Real Robot Challenge using Deep Reinforcement Learning. (arXiv:2109.15233v1 [cs.RO])
    (0 min) This paper details our winning submission to Phase 1 of the 2021 Real Robot Challenge, a challenge in which a three fingered robot must carry a cube along specified goal trajectories. To solve Phase 1, we use a pure reinforcement learning approach which requires minimal expert knowledge of the robotic system or of robotic grasping in general. A sparse goal-based reward is employed in conjunction with Hindsight Experience Replay to teach the control policy to move the cube to the desired x and y coordinates. Simultaneously, a dense distance-based reward is employed to teach the policy to lift the cube to the desired z coordinate. The policy is trained in simulation with domain randomization before being transferred to the real robot for evaluation. Although performance tends to worsen after this transfer, our best trained policy can successfully lift the real cube along goal trajectories via the use of an effective pinching grasp. Our approach outperforms all other submissions, including those leveraging more traditional robotic control techniques, and is the first learning-based approach to solve this challenge.
    A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms. (arXiv:2009.06125v2 [cs.LG] UPDATED)
    (0 min) The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate to 0 while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes, and divergence. In particular, Adam converges much smoother and even faster when the values of the two momentum factors are close to each other. This observation is particularly important for scientific computing tasks, for which the training process usually proceeds into the high precision regime.
    Reliable Estimation of KL Divergence using a Discriminator in Reproducing Kernel Hilbert Space. (arXiv:2109.14688v1 [cs.LG])
    (0 min) Estimating Kullback Leibler (KL) divergence from samples of two distributions is essential in many machine learning problems. Variational methods using neural network discriminator have been proposed to achieve this task in a scalable manner. However, we noted that most of these methods using neural network discriminators suffer from high fluctuations (variance) in estimates and instability in training. In this paper, we look at this issue from statistical learning theory and function space complexity perspective to understand why this happens and how to solve it. We argue that the cause of these pathologies is lack of control over the complexity of the neural network discriminator function and could be mitigated by controlling it. To achieve this objective, we 1) present a novel construction of the discriminator in the Reproducing Kernel Hilbert Space (RKHS), 2) theoretically relate the error probability bound of the KL estimates to the complexity of the discriminator in the RKHS space, 3) present a scalable way to control the complexity (RKHS norm) of the discriminator for a reliable estimation of KL divergence, and 4) prove the consistency of the proposed estimator. In three different applications of KL divergence : estimation of KL, estimation of mutual information and Variational Bayes, we show that by controlling the complexity as developed in the theory, we are able to reduce the variance of KL estimates and stabilize the training
    Overview of the CLEF-2019 CheckThat!: Automatic Identification and Verification of Claims. (arXiv:2109.15118v1 [cs.CL])
    (0 min) We present an overview of the second edition of the CheckThat! Lab at CLEF 2019. The lab featured two tasks in two different languages: English and Arabic. Task 1 (English) challenged the participating systems to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 (Arabic) asked to (A) rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) identify useful passages from these pages, and (D) use the useful pages to predict the claim's factuality. CheckThat! provided a full evaluation framework, consisting of data in English (derived from fact-checking sources) and Arabic (gathered and annotated from scratch) and evaluation based on mean average precision (MAP) and normalized discounted cumulative gain (nDCG) for ranking, and F1 for classification. A total of 47 teams registered to participate in this lab, and fourteen of them actually submitted runs (compared to nine last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. As for Task 2, learning-to-rank was used by the highest scoring runs for subtask A, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important tasks of check-worthiness estimation and automatic claim verification.
    Bend-Net: Bending Loss Regularized Multitask Learning Network for Nuclei Segmentation in Histopathology Images. (arXiv:2109.15283v1 [eess.IV])
    (0 min) Separating overlapped nuclei is a major challenge in histopathology image analysis. Recently published approaches have achieved promising overall performance on nuclei segmentation; however, their performance on separating overlapped nuclei is quite limited. To address the issue, we propose a novel multitask learning network with a bending loss regularizer to separate overlapped nuclei accurately. The newly proposed multitask learning architecture enhances the generalization by learning shared representation from three tasks: instance segmentation, nuclei distance map prediction, and overlapped nuclei distance map prediction. The proposed bending loss defines high penalties to concave contour points with large curvatures, and applies small penalties to convex contour points with small curvatures. Minimizing the bending loss avoids generating contours that encompass multiple nuclei. In addition, two new quantitative metrics, Aggregated Jaccard Index of overlapped nuclei (AJIO) and Accuracy of overlapped nuclei (ACCO), are designed for the evaluation of overlapped nuclei segmentation. We validate the proposed approach on the CoNSeP and MoNuSegv1 datasets using seven quantitative metrics: Aggregate Jaccard Index, Dice, Segmentation Quality, Recognition Quality, Panoptic Quality, AJIO, and ACCO. Extensive experiments demonstrate that the proposed Bend-Net outperforms eight state-of-the-art approaches.
    Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users. (arXiv:2007.06823v2 [cs.LG] UPDATED)
    (0 min) Modern deep learning methods constitute incredibly powerful tools to tackle a myriad of challenging problems. However, since deep learning methods operate as black boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with deep neural network predictions. This tutorial provides an overview of the relevant literature and a complete toolset to design, implement, train, use and evaluate Bayesian Neural Networks, i.e. Stochastic Artificial Neural Networks trained using Bayesian methods.
    SUper Team at SemEval-2016 Task 3: Building a feature-rich system for community question answering. (arXiv:2109.15120v1 [cs.CL])
    (0 min) We present the system we built for participating in SemEval-2016 Task 3 on Community Question Answering. We achieved the best results on subtask C, and strong results on subtasks A and B, by combining a rich set of various types of features: semantic, lexical, metadata, and user-related. The most important group turned out to be the metadata for the question and for the comment, semantic vectors trained on QatarLiving data and similarities between the question and the comment for subtasks A and C, and between the original and the related question for Subtask B.
    Coordinated Reinforcement Learning for Optimizing Mobile Networks. (arXiv:2109.15175v1 [cs.LG])
    (0 min) Mobile networks are composed of many base stations and for each of them many parameters must be optimized to provide good services. Automatically and dynamically optimizing all these entities is challenging as they are sensitive to variations in the environment and can affect each other through interferences. Reinforcement learning (RL) algorithms are good candidates to automatically learn base station configuration strategies from incoming data but they are often hard to scale to many agents. In this work, we demonstrate how to use coordination graphs and reinforcement learning in a complex application involving hundreds of cooperating agents. We show how mobile networks can be modeled using coordination graphs and how network optimization problems can be solved efficiently using multi- agent reinforcement learning. The graph structure occurs naturally from expert knowledge about the network and allows to explicitly learn coordinating behaviors between the antennas through edge value functions represented by neural networks. We show empirically that coordinated reinforcement learning outperforms other methods. The use of local RL updates and parameter sharing can handle a large number of agents without sacrificing coordination which makes it well suited to optimize the ever denser networks brought by 5G and beyond.
    Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons. (arXiv:2107.02397v3 [cs.LG] UPDATED)
    (0 min) This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and a softsign function. We prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensioanl hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the space of continuous functions. Furthermore, classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint closed bounded subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset.
    Rotation Equivariant Fourier Neural Operators for Learning Symmetry Preserving Transformations on Scalar Fields, Vector Fields, and Higher Order Tensor Fields. (arXiv:2108.09541v2 [cs.LG] UPDATED)
    (0 min) We introduce equivariant neural operators for learning resolution invariant, rotation equivariant and consequently symmetry preserving transformations between arbitrary sets of tensor fields including scalar fields, vector fields, and higher order fields. Our tensor field convolution layers emulate any linear operator by learning its impulse response or Green's function as the convolution kernel. Our tensor field attention layers emulate pairwise field coupling via local tensor products. Convolutions and associated adjoints can be in real or Fourier space allowing for linear scaling. By unifying concepts from Euclidean tensor field networks (E3NN), tensor basis networks (TBNN) and Fourier neural operators (FNO), we achieve good predictive performance on a wide range of PDEs and dynamical systems in engineering and quantum chemistry. Code is in Julia and available upon request from authors.
    Disease control as an optimization problem. (arXiv:2009.06576v4 [q-bio.PE] UPDATED)
    (0 min) In the context of epidemiology, policies for disease control are often devised through a mixture of intuition and brute-force, whereby the set of logically conceivable policies is narrowed down to a small family described by a few parameters, following which linearization or grid search is used to identify the optimal policy within the set. This scheme runs the risk of leaving out more complex (and perhaps counter-intuitive) policies for disease control that could tackle the disease more efficiently. In this article, we use techniques from convex optimization theory and machine learning to conduct optimizations over disease policies described by hundreds of parameters. In contrast to past approaches for policy optimization based on control theory, our framework can deal with arbitrary uncertainties on the initial conditions and model parameters controlling the spread of the disease, and stochastic models. In addition, our methods allow for optimization over policies which remain constant over weekly periods, specified by either continuous or discrete (e.g.: lockdown on/off) government measures. We illustrate our approach by minimizing the total time required to eradicate COVID-19 within the Susceptible-Exposed-Infected-Recovered (SEIR) model proposed by Kissler \emph{et al.} (March, 2020).
    Does My Representation Capture X? Probe-Ably. (arXiv:2104.05807v2 [cs.LG] UPDATED)
    (0 min) Probing (or diagnostic classification) has become a popular strategy for investigating whether a given set of intermediate features is present in the representations of neural models. Probing studies may have misleading results, but various recent works have suggested more reliable methodologies that compensate for the possible pitfalls of probing. However, these best practices are numerous and fast-evolving. To simplify the process of running a set of probing experiments in line with suggested methodologies, we introduce Probe-Ably: an extendable probing framework which supports and automates the application of probing methods to the user's inputs.
    An Offline Deep Reinforcement Learning for Maintenance Decision-Making. (arXiv:2109.15050v1 [cs.LG])
    (0 min) Several machine learning and deep learning frameworks have been proposed to solve remaining useful life estimation and failure prediction problems in recent years. Having access to the remaining useful life estimation or likelihood of failure in near future helps operators to assess the operating conditions and, therefore, provides better opportunities for sound repair and maintenance decisions. However, many operators believe remaining useful life estimation and failure prediction solutions are incomplete answers to the maintenance challenge. They argue that knowing the likelihood of failure in the future is not enough to make maintenance decisions that minimize costs and keep the operators safe. In this paper, we present a maintenance framework based on offline supervised deep reinforcement learning that instead of providing information such as likelihood of failure, suggests actions such as "continuation of the operation" or "the visitation of the repair shop" to the operators in order to maximize the overall profit. Using offline reinforcement learning makes it possible to learn the optimum maintenance policy from historical data without relying on expensive simulators. We demonstrate the application of our solution in a case study using the NASA C-MAPSS dataset.
    Dynamical symmetry breaking through AI: The dimer self-trapping transition. (arXiv:2109.15057v1 [physics.comp-ph])
    (0 min) The nonlinear dimer obtained through the nonlinear Schr{\"o}dinger equation has been a workhorse for the discovery the role nonlinearity plays in strongly interacting systems. While the analysis of the stationary states demonstrates the onset of a symmetry broken state for some degree of nonlinearity, the full dynamics maps the system into an effective $\phi^4$ model. In this latter context, the self-trapping transition is an initial condition dependent transfer of a classical particle over a barrier set by the nonlinear term. This transition has been investigated analytically and mathematically it is expressed through the hyperbolic limit of Jacobian elliptic functions. The aim of the present work is to recapture this transition through the use of methods of Artificial Intelligence (AI). Specifically, we used a physics motivated machine learning model that is shown to be able to capture the original dynamic self-trapping transition and its dependence on initial conditions. Exploitation of this result in the case of the non-degenerate nonlinear dimer gives additional information on the more general dynamics and helps delineate linear from nonlinear localization. This work shows how AI methods may be embedded in physics and provide useful tools for discovery.
    The Role of Bio-Inspired Modularity in General Learning. (arXiv:2109.15097v1 [cs.LG])
    (0 min) One goal of general intelligence is to learn novel information without overwriting prior learning. The utility of learning without forgetting (CF) is twofold: first, the system can return to previously learned tasks after learning something new. In addition, bootstrapping previous knowledge may allow for faster learning of a novel task. Previous approaches to CF and bootstrapping are primarily based on modifying learning in the form of changing weights to tune the model to the current task, overwriting previously tuned weights from previous tasks.However, another critical factor that has been largely overlooked is the initial network topology, or architecture. Here, we argue that the topology of biological brains likely evolved certain features that are designed to achieve this kind of informational conservation. In particular, we consider that the highly conserved property of modularity may offer a solution to weight-update learning methods that adheres to the learning without catastrophic forgetting and bootstrapping constraints. Final considerations are then made on how to combine these two learning objectives in a dynamical, general learning system.
    Vision-Aided Beam Tracking: Explore the Proper Use of Camera Images with Deep Learning. (arXiv:2109.14686v1 [cs.LG])
    (0 min) We investigate the problem of wireless beam tracking on mmWave bands with the assistance of camera images. In particular, based on the user's beam indices used and camera images taken in the trajectory, we predict the optimal beam indices in the next few time spots. To resolve this problem, we first reformulate the "ViWi" dataset in [1] to get rid of the image repetition problem. Then we develop a deep learning approach and investigate various model components to achieve the best performance. Finally, we explore whether, when, and how to use the image for better beam prediction. To answer this question, we split the dataset into three clusters -- (LOS, light NLOS, serious NLOS)-like -- based on the standard deviation of the beam sequence. With experiments we demonstrate that using the image indeed helps beam tracking especially when the user is in serious NLOS, and the solution relies on carefully-designed dataset for training a model. Generally speaking, including NLOS-like data for training a model does not benefit beam tracking of the user in LOS, but including light NLOS-like data for training a model benefits beam tracking of the user in serious NLOS.
    Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting. (arXiv:2109.14725v1 [cs.LG])
    (0 min) In this work, we propose Tiny-CRNN (Tiny Convolutional Recurrent Neural Network) models applied to the problem of wakeword detection, and augment them with scaled dot product attention. We find that, compared to Convolutional Neural Network models, False Accepts in a 250k parameter budget can be reduced by 25% with a 10% reduction in parameter size by using models based on the Tiny-CRNN architecture, and we can get up to 32% reduction in False Accepts at a 50k parameter budget with 75% reduction in parameter size compared to word-level Dense Neural Network models. We discuss solutions to the challenging problem of performing inference on streaming audio with this architecture, as well as differences in start-end index errors and latency in comparison to CNN, DNN, and DNN-HMM models.
    Feature Selection on a Flare Forecasting Testbed: A Comparative Study of 24 Methods. (arXiv:2109.14770v1 [astro-ph.SR])
    (0 min) The Space-Weather ANalytics for Solar Flares (SWAN-SF) is a multivariate time series benchmark dataset recently created to serve the heliophysics community as a testbed for solar flare forecasting models. SWAN-SF contains 54 unique features, with 24 quantitative features computed from the photospheric magnetic field maps of active regions, describing their precedent flare activity. In this study, for the first time, we systematically attacked the problem of quantifying the relevance of these features to the ambitious task of flare forecasting. We implemented an end-to-end pipeline for preprocessing, feature selection, and evaluation phases. We incorporated 24 Feature Subset Selection (FSS) algorithms, including multivariate and univariate, supervised and unsupervised, wrappers and filters. We methodologically compared the results of different FSS algorithms, both on the multivariate time series and vectorized formats, and tested their correlation and reliability, to the extent possible, by using the selected features for flare forecasting on unseen data, in univariate and multivariate fashions. We concluded our investigation with a report of the best FSS methods in terms of their top-k features, and the analysis of the findings. We wish the reproducibility of our study and the availability of the data allow the future attempts be comparable with our findings and themselves.
    LIFE: Learning Individual Features for Multivariate Time Series Prediction with Missing Values. (arXiv:2109.14844v1 [cs.LG])
    (0 min) Multivariate time series (MTS) prediction is ubiquitous in real-world fields, but MTS data often contains missing values. In recent years, there has been an increasing interest in using end-to-end models to handle MTS with missing values. To generate features for prediction, existing methods either merge all input dimensions of MTS or tackle each input dimension independently. However, both approaches are hard to perform well because the former usually produce many unreliable features and the latter lacks correlated information. In this paper, we propose a Learning Individual Features (LIFE) framework, which provides a new paradigm for MTS prediction with missing values. LIFE generates reliable features for prediction by using the correlated dimensions as auxiliary information and suppressing the interference from uncorrelated dimensions with missing values. Experiments on three real-world data sets verify the superiority of LIFE to existing state-of-the-art models.
    A Friend Recommendation System using Semantic Based KNN Algorithm. (arXiv:2109.14970v1 [cs.DB])
    (0 min) Social networking has become a major part of all our lives and we depend on it for day to day purposes. It is a medium that is used by people all around the world even in the smallest of towns. Its main purpose is to promote and aid communication between people. Social networks, such as Facebook, Twitter etc. were created for the sole purpose of helping individuals communicate about anything with each other. These networks are becoming an important and also contemporary method to make friends from any part of this world. These new friends can communicate through any form of social media. Recommendation systems exist in all the social networks which aid users to find new friends and unite to more people and form associations and alliances with people.
    Optimal Learning for Structured Bandits. (arXiv:2007.07302v2 [cs.LG] UPDATED)
    (0 min) We study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain structural information regarding the reward distributions and would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer only minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call DUSA whose worst-case regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. Our proposed algorithm is the first computationally viable learning policy for structured bandit problems that has asymptotic minimal regret.
    Predicting Code Review Completion Time in Modern Code Review. (arXiv:2109.15141v1 [cs.SE])
    (0 min) Context. Modern Code Review (MCR) is being adopted in both open source and commercial projects as a common practice. MCR is a widely acknowledged quality assurance practice that allows early detection of defects as well as poor coding practices. It also brings several other benefits such as knowledge sharing, team awareness, and collaboration. Problem. In practice, code reviews can experience significant delays to be completed due to various socio-technical factors which can affect the project quality and cost. For a successful review process, peer reviewers should perform their review tasks in a timely manner while providing relevant feedback about the code change being reviewed. However, there is a lack of tool support to help developers estimating the time required to complete a code review prior to accepting or declining a review request. Objective. Our objective is to build and validate an effective approach to predict the code review completion time in the context of MCR and help developers better manage and prioritize their code review tasks. Method. We formulate the prediction of the code review completion time as a learning problem. In particular, we propose a framework based on regression models to (i) effectively estimate the code review completion time, and (ii) understand the main factors influencing code review completion time.
    Mitigating Black-Box Adversarial Attacks via Output Noise Perturbation. (arXiv:2109.15160v1 [cs.CR])
    (0 min) In black-box adversarial attacks, adversaries query the deep neural network (DNN), use the output to reconstruct gradients, and then optimize the adversarial inputs iteratively. In this paper, we study the method of adding white noise to the DNN output to mitigate such attacks, with a unique focus on the trade-off analysis of noise level and query cost. The attacker's query count (QC) is derived mathematically as a function of noise standard deviation. With this result, the defender can conveniently find the noise level needed to mitigate attacks for the desired security level specified by QC and limited DNN performance loss. Our analysis shows that the added noise is drastically magnified by the small variation of DNN outputs, which makes the reconstructed gradient have an extremely low signal-to-noise ratio (SNR). Adding slight white noise with a standard deviation less than 0.01 is enough to increase QC by many orders of magnitude without introducing any noticeable classification accuracy reduction. Our experiments demonstrate that this method can effectively mitigate both soft-label and hard-label black-box attacks under realistic QC constraints. We also show that this method outperforms many other defense methods and is robust to the attacker's countermeasures.
    Stock Index Prediction using Cointegration test and Quantile Loss. (arXiv:2109.15045v1 [q-fin.ST])
    (0 min) Recent researches on stock prediction using deep learning methods has been actively studied. This is the task to predict the movement of stock prices in the future based on historical trends. The approach to predicting the movement based solely on the pattern of the historical movement of it on charts, not on fundamental values, is called the Technical Analysis, which can be divided into univariate and multivariate methods in the regression task. According to the latter approach, it is important to select different factors well as inputs to enhance the performance of the model. Moreover, its performance can depend on which loss is used to train the model. However, most studies tend to focus on building the structures of models, not on how to select informative factors as inputs to train them. In this paper, we propose a method that can get better performance in terms of returns when selecting informative factors using the cointegration test and learning the model using quantile loss. We compare the two RNN variants with quantile loss with only five factors obtained through the cointegration test among the entire 15 stock index factors collected in the experiment. The Cumulative return and Sharpe ratio were used to evaluate the performance of trained models. Our experimental results show that our proposed method outperforms the other conventional approaches.
    LayoutTransformer: Layout Generation and Completion with Self-attention. (arXiv:2006.14615v2 [cs.CV] UPDATED)
    (0 min) We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents, and 3D objects. Most complex scenes, natural or human-designed, can be expressed as a meaningful arrangement of simpler compositional graphical primitives. Generating a new layout or extending an existing layout requires understanding the relationships between these primitives. To do this, we propose LayoutTransformer, a novel framework that leverages self-attention to learn contextual relationships between layout elements and generate novel layouts in a given domain. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout. Furthermore, our analyses show that the model is able to automatically capture the semantic properties of the primitives. We propose simple improvements in both representation of layout primitives, as well as training methods to demonstrate competitive performance in very diverse data domains such as object bounding boxes in natural images(COCO bounding box), documents (PubLayNet), mobile applications (RICO dataset) as well as 3D shapes (Part-Net). Code and other materials will be made available at https://kampta.github.io/layout.
    Posttraumatic Stress Disorder Hyperarousal Event Detection Using Smartwatch Physiological and Activity Data. (arXiv:2109.14743v1 [cs.LG])
    (0 min) Posttraumatic Stress Disorder (PTSD) is a psychiatric condition affecting nearly a quarter of the United States war veterans who return from war zones. Treatment for PTSD typically consists of a combination of in-session therapy and medication. However; patients often experience their most severe PTSD symptoms outside of therapy sessions. Mobile health applications may address this gap, but their effectiveness is limited by the current gap in continuous monitoring and detection capabilities enabling timely intervention. The goal of this article is to develop a novel method to detect hyperarousal events using physiological and activity-based machine learning algorithms. Physiological data including heart rate and body acceleration as well as self-reported hyperarousal events were collected using a tool developed for commercial off-the-shelf wearable devices from 99 United States veterans diagnosed with PTSD over several days. The data were used to develop four machine learning algorithms: Random Forest, Support Vector Machine, Logistic Regression and XGBoost. The XGBoost model had the best performance in detecting onset of PTSD symptoms with over 83% accuracy and an AUC of 0.70. Post-hoc SHapley Additive exPlanations (SHAP) additive explanation analysis showed that algorithm predictions were correlated with average heart rate, minimum heart rate and average body acceleration. Findings show promise in detecting onset of PTSD symptoms which could be the basis for developing remote and continuous monitoring systems for PTSD. Such systems may address a vital gap in just-in-time interventions for PTSD self-management outside of scheduled clinical appointments.
    An End-to-End Approach for Training Neural Network Binary Classifiers on Metrics Based on the Confusion Matrix. (arXiv:2009.01367v2 [cs.LG] UPDATED)
    (0 min) While neural network binary classifiers are often evaluated on metrics such as Accuracy and $F_1$-Score, they are commonly trained with a cross-entropy objective. How can this training-testing gap be addressed? While specific techniques have been adopted to optimize certain confusion matrix based metrics, it is challenging or impossible in some cases to generalize the techniques to other metrics. Adversarial learning approaches have also been proposed to optimize networks via confusion matrix based metrics, but they tend to be much slower than common training methods. In this work, we propose to approximate the Heaviside step function, typically used to compute confusion matrix based metrics, to render these metrics amenable to gradient descent. Our extensive experiments show the effectiveness of our end-to-end approach for binary classification in several domains.
    Convolutional Neural Network Compression through Generalized Kronecker Product Decomposition. (arXiv:2109.14710v1 [cs.CV])
    (0 min) Modern Convolutional Neural Network (CNN) architectures, despite their superiority in solving various problems, are generally too large to be deployed on resource constrained edge devices. In this paper, we reduce memory usage and floating-point operations required by convolutional layers in CNNs. We compress these layers by generalizing the Kronecker Product Decomposition to apply to multidimensional tensors, leading to the Generalized Kronecker Product Decomposition(GKPD). Our approach yields a plug-and-play module that can be used as a drop-in replacement for any convolutional layer. Experimental results for image classification on CIFAR-10 and ImageNet datasets using ResNet, MobileNetv2 and SeNet architectures substantiate the effectiveness of our proposed approach. We find that GKPD outperforms state-of-the-art decomposition methods including Tensor-Train and Tensor-Ring as well as other relevant compression methods such as pruning and knowledge distillation.
    Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks. (arXiv:2109.15089v1 [cs.NE])
    (0 min) We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks. SSL, with a contrastive loss, is more natural as it does not require labelled data and its robustness to perturbations yields more adaptable embeddings. Moreover the perturbation of data required to create positive pairs for SSL is easily produced in a natural environment by observing objects in motion and with variable lighting over time. We propose a contrastive hinge based loss whose error involves simple local computations as opposed to the standard contrastive losses employed in the literature, which do not lend themselves easily to implementation in a network architecture due to complex computations involving ratios and inner products. Furthermore we show that learning can be performed with one of two more plausible alternatives to backpropagation. The first is difference target propagation (DTP), which trains network parameters using target-based local losses and employs a Hebbian learning rule, thus overcoming the biologically implausible symmetric weight problem in backpropagation. The second is simply layer-wise learning, where each layer is directly connected to a layer computing the loss error. The layers are either updated sequentially in a greedy fashion (GLL) or in random order (RLL), and each training stage involves a single hidden layer network. The one step backpropagation needed for each such network can either be altered with fixed random feedback weights as proposed in Lillicrap et al. (2016), or using updated random feedback as in Amit (2019). Both methods represent alternatives to the symmetric weight issue of backpropagation. By training convolutional neural networks (CNNs) with SSL and DTP, GLL or RLL, we find that our proposed framework achieves comparable performance to its implausible counterparts in both linear evaluation and transfer learning tasks.
    Robust Allocations with Diversity Constraints. (arXiv:2109.15015v1 [cs.GT])
    (0 min) We consider the problem of allocating divisible items among multiple agents, and consider the setting where any agent is allowed to introduce diversity constraints on the items they are allocated. We motivate this via settings where the items themselves correspond to user ad slots or task workers with attributes such as race and gender on which the principal seeks to achieve demographic parity. We consider the following question: When an agent expresses diversity constraints into an allocation rule, is the allocation of other agents hurt significantly? If this happens, the cost of introducing such constraints is disproportionately borne by agents who do not benefit from diversity. We codify this via two desiderata capturing robustness. These are no negative externality -- other agents are not hurt -- and monotonicity -- the agent enforcing the constraint does not see a large increase in value. We show in a formal sense that the Nash Welfare rule that maximizes product of agent values is uniquely positioned to be robust when diversity constraints are introduced, while almost all other natural allocation rules fail this criterion. We also show that the guarantees achieved by Nash Welfare are nearly optimal within a widely studied class of allocation rules. We finally perform an empirical simulation on real-world data that models ad allocations to show that this gap between Nash Welfare and other rules persists in the wild.
    Greedy Shallow Networks: An Approach for Constructing and Training Neural Networks. (arXiv:1905.10409v3 [cs.LG] UPDATED)
    (0 min) We present a greedy-based approach to construct an efficient single hidden layer neural network with the ReLU activation that approximates a target function. In our approach we obtain a shallow network by utilizing a greedy algorithm with the prescribed dictionary provided by the available training data and a set of possible inner weights. To facilitate the greedy selection process we employ an integral representation of the network, based on the ridgelet transform, that significantly reduces the cardinality of the dictionary and hence promotes feasibility of the greedy selection. Our approach allows for the construction of efficient architectures which can be treated either as improved initializations to be used in place of random-based alternatives, or as fully-trained networks in certain cases, thus potentially nullifying the need for backpropagation training. Numerical experiments demonstrate the tenability of the proposed concept and its advantages compared to the conventional techniques for selecting architectures and initializations for neural networks.
    Impact of Channel Variation on One-Class Learning for Spoof Detection. (arXiv:2109.14900v1 [cs.LG])
    (0 min) The value of Spoofing detection in increasing the reliability of the ASV system is unparalleled. In reality, however, the performance of countermeasure systems (CMs) degrades significantly due to channel variation. Multi-conditional training(MCT) is a well-established technique to handle such scenarios. However, "which data-feeding strategy is optimal for MCT?" is not known in the case of spoof detection. In this paper, various codec simulations were used to modify ASVspoof 2019 dataset, and assessments were done using data-feeding and mini-batching strategies to help address this question. Our experiments aim to test the efficacy of the various margin-based losses for training Resnet based models with LFCC front-end feature extractor to correctly classify the spoofed and bonafide samples degraded using codec simulations. Contrastingly to most of the works that focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of data-feeding and mini-batching to raise awareness of the need to refine it for better performance.
    Bitcoin Transaction Strategy Construction Based on Deep Reinforcement Learning. (arXiv:2109.14789v1 [cs.LG])
    (0 min) The emerging cryptocurrency market has lately received great attention for asset allocation due to its decentralization uniqueness. However, its volatility and brand new trading mode have made it challenging to devising an acceptable automatically-generating strategy. This study proposes a framework for automatic high-frequency bitcoin transactions based on a deep reinforcement learning algorithm-proximal policy optimization (PPO). The framework creatively regards the transaction process as actions, returns as awards and prices as states to align with the idea of reinforcement learning. It compares advanced machine learning-based models for static price predictions including support vector machine (SVM), multi-layer perceptron (MLP), long short-term memory (LSTM), temporal convolutional network (TCN), and Transformer by applying them to the real-time bitcoin price and the experimental results demonstrate that LSTM outperforms. Then an automatically-generating transaction strategy is constructed building on PPO with LSTM as the basis to construct the policy. Extensive empirical studies validate that the proposed method performs superiorly to various common trading strategy benchmarks for a single financial product. The approach is able to trade bitcoins in a simulated environment with synchronous data and obtains a 31.67% more return than that of the best benchmark, improving the benchmark by 12.75%. The proposed framework can earn excess returns through both the period of volatility and surge, which opens the door to research on building a single cryptocurrency trading strategy based on deep learning. Visualizations of trading the process show how the model handles high-frequency transactions to provide inspiration and demonstrate that it can be expanded to other financial products.
    Back in Black: A Comparative Evaluation of Recent State-Of-The-Art Black-Box Attacks. (arXiv:2109.15031v1 [cs.CR])
    (0 min) The field of adversarial machine learning has experienced a near exponential growth in the amount of papers being produced since 2018. This massive information output has yet to be properly processed and categorized. In this paper, we seek to help alleviate this problem by systematizing the recent advances in adversarial machine learning black-box attacks since 2019. Our survey summarizes and categorizes 20 recent black-box attacks. We also present a new analysis for understanding the attack success rate with respect to the adversarial model used in each paper. Overall, our paper surveys a wide body of literature to highlight recent attack developments and organizes them into four attack categories: score based attacks, decision based attacks, transfer attacks and non-traditional attacks. Further, we provide a new mathematical framework to show exactly how attack results can fairly be compared.
    Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems. (arXiv:2109.15142v1 [cs.LG])
    (0 min) The Transformer and its variants have been proven to be efficient sequence learners in many different domains. Despite their staggering success, a critical issue has been the enormous number of parameters that must be trained (ranging from $10^7$ to $10^{11}$) along with the quadratic complexity of dot-product attention. In this work, we investigate the problem of approximating the two central components of the Transformer -- multi-head self-attention and point-wise feed-forward transformation, with reduced parameter space and computational complexity. We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations. Taking advantage of an analogy between Transformer stages and the evolution of a dynamical system of multiple interacting particles, we formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers. We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks. We observe that the degree of approximation (or inversely, the degree of parameter reduction) has different effects on the performance, depending on the task. While in the encoder-decoder regime, TransEvolve delivers performances comparable to the original Transformer, in encoder-only tasks it consistently outperforms Transformer along with several subsequent variants.
    Latent Network Embedding via Adversarial Auto-encoders. (arXiv:2109.15257v1 [cs.LG])
    (0 min) Graph auto-encoders have proved to be useful in network embedding task. However, current models only consider explicit structures and fail to explore the informative latent structures cohered in networks. To address this issue, we propose a latent network embedding model based on adversarial graph auto-encoders. Under this framework, the problem of discovering latent structures is formulated as inferring the latent ties from partial observations. A latent transmission matrix that describes the strengths of existing edges and latent ties is derived based on influence cascades sampled by simulating diffusion processes over networks. Besides, since the inference process may bring extra noises, we introduce an adversarial training that works as regularization to dislodge noises and improve the model robustness. Extensive experiments on link prediction and node classification tasks show that the proposed model achieves superior results compared with baseline models.
    Gradient-based Hyperparameter Optimization Over Long Horizons. (arXiv:2007.07869v2 [cs.LG] UPDATED)
    (0 min) Gradient-based hyperparameter optimization has earned a widespread popularity in the context of few-shot meta-learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online, but this introduces greediness which comes with a significant performance drop. We propose forward-mode differentiation with sharing (FDS), a simple and efficient algorithm which tackles memory scaling issues with forward-mode differentiation, and gradient degradation issues by sharing hyperparameters that are contiguous in time. We provide theoretical guarantees about the noise reduction properties of our algorithm, and demonstrate its efficiency empirically by differentiating through $\sim 10^4$ gradient steps of unrolled optimization. We consider large hyperparameter search ranges on CIFAR-10 where we significantly outperform greedy gradient-based alternatives, while achieving $\times 20$ speedups compared to the state-of-the-art black-box methods. Code is available at: \url{https://github.com/polo5/FDS}
    Explanation-Aware Experience Replay in Rule-Dense Environments. (arXiv:2109.14711v1 [cs.LG])
    (0 min) Human environments are often regulated by explicit and complex rulesets. Integrating Reinforcement Learning (RL) agents into such environments motivates the development of learning mechanisms that perform well in rule-dense and exception-ridden environments such as autonomous driving on regulated roads. In this paper, we propose a method for organising experience by means of partitioning the experience buffer into clusters labelled on a per-explanation basis. We present discrete and continuous navigation environments compatible with modular rulesets and 9 learning tasks. For environments with explainable rulesets, we convert rule-based explanations into case-based explanations by allocating state-transitions into clusters labelled with explanations. This allows us to sample experiences in a curricular and task-oriented manner, focusing on the rarity, importance, and meaning of events. We label this concept Explanation-Awareness (XA). We perform XA experience replay (XAER) with intra and inter-cluster prioritisation, and introduce XA-compatible versions of DQN, TD3, and SAC. Performance is consistently superior with XA versions of those algorithms, compared to traditional Prioritised Experience Replay baselines, indicating that explanation engineering can be used in lieu of reward engineering for environments with explainable features.
    Prune Your Model Before Distill It. (arXiv:2109.14960v1 [cs.LG])
    (0 min) Unstructured pruning reduces a significant amount of weights of neural networks. However, unstructured pruning provides a sparse network with the same network architecture as the original network. On the other hand, structured pruning provides an efficient network architecture by removing channels, but the parameter reduction is not significant. In this paper, we consider transferring knowledge from unstructured pruning to a network with efficient architecture (with fewer channels). In particular, we apply the knowledge distillation (KD), where the teacher network is a sparse network (obtained from unstructured pruning), and the student network has an efficient architecture. We observe that learning from the pruned teacher is more effective than learning from the unpruned teacher. We further achieve the promising experimental results that unstructured pruning can improve the performance of knowledge distillation in general.
    Physical Gradients for Deep Learning. (arXiv:2109.15048v1 [cs.LG])
    (0 min) Solving inverse problems, such as parameter estimation and optimal control, is a vital part of science. Many experiments repeatedly collect data and employ machine learning algorithms to quickly infer solutions to the associated inverse problems. We find that state-of-the-art training techniques are not well-suited to many problems that involve physical processes since the magnitude and direction of the gradients can vary strongly. We propose a novel hybrid training approach that combines higher-order optimization methods with machine learning techniques. We replace the gradient of the physical process by a new construct, referred to as the physical gradient. This also allows us to introduce domain knowledge into training by incorporating priors about the solution space into the gradients. We demonstrate the capabilities of our method on a variety of canonical physical systems, showing that physical gradients yield significant improvements on a wide range of optimization and learning problems.
    Reinforcement Learning for Classical Planning: Viewing Heuristics as Dense Reward Generators. (arXiv:2109.14830v1 [cs.AI])
    (0 min) Recent advances in reinforcement learning (RL) have led to a growing interest in applying RL to classical planning domains or applying classical planning methods to some complex RL domains. However, the long-horizon goal-based problems found in classical planning lead to sparse rewards for RL, making direct application inefficient. In this paper, we propose to leverage domain-independent heuristic functions commonly used in the classical planning literature to improve the sample efficiency of RL. These classical heuristics act as dense reward generators to alleviate the sparse-rewards issue and enable our RL agent to learn domain-specific value functions as residuals on these heuristics, making learning easier. Correct application of this technique requires consolidating the discounted metric used in RL and the non-discounted metric used in heuristics. We implement the value functions using Neural Logic Machines, a neural network architecture designed for grounded first-order logic inputs. We demonstrate on several classical planning domains that using classical heuristics for RL allows for good sample efficiency compared to sparse-reward RL. We further show that our learned value functions generalize to novel problem instances in the same domain.
    Trainless Model Performance Estimation for Neural Architecture Search. (arXiv:2103.08312v2 [cs.LG] UPDATED)
    (0 min) Neural architecture search has become an indispensable part of the deep learning field. Modern methods allow to find one of the best performing architectures, or to build one from scratch, but they typically make decisions based on the trained accuracy information. In the present article we explore instead how the architectural component of a neural network affects its prediction power. We focus on relationships between the trained accuracy of an architecture and its accuracy prior to training, by considering statistics over multiple initialisations. We observe that minimising the coefficient of variation of the untrained accuracy, $CV_{U}$, consistently leads to better performing architectures. We test the $CV_{U}$ as a neural architecture search scoring metric using the NAS-Bench-201 database of trained neural architectures. The architectures with the lowest $CV_{U}$ value have on average an accuracy of $91.90 \pm 2.27$, $64.08 \pm 5.63$ and $38.76 \pm 6.62$ for CIFAR-10, CIFAR-100 and a downscaled version of ImageNet, respectively. Since these values are statistically above the random baseline, we make a conclusion that a good architecture should be stable against weights initialisations. It takes about $190$ s for CIFAR-10 and CIFAR-100 and $133.9$ s for ImageNet16-120 to process $100$ architectures, on a batch of $256$ images, with $100$ initialisations.
    Early Bearing Fault Diagnosis of Rotating Machinery by 1D Self-Organized Operational Neural Networks. (arXiv:2109.14873v1 [cs.LG])
    (0 min) Preventive maintenance of modern electric rotating machinery (RM) is critical for ensuring reliable operation, preventing unpredicted breakdowns and avoiding costly repairs. Recently many studies investigated machine learning monitoring methods especially based on Deep Learning networks focusing mostly on detecting bearing faults; however, none of them addressed bearing fault severity classification for early fault diagnosis with high enough accuracy. 1D Convolutional Neural Networks (CNNs) have indeed achieved good performance for detecting RM bearing faults from raw vibration and current signals but did not classify fault severity. Furthermore, recent studies have demonstrated the limitation in terms of learning capability of conventional CNNs attributed to the basic underlying linear neuron model. Recently, Operational Neural Networks (ONNs) were proposed to enhance the learning capability of CNN by introducing non-linear neuron models and further heterogeneity in the network configuration. In this study, we propose 1D Self-organized ONNs (Self-ONNs) with generative neurons for bearing fault severity classification and providing continuous condition monitoring. Experimental results over the benchmark NSF/IMS bearing vibration dataset using both x- and y-axis vibration signals for inner race and rolling element faults demonstrate that the proposed 1D Self-ONNs achieve significant performance gap against the state-of-the-art (1D CNNs) with similar computational complexity.
    Out-of-Distribution Detection for Medical Applications: Guidelines for Practical Evaluation. (arXiv:2109.14885v1 [cs.LG])
    (0 min) Detection of Out-of-Distribution (OOD) samples in real time is a crucial safety check for deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes implementation of OOD detection methods for real-world applications. Here, we propose a series of practical considerations and tests to choose the best OOD detector for a specific medical dataset. These guidelines are illustrated on a real-life use case of Electronic Health Records (EHR). Our results can serve as a guide for implementation of OOD detection methods in clinical practice, mitigating risks associated with the use of machine learning models in healthcare.
    Automatic Estimation of Ulcerative Colitis Severity from Endoscopy Videos using Ordinal Multi-Instance Learning. (arXiv:2109.14685v1 [eess.IV])
    (0 min) Ulcerative colitis (UC) is a chronic inflammatory bowel disease characterized by relapsing inflammation of the large intestine. The severity of UC is often represented by the Mayo Endoscopic Subscore (MES) which quantifies mucosal disease activity from endoscopy videos. In clinical trials, an endoscopy video is assigned an MES based upon the most severe disease activity observed in the video. For this reason, severe inflammation spread throughout the colon will receive the same MES as an otherwise healthy colon with severe inflammation restricted to a small, localized segment. Therefore, the extent of disease activity throughout the large intestine, and overall response to treatment, may not be completely captured by the MES. In this work, we aim to automatically estimate UC severity for each frame in an endoscopy video to provide a higher resolution assessment of disease activity throughout the colon. Because annotating severity at the frame-level is expensive, labor-intensive, and highly subjective, we propose a novel weakly supervised, ordinal classification method to estimate frame severity from video MES labels alone. Using clinical trial data, we first achieved 0.92 and 0.90 AUC for predicting mucosal healing and remission of UC, respectively. Then, for severity estimation, we demonstrate that our models achieve substantial Cohen's Kappa agreement with ground truth MES labels, comparable to the inter-rater agreement of expert clinicians. These findings indicate that our framework could serve as a foundation for novel clinical endpoints, based on a more localized scoring system, to better evaluate UC drug efficacy in clinical trials.
    Learning the Markov Decision Process in the Sparse Gaussian Elimination. (arXiv:2109.14929v1 [math.NA])
    (0 min) We propose a learning-based approach for the sparse Gaussian Elimination. There are many hard combinatorial optimization problems in modern sparse solver. These NP-hard problems could be handled in the framework of Markov Decision Process, especially the Q-Learning technique. We proposed some Q-Learning algorithms for the main modules of sparse solver: minimum degree ordering, task scheduling and adaptive pivoting. Finally, we recast the sparse solver into the framework of Q-Learning. Our study is the first step to connect these two classical mathematical models: Gaussian Elimination and Markov Decision Process. Our learning-based algorithm could help improve the performance of sparse solver, which has been verified in some numerical experiments.
    AffectGAN: Affect-Based Generative Art Driven by Semantics. (arXiv:2109.14845v1 [cs.CV])
    (0 min) This paper introduces a novel method for generating artistic images that express particular affective states. Leveraging state-of-the-art deep learning methods for visual generation (through generative adversarial networks), semantic models from OpenAI, and the annotated dataset of the visual art encyclopedia WikiArt, our AffectGAN model is able to generate images based on specific or broad semantic prompts and intended affective outcomes. A small dataset of 32 images generated by AffectGAN is annotated by 50 participants in terms of the particular emotion they elicit, as well as their quality and novelty. Results show that for most instances the intended emotion used as a prompt for image generation matches the participants' responses. This small-scale study brings forth a new vision towards blending affective computing with computational creativity, enabling generative systems with intentionality in terms of the emotions they wish their output to elicit.
    A Priori Calibration of Transient Kinetics Data via Machine Learning. (arXiv:2109.15042v1 [cs.LG])
    (0 min) The temporal analysis of products reactor provides a vast amount of transient kinetic information that may be used to describe a variety of chemical features including the residence time distribution, kinetic coefficients, number of active sites, and the reaction mechanism. However, as with any measurement device, the TAP reactor signal is convoluted with noise. To reduce the uncertainty of the kinetic measurement and any derived parameters or mechanisms, proper preprocessing must be performed prior to any advanced analysis. This preprocessing consists of baseline correction, i.e., a shift in the voltage response, and calibration, i.e., a scaling of the flux response based on prior experiments. The current methodology of preprocessing requires significant user discretion and reliance on previous experiments that may drift over time. Herein we use machine learning techniques combined with physical constraints to convert the raw instrument signal to chemical information. As such, the proposed methodology demonstrates clear benefits over the traditional preprocessing in the calibration of the inert and feed mixture products without need of prior calibration experiments or heuristic input from the user.
    Scalable Rule-Based Representation Learning for Interpretable Classification. (arXiv:2109.15103v1 [cs.LG])
    (0 min) Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. An improved design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on nine small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.
    Learning Material Parameters and Hydrodynamics of Soft Robotic Fish via Differentiable Simulation. (arXiv:2109.14855v1 [cs.RO])
    (0 min) The high dimensionality of soft mechanisms and the complex physics of fluid-structure interactions render the sim2real gap for soft robots particularly challenging. Our framework allows high fidelity prediction of dynamic behavior for composite bi-morph bending structures in real hardware to accuracy near measurement uncertainty. We address this gap with our differentiable simulation tool by learning the material parameters and hydrodynamics of our robots. We demonstrate an experimentally-verified, fast optimization pipeline for learning the material parameters and hydrodynamics from quasi-static and dynamic data via differentiable simulation. Our method identifies physically plausible Young's moduli for various soft silicone elastomers and stiff acetal copolymers used in creation of our three different fish robot designs. For these robots we provide a differentiable and more robust estimate of the thrust force than analytical models and we successfully predict deformation to millimeter accuracy in dynamic experiments under various actuation signals. Although we focus on a specific application for underwater soft robots, our framework is applicable to any pneumatically actuated soft mechanism. This work presents a prototypical hardware and simulation problem solved using our framework that can be extended straightforwardly to higher dimensional parameter inference, learning control policies, and computational design enabled by its differentiability.
    On the Convergence of the Projected Alternating Maximization Algorithm for Equitable and Optimal Transport. (arXiv:2109.15030v1 [math.OC])
    (0 min) This paper studies the equitable and optimal transport (EOT) problem, which has many applications such as fair division problems and optimal transport with multiple agents etc. In the discrete distributions case, the EOT problem can be formulated as a linear program (LP). Since this LP is prohibitively large for general LP solvers, Scetbon \etal \cite{scetbon2021equitable} suggests to perturb the problem by adding an entropy regularization. They proposed a projected alternating maximization algorithm (PAM) to solve the dual of the entropy regularized EOT. In this paper, we provide the first convergence analysis of PAM. A novel rounding procedure is proposed to help construct the primal solution for the original EOT problem. We also propose a variant of PAM by incorporating the extrapolation technique that can numerically improve the performance of PAM. Results in this paper may shed lights on block coordinate (gradient) descent methods for general optimization problems.
    Comparative Validation of Machine Learning Algorithms for Surgical Workflow and Skill Analysis with the HeiChole Benchmark. (arXiv:2109.14956v1 [eess.IV])
    (0 min) PURPOSE: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported for phase recognition on an open data single-center dataset. In this work we investigated the generalizability of phase recognition algorithms in a multi-center setting including more difficult recognition tasks such as surgical action and surgical skill. METHODS: To achieve this goal, a dataset with 33 laparoscopic cholecystectomy videos from three surgical centers with a total operation time of 22 hours was created. Labels included annotation of seven surgical phases with 250 phase transitions, 5514 occurences of four surgical actions, 6980 occurences of 21 surgical instruments from seven instrument categories and 495 skill classifications in five skill dimensions. The dataset was used in the 2019 Endoscopic Vision challenge, sub-challenge for surgical workflow and skill analysis. Here, 12 teams submitted their machine learning algorithms for recognition of phase, action, instrument and/or skill assessment. RESULTS: F1-scores were achieved for phase recognition between 23.9% and 67.7% (n=9 teams), for instrument presence detection between 38.5% and 63.8% (n=8 teams), but for action recognition only between 21.8% and 23.3% (n=5 teams). The average absolute error for skill assessment was 0.78 (n=1 team). CONCLUSION: Surgical workflow and skill analysis are promising technologies to support the surgical team, but are not solved yet, as shown by our comparison of algorithms. This novel benchmark can be used for comparable evaluation and validation of future work.
    Genealogical Population-Based Training for Hyperparameter Optimization. (arXiv:2109.14925v1 [cs.LG])
    (0 min) Hyperparameter optimization aims at finding more rapidly and efficiently the best hyperparameters (HPs) of learning models such as neural networks. In this work, we present a new approach called GPBT (Genealogical Population-Based Training), which shares many points with Population-Based Training: our approach outputs a schedule of HPs and updates both weights and HPs in a single run, but brings several novel contributions: the choice of new HPs is made by a modular search algorithm, the search algorithm can search HPs independently for models with different weights and can exploit separately the maximum amount of meaningful information (genealogically-related) from previous HPs evaluations instead of exploiting together all previous HPs evaluations, a variation of early stopping allows a 2-3 fold acceleration at small performance cost. GPBT significantly outperforms all other approaches of HP Optimization, on all supervised learning experiments tested in terms of speed and performances. HPs tuning will become less computationally expensive using our approach, not only in the deep learning field, but potentially for all processes based on iterative optimization.
    A useful criterion on studying consistent estimation in community detection. (arXiv:2109.14950v1 [cs.LG])
    (0 min) In network analysis, developing a unified theoretical framework that can compare methods under different models is an interesting problem. This paper proposes a partial solution to this problem. We summarize the idea of using separation condition for a standard network and sharp threshold of Erd\"os-R\'enyi random graph to study consistent estimation, compare theoretical error rates and requirements on network sparsity of spectral methods under models that can degenerate to stochastic block model as a four-step criterion SCSTC. Using SCSTC, we find some inconsistent phenomena on separation condition and sharp threshold in community detection. Especially, we find original theoretical results of the SPACL algorithm introduced to estimate network memberships under the mixed membership stochastic blockmodel were sub-optimal. To find the formation mechanism of inconsistencies, we re-establish theoretical convergence rates of this algorithm by applying recent techniques on row-wise eigenvector deviation. The results are further extended to the degree corrected mixed membership model. By comparison, our results enjoy smaller error rates, lesser dependence on the number of communities, weaker requirements on network sparsity, and so forth. Furthermore, separation condition and sharp threshold obtained from our theoretical results match classical results, which shows the usefulness of this criterion on studying consistent estimation.
    Semi-tensor Product-based TensorDecomposition for Neural Network Compression. (arXiv:2109.15200v1 [cs.LG])
    (0 min) The existing tensor networks adopt conventional matrix product for connection. The classical matrix product requires strict dimensionality consistency between factors, which can result in redundancy in data representation. In this paper, the semi-tensor product is used to generalize classical matrix product-based mode product to semi-tensor mode product. As it permits the connection of two factors with different dimensionality, more flexible and compact tensor decompositions can be obtained with smaller sizes of factors. Tucker decomposition, Tensor Train (TT) and Tensor Ring (TR) are common decomposition for low rank compression of deep neural networks. The semi-tensor product is applied to these tensor decompositions to obtained their generalized versions, i.e., semi-tensor Tucker decomposition (STTu), semi-tensor train(STT) and semi-tensor ring (STR). Experimental results show the STTu, STT and STR achieve higher compression factors than the conventional tensor decompositions with the same accuracy but less training times in ResNet and WideResNetcompression. With 2% accuracy degradation, the TT-RN (rank = 14) and the TR-WRN (rank = 16) only obtain 3 times and99t times compression factors while the STT-RN (rank = 14) and the STR-WRN (rank = 16) achieve 9 times and 179 times compression factors, respectively.
    Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates. (arXiv:2109.14727v1 [cs.LG])
    (0 min) The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll & Mr Hyde (JH), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. JH's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update JH's models with a mixture of on-policy and off-policy updates. More than an algorithm, JH defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where JH demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.
    Automated airway segmentation by learning graphical structure. (arXiv:2109.14792v1 [eess.IV])
    (0 min) In this research project, we put forward an advanced method for airway segmentation based on the existent convolutional neural network (CNN) and graph neural network (GNN). The method is originated from the vessel segmentation, but we ameliorate it and enable the novel model to perform better for datasets from computed tomography (CT) scans. Current methods for airway segmentation are considering the regular grid only. No matter what the detailed model is, including the 3-dimensional CNN or 2-dimensional CNN in three directions, the overall graph structures are not taken into consideration. In our model, with the neighbourhoods of airway taken into account, the graph structure is incorporated and the segmentation of airways are improved compared with the traditional CNN methods. We perform experiments on the chest CT scans, where the ground truth segmentation labels are produced manually. The proposed model shows that compared with the CNN-only method, the combination of CNN and GNN has a better performance in that the bronchi in the chest CT scans can be detected in most cases. In addition, the model we propose has a wide extension since the architecture is also utilitarian in fulfilling similar aims in other datasets. Hence, the state-of-the-art model is of great significance and highly applicable in our daily lives. Keywords: Airway segmentation, Convolutional neural network, Graph neural network
    CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations. (arXiv:2109.14910v1 [cs.CV])
    (0 min) Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples. Recently, the principle has also been used to learn cross-modal embeddings for video and text, yet without exploiting its full potential. In particular, previous losses do not take the intra-modality similarities into account, which leads to inefficient embeddings, as the same content is mapped to multiple points in the embedding space. With CrossCLR, we present a contrastive loss that fixes this issue. Moreover, we define sets of highly related samples in terms of their input embeddings and exclude them from the negative samples to avoid issues with false negatives. We show that these principles consistently improve the quality of the learned embeddings. The joint embeddings learned with CrossCLR extend the state of the art in video-text retrieval on Youcook2 and LSMDC datasets and in video captioning on Youcook2 dataset by a large margin. We also demonstrate the generality of the concept by learning improved joint embeddings for other pairs of modalities.
    A Variational Perspective on Diffusion-Based Generative Models and Score Matching. (arXiv:2106.02808v2 [cs.LG] UPDATED)
    (0 min) Discrete-time diffusion-based generative models and score matching methods have shown promising results in modeling high-dimensional image data. Recently, Song et al. (2021) show that diffusion processes that transform data into noise can be reversed via learning the score function, i.e. the gradient of the log-density of the perturbed data. They propose to plug the learned score function into an inverse formula to define a generative diffusion process. Despite the empirical success, a theoretical underpinning of this procedure is still lacking. In this work, we approach the (continuous-time) generative diffusion directly and derive a variational framework for likelihood estimation, which includes continuous-time normalizing flows as a special case, and can be seen as an infinitely deep variational autoencoder. Under this framework, we show that minimizing the score-matching loss is equivalent to maximizing a lower bound of the likelihood of the plug-in reverse SDE proposed by Song et al. (2021), bridging the theoretical gap.
    Interpretability in Safety-Critical FinancialTrading Systems. (arXiv:2109.15112v1 [cs.LG])
    (0 min) Sophisticated machine learning (ML) models to inform trading in the financial sector create problems of interpretability and risk management. Seemingly robust forecasting models may behave erroneously in out of distribution settings. In 2020, some of the world's most sophisticated quant hedge funds suffered losses as their ML models were first underhedged, and then overcompensated. We implement a gradient-based approach for precisely stress-testing how a trading model's forecasts can be manipulated, and their effects on downstream tasks at the trading execution level. We construct inputs -- whether in changes to sentiment or market variables -- that efficiently affect changes in the return distribution. In an industry-standard trading pipeline, we perturb model inputs for eight S&P 500 stocks. We find our approach discovers seemingly in-sample input settings that result in large negative shifts in return distributions. We provide the financial community with mechanisms to interpret ML forecasts in trading systems. For the security community, we provide a compelling application where studying ML robustness necessitates that one capture an end-to-end system's performance rather than study a ML model in isolation. Indeed, we show in our evaluation that errors in the forecasting model's predictions alone are not sufficient for trading decisions made based on these forecasts to yield a negative return.
    Adversarial Regression with Doubly Non-negative Weighting Matrices. (arXiv:2109.14875v1 [stat.ML])
    (0 min) Many machine learning tasks that involve predicting an output response can be solved by training a weighted regression model. Unfortunately, the predictive power of this type of models may severely deteriorate under low sample sizes or under covariate perturbations. Reweighting the training samples has aroused as an effective mitigation strategy to these problems. In this paper, we propose a novel and coherent scheme for kernel-reweighted regression by reparametrizing the sample weights using a doubly non-negative matrix. When the weighting matrix is confined in an uncertainty set using either the log-determinant divergence or the Bures-Wasserstein distance, we show that the adversarially reweighted estimate can be solved efficiently using first-order methods. Numerical experiments show that our reweighting strategy delivers promising results on numerous datasets.
    Unlocking the potential of deep learning for marine ecology: overview, applications, and outlook. (arXiv:2109.14737v1 [cs.LG])
    (0 min) The deep learning revolution is touching all scientific disciplines and corners of our lives as a means of harnessing the power of big data. Marine ecology is no exception. These new methods provide analysis of data from sensors, cameras, and acoustic recorders, even in real time, in ways that are reproducible and rapid. Off-the-shelf algorithms can find, count, and classify species from digital images or video and detect cryptic patterns in noisy data. Using these opportunities requires collaboration across ecological and data science disciplines, which can be challenging to initiate. To facilitate these collaborations and promote the use of deep learning towards ecosystem-based management of the sea, this paper aims to bridge the gap between marine ecologists and computer scientists. We provide insight into popular deep learning approaches for ecological data analysis in plain language, focusing on the techniques of supervised learning with deep neural networks, and illustrate challenges and opportunities through established and emerging applications of deep learning to marine ecology. We use established and future-looking case studies on plankton, fishes, marine mammals, pollution, and nutrient cycling that involve object detection, classification, tracking, and segmentation of visualized data. We conclude with a broad outlook of the field's opportunities and challenges, including potential technological advances and issues with managing complex data sets.
    Batched Bandits with Crowd Externalities. (arXiv:2109.14733v1 [cs.LG])
    (0 min) In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be updated at each time step. Usually, the setting asserts a maximum number of allowed policy updates and the algorithm schedules them so that to minimize the expected regret. In this paper, we describe a novel setting for BMAB, with the following twist: the timing of the policy update is not controlled by the BMAB algorithm, but instead the amount of data received during each batch, called \textit{crowd}, is influenced by the past selection of arms. We first design a near-optimal policy with approximate knowledge of the parameters that we prove to have a regret in $\mathcal{O}(\sqrt{\frac{\ln x}{x}}+\epsilon)$ where $x$ is the size of the crowd and $\epsilon$ is the parameter error. Next, we implement a UCB-inspired algorithm that guarantees an additional regret in $\mathcal{O}\left(\max(K\ln T,\sqrt{T\ln T})\right)$, where $K$ is the number of arms and $T$ is the horizon.
    Federated Self-Supervised Contrastive Learning via Ensemble Similarity Distillation. (arXiv:2109.14611v1 [cs.LG])
    (0 min) This paper investigates the feasibility of learning good representation space with unlabeled client data in the federated scenario. Existing works trivially inherit the supervised federated learning methods, which does not apply to the model heterogeneity and has the potential risk of privacy exposure. To tackle the problems above, we first identify that self-supervised contrastive local training is more robust against the non-i.i.d.-ness than the traditional supervised learning paradigm. Then we propose a novel federated self-supervised contrastive learning framework FLESD that supports architecture-agnostic local training and communication-efficient global aggregation. At each round of communication, the server first gathers a fraction of the clients' inferred similarity matrices on a public dataset. Then FLESD ensembles the similarity matrices and trains the global model via similarity distillation. We verify the effectiveness of our proposed framework by a series of empirical experiments and show that FLESD has three main advantages over the existing methods: it handles the model heterogeneity, is less prone to privacy leak, and is more communication-efficient. We will release the code of this paper in the future.
    Predicting the outputs of finite deep neural networks trained with noisy gradients. (arXiv:2004.01190v3 [stat.ML] UPDATED)
    (0 min) A recent line of works studied wide deep neural networks (DNNs) by approximating them as Gaussian Processes (GPs). A DNN trained with gradient flow was shown to map to a GP governed by the Neural Tangent Kernel (NTK), whereas earlier works showed that a DNN with an i.i.d. prior over its weights maps to the so-called Neural Network Gaussian Process (NNGP). Here we consider a DNN training protocol, involving noise, weight decay and finite width, whose outcome corresponds to a certain non-Gaussian stochastic process. An analytical framework is then introduced to analyze this non-Gaussian process, whose deviation from a GP is controlled by the finite width. Our contribution is three-fold: (i) In the infinite width limit, we establish a correspondence between DNNs trained with noisy gradients and the NNGP, not the NTK. (ii) We provide a general analytical form for the finite width correction (FWC) for DNNs with arbitrary activation functions and depth and use it to predict the outputs of empirical finite networks with high accuracy. Analyzing the FWC behavior as a function of $n$, the training set size, we find that it is negligible for both the very small $n$ regime, and, surprisingly, for the large $n$ regime (where the GP error scales as $O(1/n)$). (iii) We flesh out algebraically how these FWCs can improve the performance of finite convolutional neural networks (CNNs) relative to their GP counterparts on image classification tasks.
    Memory-Efficient Deep Learning Inference in Trusted Execution Environments. (arXiv:2104.15109v2 [cs.CR] UPDATED)
    (0 min) This study identifies and proposes techniques to alleviate two key bottlenecks to executing deep neural networks in trusted execution environments (TEEs): page thrashing during the execution of convolutional layers and the decryption of large weight matrices in fully-connected layers. For the former, we propose a novel partitioning scheme, y-plane partitioning, designed to (i) provide consistent execution time when the layer output is large compared to the TEE secure memory; and (ii) significantly reduce the memory footprint of convolutional layers. For the latter, we leverage quantization and compression. In our evaluation, the proposed optimizations incurred latency overheads ranging from 1.09X to 2X baseline for a wide range of TEE sizes; in contrast, an unmodified implementation incurred latencies of up to 26X when running inside of the TEE.
    Learned holographic light transport. (arXiv:2108.08253v2 [physics.optics] UPDATED)
    (0 min) Computer-Generated Holography (CGH) algorithms often fall short in matching simulations with results from a physical holographic display. Our work addresses this mismatch by learning the holographic light transport in holographic displays. Using a camera and a holographic display, we capture the image reconstructions of optimized holograms that rely on ideal simulations to generate a dataset. Inspired by the ideal simulations, we learn a complex-valued convolution kernel that can propagate given holograms to captured photographs in our dataset. Our method can dramatically improve simulation accuracy and image quality in holographic displays while paving the way for physically informed learning approaches.
    From Zero-Shot Machine Learning to Zero-Day Attack Detection. (arXiv:2109.14868v1 [cs.LG])
    (0 min) The standard ML methodology assumes that the test samples are derived from a set of pre-observed classes used in the training phase. Where the model extracts and learns useful patterns to detect new data samples belonging to the same data classes. However, in certain applications such as Network Intrusion Detection Systems, it is challenging to obtain data samples for all attack classes that the model will most likely observe in production. ML-based NIDSs face new attack traffic known as zero-day attacks, that are not used in the training of the learning models due to their non-existence at the time. In this paper, a zero-shot learning methodology has been proposed to evaluate the ML model performance in the detection of zero-day attack scenarios. In the attribute learning stage, the ML models map the network data features to distinguish semantic attributes from known attack (seen) classes. In the inference stage, the models are evaluated in the detection of zero-day attack (unseen) classes by constructing the relationships between known attacks and zero-day attacks. A new metric is defined as Zero-day Detection Rate, which measures the effectiveness of the learning model in the inference stage. The results demonstrate that while the majority of the attack classes do not represent significant risks to organisations adopting an ML-based NIDS in a zero-day attack scenario. However, for certain attack groups identified in this paper, such systems are not effective in applying the learnt attributes of attack behaviour to detect them as malicious. Further Analysis was conducted using the Wasserstein Distance technique to measure how different such attacks are from other attack types used in the training of the ML model. The results demonstrate that sophisticated attacks with a low zero-day detection rate have a significantly distinct feature distribution compared to the other attack classes.
    Towards Principled Causal Effect Estimation by Deep Identifiable Models. (arXiv:2109.15062v1 [stat.ML])
    (0 min) As an important problem of causal inference, we discuss the estimation of treatment effects (TEs) under unobserved confounding. Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying TEs. Our VAE also naturally gives representation balanced for treatment groups, using its prior. Experiments on (semi-)synthetic datasets show state-of-the-art performance under diverse settings. Based on the identifiability of our model, further theoretical developments on identification and consistent estimation are also discussed. This paves the way towards principled causal effect estimation by deep neural networks.
    Feature Selection for Multivariate Time Series via Network Pruning. (arXiv:2102.06024v2 [cs.LG] UPDATED)
    (0 min) In recent years, there has been an ever increasing amount of multivariate time series (MTS) data in various domains, typically generated by a large family of sensors such as wearable devices. This has led to the development of novel learning methods on MTS data, with deep learning models dominating the most recent advancements. Prior literature has primarily focused on designing new network architectures for modeling temporal dependencies within MTS. However, a less studied challenge is associated with high dimensionality of MTS data. In this paper, we propose a novel neural component, namely Neural Feature Se-lector (NFS), as an end-2-end solution for feature selection in MTS data. Specifically, NFS is based on decomposed convolution design and includes two modules: firstly each feature stream within MTS is processed by a temporal CNN independently; then an aggregating CNN combines the processed streams to produce input for other downstream networks. We evaluated the proposed NFS model on four real-world MTS datasets and found that it achieves comparable results with state-of-the-art methods while providing the benefit of feature selection. Our paper also highlights the robustness and effectiveness of feature selection with NFS compared to using recent autoencoder-based methods.
    Surveillance Evasion Through Bayesian Reinforcement Learning. (arXiv:2109.14811v1 [cs.LG])
    (0 min) We consider a 2D continuous path planning problem with a completely unknown intensity of random termination: an Evader is trying to escape a domain while minimizing the cumulative risk of detection (termination) by adversarial Observers. Those Observers' surveillance intensity is a priori unknown and has to be learned through repetitive path planning. We propose a new algorithm that utilizes Gaussian process regression to model the unknown surveillance intensity and relies on a confidence bound technique to promote strategic exploration. We illustrate our method through several examples and confirm the convergence of averaged regret experimentally.
    ScanMix: Learning from Severe Label Noise via Semantic Clustering and Semi-Supervised Learning. (arXiv:2103.11395v2 [cs.CV] UPDATED)
    (0 min) In this paper, we address the problem of training deep neural networks in the presence of severe label noise. Our proposed training algorithm ScanMix, combines semantic clustering with semi-supervised learning (SSL) to improve the feature representations and enable an accurate identification of noisy samples, even in severe label noise scenarios. To be specific, ScanMix is designed based on the expectation maximisation (EM) framework, where the E-step estimates the value of a latent variable to cluster the training images based on their appearance representations and classification results, and the M-step optimises the SSL classification and learns effective feature representations via semantic clustering. In our evaluations, we show state-of-the-art results on standard benchmarks for symmetric, asymmetric and semantic label noise on CIFAR-10 and CIFAR-100, as well as large scale real label noise on WebVision. Most notably, for the benchmarks contaminated with large noise rates (80% and above), our results are up to 27% better than the related work. The code is available at https://github.com/ragavsachdeva/ScanMix.
    Sequential Estimation under Multiple Resources: a Bandit Point of View. (arXiv:2109.14703v1 [cs.LG])
    (0 min) The problem of Sequential Estimation under Multiple Resources (SEMR) is defined in a federated setting. SEMR could be considered as the intersection of statistical estimation and bandit theory. In this problem, an agent is confronting with k resources to estimate a parameter $\theta$. The agent should continuously learn the quality of the resources by wisely choosing them and at the end, proposes an estimator based on the collected data. In this paper, we assume that the resources' distributions are Gaussian. The quality of the final estimator is evaluated by its mean squared error. Also, we restrict our class of estimators to unbiased estimators in order to define a meaningful notion of regret. The regret measures the performance of the agent by the variance of the final estimator in comparison to the optimal variance. We propose a lower bound to determine the fundamental limit of the setting even in the case that the distributions are not Gaussian. Also, we offer an order-optimal algorithm to achieve this lower bound.
    Data Sharing and Compression for Cooperative Networked Control. (arXiv:2109.14675v1 [cs.LG])
    (0 min) Sharing forecasts of network timeseries data, such as cellular or electricity load patterns, can improve independent control applications ranging from traffic scheduling to power generation. Typically, forecasts are designed without knowledge of a downstream controller's task objective, and thus simply optimize for mean prediction error. However, such task-agnostic representations are often too large to stream over a communication network and do not emphasize salient temporal features for cooperative control. This paper presents a solution to learn succinct, highly-compressed forecasts that are co-designed with a modular controller's task objective. Our simulations with real cellular, Internet-of-Things (IoT), and electricity load data show we can improve a model predictive controller's performance by at least $25\%$ while transmitting $80\%$ less data than the competing method. Further, we present theoretical compression results for a networked variant of the classical linear quadratic regulator (LQR) control problem.
    CALDA: Improving Multi-Source Time Series Domain Adaptation with Contrastive Adversarial Learning. (arXiv:2109.14778v1 [cs.LG])
    (0 min) Unsupervised domain adaptation (UDA) provides a strategy for improving machine learning performance in data-rich (target) domains where ground truth labels are inaccessible but can be found in related (source) domains. In cases where meta-domain information such as label distributions is available, weak supervision can further boost performance. We propose a novel framework, CALDA, to tackle these two problems. CALDA synergistically combines the principles of contrastive learning and adversarial learning to robustly support multi-source UDA (MS-UDA) for time series data. Similar to prior methods, CALDA utilizes adversarial learning to align source and target feature representations. Unlike prior approaches, CALDA additionally leverages cross-source label information across domains. CALDA pulls examples with the same label close to each other, while pushing apart examples with different labels, reshaping the space through contrastive learning. Unlike prior contrastive adaptation methods, CALDA requires neither data augmentation nor pseudo labeling, which may be more challenging for time series. We empirically validate our proposed approach. Based on results from human activity recognition, electromyography, and synthetic datasets, we find utilizing cross-source information improves performance over prior time series and contrastive methods. Weak supervision further improves performance, even in the presence of noise, allowing CALDA to offer generalizable strategies for MS-UDA. Code is available at: https://github.com/floft/calda
    Scalable Hierarchical Agglomerative Clustering. (arXiv:2010.11821v3 [cs.LG] UPDATED)
    (0 min) The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
    Automated Workers Ergonomic Risk Assessment in Manual Material Handling using sEMG Wearable Sensors and Machine Learning. (arXiv:2109.15036v1 [cs.LG])
    (0 min) Manual material handling tasks have the potential to be highly unsafe from an ergonomic viewpoint. Safety inspections to monitor body postures can help mitigate ergonomic risks of material handling. However, the real effect of awkward muscle movements, strains, and excessive forces that may result in an injury may not be identified by external cues. This paper evaluates the ability of surface electromyogram (EMG)-based systems together with machine learning algorithms to automatically detect body movements that may harm muscles in material handling. The analysis utilized a lifting equation developed by the U.S. National Institute for Occupational Safety and Health (NIOSH). This equation determines a Recommended Weight Limit, which suggests the maximum acceptable weight that a healthy worker can lift and carry as well as a Lifting Index value to assess the risk extent. Four different machine learning models, namely Decision Tree, Support Vector Machine, K-Nearest Neighbor, and Random Forest are developed to classify the risk assessments calculated based on the NIOSH lifting equation. The sensitivity of the models to various parameters is also evaluated to find the best performance using each algorithm. Results indicate that Decision Tree models have the potential to predict the risk level with close to 99.35% accuracy.
    Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks. (arXiv:2109.15256v1 [cs.CL])
    (0 min) Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. However, existing neural models have been shown to lack this basic ability in learning symbolic structures. Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics, as additional training supervision. These automatically-generated sequences are more representative of the underlying compositional symbolic structures of the input data. During inference, the model jointly predicts the next action and the next tokens in the auxiliary sequences at each step. Experiments on the SCAN dataset show that our method encourages the Transformer to understand compositional structures of the command, improving its accuracy on multiple challenging splits from <= 10% to 100%. With only 418 (5%) training instances, our approach still achieves 97.8% accuracy on the MCD1 split. Therefore, we argue that compositionality can be induced in Transformers given minimal but proper guidance. We also show that a better result is achieved using less contextualized vectors as the attention's query, providing insights into architecture choices in achieving systematic compositionality. Finally, we show positive generalization results on the groundedSCAN task (Ruis et al., 2020). Our code is publicly available at: https://github.com/jiangycTarheel/compositional-auxseq
    BulletTrain: Accelerating Robust Neural Network Training via Boundary Example Mining. (arXiv:2109.14707v1 [cs.LG])
    (0 min) Neural network robustness has become a central topic in machine learning in recent years. Most training algorithms that improve the model's robustness to adversarial and common corruptions also introduce a large computational overhead, requiring as many as ten times the number of forward and backward passes in order to converge. To combat this inefficiency, we propose BulletTrain $-$ a boundary example mining technique to drastically reduce the computational cost of robust training. Our key observation is that only a small fraction of examples are beneficial for improving robustness. BulletTrain dynamically predicts these important examples and optimizes robust training algorithms to focus on the important examples. We apply our technique to several existing robust training algorithms and achieve a 2.1$\times$ speed-up for TRADES and MART on CIFAR-10 and a 1.7$\times$ speed-up for AugMix on CIFAR-10-C and CIFAR-100-C without any reduction in clean and robust accuracy.
    A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning. (arXiv:2109.14756v1 [math.OC])
    (0 min) We study a novel two-time-scale stochastic gradient method for solving optimization problems where the gradient samples are generated from a time-varying Markov random process parameterized by the underlying optimization variable. These time-varying samples make the stochastic gradient biased and dependent, which can potentially lead to the divergence of the iterates. To address this issue, we consider a two-time-scale update scheme, where one scale is used to estimate the true gradient from the Markovian samples and the other scale is used to update the decision variable with the estimated gradient. While these two iterates are implemented simultaneously, the former is updated "faster" (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different objective functions, namely, strong convexity, convexity, non-convexity under the PL condition, and general non-convexity. Our second contribution is to apply our framework to study the performance of the popular actor-critic methods in solving stochastic control and reinforcement learning problems. First, we study an online natural actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of $\mathcal{O}(k^{-2/3})$ is achieved. This is the first time such a result is known in the literature. Second, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of $\mathcal{O}(k^{-2/5})$, which recovers the best known rate derived specifically for this problem. Finally, we support our theoretical analysis with numerical simulations where the convergence rate is visualized.
    HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning. (arXiv:2109.15163v1 [cs.CV])
    (0 min) Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at \url{https://github.com/shiming-chen/HSVA} .
    Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning. (arXiv:2012.01345v3 [cs.CV] UPDATED)
    (0 min) Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap.
    Increasing the Efficiency of Policy Learning for Autonomous Vehicles by Multi-Task Representation Learning. (arXiv:2103.14718v2 [cs.LG] UPDATED)
    (0 min) Driving in a dynamic, multi-agent, and complex urban environment is a difficult task requiring a complex decision-making policy. The learning of such a policy requires a state representation that can encode the entire environment. Mid-level representations that encode a vehicle's environment as images have become a popular choice. Still, they are quite high-dimensional, limiting their use in data-hungry approaches such as reinforcement learning. In this article, we propose to learn a low-dimensional and rich latent representation of the environment by leveraging the knowledge of relevant semantic factors. To do this, we train an encoder-decoder deep neural network to predict multiple application-relevant factors such as the trajectories of other agents and the ego car. Furthermore, we propose a hazard signal based on other vehicles' future trajectories and the planned route which is used in conjunction with the learned latent representation as input to a down-stream policy. We demonstrate that using the multi-head encoder-decoder neural network results in a more informative representation than a standard single-head model. In particular, the proposed representation learning and the hazard signal help reinforcement learning to learn faster, with increased performance and less data than baseline methods.
    Towards Better Data Augmentation using Wasserstein Distance in Variational Auto-encoder. (arXiv:2109.14795v1 [cs.LG])
    (0 min) VAE, or variational auto-encoder, compresses data into latent attributes, and generates new data of different varieties. VAE based on KL divergence has been considered as an effective technique for data augmentation. In this paper, we propose the use of Wasserstein distance as a measure of distributional similarity for the latent attributes, and show its superior theoretical lower bound (ELBO) compared with that of KL divergence under mild conditions. Using multiple experiments, we demonstrate that the new loss function exhibits better convergence property and generates artificial images that could better aid the image classification tasks.
    Deep Embedded K-Means Clustering. (arXiv:2109.15149v1 [cs.LG])
    (0 min) Recently, deep clustering methods have gained momentum because of the high representational power of deep neural networks (DNNs) such as autoencoder. The key idea is that representation learning and clustering can reinforce each other: Good representations lead to good clustering while good clustering provides good supervisory signals to representation learning. Critical questions include: 1) How to optimize representation learning and clustering? 2) Should the reconstruction loss of autoencoder be considered always? In this paper, we propose DEKM (for Deep Embedded K-Means) to answer these two questions. Since the embedding space generated by autoencoder may have no obvious cluster structures, we propose to further transform the embedding space to a new space that reveals the cluster-structure information. This is achieved by an orthonormal transformation matrix, which contains the eigenvectors of the within-class scatter matrix of K-means. The eigenvalues indicate the importance of the eigenvectors' contributions to the cluster-structure information in the new space. Our goal is to increase the cluster-structure information. To this end, we discard the decoder and propose a greedy method to optimize the representation. Representation learning and clustering are alternately optimized by DEKM. Experimental results on the real-world datasets demonstrate that DEKM achieves state-of-the-art performance.
    Epsilon Consistent Mixup: Structural Regularization with an Adaptive Consistency-Interpolation Tradeoff. (arXiv:2104.09452v2 [stat.ML] UPDATED)
    (0 min) In this paper we propose $\epsilon$-Consistent Mixup ($\epsilon$mu). $\epsilon$mu is a data-based structural regularization technique that combines Mixup's linear interpolation with consistency regularization in the Mixup direction, by compelling a simple adaptive tradeoff between the two. This learnable combination of consistency and interpolation induces a more flexible structure on the evolution of the response across the feature space and is shown to improve semi-supervised classification accuracy on the SVHN and CIFAR10 benchmark datasets, yielding the largest gains in the most challenging low label-availability scenarios. Empirical studies comparing $\epsilon$mu and Mixup are presented and provide insight into the mechanisms behind $\epsilon$mu's effectiveness. In particular, $\epsilon$mu is found to produce more accurate synthetic labels and more confident predictions than Mixup.
    A Prior Knowledge Based Tumor and Tumoral Subregion Segmentation Tool for Pediatric Brain Tumors. (arXiv:2109.14775v1 [eess.IV])
    (0 min) In the past few years, deep learning (DL) models have drawn great attention and shown superior performance on brain tumor and subregion segmentation tasks. However, the success is limited to segmentation of adult gliomas, where sufficient data have been collected, manually labeled, and published for training DL models. It is still challenging to segment pediatric tumors, because the appearances are different from adult gliomas. Hence, directly applying a pretained DL model on pediatric data usually generates unacceptable results. Because pediatric data is very limited, both labeled and unlabeled, we present a brain tumor segmentation model that is based on knowledge rather than learning from data. We also provide segmentation of more subregions for super heterogeneous tumor like atypical teratoid rhabdoid tumor (ATRT). Our proposed approach showed superior performance on both whole tumor and subregion segmentation tasks to DL based models on our pediatric data when training data is not available for transfer learning.
    Real-Time Multi-Level Neonatal Heart and Lung Sound Quality Assessment for Telehealth Applications. (arXiv:2109.15127v1 [eess.AS])
    (0 min) Digital stethoscopes in combination with telehealth allow chest sounds to be easily collected and transmitted for remote monitoring and diagnosis. Chest sounds contain important information about a newborn's cardio-respiratory health. However, low-quality recordings complicate the remote monitoring and diagnosis. In this study, a new method is proposed to objectively and automatically assess heart and lung signal quality on a 5-level scale in real-time and to assess the effect of signal quality on vital sign estimation. For the evaluation, a total of 207 10s long chest sounds were taken from 119 preterm and full-term babies. Thirty of the recordings from ten subjects were obtained with synchronous vital signs from the Neonatal Intensive Care Unit (NICU) based on electrocardiogram recordings. As reference, seven annotators independently assessed the signal quality. For automatic quality classification, 400 features were extracted from the chest sounds. After feature selection using minimum redundancy and maximum relevancy algorithm, class balancing, and hyper-parameter optimization, a variety of multi-class and ordinal classification and regression algorithms were trained. Then, heart rate and breathing rate were automatically estimated from the chest sounds using adapted pre-existing methods. The results of subject-wise leave-one-out cross-validation show that the best-performing models had a mean squared error (MSE) of 0.49 and 0.61, and balanced accuracy of 57% and 51% for heart and lung qualities, respectively. The best-performing models for real-time analysis (<200ms) had MSE of 0.459 and 0.67, and balanced accuracy of 57% and 46%, respectively. Our experimental results underscore that increasing the signal quality leads to a reduction in vital sign error, with only high-quality recordings having a mean absolute error of less than 5 beats per minute, as required for clinical usage.
    Adapting Bandit Algorithms for Settings with Sequentially Available Arms. (arXiv:2109.15228v1 [cs.LG])
    (0 min) Although the classical version of the Multi-Armed Bandits (MAB) framework has been applied successfully to several practical problems, in many real-world applications, the possible actions are not presented to the learner simultaneously, such as in the Internet campaign management and environmental monitoring settings. Instead, in such applications, a set of options is presented sequentially to the learner within a time span, and this process is repeated throughout a time horizon. At each time, the learner is asked whether to select the proposed option or not. We define this scenario as the Sequential Pull/No-pull Bandit setting, and we propose a meta-algorithm, namely Sequential Pull/No-pull for MAB (Seq), to adapt any classical MAB policy to better suit this setting for both the regret minimization and best-arm identification problems. By allowing the selection of multiple arms within a round, the proposed meta-algorithm gathers more information, especially in the first rounds, characterized by a high uncertainty in the arms estimate value. At the same time, the adapted algorithms provide the same theoretical guarantees as the classical policy employed. The Seq meta-algorithm was extensively tested and compared with classical MAB policies on synthetic and real-world datasets from advertising and environmental monitoring applications, highlighting its good empirical performances.
    Unified Data Collection for Visual-Inertial Calibration via Deep Reinforcement Learning. (arXiv:2109.14974v1 [cs.RO])
    (0 min) Visual-inertial sensors have a wide range of applications in robotics. However, good performance often requires different sophisticated motion routines to accurately calibrate camera intrinsics and inter-sensor extrinsics. This work presents a novel formulation to learn a motion policy to be executed on a robot arm for automatic data collection for calibrating intrinsics and extrinsics jointly. Our approach models the calibration process compactly using model-free deep reinforcement learning to derive a policy that guides the motions of a robotic arm holding the sensor to efficiently collect measurements that can be used for both camera intrinsic calibration and camera-IMU extrinsic calibration. Given the current pose and collected measurements, the learned policy generates the subsequent transformation that optimizes sensor calibration accuracy. The evaluations in simulation and on a real robotic system show that our learned policy generates favorable motion trajectories and collects enough measurements efficiently that yield the desired intrinsics and extrinsics with short path lengths. In simulation we are able to perform calibrations 10 times faster than hand-crafted policies, which transfers to a real-world speed up of 3 times over a human expert.
    Active Refinement for Multi-Label Learning: A Pseudo-Label Approach. (arXiv:2109.14676v1 [cs.LG])
    (0 min) The goal of multi-label learning (MLL) is to associate a given instance with its relevant labels from a set of concepts. Previous works of MLL mainly focused on the setting where the concept set is assumed to be fixed, while many real-world applications require introducing new concepts into the set to meet new demands. One common need is to refine the original coarse concepts and split them into finer-grained ones, where the refinement process typically begins with limited labeled data for the finer-grained concepts. To address the need, we formalize the problem into a special weakly supervised MLL problem to not only learn the fine-grained concepts efficiently but also allow interactive queries to strategically collect more informative annotations to further improve the classifier. The key idea within our approach is to learn to assign pseudo-labels to the unlabeled entries, and in turn leverage the pseudo-labels to train the underlying classifier and to inform a better query strategy. Experimental results demonstrate that our pseudo-label approach is able to accurately recover the missing ground truth, boosting the prediction performance significantly over the baseline methods and facilitating a competitive active learning strategy.
    Multi Scale Graph Wavenet for Spatio-temporal Wind Speed Forecasting. (arXiv:2109.15239v1 [cs.LG])
    (0 min) Geometric deep learning has gained tremendous attention in both academia and industry due to its inherent capability of representing arbitrary structures. Due to exponential increase in interest towards renewable sources of energy, especially wind energy, accurate wind speed forecasting has become very important. . In this paper, we propose a novel deep learning architecture, Multi Scale Graph Wavenet for wind speed forecasting. It is based on a graph convolutional neural network and captures both spatial and temporal relationships in multivariate time series weather data for wind speed forecasting. We especially took inspiration from dilated convolutions, skip connections and the inception network to capture temporal relationships and graph convolutional networks for capturing spatial relationships in the data. We conducted experiments on real wind speed data measured at different cities in Denmark and compared our results with the state-of-the-art baseline models. Our novel architecture outperformed the state-of-the-art methods on wind speed forecasting for multiple forecast horizons by 4-5%.
    Boost-RS: Boosted Embeddings for Recommender Systems and its Application to Enzyme-Substrate Interaction Prediction. (arXiv:2109.14766v1 [q-bio.QM])
    (0 min) Despite experimental and curation efforts, the extent of enzyme promiscuity on substrates continues to be largely unexplored and under documented. Recommender systems (RS), which are currently unexplored for the enzyme-substrate interaction prediction problem, can be utilized to provide enzyme recommendations for substrates, and vice versa. The performance of Collaborative-Filtering (CF) recommender systems however hinges on the quality of embedding vectors of users and items (enzymes and substrates in our case). Importantly, enhancing CF embeddings with heterogeneous auxiliary data, specially relational data (e.g., hierarchical, pairwise, or groupings), remains a challenge. We propose an innovative general RS framework, termed Boost-RS, that enhances RS performance by "boosting" embedding vectors through auxiliary data. Specifically, Boost-RS is trained and dynamically tuned on multiple relevant auxiliary learning tasks Boost-RS utilizes contrastive learning tasks to exploit relational data. To show the efficacy of Boost-RS for the enzyme-substrate prediction interaction problem, we apply the Boost-RS framework to several baseline CF models. We show that each of our auxiliary tasks boosts learning of the embedding vectors, and that contrastive learning using Boost-RS outperforms attribute concatenation and multi-label learning. We also show that Boost-RS outperforms similarity-based models. Ablation studies and visualization of learned representations highlight the importance of using contrastive learning on some of the auxiliary data in boosting the embedding vectors.
    An Empirical Study of Accuracy, Fairness, Explainability, Distributional Robustness, and Adversarial Robustness. (arXiv:2109.14653v1 [cs.LG])
    (0 min) To ensure trust in AI models, it is becoming increasingly apparent that evaluation of models must be extended beyond traditional performance metrics, like accuracy, to other dimensions, such as fairness, explainability, adversarial robustness, and distribution shift. We describe an empirical study to evaluate multiple model types on various metrics along these dimensions on several datasets. Our results show that no particular model type performs well on all dimensions, and demonstrate the kinds of trade-offs involved in selecting models evaluated along multiple dimensions.
    iShape: A First Step Towards Irregular Shape Instance Segmentation. (arXiv:2109.15068v1 [cs.CV])
    (0 min) In this paper, we introduce a brand new dataset to promote the study of instance segmentation for objects with irregular shapes. Our key observation is that though irregularly shaped objects widely exist in daily life and industrial scenarios, they received little attention in the instance segmentation field due to the lack of corresponding datasets. To fill this gap, we propose iShape, an irregular shape dataset for instance segmentation. iShape contains six sub-datasets with one real and five synthetics, each represents a scene of a typical irregular shape. Unlike most existing instance segmentation datasets of regular objects, iShape has many characteristics that challenge existing instance segmentation algorithms, such as large overlaps between bounding boxes of instances, extreme aspect ratios, and large numbers of connected components per instance. We benchmark popular instance segmentation methods on iShape and find their performance drop dramatically. Hence, we propose an affinity-based instance segmentation algorithm, called ASIS, as a stronger baseline. ASIS explicitly combines perception and reasoning to solve Arbitrary Shape Instance Segmentation including irregular objects. Experimental results show that ASIS outperforms the state-of-the-art on iShape. Dataset and code are available at https://ishape.github.io
    Introducing the DOME Activation Functions. (arXiv:2109.14798v1 [cs.LG])
    (0 min) In this paper, we introduce a novel non-linear activation function that spontaneously induces class-compactness and regularization in the embedding space of neural networks. The function is dubbed DOME for Difference Of Mirrored Exponential terms. The basic form of the function can replace the sigmoid or the hyperbolic tangent functions as an output activation function for binary classification problems. The function can also be extended to the case of multi-class classification, and used as an alternative to the standard softmax function. It can also be further generalized to take more flexible shapes suitable for intermediate layers of a network. In this version of the paper, we only introduce the concept. In a subsequent version, experimental evaluation will be added.
    A Generalized Hierarchical Nonnegative Tensor Decomposition. (arXiv:2109.14820v1 [cs.LG])
    (0 min) Nonnegative matrix factorization (NMF) has found many applications including topic modeling and document analysis. Hierarchical NMF (HNMF) variants are able to learn topics at various levels of granularity and illustrate their hierarchical relationship. Recently, nonnegative tensor factorization (NTF) methods have been applied in a similar fashion in order to handle data sets with complex, multi-modal structure. Hierarchical NTF (HNTF) methods have been proposed, however these methods do not naturally generalize their matrix-based counterparts. Here, we propose a new HNTF model which directly generalizes a HNMF model special case, and provide a supervised extension. We also provide a multiplicative updates training method for this model. Our experimental results show that this model more naturally illuminates the topic hierarchy than previous HNMF and HNTF methods.
    Who Explains the Explanation? Quantitatively Assessing Feature Attribution Methods. (arXiv:2109.15035v1 [cs.LG])
    (0 min) AI explainability seeks to increase the transparency of models, making them more trustworthy in the process. The need for transparency has been recently motivated by the emergence of deep learning models, which are particularly obscure by nature. Even in the domain of images, where deep learning has succeeded the most, explainability is still poorly assessed. Multiple feature attribution methods have been proposed in the literature with the purpose of explaining a DL model's behavior using visual queues, but no standardized metrics to assess or select these methods exist. In this paper we propose a novel evaluation metric -- the Focus -- designed to quantify the faithfulness of explanations provided by feature attribution methods, such as LRP or GradCAM. First, we show the robustness of the metric through randomization experiments, and then use Focus to evaluate and compare three popular explainability techniques using multiple architectures and datasets. Our results find LRP and GradCAM to be consistent and reliable, the former being more accurate for high performing models, while the latter remains most competitive even when applied to poorly performing models. Finally, we identify a strong relation between Focus and factors like model architecture and task, unveiling a new unsupervised approach for the assessment of models.
    Time-Distributed Feature Learning in Network Traffic Classification for Internet of Things. (arXiv:2109.14696v1 [cs.NI])
    (0 min) The plethora of Internet of Things (IoT) devices leads to explosive network traffic. The network traffic classification (NTC) is an essential tool to explore behaviours of network flows, and NTC is required for Internet service providers (ISPs) to manage the performance of the IoT network. We propose a novel network data representation, treating the traffic data as a series of images. Thus, the network data is realized as a video stream to employ time-distributed (TD) feature learning. The intra-temporal information within the network statistical data is learned using convolutional neural networks (CNN) and long short-term memory (LSTM), and the inter pseudo-temporal feature among the flows is learned by TD multi-layer perceptron (MLP). We conduct experiments using a large data-set with more number of classes. The experimental result shows that the TD feature learning elevates the network classification performance by 10%.
    Extracting stochastic dynamical systems with $\alpha$-stable L\'evy noise from data. (arXiv:2109.14881v1 [stat.ML])
    (0 min) With the rapid increase of valuable observational, experimental and simulated data for complex systems, much efforts have been devoted to identifying governing laws underlying the evolution of these systems. Despite the wide applications of non-Gaussian fluctuations in numerous physical phenomena, the data-driven approaches to extract stochastic dynamical systems with (non-Gaussian) L\'evy noise are relatively few so far. In this work, we propose a data-driven method to extract stochastic dynamical systems with $\alpha$-stable L\'evy noise from short burst data based on the properties of $\alpha$-stable distributions. More specifically, we first estimate the L\'evy jump measure and noise intensity via computing mean and variance of the amplitude of the increment of the sample paths. Then we approximate the drift coefficient by combining nonlocal Kramers-Moyal formulas with normalizing flows. Numerical experiments on one- and two-dimensional prototypical examples illustrate the accuracy and effectiveness of our method. This approach will become an effective scientific tool in discovering stochastic governing laws of complex phenomena and understanding dynamical behaviors under non-Gaussian fluctuations.
    On Riemannian Approach for Constrained Optimization Model in Extreme Classification Problems. (arXiv:2109.15021v1 [math.OC])
    (0 min) We propose a novel Riemannian method for solving the Extreme multi-label classification problem that exploits the geometric structure of the sparse low-dimensional local embedding models. A constrained optimization problem is formulated as an optimization problem on matrix manifold and solved using a Riemannian optimization method. The proposed approach is tested on several real world large scale multi-label datasets and its usefulness is demonstrated through numerical experiments. The numerical experiments suggest that the proposed method is fastest to train and has least model size among the embedding-based methods. An outline of the proof of convergence for the proposed Riemannian optimization method is also stated.
    DAAS: Differentiable Architecture and Augmentation Policy Search. (arXiv:2109.15273v1 [cs.LG])
    (0 min) Neural architecture search (NAS) has been an active direction of automatic machine learning (Auto-ML), aiming to explore efficient network structures. The searched architecture is evaluated by training on datasets with fixed data augmentation policies. However, recent works on auto-augmentation show that the suited augmentation policies can vary over different structures. Therefore, this work considers the possible coupling between neural architectures and data augmentation and proposes an effective algorithm jointly searching for them. Specifically, 1) for the NAS task, we adopt a single-path based differentiable method with Gumbel-softmax reparameterization strategy due to its memory efficiency; 2) for the auto-augmentation task, we introduce a novel search method based on policy gradient algorithm, which can significantly reduce the computation complexity. Our approach achieves 97.91% accuracy on CIFAR-10 and 76.6% Top-1 accuracy on ImageNet dataset, showing the outstanding performance of our search algorithm.
    A Study of Feature Selection and Extraction Algorithms for Cancer Subtype Prediction. (arXiv:2109.14648v1 [cs.LG])
    (0 min) In this work, we study and analyze different feature selection algorithms that can be used to classify cancer subtypes in case of highly varying high-dimensional data. We apply three different feature selection methods on five different types of cancers having two separate omics each. We show that the existing feature selection methods are computationally expensive when applied individually. Instead, we apply these algorithms sequentially which helps in lowering the computational cost and improving the predictive performance. We further show that reducing the number of features using some dimension reduction techniques can improve the performance of machine learning models in some cases. We support our findings through comprehensive data analysis and visualization.
    Compositional generalization in semantic parsing with pretrained transformers. (arXiv:2109.15101v1 [cs.CL])
    (0 min) Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.
    Accelerating Perturbed Stochastic Iterates in Asynchronous Lock-Free Optimization. (arXiv:2109.15292v1 [math.OC])
    (0 min) We show that stochastic acceleration can be achieved under the perturbed iterate framework (Mania et al., 2017) in asynchronous lock-free optimization, which leads to the optimal incremental gradient complexity for finite-sum objectives. We prove that our new accelerated method requires the same linear speed-up condition as the existing non-accelerated methods. Our core algorithmic discovery is a new accelerated SVRG variant with sparse updates. Empirical results are presented to verify our theoretical findings.
    FathomNet: A global underwater image training set for enabling artificial intelligence in the ocean. (arXiv:2109.14646v1 [cs.CV])
    (0 min) Ocean-going platforms are integrating high-resolution camera feeds for observation and navigation, producing a deluge of visual data. The volume and rate of this data collection can rapidly outpace researchers' abilities to process and analyze them. Recent advances in machine learning enable fast, sophisticated analysis of visual data, but have had limited success in the oceanographic world due to lack of dataset standardization, sparse annotation tools, and insufficient formatting and aggregation of existing, expertly curated imagery for use by data scientists. To address this need, we have built FathomNet, a public platform that makes use of existing (and future), expertly curated data. Initial efforts have leveraged MBARI's Video Annotation and Reference System and annotated deep sea video database, which has more than 7M annotations, 1M framegrabs, and 5k terms in the knowledgebase, with additional contributions by National Geographic Society (NGS) and NOAA's Office of Ocean Exploration and Research. FathomNet has over 100k localizations of 1k midwater and benthic classes, and contains iconic and non-iconic views of marine animals, underwater equipment, debris, etc. We will demonstrate how machine learning models trained on FathomNet data can be applied across different institutional video data, (e.g., NGS' Deep Sea Camera System and NOAA's ROV Deep Discoverer), and enable automated acquisition and tracking of midwater animals using MBARI's ROV MiniROV. As FathomNet continues to develop and incorporate more image data from other oceanographic community members, this effort will enable scientists, explorers, policymakers, storytellers, and the public to understand and care for our ocean.
    Reinforcement Learning with Information-Theoretic Actuation. (arXiv:2109.15147v1 [cs.AI])
    (0 min) Reinforcement Learning formalises an embodied agent's interaction with the environment through observations, rewards and actions. But where do the actions come from? Actions are often considered to represent something external, such as the movement of a limb, a chess piece, or more generally, the output of an actuator. In this work we explore and formalize a contrasting view, namely that actions are best thought of as the output of a sequence of internal choices with respect to an action model. This view is particularly well-suited for leveraging the recent advances in large sequence models as prior knowledge for multi-task reinforcement learning problems. Our main contribution in this work is to show how to augment the standard MDP formalism with a sequential notion of internal action using information-theoretic techniques, and that this leads to self-consistent definitions of both internal and external action value functions.
    Robust Adversarial Classification via Abstaining. (arXiv:2104.02334v2 [cs.LG] UPDATED)
    (0 min) In this work, we consider a binary classification problem and cast it into a binary hypothesis testing framework, where the observations can be perturbed by an adversary. To improve the adversarial robustness of a classifier, we include an abstain option, where the classifier abstains from making a decision when it has low confidence about the prediction. We propose metrics to quantify the nominal performance of a classifier with an abstain option and its robustness against adversarial perturbations. We show that there exist a tradeoff between the two metrics regardless of what method is used to choose the abstain region. Our results imply that the robustness of a classifier with an abstain option can only be improved at the expense of its nominal performance. Further, we provide necessary conditions to design the abstain region for a 1- dimensional binary classification problem. We validate our theoretical results on the MNIST dataset, where we numerically show that the tradeoff between performance and robustness also exist for the general multi-class classification problems.
    A Bayesian Approach to (Online) Transfer Learning: Theory and Algorithms. (arXiv:2109.01377v2 [cs.LG] UPDATED)
    (0 min) Transfer learning is a machine learning paradigm where knowledge from one problem is utilized to solve a new but related problem. While conceivable that knowledge from one task could be useful for solving a related task, if not executed properly, transfer learning algorithms can impair the learning performance instead of improving it -- commonly known as negative transfer. In this paper, we study transfer learning from a Bayesian perspective, where a parametric statistical model is used. Specifically, we study three variants of transfer learning problems, instantaneous, online, and time-variant transfer learning. For each problem, we define an appropriate objective function, and provide either exact expressions or upper bounds on the learning performance using information-theoretic quantities, which allow simple and explicit characterizations when the sample size becomes large. Furthermore, examples show that the derived bounds are accurate even for small sample sizes. The obtained bounds give valuable insights into the effect of prior knowledge for transfer learning, at least with respect to our Bayesian formulation of the transfer learning problem. In particular, we formally characterize the conditions under which negative transfer occurs. Lastly, we devise two (online) transfer learning algorithms that are amenable to practical implementations, one of which does not require the parametric assumption. We demonstrate the effectiveness of our algorithms with real data sets, focusing primarily on when the source and target data have strong similarities.
    A system on chip for melanoma detection using FPGA-based SVM classifier. (arXiv:2109.14840v1 [cs.LG])
    (0 min) Support Vector Machine (SVM) is a robust machine learning model that shows high accuracy with different classification problems, and is widely used for various embedded applications. However , implementation of embedded SVM classifiers is challenging, due to the inherent complicated computations required. This motivates implementing the SVM on hardware platforms for achieving high performance computing at low cost and power consumption. Melanoma is the most aggressive form of skin cancer that increases the mortality rate. We aim to develop an optimized embedded SVM classifier dedicated for a low-cost handheld device for early detection of melanoma at the primary healthcare. In this paper, we propose a hardware/software co-design for implementing the SVM classifier onto FPGA to realize melanoma detection on a chip. The implemented SVM on a recent hybrid FPGA (Zynq) platform utilizing the modern UltraFast High-Level Synthesis design methodology achieves efficient melanoma classification on chip. The hardware implementation results demonstrate classification accuracy of 97.9%, and a significant hardware acceleration rate of 21 with only 3% resources utilization and 1.69W for power consumption. These results show that the implemented system on chip meets crucial embedded system constraints of high performance and low resources utilization, power consumption, and cost, while achieving efficient classification with high classification accuracy.
    Deep Neural Compression Via Concurrent Pruning and Self-Distillation. (arXiv:2109.15014v1 [cs.LG])
    (0 min) Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel \emph{self-distillation} based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed {\em cross-correlation objective for self-distilled pruning} implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.
    Causal Matrix Completion. (arXiv:2109.15154v1 [econ.EM])
    (0 min) Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed that the entries of the matrix are "missing completely at random" (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of "latent confounders", i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, in the context of movie recommender systems -- a canonical application for matrix completion -- a user who vehemently dislikes horror films is unlikely to ever watch horror films. In general, these confounders yield "missing not at random" (MNAR) data, which can severely impact any inference procedure that does not correct for this bias. We develop a formal causal model for matrix completion through the language of potential outcomes, and provide novel identification arguments for a variety of causal estimands of interest. We design a procedure, which we call "synthetic nearest neighbors" (SNN), to estimate these causal estimands. We prove finite-sample consistency and asymptotic normality of our estimator. Our analysis also leads to new theoretical results for the matrix completion literature. In particular, we establish entry-wise, i.e., max-norm, finite-sample consistency and asymptotic normality results for matrix completion with MNAR data. As a special case, this also provides entry-wise bounds for matrix completion with MCAR data. Across simulated and real data, we demonstrate the efficacy of our proposed estimator.
    Formal derivation of Mesh Neural Networks with their Forward-Only gradient Propagation. (arXiv:1905.06684v6 [cs.LG] UPDATED)
    (0 min) This paper proposes the Mesh Neural Network (MNN), a novel architecture which allows neurons to be connected in any topology, to efficiently route information. In MNNs, information is propagated between neurons throughout a state transition function. State and error gradients are then directly computed from state updates without backward computation. The MNN architecture and the error propagation schema is formalized and derived in tensor algebra. The proposed computational model can fully supply a gradient descent process, and is potentially suitable for very large scale sparse NNs, due to its expressivity and training efficiency, with respect to NNs based on back-propagation and computational graphs.
    Segmentation of Roads in Satellite Images using specially modified U-Net CNNs. (arXiv:2109.14671v1 [cs.CV])
    (0 min) The image classification problem has been deeply investigated by the research community, with computer vision algorithms and with the help of Neural Networks. The aim of this paper is to build an image classifier for satellite images of urban scenes that identifies the portions of the images in which a road is located, separating these portions from the rest. Unlike conventional computer vision algorithms, convolutional neural networks (CNNs) provide accurate and reliable results on this task. Our novel approach uses a sliding window to extract patches out of the whole image, data augmentation for generating more training/testing data and lastly a series of specially modified U-Net CNNs. This proposed technique outperforms all other baselines tested in terms of mean F-score metric.
    Modeling Interactions of Autonomous Vehicles and Pedestrians with Deep Multi-Agent Reinforcement Learning for Collision Avoidance. (arXiv:2109.15266v1 [cs.RO])
    (0 min) Reliable pedestrian crash avoidance mitigation (PCAM) systems are crucial components of safe autonomous vehicles (AVs). The sequential nature of the vehicle-pedestrian interaction, i.e., where immediate decisions of one agent directly influence the following decisions of the other agent, is an often neglected but important aspect. In this work, we model the corresponding interaction sequence as a Markov decision process (MDP) that is solved by deep reinforcement learning (DRL) algorithms to define the PCAM system's policy. The simulated driving scenario is based on an AV acting as a DRL agent driving along an urban street, facing a pedestrian at an unmarked crosswalk who tries to cross. Since modeling realistic crossing behavior of the pedestrian is challenging, we introduce two levels of intelligent pedestrian behavior: While the baseline model follows a predefined strategy, our advanced model captures continuous learning and the inherent uncertainty in human behavior by defining the pedestrian as a second DRL agent, i.e., we introduce a deep multi-agent reinforcement learning (DMARL) problem. The presented PCAM system with different levels of intelligent pedestrian behavior is benchmarked according to the agents' collision rate and the resulting traffic flow efficiency. In this analysis, our focus lies on evaluating the influence of observation noise on the decision making of the agents. The results show that the AV is able to completely mitigate collisions under the majority of the investigated conditions and that the DRL-based pedestrian model indeed learns a more human-like crossing behavior.
    Transformer-based Map Matching Model with Limited Ground-Truth Data using Transfer-Learning Approach. (arXiv:2108.00439v3 [cs.LG] UPDATED)
    (0 min) In many spatial trajectory-based applications, it is necessary to map raw trajectory data points onto road networks in digital maps, which is commonly referred to as a map-matching process. While most previous map-matching methods have focused on using rule-based algorithms to deal with the map-matching problems, in this paper, we consider the map-matching task from the data-driven perspective, proposing a deep learning-based map-matching model. We build a Transformer-based map-matching model with a transfer learning approach. We generate trajectory data to pre-train the Transformer model and then fine-tune the model with a limited number of ground-truth data to minimize the model development cost and reduce the real-to-virtual gap. Three metrics (Average Hamming Distance, F-score, and BLEU) at two levels (point and segment level) are used to evaluate the model performance. The results indicate that the proposed model outperforms existing models. Furthermore, we use the attention weights of the Transformer to plot the map-matching process and find how the model matches the road segments correctly.
    SPATE-GAN: Improved Generative Modeling of Dynamic Spatio-Temporal Patterns with an Autoregressive Embedding Loss. (arXiv:2109.15044v1 [cs.LG])
    (0 min) From ecology to atmospheric sciences, many academic disciplines deal with data characterized by intricate spatio-temporal complexities, the modeling of which often requires specialized approaches. Generative models of these data are of particular interest, as they enable a range of impactful downstream applications like simulation or creating synthetic training data. Recent work has highlighted the potential of generative adversarial nets (GANs) for generating spatio-temporal data. A new GAN algorithm COT-GAN, inspired by the theory of causal optimal transport (COT), was proposed in an attempt to better tackle this challenge. However, the task of learning more complex spatio-temporal patterns requires additional knowledge of their specific data structures. In this study, we propose a novel loss objective combined with COT-GAN based on an autoregressive embedding to reinforce the learning of spatio-temporal dynamics. We devise SPATE (spatio-temporal association), a new metric measuring spatio-temporal autocorrelation by using the deviance of observations from their expected values. We compute SPATE for real and synthetic data samples and use it to compute an embedding loss that considers space-time interactions, nudging the GAN to learn outputs that are faithful to the observed dynamics. We test this new objective on a diverse set of complex spatio-temporal patterns: turbulent flows, log-Gaussian Cox processes and global weather data. We show that our novel embedding loss improves performance without any changes to the architecture of the COT-GAN backbone, highlighting our model's increased capacity for capturing autoregressive structures. We also contextualize our work with respect to recent advances in physics-informed deep learning and interdisciplinary work connecting neural networks with geographic and geophysical sciences.
    Deep neural networks with controlled variable selection for the identification of putative causal genetic variants. (arXiv:2109.14719v1 [cs.LG])
    (0 min) Deep neural networks (DNN) have been used successfully in many scientific problems for their high prediction accuracy, but their application to genetic studies remains challenging due to their poor interpretability. In this paper, we consider the problem of scalable, robust variable selection in DNN for the identification of putative causal genetic variants in genome sequencing studies. We identified a pronounced randomness in feature selection in DNN due to its stochastic nature, which may hinder interpretability and give rise to misleading results. We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: (1) flexible modelling of the non-linear effect of genetic variants to improve statistical power; (2) multiple knockoffs in the input layer to rigorously control false discovery rate; (3) hierarchical layers to substantially reduce the number of weight parameters and activations to improve computational efficiency; (4) de-randomized feature selection to stabilize identified signals. We evaluated the proposed method in extensive simulation studies and applied it to the analysis of Alzheimer disease genetics. We showed that the proposed method, when compared to conventional linear and nonlinear methods, can lead to substantially more discoveries.
    SCIMAT: Science and Mathematics Dataset. (arXiv:2109.15005v1 [math.NA])
    (0 min) In this work, we announce a comprehensive well curated and opensource dataset with millions of samples for pre-college and college level problems in mathematicsand science. A preliminary set of results using transformer architecture with character to character encoding is shown. The dataset identifies some challenging problem and invites research on better architecture search
    Stroke recovery phenotyping through network trajectory approaches and graph neural networks. (arXiv:2109.14659v1 [q-bio.QM])
    (0 min) Stroke is a leading cause of neurological injury characterized by impairments in multiple neurological domains including cognition, language, sensory and motor functions. Clinical recovery in these domains is tracked using a wide range of measures that may be continuous, ordinal, interval or categorical in nature, which presents challenges for standard multivariate regression approaches. This has hindered stroke researchers' ability to achieve an integrated picture of the complex time-evolving interactions amongst symptoms. Here we use tools from network science and machine learning that are particularly well-suited to extracting underlying patterns in such data, and may assist in prediction of recovery patterns. To demonstrate the utility of this approach, we analyzed data from the NINDS tPA trial using the Trajectory Profile Clustering (TPC) method to identify distinct stroke recovery patterns for 11 different neurological domains at 5 discrete time points. Our analysis identified 3 distinct stroke trajectory profiles that align with clinically relevant stroke syndromes, characterized both by distinct clusters of symptoms, as well as differing degrees of symptom severity. We then validated our approach using graph neural networks to determine how well our model performed predictively for stratifying patients into these trajectory profiles at early vs. later time points post-stroke. We demonstrate that trajectory profile clustering is an effective method for identifying clinically relevant recovery subtypes in multidimensional longitudinal datasets, and for early prediction of symptom progression subtypes in individual patients. This paper is the first work introducing network trajectory approaches for stroke recovery phenotyping, and is aimed at enhancing the translation of such novel computational approaches for practical clinical application.
    A Privacy-preserving Distributed Training Framework for Cooperative Multi-agent Deep Reinforcement Learning. (arXiv:2109.14998v1 [cs.LG])
    (0 min) Deep Reinforcement Learning (DRL) sometimes needs a large amount of data to converge in the training procedure and in some cases, each action of the agent may produce regret. This barrier naturally motivates different data sets or environment owners to cooperate to share their knowledge and train their agents more efficiently. However, it raises privacy concerns if we directly merge the raw data from different owners. To solve this problem, we proposed a new Deep Neural Network (DNN) architecture with both global NN and local NN, and a distributed training framework. We allow the global weights to be updated by all the collaborator agents while the local weights are only updated by the agent they belong to. In this way, we hope the global weighs can share the common knowledge among these collaborators while the local NN can keep the specialized properties and ensure the agent to be compatible with its specific environment. Experiments show that the framework can efficiently help agents in the same or similar environments to collaborate in their training process and gain a higher convergence rate and better performance.
    Robust Segmentation Models using an Uncertainty Slice Sampling Based Annotation Workflow. (arXiv:2109.14879v1 [cs.CV])
    (0 min) Semantic segmentation neural networks require pixel-level annotations in large quantities to achieve a good performance. In the medical domain, such annotations are expensive, because they are time-consuming and require expert knowledge. Active learning optimizes the annotation effort by devising strategies to select cases for labeling that are most informative to the model. In this work, we propose an uncertainty slice sampling (USS) strategy for semantic segmentation of 3D medical volumes that selects 2D image slices for annotation and compare it with various other strategies. We demonstrate the efficiency of USS on a CT liver segmentation task using multi-site data. After five iterations, the training data resulting from USS consisted of 2410 slices (4% of all slices in the data pool) compared to 8121 (13%), 8641 (14%), and 3730 (6%) for uncertainty volume (UVS), random volume (RVS), and random slice (RSS) sampling, respectively. Despite being trained on the smallest amount of data, the model based on the USS strategy evaluated on 234 test volumes significantly outperformed models trained according to other strategies and achieved a mean Dice index of 0.964, a relative volume error of 4.2%, a mean surface distance of 1.35 mm, and a Hausdorff distance of 23.4 mm. This was only slightly inferior to 0.967, 3.8%, 1.18 mm, and 22.9 mm achieved by a model trained on all available data, but the robustness analysis using the 5th percentile of Dice and the 95th percentile of the remaining metrics demonstrated that USS resulted not only in the most robust model compared to other sampling schemes, but also outperformed the model trained on all data according to Dice (0.946 vs. 0.945) and mean surface distance (1.92 mm vs. 2.03 mm).
    Focused Contrastive Training for Test-based Constituency Analysis. (arXiv:2109.15159v1 [cs.CL])
    (0 min) We propose a scheme for self-training of grammaticality models for constituency analysis based on linguistic tests. A pre-trained language model is fine-tuned by contrastive estimation of grammatical sentences from a corpus, and ungrammatical sentences that were perturbed by a syntactic test, a transformation that is motivated by constituency theory. We show that consistent gains can be achieved if only certain positive instances are chosen for training, depending on whether they could be the result of a test transformation. This way, the positives, and negatives exhibit similar characteristics, which makes the objective more challenging for the language model, and also allows for additional markup that indicates the position of the test application within the sentence.
    Coding for Straggler Mitigation in Federated Learning. (arXiv:2109.15226v1 [cs.LG])
    (0 min) We present a novel coded federated learning (FL) scheme for linear regression that mitigates the effect of straggling devices while retaining the privacy level of conventional FL. The proposed scheme combines one-time padding to preserve privacy and gradient codes to yield resiliency against stragglers and consists of two phases. In the first phase, the devices share a one-time padded version of their local data with a subset of other devices. In the second phase, the devices and the central server collaboratively and iteratively train a global linear model using gradient codes on the one-time padded local data. To apply one-time padding to real data, our scheme exploits a fixed-point arithmetic representation of the data. Unlike the coded FL scheme recently introduced by Prakash et al., the proposed scheme maintains the same level of privacy as conventional FL while achieving a similar training time. Compared to conventional FL, we show that the proposed scheme achieves a training speed-up factor of $6.6$ and $9.2$ on the MNIST and Fashion-MNIST datasets for an accuracy of $95\%$ and $85\%$, respectively.
    Extended Tactile Perception: Vibration Sensing through Tools and Grasped Objects. (arXiv:2106.00489v2 [cs.RO] UPDATED)
    (0 min) Humans display the remarkable ability to sense the world through tools and other held objects. For example, we are able to pinpoint impact locations on a held rod and tell apart different textures using a rigid probe. In this work, we consider how we can enable robots to have a similar capacity, i.e., to embody tools and extend perception using standard grasped objects. We propose that vibro-tactile sensing using dynamic tactile sensors on the robot fingers, along with machine learning models, enables robots to decipher contact information that is transmitted as vibrations along rigid objects. This paper reports on extensive experiments using the BioTac micro-vibration sensor and a new event dynamic sensor, the NUSkin, capable of multi-taxel sensing at 4~kHz. We demonstrate that fine localization on a held rod is possible using our approach (with errors less than 1 cm on a 20 cm rod). Next, we show that vibro-tactile perception can lead to reasonable grasp stability prediction during object handover, and accurate food identification using a standard fork. We find that multi-taxel vibro-tactile sensing at sufficiently high sampling rate led to the best performance across the various tasks and objects. Taken together, our results provides both evidence and guidelines for using vibro-tactile perception to extend tactile perception, which we believe will lead to enhanced competency with tools and better physical human-robot-interaction.
    Kernel distance measures for time series, random fields and other structured data. (arXiv:2109.14752v1 [stat.ML])
    (0 min) This paper introduces kdiff, a novel kernel-based measure for estimating distances between instances of time series, random fields and other forms of structured data. This measure is based on the idea of matching distributions that only overlap over a portion of their region of support. Our proposed measure is inspired by MPdist which has been previously proposed for such datasets and is constructed using Euclidean metrics, whereas kdiff is constructed using non-linear kernel distances. Also, kdiff accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution. Comparing the cross similarity to self similarity allows for measures of similarity that are more robust to noise and partial occlusions of the relevant signals. Our proposed measure kdiff is a more general form of the well known kernel-based Maximum Mean Discrepancy (MMD) distance estimated over the embeddings. Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems where the embedding distributions can be modeled as two component mixtures. Applications are demonstrated for clustering of synthetic and real-life time series and image data, and the performance of kdiff is compared to competing distance measures for clustering.
    An Automated Scanning Transmission Electron Microscope Guided by Sparse Data Analytics. (arXiv:2109.14772v1 [cond-mat.mtrl-sci])
    (0 min) Artificial intelligence (AI) promises to reshape scientific inquiry and enable breakthrough discoveries in areas such as energy storage, quantum computing, and biomedicine. Scanning transmission electron microscopy (STEM), a cornerstone of the study of chemical and materials systems, stands to benefit greatly from AI-driven automation. However, present barriers to low-level instrument control, as well as generalizable and interpretable feature detection, make truly automated microscopy impractical. Here, we discuss the design of a closed-loop instrument control platform guided by emerging sparse data analytics. We demonstrate how a centralized controller, informed by machine learning combining limited $a$ $priori$ knowledge and task-based discrimination, can drive on-the-fly experimental decision-making. This platform unlocks practical, automated analysis of a variety of material features, enabling new high-throughput and statistical studies.
    Chest X-Rays Image Classification from beta-Variational Autoencoders Latent Features. (arXiv:2109.14760v1 [eess.IV])
    (0 min) Chest X-Ray (CXR) is one of the most common diagnostic techniques used in everyday clinical practice all around the world. We hereby present a work which intends to investigate and analyse the use of Deep Learning (DL) techniques to extract information from such images and allow to classify them, trying to keep our methodology as general as possible and possibly also usable in a real world scenario without much effort, in the future. To move in this direction, we trained several beta-Variational Autoencoder (beta-VAE) models on the CheXpert dataset, one of the largest publicly available collection of labeled CXR images; from these models, latent features have been extracted and used to train other Machine Learning models, able to classify the original images from the features extracted by the beta-VAE. Lastly, tree-based models have been combined together in ensemblings to improve the results without the necessity of further training or models engineering. Expecting some drop in pure performance with the respect to state of the art classification specific models, we obtained encouraging results, which show the viability of our approach and the usability of the high level features extracted by the autoencoders for classification tasks.
    How Neural Processes Improve Graph Link Prediction. (arXiv:2109.14894v1 [cs.LG])
    (0 min) Link prediction is a fundamental problem in graph data analysis. While most of the literature focuses on transductive link prediction that requires all the graph nodes and majority of links in training, inductive link prediction, which only uses a proportion of the nodes and their links in training, is a more challenging problem in various real-world applications. In this paper, we propose a meta-learning approach with graph neural networks for link prediction: Neural Processes for Graph Neural Networks (NPGNN), which can perform both transductive and inductive learning tasks and adapt to patterns in a large new graph after training with a small subgraph. Experiments on real-world graphs are conducted to validate our model, where the results suggest that the proposed method achieves stronger performance compared to other state-of-the-art models, and meanwhile generalizes well when training on a small subgraph.
    Learning to synthesise the ageing brain without longitudinal data. (arXiv:1912.02620v6 [eess.IV] UPDATED)
    (0 min) How will my face look when I get older? Or, for a more challenging question: How will my brain look when I get older? To answer this question one must devise (and learn from data) a multivariate auto-regressive function which given an image and a desired target age generates an output image. While collecting data for faces may be easier, collecting longitudinal brain data is not trivial. We propose a deep learning-based method that learns to simulate subject-specific brain ageing trajectories without relying on longitudinal data. Our method synthesises images conditioned on two factors: age (a continuous variable), and status of Alzheimer's Disease (AD, an ordinal variable). With an adversarial formulation we learn the joint distribution of brain appearance, age and AD status, and define reconstruction losses to address the challenging problem of preserving subject identity. We compare with several benchmarks using two widely used datasets. We evaluate the quality and realism of synthesised images using ground-truth longitudinal data and a pre-trained age predictor. We show that, despite the use of cross-sectional data, our model learns patterns of gray matter atrophy in the middle temporal gyrus in patients with AD. To demonstrate generalisation ability, we train on one dataset and evaluate predictions on the other. In conclusion, our model shows an ability to separate age, disease influence and anatomy using only 2D cross-sectional data that should be useful in large studies into neurodegenerative disease, that aim to combine several data sources. To facilitate such future studies by the community at large our code is made available at https://github.com/xiat0616/BrainAgeing.
    Convergence of Gradient Algorithms for Nonconvex C^{1+alpha} Cost Functions. (arXiv:2012.00628v3 [math.OC] UPDATED)
    (0 min) This paper is concerned with convergence of stochastic gradient algorithms with momentum terms in the nonconvex setting. A class of stochastic momentum methods, including stochastic gradient descent, heavy ball, and Nesterov's accelerated gradient, is analyzed in a general framework under mild assumptions. Based on the convergence result of expected gradients, we prove the almost sure convergence by a detailed discussion of the effects of momentum and the number of upcrossings. It is worth noting that there are not additional restrictions imposed on the objective function and stepsize. Another improvement over previous results is that the existing Lipschitz condition of the gradient is relaxed into the condition of Holder continuity. As a byproduct, we apply a localization procedure to extend our results to stochastic stepsizes.
    Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation. (arXiv:2109.15317v1 [cs.CV])
    (0 min) We present MetaUVFS as the first Unsupervised Meta-learning algorithm for Video Few-Shot action recognition. MetaUVFS leverages over 550K unlabeled videos to train a two-stream 2D and 3D CNN architecture via contrastive learning to capture the appearance-specific spatial and action-specific spatio-temporal video features respectively. MetaUVFS comprises a novel Action-Appearance Aligned Meta-adaptation (A3M) module that learns to focus on the action-oriented video features in relation to the appearance features via explicit few-shot episodic meta-learning over unsupervised hard-mined episodes. Our action-appearance alignment and explicit few-shot learner conditions the unsupervised training to mimic the downstream few-shot task, enabling MetaUVFS to significantly outperform all unsupervised methods on few-shot benchmarks. Moreover, unlike previous few-shot action recognition methods that are supervised, MetaUVFS needs neither base-class labels nor a supervised pretrained backbone. Thus, we need to train MetaUVFS just once to perform competitively or sometimes even outperform state-of-the-art supervised methods on popular HMDB51, UCF101, and Kinetics100 few-shot datasets.
    Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. (arXiv:2109.15121v1 [cs.CL])
    (0 min) The paper presents a feature-rich approach to the automatic recognition and categorization of named entities (persons, organizations, locations, and miscellaneous) in news text for Bulgarian. We combine well-established features used for other languages with language-specific lexical, syntactic and morphological information. In particular, we make use of the rich tagset annotation of the BulTreeBank (680 morpho-syntactic tags), from which we derive suitable task-specific tagsets (local and nonlocal). We further add domain-specific gazetteers and additional unlabeled data, achieving F1=89.4%, which is comparable to the state-of-the-art results for English.
    Workflow Augmentation of Video Data for Event Recognition with Time-Sensitive Neural Networks. (arXiv:2109.15063v1 [eess.IV])
    (0 min) Supervised training of neural networks requires large, diverse and well annotated data sets. In the medical field, this is often difficult to achieve due to constraints in time, expert knowledge and prevalence of an event. Artificial data augmentation can help to prevent overfitting and improve the detection of rare events as well as overall performance. However, most augmentation techniques use purely spatial transformations, which are not sufficient for video data with temporal correlations. In this paper, we present a novel methodology for workflow augmentation and demonstrate its benefit for event recognition in cataract surgery. The proposed approach increases the frequency of event alternation by creating artificial videos. The original video is split into event segments and a workflow graph is extracted from the original annotations. Finally, the segments are assembled into new videos based on the workflow graph. Compared to the original videos, the frequency of event alternation in the augmented cataract surgery videos increased by 26%. Further, a 3% higher classification accuracy and a 7.8% higher precision was achieved compared to a state-of-the-art approach. Our approach is particularly helpful to increase the occurrence of rare but important events and can be applied to a large variety of use cases.
    From unbiased MDI Feature Importance to Explainable AI for Trees. (arXiv:2003.12043v4 [stat.ML] UPDATED)
    (0 min) We attempt to give a unifying view of the various recent attempts to (i) improve the interpretability of tree-based models and (ii) debias the the default variable-importance measure in random Forests, Gini importance. In particular, we demonstrate a common thread among the out-of-bag based bias correction methods and their connection to local explanation for trees. In addition, we point out a bias caused by the inclusion of inbag data in the newly developed explainable AI for trees algorithms.
    Stock Price Prediction Under Anomalous Circumstances. (arXiv:2109.15059v1 [q-fin.ST])
    (0 min) The stock market is volatile and complicated, especially in 2020. Because of a series of global and regional "black swans," such as the COVID-19 pandemic, the U.S. stock market triggered the circuit breaker three times within one week of March 9 to 16, which is unprecedented throughout history. Affected by the whole circumstance, the stock prices of individual corporations also plummeted by rates that were never predicted by any pre-developed forecasting models. It reveals that there was a lack of satisfactory models that could predict the changes in stocks prices when catastrophic, highly unlikely events occur. To fill the void of such models and to help prevent investors from heavy losses during uncertain times, this paper aims to capture the movement pattern of stock prices under anomalous circumstances. First, we detect outliers in sequential stock prices by fitting a standard ARIMA model and identifying the points where predictions deviate significantly from actual values. With the selected data points, we train ARIMA and LSTM models at the single-stock level, industry level, and general market level, respectively. Since the public moods affect the stock market tremendously, a sentiment analysis is also incorporated into the models in the form of sentiment scores, which are converted from comments about specific stocks on Reddit. Based on 100 companies' stock prices in the period of 2016 to 2020, the models achieve an average prediction accuracy of 98% which can be used to optimize existing prediction methodologies.
    Semi-Decentralized Federated Learning with Cooperative D2D Local Model Aggregations. (arXiv:2103.10481v3 [cs.LG] UPDATED)
    (0 min) Federated learning has emerged as a popular technique for distributing machine learning (ML) model training across the wireless edge. In this paper, we propose two timescale hybrid federated learning (TT-HF), a semi-decentralized learning architecture that combines the conventional device-to-server communication paradigm for federated learning with device-to-device (D2D) communications for model training. In TT-HF, during each global aggregation interval, devices (i) perform multiple stochastic gradient descent iterations on their individual datasets, and (ii) aperiodically engage in consensus procedure of their model parameters through cooperative, distributed D2D communications within local clusters. With a new general definition of gradient diversity, we formally study the convergence behavior of TT-HF, resulting in new convergence bounds for distributed ML. We leverage our convergence bounds to develop an adaptive control algorithm that tunes the step size, D2D communication rounds, and global aggregation period of TT-HF over time to target a sublinear convergence rate of O(1/t) while minimizing network resource utilization. Our subsequent experiments demonstrate that TT-HF significantly outperforms the current art in federated learning in terms of model accuracy and/or network energy consumption in different scenarios where local device datasets exhibit statistical heterogeneity. Finally, our numerical evaluations demonstrate robustness against outages caused by fading channels, as well favorable performance with non-convex loss functions.
    XPROAX-Local explanations for text classification with progressive neighborhood approximation. (arXiv:2109.15004v1 [cs.LG])
    (0 min) The importance of the neighborhood for training a local surrogate model to approximate the local decision boundary of a black box classifier has been already highlighted in the literature. Several attempts have been made to construct a better neighborhood for high dimensional data, like texts, by using generative autoencoders. However, existing approaches mainly generate neighbors by selecting purely at random from the latent space and struggle under the curse of dimensionality to learn a good local decision boundary. To overcome this problem, we propose a progressive approximation of the neighborhood using counterfactual instances as initial landmarks and a careful 2-stage sampling approach to refine counterfactuals and generate factuals in the neighborhood of the input instance to be explained. Our work focuses on textual data and our explanations consist of both word-level explanations from the original instance (intrinsic) and the neighborhood (extrinsic) and factual- and counterfactual-instances discovered during the neighborhood generation process that further reveal the effect of altering certain parts in the input text. Our experiments on real-world datasets demonstrate that our method outperforms the competitors in terms of usefulness and stability (for the qualitative part) and completeness, compactness and correctness (for the quantitative part).
    Linear systems with neural network nonlinearities: Improved stability analysis via acausal Zames-Falb multipliers. (arXiv:2103.17106v2 [eess.SY] UPDATED)
    (0 min) In this paper, we analyze the stability of feedback interconnections of a linear time-invariant system with a neural network nonlinearity in discrete time. Our analysis is based on abstracting neural networks using integral quadratic constraints (IQCs), exploiting the sector-bounded and slope-restricted structure of the underlying activation functions. In contrast to existing approaches, we leverage the full potential of dynamic IQCs to describe the nonlinear activation functions in a less conservative fashion. To be precise, we consider multipliers based on the full-block Yakubovich / circle criterion in combination with acausal Zames-Falb multipliers, leading to linear matrix inequality based stability certificates. Our approach provides a flexible and versatile framework for stability analysis of feedback interconnections with neural network nonlinearities, allowing to trade off computational efficiency and conservatism. Finally, we provide numerical examples that demonstrate the applicability of the proposed framework and the achievable improvements over previous approaches.

2021-09-30

  • cs.CL updates on arXiv.org

    Influence Patterns for Explaining Information Flow in BERT. (arXiv:2011.00740v2 [cs.CL] UPDATED)
    (2 min) While attention is all you need may be proving true, we do not know why: attention-based transformer models such as BERT are superior but how information flows from input tokens to output predictions are unclear. We introduce influence patterns, abstractions of sets of paths through a transformer model. Patterns quantify and localize the flow of information to paths passing through a sequence of model nodes. Experimentally, we find that significant portion of information flow in BERT goes through skip connections instead of attention heads. We further show that consistency of patterns across instances is an indicator of BERT's performance. Finally, We demonstrate that patterns account for far more model performance than previous attention-based and layer-based methods.
    Multilingual Fact Linking. (arXiv:2109.14364v1 [cs.CL])
    (2 min) Knowledge-intensive NLP tasks can benefit from linking natural language text with facts from a Knowledge Graph (KG). Although facts themselves are language-agnostic, the fact labels (i.e., language-specific representation of the fact) in the KG are often present only in a few languages. This makes it challenging to link KG facts to sentences in languages other than the limited set of languages. To address this problem, we introduce the task of Multilingual Fact Linking (MFL) where the goal is to link fact expressed in a sentence to corresponding fact in the KG, even when the fact label in the KG is not available in the language of the sentence. To facilitate research in this area, we present a new evaluation dataset, IndicLink. This dataset contains 11,293 linked WikiData facts and 6,429 sentences spanning English and six Indian languages. We propose a Retrieval+Generation model, ReFCoG, that can scale to millions of KG facts by combining Dual Encoder based retrieval with a Seq2Seq based generation model which is constrained to output only valid KG facts. ReFCoG outperforms standard Retrieval+Re-ranking models by 10.7 pts in Precision@1. In spite of this gain, the model achieves an overall score of 52.1, showing ample scope for improvement in the task.ReFCoG code and IndicLink data are available at https://github.com/SaiKeshav/mfl
    Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models. (arXiv:2109.14350v1 [cs.CL])
    (2 min) Practical needs of developing task-oriented dialogue assistants require the ability to understand many languages. Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots. In such setup models for cross-lingual transfer show remarkable performance in joint intent recognition and slot filling. However, existing benchmarks lack of code-switched utterances, which are difficult to gather and label due to complexity in the grammatical structure. The evaluation of NLU models seems biased and limited, since code-switching is being left out of scope. Our work adopts recognized methods to generate plausible and naturally-sounding code-switched utterances and uses them to create a synthetic code-switched test set. Based on experiments, we report that the state-of-the-art NLU models are unable to handle code-switching. At worst, the performance, evaluated by semantic accuracy, drops as low as 15\% from 80\% across languages. Further we show, that pre-training on synthetic code-mixed data helps to maintain performance on the proposed test set at a comparable level with monolingual data. Finally, we analyze different language pairs and show that the closer the languages are, the better the NLU model handles their alternation. This is in line with the common understanding of how multilingual models conduct transferring between languages
    BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text. (arXiv:2109.14250v1 [cs.CL])
    (2 min) Social media companies as well as authorities make extensive use of artificial intelligence (AI) tools to monitor postings of hate speech, celebrations of violence or profanity. Since AI software requires massive volumes of data to train computers, Machine Translation (MT) of the online content is commonly used to process posts written in several languages and hence augment the data needed for training. However, MT mistakes are a regular occurrence when translating sentiment-oriented user-generated content (UGC), especially when a low-resource language is involved. The adequacy of the whole process relies on the assumption that the evaluation metrics used give a reliable indication of the quality of the translation. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors which can cause serious misunderstanding of the affect message. We compare the performance of three canonical metrics on meaningless translations where the semantic content is seriously impaired as compared to meaningful translations with a critical error which exclusively distorts the sentiment of the source text. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
    FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition. (arXiv:2109.14420v1 [cs.CL])
    (2 min) Error correction is widely used in automatic speech recognition (ASR) to post-process the generated sentence, and can further reduce the word error rate (WER). Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect and correct error tokens. In this work, we propose FastCorrect 2, an error correction model that takes multiple ASR candidates as input for better correction accuracy. FastCorrect 2 adopts non-autoregressive generation for fast inference, which consists of an encoder that processes multiple source sentences and a decoder that generates the target sentence in parallel from the adjusted source sentence, where the adjustment is based on the predicted duration of each source token. However, there are some issues when handling multiple source sentences. First, it is non-trivial to leverage the voting effect from multiple source sentences since they usually vary in length. Thus, we propose a novel alignment algorithm to maximize the degree of token alignment among multiple sentences in terms of token and pronunciation similarity. Second, the decoder can only take one adjusted source sentence as input, while there are multiple source sentences. Thus, we develop a candidate predictor to detect the most suitable candidate for the decoder. Experiments on our inhouse dataset and AISHELL-1 show that FastCorrect 2 can further reduce the WER over the previous correction model with single candidate by 3.2% and 2.6%, demonstrating the effectiveness of leveraging multiple candidates in ASR error correction. FastCorrect 2 achieves better performance than the cascaded re-scoring and correction pipeline and can serve as a unified post-processing module for ASR.
    Decomposing and Recomposing Event Structure. (arXiv:2103.10387v2 [cs.CL] UPDATED)
    (2 min) We present an event structure classification empirically derived from inferential properties annotated on sentence- and document-level Universal Decompositional Semantics (UDS) graphs. We induce this classification jointly with semantic role, entity, and event-event relation classifications using a document-level generative model structured by these graphs. To support this induction, we augment existing annotations found in the UDS1.0 dataset, which covers the entirety of the English Web Treebank, with an array of inferential properties capturing fine-grained aspects of the temporal and aspectual structure of events. The resulting dataset (available at decomp.io) is the largest annotation of event structure and (partial) event coreference to date.
    VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. (arXiv:2105.09996v2 [cs.CV] UPDATED)
    (2 min) We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
    Contextualize Knowledge Bases with Transformer for End-to-end Task-Oriented Dialogue Systems. (arXiv:2010.05740v4 [cs.CL] UPDATED)
    (2 min) Incorporating knowledge bases (KB) into end-to-end task-oriented dialogue systems is challenging, since it requires to properly represent the entity of KB, which is associated with its KB context and dialogue context. The existing works represent the entity with only perceiving a part of its KB context, which can lead to the less effective representation due to the information loss, and adversely favor KB reasoning and response generation. To tackle this issue, we explore to fully contextualize the entity representation by dynamically perceiving all the relevant entities} and dialogue history. To achieve this, we propose a COntext-aware Memory Enhanced Transformer framework (COMET), which treats the KB as a sequence and leverages a novel Memory Mask to enforce the entity to only focus on its relevant entities and dialogue history, while avoiding the distraction from the irrelevant entities. Through extensive experiments, we show that our COMET framework can achieve superior performance over the state of the arts.
    Evaluating Various Tokenizers for Arabic Text Classification. (arXiv:2106.07540v2 [cs.CL] UPDATED)
    (2 min) The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic i.e they don't incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations. In addition to that, we compare all the six algorithms by evaluating them on three supervised classification tasks which are sentiment analysis, news classification and poetry classification using six publicly available datasets. Our experiments show that none of the tokenization technique is the best choice overall and that the performance of a given tokenization algorithm depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.
    mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset. (arXiv:2108.13897v2 [cs.CL] UPDATED)
    (2 min) The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in other languages than English. In this work we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 8 languages that was created using machine translation. We evaluated mMARCO by fine-tuning mono and multilingual re-ranking models on it. Experimental results demonstrate that multilingual models fine-tuned on our translated dataset achieve superior effectiveness than models fine-tuned on the original English version alone. Also, our distilled multilingual re-ranker is competitive with non-distilled models while having 5.4 times fewer parameters. The translated datasets as well as fine-tuned models are available at https://github.com/unicamp-dl/mMARCO.git.
    StoryDB: Broad Multi-language Narrative Dataset. (arXiv:2109.14396v1 [cs.CL])
    (2 min) This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.
    Patterns of Lexical Ambiguity in Contextualised Language Models. (arXiv:2109.13032v2 [cs.CL] UPDATED)
    (2 min) One of the central aspects of contextualised language models is that they should be able to distinguish the meaning of lexically ambiguous words by their contexts. In this paper we investigate the extent to which the contextualised embeddings of word forms that display multiplicity of sense reflect traditional distinctions of polysemy and homonymy. To this end, we introduce an extended, human-annotated dataset of graded word sense similarity and co-predication acceptability, and evaluate how well the similarity of embeddings predicts similarity in meaning. Both types of human judgements indicate that the similarity of polysemic interpretations falls in a continuum between identity of meaning and homonymy. However, we also observe significant differences within the similarity ratings of polysemes, forming consistent patterns for different types of polysemic sense alternation. Our dataset thus appears to capture a substantial part of the complexity of lexical ambiguity, and can provide a realistic test bed for contextualised embeddings. Among the tested models, BERT Large shows the strongest correlation with the collected word sense similarity ratings, but struggles to consistently replicate the observed similarity patterns. When clustering ambiguous word forms based on their embeddings, the model displays high confidence in discerning homonyms and some types of polysemic alternations, but consistently fails for others.
    How Does Adversarial Fine-Tuning Benefit BERT?. (arXiv:2108.13602v2 [cs.CL] UPDATED)
    (2 min) Adversarial training (AT) is one of the most reliable methods for defending against adversarial attacks in machine learning. Variants of this method have been used as regularization mechanisms to achieve SOTA results on NLP benchmarks, and they have been found to be useful for transfer learning and continual learning. We search for the reasons for the effectiveness of AT by contrasting vanilla and adversarially fine-tuned BERT models. We identify partial preservation of BERT's syntactic abilities during fine-tuning as the key to the success of AT. We observe that adversarially fine-tuned models remain more faithful to BERT's language modeling behavior and are more sensitive to the word order. As concrete examples of syntactic abilities, an adversarially fine-tuned model could have an advantage of up to 38% on anaphora agreement and up to 11% on dependency parsing. Our analysis demonstrates that vanilla fine-tuning oversimplifies the sentence representation by focusing heavily on a small subset of words. AT, however, moderates the effect of these influential words and encourages representational diversity. This allows for a more hierarchical representation of a sentence and leads to the mitigation of BERT's loss of syntactic abilities.
    Fast and Scalable Dialogue State Tracking with Explicit Modular Decomposition. (arXiv:2004.10663v3 [cs.CL] UPDATED)
    (2 min) We present a fast and scalable architecture called Explicit Modular Decomposition (EMD), in which we incorporate both classification-based and extraction-based methods and design four modules (for classification and sequence labelling) to jointly extract dialogue states. Experimental results based on the MultiWoz 2.0 dataset validates the superiority of our proposed model in terms of both complexity and scalability when compared to the state-of-the-art methods, especially in the scenario of multi-domain dialogues entangled with many turns of utterances.
    BiQUE: Biquaternionic Embeddings of Knowledge Graphs. (arXiv:2109.14401v1 [cs.CL])
    (2 min) Knowledge graph embeddings (KGEs) compactly encode multi-relational knowledge graphs (KGs). Existing KGE models rely on geometric operations to model relational patterns. Euclidean (circular) rotation is useful for modeling patterns such as symmetry, but cannot represent hierarchical semantics. In contrast, hyperbolic models are effective at modeling hierarchical relations, but do not perform as well on patterns on which circular rotation excels. It is crucial for KGE models to unify multiple geometric transformations so as to fully cover the multifarious relations in KGs. To do so, we propose BiQUE, a novel model that employs biquaternions to integrate multiple geometric transformations, viz., scaling, translation, Euclidean rotation, and hyperbolic rotation. BiQUE makes the best trade-offs among geometric operators during training, picking the best one (or their best combination) for each relation. Experiments on five datasets show BiQUE's effectiveness.
    COVID-19 Vaccine and Social Media: Exploring Emotions and Discussions on Twitter. (arXiv:2108.04816v2 [cs.SI] UPDATED)
    (2 min) The understanding of the public response to COVID-19 vaccines is the key success factor to control the COVID-19 pandemic. To understand the public response, there is a need to explore public opinion. Traditional surveys are expensive and time-consuming, address limited health topics, and obtain small-scale data. Twitter can provide a great opportunity to understand public opinion regarding COVID-19 vaccines. The current study proposes an approach using computational and human coding methods to collect and analyze a large number of tweets to provide a wider perspective on the COVID-19 vaccine. This study identifies the sentiment of tweets using a machine learning rule-based approach, discovers major topics, explores temporal trend and compares topics of negative and non-negative tweets using statistical tests, and discloses top topics of tweets having negative and non-negative sentiment. Our findings show that the negative sentiment regarding the COVID-19 vaccine had a decreasing trend between November 2020 and February 2021. We found Twitter users have discussed a wide range of topics from vaccination sites to the 2020 U.S. election between November 2020 and February 2021. The findings show that there was a significant difference between tweets having negative and non-negative sentiment regarding the weight of most topics. Our results also indicate that the negative and non-negative tweets had different topic priorities and focuses. This research illustrates that Twitter data can be used to explore public opinion regarding the COVID-19 vaccine.
    Overview of the Arabic Sentiment Analysis 2021 Competition at KAUST. (arXiv:2109.14456v1 [cs.CL])
    (2 min) This paper provides an overview of the Arabic Sentiment Analysis Challenge organized by King Abdullah University of Science and Technology (KAUST). The task in this challenge is to develop machine learning models to classify a given tweet into one of the three categories Positive, Negative, or Neutral. From our recently released ASAD dataset, we provide the competitors with 55K tweets for training, and 20K tweets for validation, based on which the performance of participating teams are ranked on a leaderboard, https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust. The competition received in total 1247 submissions from 74 teams (99 team members). The final winners are determined by another private set of 20K tweets that have the same distribution as the training and validation set. In this paper, we present the main findings in the competition and summarize the methods and tools used by the top ranked teams. The full dataset of 100K labeled tweets is also released for public usage, at https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust/data.
    FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark. (arXiv:2107.07498v2 [cs.CL] UPDATED)
    (2 min) Pretrained Language Models (PLMs) have achieved tremendous success in natural language understanding tasks. While different learning schemes -- fine-tuning, zero-shot, and few-shot learning -- have been widely explored and compared for languages such as English, there is comparatively little work in Chinese to fairly and comprehensively evaluate and compare these methods and thus hinders cumulative progress. In this paper, we introduce the Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive few-shot evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks. We systematically evaluate five state-of-the-art (SOTA) few-shot learning methods (including PET, ADAPET, LM-BFF, P-tuning and EFL), and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark. Experimental results reveal that: 1) The effect of different few-shot learning methods is sensitive to the pre-trained model to which the methods are applied; 2) PET and P-tuning achieve the best overall performance with RoBERTa and ERNIE respectively. Our benchmark is used in the few-shot learning contest of NLPCC 2021. In addition, we provide a user-friendly toolkit, as well as an online leaderboard to help facilitate further progress on Chinese few-shot learning. We provide a baseline performance on different learning methods, a reference for future research.
    Hierarchical Character Tagger for Short Text Spelling Error Correction. (arXiv:2109.14259v1 [cs.CL])
    (2 min) State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.
    EDGAR-CORPUS: Billions of Tokens Make The World Go Round. (arXiv:2109.14394v1 [cs.CL])
    (2 min) We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUSis the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.
    EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT. (arXiv:2109.14368v1 [cs.CL])
    (2 min) We describe the EdinSaar submission to the shared task of Multilingual Low-Resource Translation for North Germanic Languages at the Sixth Conference on Machine Translation (WMT2021). We submit multilingual translation models for translations to/from Icelandic (is), Norwegian-Bokmal (nb), and Swedish (sv). We employ various experimental approaches, including multilingual pre-training, back-translation, fine-tuning, and ensembling. In most translation directions, our models outperform other submitted systems.
    Reflexivity in Issues of Scale and Representation in a Digital Humanities Project. (arXiv:2109.14184v1 [cs.CL])
    (2 min) In this paper, we explore issues that we have encountered in developing a pipeline that combines natural language processing with data analysis and visualization techniques. The characteristics of the corpus - being comprised of diaries of a single person spanning several decades - present both conceptual challenges in terms of issues of representation, and affordances as a source for historical research. We consider these issues in a team context with a particular focus on the generation and interpretation of visualizations.
    Marked Attribute Bias in Natural Language Inference. (arXiv:2109.14039v1 [cs.CL])
    (2 min) Reporting and providing test sets for harmful bias in NLP applications is essential for building a robust understanding of the current problem. We present a new observation of gender bias in a downstream NLP application: marked attribute bias in natural language inference. Bias in downstream applications can stem from training data, word embeddings, or be amplified by the model in use. However, focusing on biased word embeddings is potentially the most impactful first step due to their universal nature. Here we seek to understand how the intrinsic properties of word embeddings contribute to this observed marked attribute effect, and whether current post-processing methods address the bias successfully. An investigation of the current debiasing landscape reveals two open problems: none of the current debiased embeddings mitigate the marked attribute error, and none of the intrinsic bias measures are predictive of the marked attribute effect. By noticing that a new type of intrinsic bias measure correlates meaningfully with the marked attribute effect, we propose a new postprocessing debiasing scheme for static word embeddings. The proposed method applied to existing embeddings achieves new best results on the marked attribute bias test set. See https://github.com/hillary-dawkins/MAB.
    Improving Dialogue State Tracking by Joint Slot Modeling. (arXiv:2109.14144v1 [cs.CL])
    (2 min) Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they are able to alleviate the confusion mentioned above, and they push the state-of-the-art on dataset MultiWoZ 2.1 from 58.7 to 61.3. Our implementation is available at https://github.com/CTinRay/Trippy-Joint.
    Context based Roman-Urdu to Urdu Script Transliteration System. (arXiv:2109.14197v1 [cs.CL])
    (2 min) Now a day computer is necessary for human being and it is very useful in many fields like search engine, text processing, short messaging services, voice chatting and text recognition. Since last many years there are many tools and techniques that have been developed to support the writing of language script. Most of the Asian languages like Arabic, Urdu, Persian, Chains and Korean are written in Roman alphabets. Roman alphabets are the most commonly used for transliteration of languages, which have non-Latin scripts. For writing Urdu characters as an input, there are many layouts which are already exist. Mostly Urdu speaker prefer to use Roman-Urdu for different applications, because mostly user is not familiar with Urdu language keyboard. The objective of this work is to improve the context base transliteration of Roman-Urdu to Urdu script. In this paper, we propose an algorithm which effectively solve the transliteration issues. The algorithm work like, convert the encoding roman words into the words in the standard Urdu script and match it with the lexicon. If match found, then display the word in the text editor. The highest frequency words are displayed if more than one match found in the lexicon. Display the first encoded and converted instance and set it to the default if there is not a single instance of the match is found and then adjust the given ambiguous word to their desire location according to their context. The outcome of this algorithm proved the efficiency and significance as compare to other models and algorithms which work for transliteration of Raman-Urdu to Urdu on context.
    Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation. (arXiv:2109.14200v1 [eess.AS])
    (2 min) Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and without the units being proximal learning targets for the learner. In this study, we formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated...
    Who says like a style of Vitamin: Towards Syntax-Aware DialogueSummarization using Multi-task Learning. (arXiv:2109.14199v1 [cs.CL])
    (2 min) Abstractive dialogue summarization is a challenging task for several reasons. First, most of the important pieces of information in a conversation are scattered across utterances through multi-party interactions with different textual styles. Second, dialogues are often informal structures, wherein different individuals express personal perspectives, unlike text summarization, tasks that usually target formal documents such as news articles. To address these issues, we focused on the association between utterances from individual speakers and unique syntactic structures. Speakers have unique textual styles that can contain linguistic information, such as voiceprint. Therefore, we constructed a syntax-aware model by leveraging linguistic information (i.e., POS tagging), which alleviates the above issues by inherently distinguishing sentences uttered from individual speakers. We employed multi-task learning of both syntax-aware information and dialogue summarization. To the best of our knowledge, our approach is the first method to apply multi-task learning to the dialogue summarization task. Experiments on a SAMSum corpus (a large-scale dialogue summarization corpus) demonstrated that our method improved upon the vanilla model. We further analyze the costs and benefits of our approach relative to baseline models.
    Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations. (arXiv:2109.14017v1 [cs.CL])
    (2 min) Recent research has adopted a new experimental field centered around the concept of text perturbations which has revealed that shuffled word order has little to no impact on the downstream performance of Transformer-based language models across many NLP tasks. These findings contradict the common understanding of how the models encode hierarchical and structural information and even question if the word order is modeled with position embeddings. To this end, this paper proposes nine probing datasets organized by the type of \emph{controllable} text perturbation for three Indo-European languages with a varying degree of word order flexibility: English, Swedish and Russian. Based on the probing analysis of the M-BERT and M-BART models, we report that the syntactic sensitivity depends on the language and model pre-training objectives. We also find that the sensitivity grows across layers together with the increase of the perturbation granularity. Last but not least, we show that the models barely use the positional information to induce syntactic trees from their intermediate self-attention and contextualized representations.
    Text Simplification for Comprehension-based Question-Answering. (arXiv:2109.13984v1 [cs.CL])
    (2 min) Text simplification is the process of splitting and rephrasing a sentence to a sequence of sentences making it easier to read and understand while preserving the content and approximating the original meaning. Text simplification has been exploited in NLP applications like machine translation, summarization, semantic role labeling, and information extraction, opening a broad avenue for its exploitation in comprehension-based question-answering downstream tasks. In this work, we investigate the effect of text simplification in the task of question-answering using a comprehension context. We release Simple-SQuAD, a simplified version of the widely-used SQuAD dataset. Firstly, we outline each step in the dataset creation pipeline, including style transfer, thresholding of sentences showing correct transfer, and offset finding for each answer. Secondly, we verify the quality of the transferred sentences through various methodologies involving both automated and human evaluation. Thirdly, we benchmark the newly created corpus and perform an ablation study for examining the effect of the simplification process in the SQuAD-based question answering task. Our experiments show that simplification leads to up to 2.04% and 1.74% increase in Exact Match and F1, respectively. Finally, we conclude with an analysis of the transfer process, investigating the types of edits made by the model, and the effect of sentence length on the transfer model.
    Contrastive Video-Language Segmentation. (arXiv:2109.14131v1 [cs.CV])
    (2 min) We focus on the problem of segmenting a certain object referred by a natural language sentence in video content, at the core of formulating a pinpoint vision-language relation. While existing attempts mainly construct such relation in an implicit way, i.e., grid-level multi-modal feature fusion, it has been proven problematic to distinguish semantically similar objects under this paradigm. In this work, we propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective, which directly aligns the referred object and the language description and separates the unreferred content apart across frames. Moreover, to remedy for the degradation problem, we present two complementary hard instance mining strategies, i.e., Language-relevant Channel Filter and Relative Hard Instance Construction. They encourage the network to exclude visual-distinguishable feature and to focus on easy-confused objects during the contrastive training. Extensive experiments on two benchmarks, i.e., A2D Sentences and J-HMDB Sentences, quantitatively demonstrate the state-of-the-arts performance of our method and qualitatively show the more accurate distinguishment between semantically similar objects over baselines.
    Improving Arabic Diacritization by Learning to Diacritize and Translate. (arXiv:2109.14150v1 [cs.CL])
    (2 min) We propose a novel multitask learning method for diacritization which trains a model to both diacritize and translate. Our method addresses data sparsity by exploiting large, readily available bitext corpora. Furthermore, translation requires implicit linguistic and semantic knowledge, which is helpful for resolving ambiguities in the diacritization task. We apply our method to the Penn Arabic Treebank and report a new state-of-the-art word error rate of 4.79%. We also conduct manual and automatic analysis to better understand our method and highlight some of the remaining challenges in diacritization.
    Second Order WinoBias (SoWinoBias) Test Set for Latent Gender Bias Detection in Coreference Resolution. (arXiv:2109.14047v1 [cs.CL])
    (2 min) We observe an instance of gender-induced bias in a downstream application, despite the absence of explicit gender words in the test cases. We provide a test set, SoWinoBias, for the purpose of measuring such latent gender bias in coreference resolution systems. We evaluate the performance of current debiasing methods on the SoWinoBias test set, especially in reference to the method's design and altered embedding space properties. See https://github.com/hillarydawkins/SoWinoBias.
    RAFT: A Real-World Few-Shot Text Classification Benchmark. (arXiv:2109.14076v1 [cs.CL])
    (2 min) Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .
    Generating Summaries for Scientific Paper Review. (arXiv:2109.14059v1 [cs.CL])
    (2 min) The review process is essential to ensure the quality of publications. Recently, the increase of submissions for top venues in machine learning and NLP has caused a problem of excessive burden on reviewers and has often caused concerns regarding how this may not only overload reviewers, but also may affect the quality of the reviews. An automatic system for assisting with the reviewing process could be a solution for ameliorating the problem. In this paper, we explore automatic review summary generation for scientific papers. We posit that neural language models have the potential to be valuable candidates for this task. In order to test this hypothesis, we release a new dataset of scientific papers and their reviews, collected from papers published in the NeurIPS conference from 2013 to 2020. We evaluate state of the art neural summarization models, present initial results on the feasibility of automatic review summary generation, and propose directions for the future.
    VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. (arXiv:2109.14084v1 [cs.CV])
    (2 min) We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
  • cs.CV updates on arXiv.org

    Robust Temporal Ensembling for Learning with Noisy Labels. (arXiv:2109.14563v1 [cs.CV])
    (2 min) Successful training of deep neural networks with noisy labels is an essential capability as most real-world datasets contain some amount of mislabeled data. Left unmitigated, label noise can sharply degrade typical supervised learning approaches. In this paper, we present robust temporal ensembling (RTE), which combines robust loss with semi-supervised regularization methods to achieve noise-robust learning. We demonstrate that RTE achieves state-of-the-art performance across the CIFAR-10, CIFAR-100, ImageNet, WebVision, and Food-101N datasets, while forgoing the recent trend of label filtering and/or fixing. Finally, we show that RTE also retains competitive corruption robustness to unforeseen input noise using CIFAR-10-C, obtaining a mean corruption error (mCE) of 13.50% even in the presence of an 80% noise ratio, versus 26.9% mCE with standard methods on clean data.
    Stain-Robust Mitotic Figure Detection for the Mitosis Domain Generalization Challenge. (arXiv:2109.00853v2 [cs.CV] UPDATED)
    (2 min) The detection of mitotic figures from different scanners/sites remains an important topic of research, owing to its potential in assisting clinicians with tumour grading. The MItosis DOmain Generalization (MIDOG) challenge aims to test the robustness of detection models on unseen data from multiple scanners for this task. We present a short summary of the approach employed by the TIA Centre team to address this challenge. Our approach is based on a hybrid detection model, where mitotic candidates are segmented on stain normalised images, before being refined by a deep learning classifier. Cross-validation on the training images achieved the F1-score of 0.786 and 0.765 on the preliminary test set, demonstrating the generalizability of our model to unseen data from new scanners.
    Designing Counterfactual Generators using Deep Model Inversion. (arXiv:2109.14274v1 [cs.LG])
    (2 min) Explanation techniques that synthesize small, interpretable changes to a given image while producing desired changes in the model prediction have become popular for introspecting black-box models. Commonly referred to as counterfactuals, the synthesized explanations are required to contain discernible changes (for easy interpretability) while also being realistic (consistency to the data manifold). In this paper, we focus on the case where we have access only to the trained deep classifier and not the actual training data. While the problem of inverting deep models to synthesize images from the training distribution has been explored, our goal is to develop a deep inversion approach to generate counterfactual explanations for a given query image. Despite their effectiveness in conditional image synthesis, we show that existing deep inversion methods are insufficient for producing meaningful counterfactuals. We propose DISC (Deep Inversion for Synthesizing Counterfactuals) that improves upon deep inversion by utilizing (a) stronger image priors, (b) incorporating a novel manifold consistency objective and (c) adopting a progressive optimization strategy. We find that, in addition to producing visually meaningful explanations, the counterfactuals from DISC are effective at learning classifier decision boundaries and are robust to unknown test-time corruptions.
    Beyond Background-Aware Correlation Filters: Adaptive Context Modeling by Hand-Crafted and Deep RGB Features for Visual Tracking. (arXiv:2004.02932v2 [cs.CV] UPDATED)
    (2 min) In recent years, the background-aware correlation filters have achie-ved a lot of research interest in the visual target tracking. However, these methods cannot suitably model the target appearance due to the exploitation of hand-crafted features. On the other hand, the recent deep learning-based visual tracking methods have provided a competitive performance along with extensive computations. In this paper, an adaptive background-aware correlation filter-based tracker is proposed that effectively models the target appearance by using either the histogram of oriented gradients (HOG) or convolutional neural network (CNN) feature maps. The proposed method exploits the fast 2D non-maximum suppression (NMS) algorithm and the semantic information comparison to detect challenging situations. When the HOG-based response map is not reliable, or the context region has a low semantic similarity with prior regions, the proposed method constructs the CNN context model to improve the target region estimation. Furthermore, the rejection option allows the proposed method to update the CNN context model only on valid regions. Comprehensive experimental results demonstrate that the proposed adaptive method clearly outperforms the accuracy and robustness of visual target tracking compared to the state-of-the-art methods on the OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 datasets.
    UNETR: Transformers for 3D Medical Image Segmentation. (arXiv:2103.10504v2 [eess.IV] UPDATED)
    (2 min) Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr
    Road Network Guided Fine-Grained Urban Traffic Flow Inference. (arXiv:2109.14251v1 [cs.LG])
    (2 min) Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem, which can help greatly reduce the number of traffic monitoring sensors for cost savings. In this work, we notice that traffic flow has a high correlation with road network, which was either completely ignored or simply treated as an external factor in previous works. To facilitate this problem, we propose a novel Road-Aware Traffic Flow Magnifier (RATFM) that explicitly exploits the prior knowledge of road networks to fully learn the road-aware spatial distribution of fine-grained traffic flow. Specifically, a multi-directional 1D convolutional layer is first introduced to extract the semantic feature of the road network. Subsequently, we incorporate the road network feature and coarse-grained flow feature to regularize the short-range spatial distribution modeling of road-relative traffic flow. Furthermore, we take the road network feature as a query to capture the long-range spatial distribution of traffic flow with a transformer architecture. Benefiting from the road-aware inference mechanism, our method can generate high-quality fine-grained traffic flow maps. Extensive experiments on three real-world datasets show that the proposed RATFM outperforms state-of-the-art models under various scenarios.
    Decomposable-Net: Scalable Low-Rank Compression for Neural Networks. (arXiv:1910.13141v3 [cs.LG] UPDATED)
    (2 min) Compressing DNNs is important for the real-world applications operating on resource-constrained devices. However, we typically observe drastic performance deterioration when changing model size after training is completed. Therefore, retraining is required to resume the performance of the compressed models suitable for different devices. In this paper, we propose Decomposable-Net (the network decomposable in any size), which allows flexible changes to model size without retraining. We decompose weight matrices in the DNNs via singular value decomposition and adjust ranks according to the target model size. Unlike the existing low-rank compression methods that specialize the model to a fixed size, we propose a novel backpropagation scheme that jointly minimizes losses for both of full- and low-rank networks. This enables not only to maintain the performance of a full-rank network {\it without retraining} but also to improve low-rank networks in multiple sizes. Additionally, we introduce a simple criterion for rank selection that effectively suppresses approximation error. In experiments on the ImageNet classification task, Decomposable-Net yields superior accuracy in a wide range of model sizes. In particular, Decomposable-Net achieves the top-1 accuracy of $73.2\%$ with $0.27\times$MACs with ResNet-50, compared to Tucker decomposition ($67.4\% / 0.30\times$), Trained Rank Pruning ($70.6\% / 0.28\times$), and universally slimmable networks ($71.4\% / 0.26\times$).
    Multipath CNN with alpha matte inference for knee tissue segmentation from MRI. (arXiv:2109.14249v1 [eess.IV])
    (2 min) Precise segmentation of knee tissues from magnetic resonance imaging (MRI) is critical in quantitative imaging and diagnosis. Convolutional neural networks (CNNs), which are state of the art, have limitations owing to the lack of image-specific adaptation, such as low tissue contrasts and structural inhomogeneities, thereby leading to incomplete segmentation results. This paper presents a deep learning based automatic segmentation framework for knee tissue segmentation. A novel multipath CNN-based method is proposed, which consists of an encoder decoder-based segmentation network in combination with a low rank tensor-reconstructed segmentation network. Low rank reconstruction in MRI tensor sub-blocks is introduced to exploit the structural and morphological variations in knee tissues. To further improve the segmentation from CNNs, trimap generation, which effectively utilizes superimposed regions, is proposed for defining high, medium and low confidence regions from the multipath CNNs. The secondary path with low rank reconstructed input mitigates the conditions in which the primary segmentation network can potentially fail and overlook the boundary regions. The outcome of the segmentation is solved as an alpha matting problem by blending the trimap with the source input. Experiments on Osteoarthritis Initiative (OAI) datasets and a self prepared scan validate the effectiveness of the proposed method. We specifically demonstrate the application of the proposed method in a cartilage segmentation based thickness map for diagnosis purposes.
    Identity-Expression Ambiguity in 3D Morphable Face Models. (arXiv:2109.14203v1 [cs.CV])
    (2 min) 3D Morphable Models are a class of generative models commonly used to model faces. They are typically applied to ill-posed problems such as 3D reconstruction from 2D data. Several ambiguities in this problem's image formation process have been studied explicitly. We demonstrate that non-orthogonality of the variation in identity and expression can cause identity-expression ambiguity in 3D Morphable Models, and that in practice expression and identity are far from orthogonal and can explain each other surprisingly well. Whilst previously reported ambiguities only arise in an inverse rendering setting, identity-expression ambiguity emerges in the 3D shape generation process itself. We demonstrate this effect with 3D shapes directly as well as through an inverse rendering task, and use two popular models built from high quality 3D scans as well as a model built from a large collection of 2D images and videos. We explore this issue's implications for inverse rendering and observe that it cannot be resolved by a purely statistical prior on identity and expression deformations.
    Y-GAN: Learning Dual Data Representations for Efficient Anomaly Detection. (arXiv:2109.14020v1 [cs.CV])
    (2 min) We propose a novel reconstruction-based model for anomaly detection, called Y-GAN. The model consists of a Y-shaped auto-encoder and represents images in two separate latent spaces. The first captures meaningful image semantics, key for representing (normal) training data, whereas the second encodes low-level residual image characteristics. To ensure the dual representations encode mutually exclusive information, a disentanglement procedure is designed around a latent (proxy) classifier. Additionally, a novel consistency loss is proposed to prevent information leakage between the latent spaces. The model is trained in a one-class learning setting using normal training data only. Due to the separation of semantically-relevant and residual information, Y-GAN is able to derive informative data representations that allow for efficient anomaly detection across a diverse set of anomaly detection tasks. The model is evaluated in comprehensive experiments with several recent anomaly detection models using four popular datasets, i.e., MNIST, FMNIST and CIFAR10, and PlantVillage.
    Deep Unrolled Recovery in Sparse Biological Imaging. (arXiv:2109.14025v1 [cs.CV])
    (2 min) Deep algorithm unrolling has emerged as a powerful model-based approach to develop deep architectures that combine the interpretability of iterative algorithms with the performance gains of supervised deep learning, especially in cases of sparse optimization. This framework is well-suited to applications in biological imaging, where physics-based models exist to describe the measurement process and the information to be recovered is often highly structured. Here, we review the method of deep unrolling, and show how it improves source localization in several biological imaging settings.
    RobFR: Benchmarking Adversarial Robustness on Face Recognition. (arXiv:2007.04118v2 [cs.CV] UPDATED)
    (2 min) Face recognition (FR) has recently made substantial progress and achieved high accuracy on standard benchmarks. However, it has raised security concerns in enormous FR applications because deep CNNs are unusually vulnerable to adversarial examples, and it is still lack of a comprehensive robustness evaluation before a FR model is deployed in safety-critical scenarios. To facilitate a better understanding of the adversarial vulnerability on FR, we develop an adversarial robustness evaluation library on FR named \textbf{RobFR}, which serves as a reference for evaluating the robustness of downstream tasks. Specifically, RobFR involves 15 popular naturally trained FR models, 9 models with representative defense mechanisms and 2 commercial FR API services, to perform the robustness evaluation by using various adversarial attacks as an important surrogate. The evaluations are conducted under diverse adversarial settings in terms of dodging and impersonation, $\ell_2$ and $\ell_\infty$, as well as white-box and black-box attacks. We further propose a landmark-guided cutout (LGC) attack method to improve the transferability of adversarial examples for black-box attacks by considering the special characteristics of FR. Based on large-scale evaluations, the commercial FR API services fail to exhibit acceptable performance on robustness evaluation, and we also draw several important conclusions for understanding the adversarial robustness of FR models and providing insights for the design of robust FR models. RobFR is open-source and maintains all extendable modules, i.e., \emph{Datasets}, \emph{FR Models}, \emph{Attacks\&Defenses}, and \emph{Evaluations} at \url{https://github.com/ShawnXYang/Face-Robustness-Benchmark}, which will be continuously updated to promote future research on robust FR.
    Self-Supervised Learning for 3D Medical Image Analysis using 3D SimCLR and Monte Carlo Dropout. (arXiv:2109.14288v1 [cs.LG])
    (2 min) Self-supervised learning methods can be used to learn meaningful representations from unlabeled data that can be transferred to supervised downstream tasks to reduce the need for labeled data. In this paper, we propose a 3D self-supervised method that is based on the contrastive (SimCLR) method. Additionally, we show that employing Bayesian neural networks (with Monte-Carlo Dropout) during the inference phase can further enhance the results on the downstream tasks. We showcase our models on two medical imaging segmentation tasks: i) Brain Tumor Segmentation from 3D MRI, ii) Pancreas Tumor Segmentation from 3D CT. Our experimental results demonstrate the benefits of our proposed methods in both downstream data-efficiency and performance.
    Real-time Multi-Adaptive-Resolution-Surfel 6D LiDAR Odometry using Continuous-time Trajectory Optimization. (arXiv:2105.02010v2 [cs.RO] UPDATED)
    (2 min) Simultaneous Localization and Mapping (SLAM) is an essential capability for autonomous robots, but due to high data rates of 3D LiDARs real-time SLAM is challenging. We propose a real-time method for 6D LiDAR odometry. Our approach combines a continuous-time B-Spline trajectory representation with a Gaussian Mixture Model (GMM) formulation to jointly align local multi-resolution surfel maps. Sparse voxel grids and permutohedral lattices ensure fast access to map surfels, and an adaptive resolution selection scheme effectively speeds up registration. A thorough experimental evaluation shows the performance of our approach on multiple datasets and during real-robot experiments.
    VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. (arXiv:2105.09996v2 [cs.CV] UPDATED)
    (2 min) We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
    On Brightness Agnostic Adversarial Examples Against Face Recognition Systems. (arXiv:2109.14205v1 [cs.CV])
    (2 min) This paper introduces a novel adversarial example generation method against face recognition systems (FRSs). An adversarial example (AX) is an image with deliberately crafted noise to cause incorrect predictions by a target system. The AXs generated from our method remain robust under real-world brightness changes. Our method performs non-linear brightness transformations while leveraging the concept of curriculum learning during the attack generation procedure. We demonstrate that our method outperforms conventional techniques from comprehensive experimental investigations in the digital and physical world. Furthermore, this method enables practical risk assessment of FRSs against brightness agnostic AXs.
    Localizing Objects with Self-Supervised Transformers and no Labels. (arXiv:2109.14279v1 [cs.CV])
    (2 min) Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.
    VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. (arXiv:2109.14084v1 [cs.CV])
    (2 min) We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
    The 2nd Anti-UAV Workshop & Challenge: Methods and Results. (arXiv:2108.09909v3 [cs.CV] UPDATED)
    (2 min) The 2nd Anti-UAV Workshop \& Challenge aims to encourage research in developing novel and accurate methods for multi-scale object tracking. The Anti-UAV dataset used for the Anti-UAV Challenge has been publicly released. There are two subsets in the dataset, $i.e.$, the test-dev subset and test-challenge subset. Both subsets consist of 140 thermal infrared video sequences, spanning multiple occurrences of multi-scale UAVs. Around 24 participating teams from the globe competed in the 2nd Anti-UAV Challenge. In this paper, we provide a brief summary of the 2nd Anti-UAV Workshop \& Challenge including brief introductions to the top three methods.The submission leaderboard will be reopened for researchers that are interested in the Anti-UAV challenge. The benchmark dataset and other information can be found at: https://anti-uav.github.io/.
    A Seamless and High-Performance Out-of-Distribution Detection Approach Simply Replacing the SoftMax Loss. (arXiv:2105.14399v4 [cs.LG] UPDATED)
    (2 min) Current out-of-distribution detection approaches usually present special requirements (e.g., collecting outlier data and hyperparameter validation) and produce side effects (classification accuracy drop and slow/inefficient inferences). Recently, entropic out-of-distribution detection has been proposed as a seamless approach (i.e., a solution that avoids all the previously mentioned drawbacks). The entropic out-of-distribution detection solution comprises the IsoMax loss for training and the entropic score for out-of-distribution detection. The IsoMax loss works as a SoftMax loss drop-in replacement because swapping the SoftMax loss with the IsoMax loss requires no changes in the model's architecture or training procedures/hyperparameters. In this paper, we propose to perform what we call an isometrization of the distances used in the IsoMax loss. Additionally, we propose to replace the entropic score with the minimum distance score. Our experiments showed that these simple modifications increase out-of-distribution detection performance while keeping the solution seamless. Besides being competitive with or outperforming all major current approaches, our solution avoids all their current limitations in addition to being much easier to use, as just a simple loss replacement for training the neural network is required. Code available at https://github.com/dlmacedo/entropic-out-of-distribution-detection.
    An Information Theory-inspired Strategy for Automatic Network Pruning. (arXiv:2108.08532v2 [cs.CV] UPDATED)
    (2 min) Despite superior performance on many computer vision tasks, deep convolution neural networks are well known to be compressed on devices that have resource constraints. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide range of devices. Besides, existing methods are still challenged by the missing theoretical guidance. In this paper we propose an information theory-inspired strategy for automatic model compression. The principle behind our method is the information bottleneck theory, i.e., the hidden representation should compress information with each other. We thus introduce the normalized Hilbert-Schmidt Independence Criterion (nHSIC) on network activations as a stable and generalized indicator of layer importance. When a certain resource constraint is given, we integrate the HSIC indicator with the constraint to transform the architecture search problem into a linear programming problem with quadratic constraints. Such a problem is easily solved by a convex optimization method with a few seconds. We also provide a rigorous proof to reveal that optimizing the normalized HSIC simultaneously minimizes the mutual information between different layers. Without any search process, our method achieves better compression tradeoffs comparing to the state-of-the-art compression algorithms. For instance, with ResNet-50, we achieve a 45.3%-FLOPs reduction, with a 75.75 top-1 accuracy on ImageNet. Codes are avaliable at https://github.com/MAC-AutoML/ITPruner/tree/master.
    Compressive Visual Representations. (arXiv:2109.12909v2 [cs.LG] UPDATED)
    (2 min) Learning effective visual representations that generalize well without human supervision is a fundamental problem in order to apply Machine Learning to a wide variety of tasks. Recently, two families of self-supervised methods, contrastive learning and latent bootstrapping, exemplified by SimCLR and BYOL respectively, have made significant progress. In this work, we hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn. As Lipschitz continuity is closely related to robustness, this provides a new explanation for why compressed models are more robust. Our experiments confirm that adding compression to SimCLR and BYOL significantly improves linear evaluation accuracies and model robustness across a wide range of domain shifts. In particular, the compressed version of BYOL achieves 76.0% Top-1 linear evaluation accuracy on ImageNet with ResNet-50, and 78.8% with ResNet-50 2x.
    Positional Contrastive Learning for Volumetric Medical Image Segmentation. (arXiv:2106.09157v3 [cs.CV] UPDATED)
    (2 min) The success of deep learning heavily depends on the availability of large labeled training sets. However, it is hard to get large labeled datasets in medical image domain because of the strict privacy concern and costly labeling efforts. Contrastive learning, an unsupervised learning technique, has been proved powerful in learning image-level representations from unlabeled data. The learned encoder can then be transferred or fine-tuned to improve the performance of downstream tasks with limited labels. A critical step in contrastive learning is the generation of contrastive data pairs, which is relatively simple for natural image classification but quite challenging for medical image segmentation due to the existence of the same tissue or organ across the dataset. As a result, when applied to medical image segmentation, most state-of-the-art contrastive learning frameworks inevitably introduce a lot of false-negative pairs and result in degraded segmentation quality. To address this issue, we propose a novel positional contrastive learning (PCL) framework to generate contrastive data pairs by leveraging the position information in volumetric medical images. Experimental results on CT and MRI datasets demonstrate that the proposed PCL method can substantially improve the segmentation performance compared to existing methods in both semi-supervised setting and transfer learning setting.
    Domain Adaptive Cascade R-CNN for MItosis DOmain Generalization (MIDOG) Challenge. (arXiv:2109.00965v2 [eess.IV] UPDATED)
    (2 min) We present a summary of the domain adaptive cascade R-CNN method for mitosis detection of digital histopathology images. By comprehensive data augmentation and adapting existing popular detection architecture, our proposed method has achieved an F1 score of 0.7500 on the preliminary test set in MItosis DOmain Generalization (MIDOG) Challenge at MICCAI 2021.
    flexgrid2vec: Learning Efficient Visual Representations Vectors. (arXiv:2007.15444v6 [cs.CV] UPDATED)
    (2 min) We propose flexgrid2vec, a novel approach for image representation learning. Existing visual representation methods suffer from several issues, including the need for highly intensive computation, the risk of losing in-depth structural information and the specificity of the method to certain shapes or objects. flexgrid2vec converts an image to a low-dimensional feature vector. We represent each image with a graph of flexible, unique node locations and edge distances. flexgrid2vec is a multi-channel GCN that learns features of the most representative image patches. We have investigated both spectral and non-spectral implementations of the GCN node-embedding. Specifically, we have implemented flexgrid2vec based on different node-aggregation methods, such as vector summation, concatenation and normalisation with eigenvector centrality. We compare the performance of flexgrid2vec with a set of state-of-the-art visual representation learning models on binary and multi-class image classification tasks. Although we utilise imbalanced, low-size and low-resolution datasets, flexgrid2vec shows stable and outstanding results against well-known base classifiers. flexgrid2vec achieves 96.23% on CIFAR-10, 83.05% on CIFAR-100, 94.50% on STL-10, 98.8% on ASIRRA and 89.69% on the COCO dataset.
    On the importance of cross-task features for class-incremental learning. (arXiv:2106.11930v3 [cs.LG] UPDATED)
    (2 min) In class-incremental learning, an agent with limited resources needs to learn a sequence of classification tasks, forming an ever growing classification problem, with the constraint of not being able to access data from previous tasks. The main difference with task-incremental learning, where a task-ID is available at inference time, is that the learner also needs to perform cross-task discrimination, i.e. distinguish between classes that have not been seen together. Approaches to tackle this problem are numerous and mostly make use of an external memory (buffer) of non-negligible size. In this paper, we ablate the learning of cross-task features and study its influence on the performance of basic replay strategies used for class-IL. We also define a new forgetting measure for class-incremental learning, and see that forgetting is not the principal cause of low performance. Our experimental results show that future algorithms for class-incremental learning should not only prevent forgetting, but also aim to improve the quality of the cross-task features, and the knowledge transfer between tasks. This is especially important when tasks contain limited amount of data.
    GTNet:Guided Transformer Network for Detecting Human-Object Interactions. (arXiv:2108.00596v3 [cs.CV] UPDATED)
    (2 min) The human-object interaction (HOI) detection task refers to localizing humans, localizing objects, and predicting the interactions between each human-object pair. HOI is considered one of the fundamental steps in truly understanding complex visual scenes. For detecting HOI, it is important to utilize relative spatial configurations and object semantics to find salient spatial regions of images that highlight the interactions between human object pairs. This issue is addressed by the novel self-attention based guided transformer network, GTNet. GTNet encodes this spatial contextual information in human and object visual features via self-attention while achieving state of the art results on both the V-COCO and HICO-DET datasets. Code will be made available online.
    Integrating Contextual Knowledge to Visual Features for Fine Art Classification. (arXiv:2105.15028v2 [cs.CV] UPDATED)
    (2 min) Automatic art analysis has seen an ever-increasing interest from the pattern recognition and computer vision community. However, most of the current work is mainly based solely on digitized artwork images, sometimes supplemented with some metadata and textual comments. A knowledge graph that integrates a rich body of information about artworks, artists, painting schools, etc., in a unified structured framework can provide a valuable resource for more powerful information retrieval and knowledge discovery tools in the artistic domain. To this end, this paper presents ArtGraph: an artistic knowledge graph based on WikiArt and DBpedia. The graph, implemented in Neo4j, already provides knowledge discovery capabilities without having to train a learning system. In addition, the embeddings extracted from the graph are used to inject "contextual" knowledge into a deep learning model to improve the accuracy of artwork attribute prediction tasks.
    ToAlign: Task-oriented Alignment for Unsupervised Domain Adaptation. (arXiv:2106.10812v2 [cs.CV] UPDATED)
    (2 min) Unsupervised domain adaptive classification intends to improve theclassification performance on unlabeled target domain. To alleviate the adverse effect of domain shift, many approaches align the source and target domains in the feature space. However, a feature is usually taken as a whole for alignment without explicitly making domain alignment proactively serve the classification task, leading to sub-optimal solution. What sub-feature should be aligned for better adaptation is under-explored. In this paper, we propose an effective Task-oriented Alignment (ToAlign) for unsupervised domain adaptation (UDA). We study what features should be aligned across domains and propose to make the domain alignment proactively serve classification by performing feature decomposition and alignment under the guidance of the prior knowledge induced from the classification taskitself. Particularly, we explicitly decompose a feature in the source domain intoa task-related/discriminative feature that should be aligned, and a task-irrelevant feature that should be avoided/ignored, based on the classification meta-knowledge. Extensive experimental results on various benchmarks (e.g., Office-Home, Visda-2017, and DomainNet) under different domain adaptation settings demonstrate theeffectiveness of ToAlign which helps achieve the state-of-the-art performance.
    Attention-Guided Black-box Adversarial Attacks with Large-Scale Multiobjective Evolutionary Optimization. (arXiv:2101.07512v2 [cs.CV] UPDATED)
    (2 min) Fooling deep neural networks (DNNs) with the black-box optimization has become a popular adversarial attack fashion, as the structural prior knowledge of DNNs is always unknown. Nevertheless, recent black-box adversarial attacks may struggle to balance their attack ability and visual quality of the generated adversarial examples (AEs) in tackling high-resolution images. In this paper, we propose an attention-guided black-box adversarial attack based on the large-scale multiobjective evolutionary optimization, termed as LMOA. By considering the spatial semantic information of images, we firstly take advantage of the attention map to determine the perturbed pixels. Instead of attacking the entire image, reducing the perturbed pixels with the attention mechanism can help to avoid the notorious curse of dimensionality and thereby improves the performance of attacking. Secondly, a large-scale multiobjective evolutionary algorithm is employed to traverse the reduced pixels in the salient region. Benefiting from its characteristics, the generated AEs have the potential to fool target DNNs while being imperceptible by the human vision. Extensive experimental results have verified the effectiveness of the proposed LMOA on the ImageNet dataset. More importantly, it is more competitive to generate high-resolution AEs with better visual quality compared with the existing black-box adversarial attacks.
    Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization. (arXiv:2109.14549v1 [cs.RO])
    (2 min) Developing robust vision-guided controllers for quadrupedal robots in complex environments, with various obstacles, dynamical surroundings and uneven terrains, is very challenging. While Reinforcement Learning (RL) provides a promising paradigm for agile locomotion skills with vision inputs in simulation, it is still very challenging to deploy the RL policy in the real world. Our key insight is that aside from the discrepancy in the domain gap, in visual appearance between the simulation and the real world, the latency from the control pipeline is also a major cause of difficulty. In this paper, we propose Multi-Modal Delay Randomization (MMDR) to address this issue when training RL agents. Specifically, we simulate the latency of real hardware by using past observations, sampled with randomized periods, for both proprioception and vision. We train the RL policy for end-to-end control in a physical simulator without any predefined controller or reference motion, and directly deploy it on the real A1 quadruped robot running in the wild. We evaluate our method in different outdoor environments with complex terrains and obstacles. We demonstrate the robot can smoothly maneuver at a high speed, avoid the obstacles, and show significant improvement over the baselines. Our project page with videos is at https://mehooz.github.io/mmdr-wild/.
    Generative Probabilistic Image Colorization. (arXiv:2109.14518v1 [cs.CV])
    (2 min) We propose Generative Probabilistic Image Colorization, a diffusion-based generative process that trains a sequence of probabilistic models to reverse each step of noise corruption. Given a line-drawing image as input, our method suggests multiple candidate colorized images. Therefore, our method accounts for the ill-posed nature of the colorization problem. We conducted comprehensive experiments investigating the colorization of line-drawing images, report the influence of a score-based MCMC approach that corrects the marginal distribution of estimated samples, and further compare different combinations of models and the similarity of their generated images. Despite using only a relatively small training dataset, we experimentally develop a method to generate multiple diverse colorization candidates which avoids mode collapse and does not require any additional constraints, losses, or re-training with alternative training conditions. Our proposed approach performed well not only on color-conditional image generation tasks using biased initial values, but also on some practical image completion and inpainting tasks.
    LGVTON: A Landmark Guided Approach to Virtual Try-On. (arXiv:2004.00562v2 [cs.CV] UPDATED)
    (2 min) In this paper, we propose a Landmark Guided Virtual Try-On (LGVTON) method for clothes, which aims to solve the problem of clothing trials on e-commerce websites. Given the images of two people: a person and a model, it generates a rendition of the person wearing the clothes of the model. This is useful considering the fact that on most e-commerce websites images of only clothes are not usually available. We follow a three-stage approach to achieve our objective. In the first stage, LGVTON warps the clothes of the model using a Thin-Plate Spline (TPS) based transformation to fit the person. Unlike previous TPS-based methods, we use the landmarks (of human and clothes) to compute the TPS transformation. This enables the warping to work independently of the complex patterns, such as stripes, florals, and textures, present on the clothes. However, this computed warp may not always be very precise. We, therefore, further refine it in the subsequent stages with the help of a mask generator (Stage 2) and an image synthesizer (Stage 3) modules. The mask generator improves the fit of the warped clothes, and the image synthesizer ensures a realistic output. To tackle the problem of lack of paired training data, we resort to a self-supervised training strategy. Here paired data refers to the image pair of model and person wearing the same cloth. We compare LGVTON with four existing methods on two popular fashion datasets namely MPV and DeepFashion using two performance measures, FID (Fr\'echet Inception Distance) and SSIM (Structural Similarity Index). The proposed method in most cases outperforms the state-of-the-art methods.
    REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. (arXiv:2109.14187v1 [cs.CV])
    (2 min) Deep learning has shown recent success in classifying anomalies in chest x-rays, but datasets are still small compared to natural image datasets. Supervision of abnormality localization has been shown to improve trained models, partially compensating for dataset sizes. However, explicitly labeling these anomalies requires an expert and is very time-consuming. We propose a method for collecting implicit localization data using an eye tracker to capture gaze locations and a microphone to capture a dictation of a report, imitating the setup of a reading room, and potentially scalable for large datasets. The resulting REFLACX (Reports and Eye-Tracking Data for Localization of Abnormalities in Chest X-rays) dataset was labeled by five radiologists and contains 3,032 synchronized sets of eye-tracking data and timestamped report transcriptions. We also provide bounding boxes around lungs and heart and validation labels consisting of ellipses localizing abnormalities and image-level labels. Furthermore, a small subset of the data contains readings from all radiologists, allowing for the calculation of inter-rater scores.
    Programmable Spectral Filter Arrays for Hyperspectral Imaging. (arXiv:2109.14450v1 [cs.CV])
    (2 min) Modulating the spectral dimension of light has numerous applications in computational imaging. While there are many techniques for achieving this, there are few, if any, for implementing a spatially-varying and programmable spectral filter. This paper provides an optical design for implementing such a capability. Our key insight is that spatially-varying spectral modulation can be implemented using a liquid crystal spatial light modulator since it provides an array of liquid crystal cells, each of which can be purposed to act as a programmable spectral filter array. Relying on this insight, we provide an optical schematic and an associated lab prototype for realizing the capability, as well as address the associated challenges at implementation using optical and computational innovations. We show a number of unique operating points with our prototype including single- and multi-image hyperspectral imaging, as well as its application in material identification.
    Detailed Region-Adaptive Normalization for Heavy Makeup Transfer. (arXiv:2109.14525v1 [cs.CV])
    (2 min) In recent years, facial makeup transfer has attracted growing attention due to its efficiency and flexibility in transferring makeup styles between different faces. Although recent works have achieved realistic results, most of them fail to handle heavy makeup styles with multiple colors and subtle details. Hence we propose a novel GAN model to handle heavy makeup transfer, while maintaining the robustness to different poses and expressions. Firstly, a Makeup Multi-Extraction Network is introduced to learn region-wise makeup features from multiple layers. Then, a key transferring module called Detailed Region-Adaptive Normalization is proposed to fuse different levels of makeup styles in an adaptive way, making great improvement to the quality of heavy makeup transfer. With the outputs from the two components, Makeup Transfer Network is used to perform makeup transfer. To evaluate the efficacy of our proposed method, we collected a new makeup dataset containing a wide range of heavy styles. Experiments show that our method achieves state-of-the-art results both on light and heavy makeup styles, and is robust to different poses and expressions.
    Towards Flexible Blind JPEG Artifacts Removal. (arXiv:2109.14573v1 [eess.IV])
    (2 min) Training a single deep blind model to handle different quality factors for JPEG image artifacts removal has been attracting considerable attention due to its convenience for practical usage. However, existing deep blind methods usually directly reconstruct the image without predicting the quality factor, thus lacking the flexibility to control the output as the non-blind methods. To remedy this problem, in this paper, we propose a flexible blind convolutional neural network, namely FBCNN, that can predict the adjustable quality factor to control the trade-off between artifacts removal and details preservation. Specifically, FBCNN decouples the quality factor from the JPEG image via a decoupler module and then embeds the predicted quality factor into the subsequent reconstructor module through a quality factor attention block for flexible control. Besides, we find existing methods are prone to fail on non-aligned double JPEG images even with only a one-pixel shift, and we thus propose a double JPEG degradation model to augment the training data. Extensive experiments on single JPEG images, more general double JPEG images, and real-world JPEG images demonstrate that our proposed FBCNN achieves favorable performance against state-of-the-art methods in terms of both quantitative metrics and visual quality.
    CCTrans: Simplifying and Improving Crowd Counting with Transformer. (arXiv:2109.14483v1 [cs.CV])
    (2 min) Most recent methods used for crowd counting are based on the convolutional neural network (CNN), which has a strong ability to extract local features. But CNN inherently fails in modeling the global context due to the limited receptive fields. However, the transformer can model the global context easily. In this paper, we propose a simple approach called CCTrans to simplify the design pipeline. Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information, a pyramid feature aggregation (PFA) model to combine low-level and high-level features, an efficient regression head with multi-scale dilated convolution (MDC) to predict density maps. Besides, we tailor the loss functions for our pipeline. Without bells and whistles, extensive experiments demonstrate that our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting. Moreover, we currently rank No.1 on the leaderboard of NWPU-Crowd. Our code will be made available.
    AutoPhaseNN: Unsupervised Physics-aware Deep Learning of 3D Nanoscale Coherent Imaging. (arXiv:2109.14053v1 [physics.app-ph])
    (2 min) The problem of phase retrieval, or the algorithmic recovery of lost phase information from measured intensity alone, underlies various imaging methods from astronomy to nanoscale imaging. Traditional methods of phase retrieval are iterative in nature, and are therefore computationally expensive and time consuming. More recently, deep learning (DL) models have been developed to either provide learned priors to iterative phase retrieval or in some cases completely replace phase retrieval with networks that learn to recover the lost phase information from measured intensity alone. However, such models require vast amounts of labeled data, which can only be obtained through simulation or performing computationally prohibitive phase retrieval on hundreds of or even thousands of experimental datasets. Using a 3D nanoscale X-ray imaging modality (Bragg Coherent Diffraction Imaging or BCDI) as a representative technique, we demonstrate AutoPhaseNN, a DL-based approach which learns to solve the phase problem without labeled data. By incorporating the physics of the imaging technique into the DL model during training, AutoPhaseNN learns to invert 3D BCDI data from reciprocal space to real space in a single shot without ever being shown real space images. Once trained, AutoPhaseNN is about one hundred times faster than traditional iterative phase retrieval methods while providing comparable image quality.
    One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective. (arXiv:2109.14449v1 [cs.CV])
    (2 min) A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at https://github.com/kamwoh/orthohash
    From Beginner to Master: A Survey for Deep Learning-based Single-Image Super-Resolution. (arXiv:2109.14335v1 [eess.IV])
    (2 min) Single-image super-resolution (SISR) is an important task in image processing, which aims to enhance the resolution of imaging systems. Recently, SISR has made a huge leap and has achieved promising results with the help of deep learning (DL). In this survey, we give an overview of DL-based SISR methods and group them according to their targets, such as reconstruction efficiency, reconstruction accuracy, and perceptual accuracy. Specifically, we first introduce the problem definition, research background, and the significance of SISR. Secondly, we introduce some related works, including benchmark datasets, upsampling methods, optimization objectives, and image quality assessment methods. Thirdly, we provide a detailed investigation of SISR and give some domain-specific applications of it. Fourthly, we present the reconstruction results of some classic SISR methods to intuitively know their performance. Finally, we discuss some issues that still exist in SISR and summarize some new trends and future directions. This is an exhaustive survey of SISR, which can help researchers better understand SISR and inspire more exciting research in this field. An investigation project for SISR is provided in https://github.com/CV-JunchengLi/SISR-Survey.
    Unsupervised Domain Adaptation in Semantic Segmentation Based on Pixel Alignment and Self-Training. (arXiv:2109.14219v1 [cs.CV])
    (2 min) This paper proposes an unsupervised cross-modality domain adaptation approach based on pixel alignment and self-training. Pixel alignment transfers ceT1 scans to hrT2 modality, helping to reduce domain shift in the training segmentation model. Self-training adapts the decision boundary of the segmentation network to fit the distribution of hrT2 scans. Experiment results show that PAST has outperformed the non-UDA baseline significantly, and it received rank-2 on CrossMoDA validation phase Leaderboard with a mean Dice score of 0.8395.
    An Automatic System to Monitor the Physical Distance and Face Mask Wearing of Construction Workers in COVID-19 Pandemic. (arXiv:2101.01373v3 [cs.CV] UPDATED)
    (3 min) The COVID-19 pandemic has caused many shutdowns in different industries around the world. Sectors such as infrastructure construction and maintenance projects have not been suspended due to their significant effect on people's routine life. In such projects, workers work close together that makes a high risk of infection. The World Health Organization recommends wearing a face mask and practicing physical distancing to mitigate the virus's spread. This paper developed a computer vision system to automatically detect the violation of face mask wearing and physical distancing among construction workers to assure their safety on infrastructure projects during the pandemic. For the face mask detection, the paper collected and annotated 1,000 images, including different types of face mask wearing, and added them to a pre-existing face mask dataset to develop a dataset of 1,853 images. Then trained and tested multiple Tensorflow state-of-the-art object detection models on the face mask dataset and chose the Faster R-CNN Inception ResNet V2 network that yielded the accuracy of 99.8%. For physical distance detection, the paper employed the Faster R-CNN Inception V2 to detect people. A transformation matrix was used to eliminate the camera angle's effect on the object distances on the image. The Euclidian distance used the pixels of the transformed image to compute the actual distance between people. A threshold of six feet was considered to capture physical distance violation. The paper also used transfer learning for training the model. The final model was applied on four videos of road maintenance projects in Houston, TX, that effectively detected the face mask and physical distance. We recommend that construction owners use the proposed system to enhance construction workers' safety in the pandemic situation.
    Three-Stream 3D/1D CNN for Fine-Grained Action Classification and Segmentation in Table Tennis. (arXiv:2109.14306v1 [cs.CV])
    (2 min) This paper proposes a fusion method of modalities extracted from video through a three-stream network with spatio-temporal and temporal convolutions for fine-grained action classification in sport. It is applied to TTStroke-21 dataset which consists of untrimmed videos of table tennis games. The goal is to detect and classify table tennis strokes in the videos, the first step of a bigger scheme aiming at giving feedback to the players for improving their performance. The three modalities are raw RGB data, the computed optical flow and the estimated pose of the player. The network consists of three branches with attention blocks. Features are fused at the latest stage of the network using bilinear layers. Compared to previous approaches, the use of three modalities allows faster convergence and better performances on both tasks: classification of strokes with known temporal boundaries and joint segmentation and classification. The pose is also further investigated in order to offer richer feedback to the athletes.
    Comparison of atlas-based and neural-network-based semantic segmentation for DENSE MRI images. (arXiv:2109.14116v1 [eess.IV])
    (2 min) Two segmentation methods, one atlas-based and one neural-network-based, were compared to see how well they can each automatically segment the brain stem and cerebellum in Displacement Encoding with Stimulated Echoes Magnetic Resonance Imaging (DENSE-MRI) data. The segmentation is a pre-requisite for estimating the average displacements in these regions, which have recently been proposed as biomarkers in the diagnosis of Chiari Malformation type I (CMI). In numerical experiments, the segmentations of both methods were similar to manual segmentations provided by trained experts. It was found that, overall, the neural-network-based method alone produced more accurate segmentations than the atlas-based method did alone, but that a combination of the two methods -- in which the atlas-based method is used for the segmentation of the brain stem and the neural-network is used for the segmentation of the cerebellum -- may be the most successful.
    TSAMT: Time-Series-Analysis-based Motion Transfer among Multiple Cameras. (arXiv:2109.14174v1 [cs.CV])
    (2 min) Along with advances in optical sensors is the common practice of building an imaging system with heterogeneous cameras. While high-resolution (HR) videos acquisition and analysis are benefited from hybrid sensors, the intrinsic characteristics of multiple cameras lead to an interesting motion transfer problem. Unfortunately, most of the existing methods provide no theoretical analysis and require intensive training data. In this paper, we propose an algorithm using time series analysis for motion transfer among multiple cameras. Specifically, we firstly identify seasonality in motion data and then build an addictive time series model to extract patterns that could be transferred across cameras. Our approach has a complete and clear mathematical formulation, thus being efficient and interpretable. Through quantitative evaluations on real-world data, we demonstrate the effectiveness of our method. Furthermore, our motion transfer algorithm could combine with and facilitate downstream tasks, e.g., enhancing pose estimation on LR videos with inherent patterns extracted from HR ones. Code is available at https://github.com/IndigoPurple/TSAMT.
    One-shot Key Information Extraction from Document with Deep Partial Graph Matching. (arXiv:2109.13967v1 [cs.CV])
    (2 min) Automating the Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios such as rapid indexing and archiving. Many existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. However, collecting and labeling a large dataset is time-consuming and is not a user-friendly requirement for many cloud platforms. To overcome these challenges, we propose a deep end-to-end trainable network for one-shot KIE using partial graph matching. Contrary to previous methods that the learning of similarity and solving are optimized separately, our method enables the learning of the two processes in an end-to-end framework. Existing one-shot KIE methods are either template or simple attention-based learning approach that struggle to handle texts that are shifted beyond their desired positions caused by printers, as illustrated in Fig.1. To solve this problem, we add one-to-(at most)-one constraint such that we will find the globally optimized solution even if some texts are drifted. Further, we design a multimodal context ensemble block to boost the performance through fusing features of spatial, textual, and aspect representations. To promote research of KIE, we collected and annotated a one-shot document KIE dataset named DKIE with diverse types of images. The DKIE dataset consists of 2.5K document images captured by mobile phones in natural scenes, and it is the largest available one-shot KIE dataset up to now. The results of experiments on DKIE show that our method achieved state-of-the-art performance compared with recent one-shot and supervised learning approaches. The dataset and proposed one-shot KIE model will be released soo
    WEDGE: Web-Image Assisted Domain Generalization for Semantic Segmentation. (arXiv:2109.14196v1 [cs.CV])
    (2 min) Domain generalization for semantic segmentation is highly demanded in real applications, where a trained model is expected to work well in previously unseen domains. One challenge lies in the lack of data which could cover the diverse distributions of the possible unseen domains for training. In this paper, we propose a WEb-image assisted Domain GEneralization (WEDGE) scheme, which is the first to exploit the diversity of web-crawled images for generalizable semantic segmentation. To explore and exploit the real-world data distributions, we collect a web-crawled dataset which presents large diversity in terms of weather conditions, sites, lighting, camera styles, etc. We also present a method which injects the style representation of the web-crawled data into the source domain on-the-fly during training, which enables the network to experience images of diverse styles with reliable labels for effective training. Moreover, we use the web-crawled dataset with predicted pseudo labels for training to further enhance the capability of the network. Extensive experiments demonstrate that our method clearly outperforms existing domain generalization techniques.
    UFO-ViT: High Performance Linear Vision Transformer without Softmax. (arXiv:2109.14382v1 [cs.CV])
    (2 min) Vision transformers have become one of the most important models for computer vision tasks. While they outperform earlier convolutional networks, the complexity quadratic to $N$ is one of the major drawbacks when using traditional self-attention algorithms. Here we propose the UFO-ViT(Unit Force Operated Vision Trnasformer), novel method to reduce the computations of self-attention by eliminating some non-linearity. Modifying few of lines from self-attention, UFO-ViT achieves linear complexity without the degradation of performance. The proposed models outperform most transformer-based models on image classification and dense prediction tasks through most capacity regime.
    Multi-loss ensemble deep learning for chest X-ray classification. (arXiv:2109.14433v1 [eess.IV])
    (2 min) Class imbalance is common in medical image classification tasks, where the number of abnormal samples is fewer than the number of normal samples. The difficulty of imbalanced classification is compounded by other issues such as the size and distribution of the dataset. Reliable training of deep neural networks continues to be a major challenge in such class-imbalanced conditions. The loss function used to train the deep neural networks highly impact the performance of both balanced and imbalanced tasks. Currently, the cross-entropy loss remains the de-facto loss function for balanced and imbalanced classification tasks. This loss, however, asserts equal learning to all classes, leading to the classification of most samples as the majority normal class. To provide a critical analysis of different loss functions and identify those suitable for class-imbalanced classification, we benchmark various state-of-the-art loss functions and propose novel loss functions to train a DL model and analyze its performance in a multiclass classification setting that classifies pediatric chest X-rays as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. We also construct prediction-level and model-level ensembles of the models that are trained with various loss functions to improve classification performance. We performed localization studies to interpret model behavior to ensure that the individual models and their ensembles precisely learned the regions of interest showing disease manifestations to classify the chest X-rays to their respective categories.
    Improved Xception with Dual Attention Mechanism and Feature Fusion for Face Forgery Detection. (arXiv:2109.14136v1 [cs.CV])
    (2 min) With the rapid development of deep learning technology, more and more face forgeries by deepfake are widely spread on social media, causing serious social concern. Face forgery detection has become a research hotspot in recent years, and many related methods have been proposed until now. For those images with low quality and/or diverse sources, however, the detection performances of existing methods are still far from satisfactory. In this paper, we propose an improved Xception with dual attention mechanism and feature fusion for face forgery detection. Different from the middle flow in original Xception model, we try to catch different high-semantic features of the face images using different levels of convolution, and introduce the convolutional block attention module and feature fusion to refine and reorganize those high-semantic features. In the exit flow, we employ the self-attention mechanism and depthwise separable convolution to learn the global information and local information of the fused features separately to improve the classification the ability of the proposed model. Experimental results evaluated on three Deepfake datasets demonstrate that the proposed method outperforms Xception as well as other related methods both in effectiveness and generalization ability.
    Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds. (arXiv:2109.14379v1 [cs.CV])
    (2 min) The infrared small-dim target detection is one of the key techniques in the infrared search and tracking system. Since the local regions which similar to infrared small-dim targets spread over the whole background, exploring the interaction information amongst image features in large-range dependencies to mine the difference between the target and background is crucial for robust detection. However, existing deep learning-based methods are limited by the locality of convolutional neural networks, which impairs the ability to capture large-range dependencies. To this end, we propose a new infrared small-dim target detection method with the transformer. We adopt the self-attention mechanism of the transformer to learn the interaction information of image features in a larger range. Additionally, we design a feature enhancement module to learn more features of small-dim targets. After that, we adopt a decoder with the U-Net-like skip connection operation to get the detection result. Extensive experiments on two public datasets show the obvious superiority of the proposed method over state-of-the-art methods.
    Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation. (arXiv:2109.14200v1 [eess.AS])
    (2 min) Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and without the units being proximal learning targets for the learner. In this study, we formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated...
    Semi-Supervised Segmentation of Radiation-Induced Pulmonary Fibrosis from Lung CT Scans with Multi-Scale Guided Dense Attention. (arXiv:2109.14172v1 [eess.IV])
    (2 min) Computed Tomography (CT) plays an important role in monitoring radiation-induced Pulmonary Fibrosis (PF), where accurate segmentation of the PF lesions is highly desired for diagnosis and treatment follow-up. However, the task is challenged by ambiguous boundary, irregular shape, various position and size of the lesions, as well as the difficulty in acquiring a large set of annotated volumetric images for training. To overcome these problems, we propose a novel convolutional neural network called PF-Net and incorporate it into a semi-supervised learning framework based on Iterative Confidence-based Refinement And Weighting of pseudo Labels (I-CRAWL). Our PF-Net combines 2D and 3D convolutions to deal with CT volumes with large inter-slice spacing, and uses multi-scale guided dense attention to segment complex PF lesions. For semi-supervised learning, our I-CRAWL employs pixel-level uncertainty-based confidence-aware refinement to improve the accuracy of pseudo labels of unannotated images, and uses image-level uncertainty for confidence-based image weighting to suppress low-quality pseudo labels in an iterative training process. Extensive experiments with CT scans of Rhesus Macaques with radiation-induced PF showed that: 1) PF-Net achieved higher segmentation accuracy than existing 2D, 3D and 2.5D neural networks, and 2) I-CRAWL outperformed state-of-the-art semi-supervised learning methods for the PF lesion segmentation task. Our method has a potential to improve the diagnosis of PF and clinical assessment of side effects of radiotherapy for lung cancers.
    Neural Knitworks: Patched Neural Implicit Representation Networks. (arXiv:2109.14406v1 [cs.CV])
    (2 min) Coordinate-based Multilayer Perceptron (MLP) networks, despite being capable of learning neural implicit representations, are not performant for internal image synthesis applications. Convolutional Neural Networks (CNNs) are typically used instead for a variety of internal generative tasks, at the cost of a larger model. We propose Neural Knitwork, an architecture for neural implicit representation learning of natural images that achieves image synthesis by optimizing the distribution of image patches in an adversarial manner and by enforcing consistency between the patch predictions. To the best of our knowledge, this is the first implementation of a coordinate-based MLP tailored for synthesis tasks such as image inpainting, super-resolution, and denoising. We demonstrate the utility of the proposed technique by training on these three tasks. The results show that modeling natural images using patches, rather than pixels, produces results of higher fidelity. The resulting model requires 80% fewer parameters than alternative CNN-based solutions while achieving comparable performance and training time.
    Grouptron: Dynamic Multi-Scale Graph Convolutional Networks for Group-Aware Dense Crowd Trajectory Forecasting. (arXiv:2109.14128v1 [cs.CV])
    (2 min) Accurate, long-term forecasting of human pedestrian trajectories in highly dynamic and interactive scenes is a long-standing challenge. Recent advances in using data-driven approaches have achieved significant improvements in terms of prediction accuracy. However, the lack of group-aware analysis has limited the performance of forecasting models. This is especially apparent in highly populated scenes, where pedestrians are moving in groups and the interactions between groups are extremely complex and dynamic. In this paper, we present Grouptron, a multi-scale dynamic forecasting framework that leverages pedestrian group detection and utilizes individual-level, group-level, and scene-level information for better understanding and representation of the scenes. Our approach employs spatio-temporal clustering algorithms to identify pedestrian groups, creates spatio-temporal graphs at the individual, group, and scene levels. It then uses graph neural networks to encode dynamics at different scales and incorporates encoding across different scales for trajectory prediction. We carried out extensive comparisons and ablation experiments to demonstrate the effectiveness of our approach. Our method achieves 9.3% decrease in final displacement error (FDE) compared with state-of-the-art methods on ETH/UCY benchmark datasets, and 16.1% decrease in FDE in more crowded scenes where extensive human group interactions are more frequently present.
    Contrastive Video-Language Segmentation. (arXiv:2109.14131v1 [cs.CV])
    (2 min) We focus on the problem of segmenting a certain object referred by a natural language sentence in video content, at the core of formulating a pinpoint vision-language relation. While existing attempts mainly construct such relation in an implicit way, i.e., grid-level multi-modal feature fusion, it has been proven problematic to distinguish semantically similar objects under this paradigm. In this work, we propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective, which directly aligns the referred object and the language description and separates the unreferred content apart across frames. Moreover, to remedy for the degradation problem, we present two complementary hard instance mining strategies, i.e., Language-relevant Channel Filter and Relative Hard Instance Construction. They encourage the network to exclude visual-distinguishable feature and to focus on easy-confused objects during the contrastive training. Extensive experiments on two benchmarks, i.e., A2D Sentences and J-HMDB Sentences, quantitatively demonstrate the state-of-the-arts performance of our method and qualitatively show the more accurate distinguishment between semantically similar objects over baselines.
    Visually Grounded Concept Composition. (arXiv:2109.14115v1 [cs.CV])
    (2 min) We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant compound divergence from the training data.
    Geometry-Entangled Visual Semantic Transformer for Image Captioning. (arXiv:2109.14137v1 [cs.CV])
    (2 min) Recent advancements of image captioning have featured Visual-Semantic Fusion or Geometry-Aid attention refinement. However, those fusion-based models, they are still criticized for the lack of geometry information for inter and intra attention refinement. On the other side, models based on Geometry-Aid attention still suffer from the modality gap between visual and semantic information. In this paper, we introduce a novel Geometry-Entangled Visual Semantic Transformer (GEVST) network to realize the complementary advantages of Visual-Semantic Fusion and Geometry-Aid attention refinement. Concretely, a Dense-Cap model proposes some dense captions with corresponding geometry information at first. Then, to empower GEVST with the ability to bridge the modality gap among visual and semantic information, we build four parallel transformer encoders VV(Pure Visual), VS(Semantic fused to Visual), SV(Visual fused to Semantic), SS(Pure Semantic) for final caption generation. Both visual and semantic geometry features are used in the Fusion module and also the Self-Attention module for better attention measurement. To validate our model, we conduct extensive experiments on the MS-COCO dataset, the experimental results show that our GEVST model can obtain promising performance gains.
    Competence-Aware Path Planning via Introspective Perception. (arXiv:2109.13974v1 [cs.RO])
    (2 min) Robots deployed in the real world over extended periods of time need to reason about unexpected failures, learn to predict them, and to proactively take actions to avoid future failures. Existing approaches for competence-aware planning are either model-based, requiring explicit enumeration of known failure modes, or purely statistical, using state- and location-specific failure statistics to infer competence. We instead propose a structured model-free approach to competence-aware planning by reasoning about plan execution failures due to errors in perception, without requiring a-priori enumeration of failure modes or requiring location-specific failure statistics. We introduce competence-aware path planning via introspective perception (CPIP), a Bayesian framework to iteratively learn and exploit task-level competence in novel deployment environments. CPIP factorizes the competence-aware planning problem into two components. First, perception errors are learned in a model-free and location-agnostic setting via introspective perception prior to deployment in novel environments. Second, during actual deployments, the prediction of task-level failures is learned in a context-aware setting. Experiments in a simulation show that the proposed CPIP approach outperforms the frequentist baseline in multiple mobile robot tasks, and is further validated via real robot experiments in an environment with perceptually challenging obstacles and terrain.
    Multi-frame Joint Enhancement for Early Interlaced Videos. (arXiv:2109.14151v1 [eess.IV])
    (2 min) Early interlaced videos usually contain multiple and interlacing and complex compression artifacts, which significantly reduce the visual quality. Although the high-definition reconstruction technology for early videos has made great progress in recent years, related research on deinterlacing is still lacking. Traditional methods mainly focus on simple interlacing mechanism, and cannot deal with the complex artifacts in real-world early videos. Recent interlaced video reconstruction deep deinterlacing models only focus on single frame, while neglecting important temporal information. Therefore, this paper proposes a multiframe deinterlacing network joint enhancement network for early interlaced videos that consists of three modules, i.e., spatial vertical interpolation module, temporal alignment and fusion module, and final refinement module. The proposed method can effectively remove the complex artifacts in early videos by using temporal redundancy of multi-fields. Experimental results demonstrate that the proposed method can recover high quality results for both synthetic dataset and real-world early interlaced videos.
    Semantic Communications With AI Tasks. (arXiv:2109.14170v1 [cs.CV])
    (2 min) A radical paradigm shift of wireless networks from ``connected things'' to ``connected intelligence'' undergoes, which coincides with the Shanno and Weaver's envisions: Communications will transform from the technical level to the semantic level. This article proposes a semantic communication method with artificial intelligence tasks (SC-AIT). First, the architecture of SC-AIT is elaborated. Then, based on the proposed architecture, we implement SC-AIT for a image classifications task. A prototype of SC-AIT is also established for surface defect detection, is conducted. Experimental results show that SC-AIT has much lower bandwidth requirements, and can achieve more than $40\%$ classification accuracy gains compared with the communications at the technical level. Future trends and key challenges for semantic communications are also identified.
    Meta Learning on a Sequence of Imbalanced Domains with Difficulty Awareness. (arXiv:2109.14120v1 [cs.LG])
    (2 min) Recognizing new objects by learning from a few labeled examples in an evolving environment is crucial to obtain excellent generalization ability for real-world machine learning systems. A typical setting across current meta learning algorithms assumes a stationary task distribution during meta training. In this paper, we explore a more practical and challenging setting where task distribution changes over time with domain shift. Particularly, we consider realistic scenarios where task distribution is highly imbalanced with domain labels unavailable in nature. We propose a kernel-based method for domain change detection and a difficulty-aware memory management mechanism that jointly considers the imbalanced domain size and domain importance to learn across domains continuously. Furthermore, we introduce an efficient adaptive task sampling method during meta training, which significantly reduces task gradient variance with theoretical guarantees. Finally, we propose a challenging benchmark with imbalanced domain sequences and varied domain difficulty. We have performed extensive evaluations on the proposed benchmark, demonstrating the effectiveness of our method. We made our code publicly available.
    All-Around Real Label Supervision: Cyclic Prototype Consistency Learning for Semi-supervised Medical Image Segmentation. (arXiv:2109.13930v1 [eess.IV])
    (2 min) Semi-supervised learning has substantially advanced medical image segmentation since it alleviates the heavy burden of acquiring the costly expert-examined annotations. Especially, the consistency-based approaches have attracted more attention for their superior performance, wherein the real labels are only utilized to supervise their paired images via supervised loss while the unlabeled images are exploited by enforcing the perturbation-based \textit{"unsupervised"} consistency without explicit guidance from those real labels. However, intuitively, the expert-examined real labels contain more reliable supervision signals. Observing this, we ask an unexplored but interesting question: can we exploit the unlabeled data via explicit real label supervision for semi-supervised training? To this end, we discard the previous perturbation-based consistency but absorb the essence of non-parametric prototype learning. Based on the prototypical network, we then propose a novel cyclic prototype consistency learning (CPCL) framework, which is constructed by a labeled-to-unlabeled (L2U) prototypical forward process and an unlabeled-to-labeled (U2L) backward process. Such two processes synergistically enhance the segmentation network by encouraging more discriminative and compact features. In this way, our framework turns previous \textit{"unsupervised"} consistency into new \textit{"supervised"} consistency, obtaining the \textit{"all-around real label supervision"} property of our method. Extensive experiments on brain tumor segmentation from MRI and kidney segmentation from CT images show that our CPCL can effectively exploit the unlabeled data and outperform other state-of-the-art semi-supervised medical image segmentation methods.
    Fine-Grained Zero-Shot Learning with DNA as Side Information. (arXiv:2109.14133v1 [cs.CV])
    (2 min) Fine-grained zero-shot learning task requires some form of side-information to transfer discriminative information from seen to unseen classes. As manually annotated visual attributes are extremely costly and often impractical to obtain for a large number of classes, in this study we use DNA as side information for the first time for fine-grained zero-shot classification of species. Mitochondrial DNA plays an important role as a genetic marker in evolutionary biology and has been used to achieve near-perfect accuracy in the species classification of living organisms. We implement a simple hierarchical Bayesian model that uses DNA information to establish the hierarchy in the image space and employs local priors to define surrogate classes for unseen ones. On the benchmark CUB dataset, we show that DNA can be equally promising yet in general a more accessible alternative than word vectors as a side information. This is especially important as obtaining robust word representations for fine-grained species names is not a practicable goal when information about these species in free-form text is limited. On a newly compiled fine-grained insect dataset that uses DNA information from over a thousand species, we show that the Bayesian approach outperforms state-of-the-art by a wide margin.
    Hybrid Dynamic Contrast and Probability Distillation for Unsupervised Person Re-Id. (arXiv:2109.14157v1 [cs.CV])
    (2 min) Unsupervised person re-identification (Re-Id) has attracted increasing attention due to its practical application in the read-world video surveillance system. The traditional unsupervised Re-Id are mostly based on the method alternating between clustering and fine-tuning with the classification or metric learning objectives on the grouped clusters. However, since person Re-Id is an open-set problem, the clustering based methods often leave out lots of outlier instances or group the instances into the wrong clusters, thus they can not make full use of the training samples as a whole. To solve these problems, we present the hybrid dynamic cluster contrast and probability distillation algorithm. It formulates the unsupervised Re-Id problem into an unified local-to-global dynamic contrastive learning and self-supervised probability distillation framework. Specifically, the proposed method can make the utmost of the self-supervised signals of all the clustered and un-clustered instances, from both the instances' self-contrastive level and the probability distillation respective, in the memory-based non-parametric manner. Besides, the proposed hybrid local-to-global contrastive learning can take full advantage of the informative and valuable training examples for effective and robust training. Extensive experiment results show that the proposed method achieves superior performances to state-of-the-art methods, under both the purely unsupervised and unsupervised domain adaptation experiment settings.
    Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models. (arXiv:2109.13925v1 [cs.CV])
    (2 min) Transformers are state-of-the-art deep learning models that are composed of stacked attention and point-wise, fully connected layers designed for handling sequential data. Transformers are not only ubiquitous throughout Natural Language Processing (NLP), but, recently, they have inspired a new wave of Computer Vision (CV) applications research. In this work, a Vision Transformer (ViT) is applied to predict the state variables of 2-dimensional Ising model simulations. Our experiments show that ViT outperform state-of-the-art Convolutional Neural Networks (CNN) when using a small number of microstate images from the Ising model corresponding to various boundary conditions and temperatures. This work opens the possibility of applying ViT to other simulations, and raises interesting research directions on how attention maps can learn about the underlying physics governing different phenomena.
    A framework for quantitative analysis of Computed Tomography images of viral pneumonitis: radiomic features in COVID and non-COVID patients. (arXiv:2109.13931v1 [physics.med-ph])
    (2 min) Purpose: to optimize a pipeline of clinical data gathering and CT images processing implemented during the COVID-19 pandemic crisis and to develop artificial intelligence model for different of viral pneumonia. Methods: 1028 chest CT image of patients with positive swab were segmented automatically for lung extraction. A Gaussian model developed in Python language was applied to calculate quantitative metrics (QM) describing well-aerated and ill portions of the lungs from the histogram distribution of lung CT numbers in both lungs of each image and in four geometrical subdivision. Furthermore, radiomic features (RF) of first and second order were extracted from bilateral lungs using PyRadiomic tools. QM and RF were used to develop 4 different Multi-Layer Perceptron (MLP) classifier to discriminate images of patients with COVID (n=646) and non-COVID (n=382) viral pneumonia. Results: The Gaussian model applied to lung CT histogram correctly described healthy parenchyma 94% of the patients. The resulting accuracy of the models for COVID diagnosis were in the range 0.76-0.87, as the integral of the receiver operating curve. The best diagnostic performances were associated to the model based on RF of first and second order, with 21 relevant features after LASSO regression and an accuracy of 0.81$\pm$0.02 after 4-fold cross validation Conclusions: Despite these results were obtained with CT images from a single center, a platform for extracting useful quantitative metrics from CT images was developed and optimized. Four artificial intelligence-based models for classifying patients with COVID and non-COVID viral pneumonia were developed and compared showing overall good diagnostic performances
  • cs.IR updates on arXiv.org

    Accelerating Recommendation System Training by Leveraging Popular Choices. (arXiv:2103.00686v3 [cs.IR] UPDATED)
    (2 min) Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy
    Click-through Rate Prediction with Auto-Quantized Contrastive Learning. (arXiv:2109.13921v1 [cs.IR])
    (2 min) Click-through rate (CTR) prediction becomes indispensable in ubiquitous web recommendation applications. Nevertheless, the current methods are struggling under the cold-start scenarios where the user interactions are extremely sparse. We consider this problem as an automatic identification about whether the user behaviors are rich enough to capture the interests for prediction, and propose an Auto-Quantized Contrastive Learning (AQCL) loss to regularize the model. Different from previous methods, AQCL explores both the instance-instance and the instance-cluster similarity to robustify the latent representation, and automatically reduces the information loss to the active users due to the quantization. The proposed framework is agnostic to different model architectures and can be trained in an end-to-end fashion. Extensive results show that it consistently improves the current state-of-the-art CTR models.
    How Much Data Analytics is Enough? The ROI of Machine Learning Classification and its Application to Requirements Dependency Classification. (arXiv:2109.14097v1 [cs.SE])
    (2 min) Machine Learning (ML) can substantially improve the efficiency and effectiveness of organizations and is widely used for different purposes within Software Engineering. However, the selection and implementation of ML techniques rely almost exclusively on accuracy criteria. Thus, for organizations wishing to realize the benefits of ML investments, this narrow approach ignores crucial considerations around the anticipated costs of the ML activities across the ML lifecycle, while failing to account for the benefits that are likely to accrue from the proposed activity. We present findings for an approach that addresses this gap by enhancing the accuracy criterion with return on investment (ROI) considerations. Specifically, we analyze the performance of the two state-of-the-art ML techniques: Random Forest and Bidirectional Encoder Representations from Transformers (BERT), based on accuracy and ROI for two publicly available data sets. Specifically, we compare decision-making on requirements dependency extraction (i) exclusively based on accuracy and (ii) extended to include ROI analysis. As a result, we propose recommendations for selecting ML classification techniques based on the degree of training data used. Our findings indicate that considering ROI as additional criteria can drastically influence ML selection when compared to decisions based on accuracy as the sole criterion
    New Hybrid Techniques for Business Recommender Systems. (arXiv:2109.13922v1 [cs.IR])
    (2 min) Besides the typical applications of recommender systems in B2C scenarios such as movie or shopping platforms, there is a rising interest in transforming the human-driven advice provided e.g. in consultancy via the use of recommender systems. We explore the special characteristics of such knowledge-based B2B services and propose a process that allows to incorporate recommender systems into them. We suggest and compare several recommender techniques that allow to incorporate the necessary contextual knowledge (e.g. company demographics). These techniques are evaluated in isolation on a test set of business intelligence consultancy cases. We then identify the respective strengths of the different techniques and propose a new hybridisation strategy to combine these strengths. Our results show that the hybridisation leads to a substantial performance improvement over the individual methods.
    Text Simplification for Comprehension-based Question-Answering. (arXiv:2109.13984v1 [cs.CL])
    (2 min) Text simplification is the process of splitting and rephrasing a sentence to a sequence of sentences making it easier to read and understand while preserving the content and approximating the original meaning. Text simplification has been exploited in NLP applications like machine translation, summarization, semantic role labeling, and information extraction, opening a broad avenue for its exploitation in comprehension-based question-answering downstream tasks. In this work, we investigate the effect of text simplification in the task of question-answering using a comprehension context. We release Simple-SQuAD, a simplified version of the widely-used SQuAD dataset. Firstly, we outline each step in the dataset creation pipeline, including style transfer, thresholding of sentences showing correct transfer, and offset finding for each answer. Secondly, we verify the quality of the transferred sentences through various methodologies involving both automated and human evaluation. Thirdly, we benchmark the newly created corpus and perform an ablation study for examining the effect of the simplification process in the SQuAD-based question answering task. Our experiments show that simplification leads to up to 2.04% and 1.74% increase in Exact Match and F1, respectively. Finally, we conclude with an analysis of the transfer process, investigating the types of edits made by the model, and the effect of sentence length on the transfer model.
    A Next Basket Recommendation Reality Check. (arXiv:2109.14233v1 [cs.IR])
    (2 min) The goal of a next basket recommendation (NBR) system is to recommend items for the next basket for a user, based on the sequence of their prior baskets. Recently, a number of methods with complex modules have been proposed that claim state-of-the-art performance. They rarely look into the predicted basket and just provide intuitive reasons for the observed improvements, e.g., better representation, capturing intentions or relations, etc. We provide a novel angle on the evaluation of next basket recommendation methods, centered on the distinction between repetition and exploration: the next basket is typically composed of previously consumed items (i.e., repeat items) and new items (i.e, explore items). We propose a set of metrics that measure the repeat/explore ratio and performance of NBR models. Using these new metrics, we analyze state-of-the-art NBR models. The results of our analysis help to clarify the extent of the actual progress achieved by existing NBR methods as well as the underlying reasons for the improvements. Overall, our work sheds light on the evaluation problem of NBR and provides useful insights into the model design for this task.
  • cs.LG updates on arXiv.org

    Energy Aware Deep Reinforcement Learning Scheduling for Sensors Correlated in Time and Space. (arXiv:2011.09747v2 [cs.LG] UPDATED)
    (2 min) Millions of battery-powered sensors deployed for monitoring purposes in a multitude of scenarios, e.g., agriculture, smart cities, industry, etc., require energy-efficient solutions to prolong their lifetime. When these sensors observe a phenomenon distributed in space and evolving in time, it is expected that collected observations will be correlated in time and space. In this paper, we propose a Deep Reinforcement Learning (DRL) based scheduling mechanism capable of taking advantage of correlated information. We design our solution using the Deep Deterministic Policy Gradient (DDPG) algorithm. The proposed mechanism is capable of determining the frequency with which sensors should transmit their updates, to ensure accurate collection of observations, while simultaneously considering the energy available. To evaluate our scheduling mechanism, we use multiple datasets containing environmental observations obtained in multiple real deployments. The real observations enable us to model the environment with which the mechanism interacts as realistically as possible. We show that our solution can significantly extend the sensors' lifetime. We compare our mechanism to an idealized, all-knowing scheduler to demonstrate that its performance is near-optimal. Additionally, we highlight the unique feature of our design, energy-awareness, by displaying the impact of sensors' energy levels on the frequency of updates.
    FORK: A Forward-Looking Actor For Model-Free Reinforcement Learning. (arXiv:2010.01652v3 [cs.LG] UPDATED)
    (2 min) In this paper, we propose a new type of Actor, named forward-looking Actor or FORK for short, for Actor-Critic algorithms. FORK can be easily integrated into a model-free Actor-Critic algorithm. Our experiments on six Box2D and MuJoCo environments with continuous state and action spaces demonstrate significant performance improvement FORK can bring to the state-of-the-art algorithms. A variation of FORK can further solve Bipedal-WalkerHardcore in as few as four hours using a single GPU.
    Multi-Level Attention Pooling for Graph Neural Networks: Unifying Graph Representations with Multiple Localities. (arXiv:2103.01488v3 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) have been widely used to learn vector representation of graph-structured data and achieved better task performance than conventional methods. The foundation of GNNs is the message passing procedure, which propagates the information in a node to its neighbors. Since this procedure proceeds one step per layer, the range of the information propagation among nodes is small in the lower layers, and it expands toward the higher layers. Therefore, a GNN model has to be deep enough to capture global structural information in a graph. On the other hand, it is known that deep GNN models suffer from performance degradation because they lose nodes' local information, which would be essential for good model performance, through many message passing steps. In this study, we propose multi-level attention pooling (MLAP) for graph-level classification tasks, which can adapt to both local and global structural information in a graph. It has an attention pooling layer for each message passing step and computes the final graph representation by unifying the layer-wise graph representations. The MLAP architecture allows models to utilize the structural information of graphs with multiple levels of localities because it preserves layer-wise information before losing them due to oversmoothing. Results of our experiments show that the MLAP architecture improves the graph classification performance compared to the baseline architectures. In addition, analyses on the layer-wise graph representations suggest that aggregating information from multiple levels of localities indeed has the potential to improve the discriminability of learned graph representations.
    baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling. (arXiv:2102.03291v3 [cs.LG] UPDATED)
    (2 min) Multi-agent spatiotemporal modeling is a challenging task from both an algorithmic design and computational complexity perspective. Recent work has explored the efficacy of traditional deep sequential models in this domain, but these architectures are slow and cumbersome to train, particularly as model size increases. Further, prior attempts to model interactions between agents across time have limitations, such as imposing an order on the agents, or making assumptions about their relationships. In this paper, we introduce baller2vec, a multi-entity generalization of the standard Transformer that can, with minimal assumptions, simultaneously and efficiently integrate information across entities and time. We test the effectiveness of baller2vec for multi-agent spatiotemporal modeling by training it to perform two different basketball-related tasks: (1) simultaneously modeling the trajectories of all players on the court and (2) modeling the trajectory of the ball. Not only does baller2vec learn to perform these tasks well (outperforming a graph recurrent neural network with a similar number of parameters by a wide margin), it also appears to "understand" the game of basketball, encoding idiosyncratic qualities of players in its embeddings, and performing basketball-relevant functions with its attention heads.
    Evaluating Various Tokenizers for Arabic Text Classification. (arXiv:2106.07540v2 [cs.CL] UPDATED)
    (2 min) The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic i.e they don't incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations. In addition to that, we compare all the six algorithms by evaluating them on three supervised classification tasks which are sentiment analysis, news classification and poetry classification using six publicly available datasets. Our experiments show that none of the tokenization technique is the best choice overall and that the performance of a given tokenization algorithm depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.
    Semi-Empirical Objective Functions for Neural MCMC Proposal Optimization. (arXiv:2106.02104v2 [cs.LG] UPDATED)
    (2 min) Current objective functions used for training neural MCMC proposal distributions implicitly rely on architectural restrictions to yield sensible optimization results, which hampers the development of highly expressive neural MCMC proposal architectures. In this work, we introduce and demonstrate a semi-empirical procedure for determining approximate objective functions suitable for optimizing arbitrarily parameterized proposal distributions in MCMC methods. Our proposed Ab Initio objective functions consist of the weighted combination of functions following constraints on their global optima and transformation invariances that we argue should be upheld by general measures of MCMC efficiency for use in proposal optimization. Our experimental results demonstrate that Ab Initio objective functions maintain favorable performance and preferable optimization behavior compared to existing objective functions for neural MCMC optimization. We find that Ab Initio objective functions are sufficiently robust to enable the confident optimization of neural proposal distributions parameterized by deep generative networks extending beyond the regimes of traditional MCMC schemes
    Online Robust Reinforcement Learning with Model Uncertainty. (arXiv:2109.14523v1 [cs.LG])
    (2 min) Robust reinforcement learning (RL) is to find a policy that optimizes the worst-case performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust RL, where the uncertainty set is defined to be centering at a misspecified MDP that generates a single sample trajectory sequentially and is assumed to be unknown. We develop a sample-based approach to estimate the unknown uncertainty set and design a robust Q-learning algorithm (tabular case) and robust TDC algorithm (function approximation setting), which can be implemented in an online and incremental fashion. For the robust Q-learning algorithm, we prove that it converges to the optimal robust Q function, and for the robust TDC algorithm, we prove that it converges asymptotically to some stationary points. Unlike the results in [Roy et al., 2017], our algorithms do not need any additional conditions on the discount factor to guarantee the convergence. We further characterize the finite-time error bounds of the two algorithms and show that both the robust Q-learning and robust TDC algorithms converge as fast as their vanilla counterparts(within a constant factor). Our numerical experiments further demonstrate the robustness of our algorithms. Our approach can be readily extended to robustify many other algorithms, e.g., TD, SARSA, and other GTD algorithms.
    A Permutation-Equivariant Neural Network Architecture For Auction Design. (arXiv:2003.01497v3 [cs.GT] UPDATED)
    (2 min) Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. Theoretical approaches to the problem have hit some limits in the past decades and analytical solutions are known for only a few simple settings. Computational approaches to the problem through the use of LPs have their own set of limitations. Building on the success of deep learning, a new approach was recently proposed by Duetting et al. (2019) in which the auction is modeled by a feed-forward neural network and the design problem is framed as a learning problem. The neural architectures used in that work are general purpose and do not take advantage of any of the symmetries the problem could present, such as permutation equivariance. In this work, we consider auction design problems that have permutation-equivariant symmetry and construct a neural architecture that is capable of perfectly recovering the permutation-equivariant optimal mechanism, which we show is not possible with the previous architecture. We demonstrate that permutation-equivariant architectures are not only capable of recovering previous results, they also have better generalization properties.
    Distribution Regression for Sequential Data. (arXiv:2006.05805v5 [cs.LG] UPDATED)
    (2 min) Distribution regression refers to the supervised learning problem where labels are only available for groups of inputs instead of individual inputs. In this paper, we develop a rigorous mathematical framework for distribution regression where inputs are complex data streams. Leveraging properties of the expected signature and a recent signature kernel trick for sequential data from stochastic analysis, we introduce two new learning techniques, one feature-based and the other kernel-based. Each is suited to a different data regime in terms of the number of data streams and the dimensionality of the individual streams. We provide theoretical results on the universality of both approaches and demonstrate empirically their robustness to irregularly sampled multivariate time-series, achieving state-of-the-art performance on both synthetic and real-world examples from thermodynamics, mathematical finance and agricultural science.
    FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition. (arXiv:2109.14420v1 [cs.CL])
    (2 min) Error correction is widely used in automatic speech recognition (ASR) to post-process the generated sentence, and can further reduce the word error rate (WER). Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect and correct error tokens. In this work, we propose FastCorrect 2, an error correction model that takes multiple ASR candidates as input for better correction accuracy. FastCorrect 2 adopts non-autoregressive generation for fast inference, which consists of an encoder that processes multiple source sentences and a decoder that generates the target sentence in parallel from the adjusted source sentence, where the adjustment is based on the predicted duration of each source token. However, there are some issues when handling multiple source sentences. First, it is non-trivial to leverage the voting effect from multiple source sentences since they usually vary in length. Thus, we propose a novel alignment algorithm to maximize the degree of token alignment among multiple sentences in terms of token and pronunciation similarity. Second, the decoder can only take one adjusted source sentence as input, while there are multiple source sentences. Thus, we develop a candidate predictor to detect the most suitable candidate for the decoder. Experiments on our inhouse dataset and AISHELL-1 show that FastCorrect 2 can further reduce the WER over the previous correction model with single candidate by 3.2% and 2.6%, demonstrating the effectiveness of leveraging multiple candidates in ASR error correction. FastCorrect 2 achieves better performance than the cascaded re-scoring and correction pipeline and can serve as a unified post-processing module for ASR.
    Digital Twins based Day-ahead Integrated Energy System Scheduling under Load and Renewable Energy Uncertainties. (arXiv:2109.14423v1 [cs.IT])
    (2 min) By constructing digital twins (DT) of an integrated energy system (IES), one can benefit from DT's predictive capabilities to improve coordinations among various energy converters, hence enhancing energy efficiency, cost savings and carbon emission reduction. This paper is motivated by the fact that practical IESs suffer from multiple uncertainty sources, and complicated surrounding environment. To address this problem, a novel DT-based day-ahead scheduling method is proposed. The physical IES is modelled as a multi-vector energy system in its virtual space that interacts with the physical IES to manipulate its operations. A deep neural network is trained to make statistical cost-saving scheduling by learning from both historical forecasting errors and day-ahead forecasts. Case studies of IESs show that the proposed DT-based method is able to reduce the operating cost of IES by 63.5%, comparing to the existing forecast-based scheduling methods. It is also found that both electric vehicles and thermal energy storages play proactive roles in the proposed method, highlighting their importance in future energy system integration and decarbonisation.
    A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms. (arXiv:2106.02126v2 [cs.LG] UPDATED)
    (2 min) One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O(log n) regret in instances with "large" gaps, and a near-optimal O(\sqrt{n log n}) minimax regret when the gap can be arbitrarily "small." This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity. This discovery facilitates new sharp asymptotics and a novel alternative proof for the O(\sqrt{n log n}) minimax regret of UCB. Furthermore, the paper also provides the first complete process-level characterization of the MAB problem under UCB in the conventional diffusion scaling. Among other things, the "small" gap worst-case lens adopted in this paper also reveals profound distinctions between the behavior of UCB and Thompson Sampling, such as an "incomplete learning" phenomenon characteristic of the latter.
    Auction learning as a two-player game. (arXiv:2006.05684v3 [cs.GT] UPDATED)
    (2 min) Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. While theoretical approaches to the problem have hit some limits, a recent research direction initiated by Duetting et al. (2019) consists in building neural network architectures to find optimal auctions. We propose two conceptual deviations from their approach which result in enhanced performance. First, we use recent results in theoretical auction design (Rubinstein and Weinberg, 2018) to introduce a time-independent Lagrangian. This not only circumvents the need for an expensive hyper-parameter search (as in prior work), but also provides a principled metric to compare the performance of two auctions (absent from prior work). Second, the optimization procedure in previous work uses an inner maximization loop to compute optimal misreports. We amortize this process through the introduction of an additional neural network. We demonstrate the effectiveness of our approach by learning competitive or strictly improved auctions compared to prior work. Both results together further imply a novel formulation of Auction Design as a two-player game with stationary utility functions.
    A hybrid ensemble method with negative correlation learning for regression. (arXiv:2104.02317v2 [cs.LG] UPDATED)
    (2 min) Hybrid ensemble, an essential branch of ensembles, has flourished in numerous machine learning problems, especially regression. Several studies have confirmed the importance of diversity; however, previous ensembles only consider diversity in the sub-model training stage, with limited improvement compared to single models. In contrast, this study focuses on the sub-model combination stage of the ensemble. It solves a non-convex optimization problem using an interior-point filtering linear-search algorithm to select and weight sub-models from a heterogeneous model pool automatically. This optimization problem innovatively incorporates negative correlation learning as a penalty term. Thus, a diverse model subset can be selected. Experimental results show that the approach outperforms single model and overcomes the instability of the models and parameters. Compared to bagging and stacking without model diversity, our method stands out more and confirms the importance of diversity in the ensemble. Additionally, the performance of our proposed method is better than that of simple and weighted averages, and the variance of the weights is lower and more stable than that of a linear model. Finally, the prediction accuracy can be further improved by fine-tuning the weights using the error inverse weights. In conclusion, the value of this study lies in its ease of use and effectiveness, allowing the hybrid ensemble to embrace both diversity and accuracy.
    Terahertz-Band Joint Ultra-Massive MIMO Radar-Communications: Model-Based and Model-Free Hybrid Beamforming. (arXiv:2103.00328v2 [eess.SP] UPDATED)
    (2 min) Wireless communications and sensing at terahertz (THz) band are increasingly investigated as promising short-range technologies because of the availability of high operational bandwidth at THz. In order to address the extremely high attenuation at THz, ultra-massive multiple-input multiple-output (MIMO) antenna systems have been proposed for THz communications to compensate propagation losses. However, the cost and power associated with fully digital beamformers of these huge antenna arrays are prohibitive. In this paper, we develop wideband hybrid beamformers based on both model-based and model-free techniques for a new group-of-subarrays (GoSA) ultra-massive MIMO structure in low-THz band. Further, driven by the recent developments to save the spectrum, we propose beamformers for a joint ultra-massive MIMO radar-communications system, wherein the base station serves multi-antenna user equipment (RX), and tracks radar targets by generating multiple beams toward both RX and the targets. We formulate the GoSA beamformer design as an optimization problem to provide a trade-off between the unconstrained communications beamformers and the desired radar beamformers. To mitigate the beam split effect at THz band arising from frequency-independent analog beamformers, we propose a phase correction technique to align the beams of multiple subcarriers toward a single physical direction. To further decrease the ultra-massive MIMO computational complexity and enhance robustness, we also implement deep learning solutions to the proposed model-based hybrid beamformers. Numerical experiments demonstrate that both techniques outperform the conventional approaches in terms of spectral efficiency and radar beampatterns, as well as exhibiting less hardware cost and computation time.
    Opportunistic Emulation of Computationally Expensive Simulations via Deep Learning. (arXiv:2108.11057v2 [cs.LG] UPDATED)
    (2 min) With the underlying aim of increasing efficiency of computational modelling pertinent for managing & protecting the Great Barrier Reef, we perform a preliminary investigation on the use of deep neural networks for opportunistic model emulation of APSIM models by repurposing an existing large dataset containing outputs of APSIM model runs. The dataset has not been specifically tailored for the model emulation task. We employ two neural network architectures for the emulation task: densely connected feed-forward neural network (FFNN), and gated recurrent unit feeding into FFNN (GRU-FFNN), a type of a recurrent neural network. Various configurations of the architectures are trialled. A minimum correlation statistic is used to identify clusters of APSIM scenarios that can be aggregated to form training sets for model emulation. We focus on emulating 4 important outputs of the APSIM model: runoff, soil_loss, DINrunoff, Nleached. The GRU-FFNN architecture with three hidden layers and 128 units per layer provides good emulation of runoff and DINrunoff. However, soil_loss and Nleached were emulated relatively poorly under a wide range of the considered architectures; the emulators failed to capture variability at higher values of these two outputs. While the opportunistic data available from past modelling activities provides a large and useful dataset for exploring APSIM emulation, it may not be sufficiently rich enough for successful deep learning of more complex model dynamics. Design of Computer Experiments may be required to generate more informative data to emulate all output variables of interest. We also suggest the use of synthetic meteorology settings to allow the model to be fed a wide range of inputs. These need not all be representative of normal conditions, but can provide a denser, more informative dataset from which complex relationships between input and outputs can be learned.
    Training Feedback Spiking Neural Networks by Implicit Differentiation on the Equilibrium State. (arXiv:2109.14247v1 [cs.NE])
    (2 min) Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware. However, the supervised training of SNNs remains a hard problem due to the discontinuity of the spiking neuron model. Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks, and use surrogate derivatives or compute gradients with respect to the spiking time to deal with the problem. These approaches either accumulate approximation errors or only propagate information limitedly through existing spikes, and usually require information propagation along time steps with large memory costs and biological implausibility. In this work, we consider feedback spiking neural networks, which are more brain-like, and propose a novel training method that does not rely on the exact reverse of the forward computation. First, we show that the average firing rates of SNNs with feedback connections would gradually evolve to an equilibrium state along time, which follows a fixed-point equation. Then by viewing the forward computation of feedback SNNs as a black-box solver for this equation, and leveraging the implicit differentiation on the equation, we can compute the gradient for parameters without considering the exact forward procedure. In this way, the forward and backward procedures are decoupled and therefore the problem of non-differentiable spiking functions is avoided. We also briefly discuss the biological plausibility of implicit differentiation, which only requires computing another equilibrium. Extensive experiments on MNIST, Fashion-MNIST, N-MNIST, CIFAR-10, and CIFAR-100 demonstrate the superior performance of our method for feedback models with fewer neurons and parameters in a small number of time steps. Our code is avaiable at https://github.com/pkuxmq/IDE-FSNN.
    Intuitiveness in Active Teaching. (arXiv:2012.13551v2 [cs.HC] UPDATED)
    (2 min) While Machine learning gives rise to astonishing results in automated systems, it is usually at the cost of large data requirements. This makes many successful algorithms from machine learning unsuitable for human-machine interaction, where the machine must learn from a small number of training samples that can be provided by a user within a reasonable time frame. Fortunately, the user can tailor the training data they create to be as useful as possible, severely limiting its necessary size -- as long as they know about the machine's requirements and limitations. Of course, acquiring this knowledge can in turn be cumbersome and costly. This raises the question of how easy machine learning algorithms are to interact with. In this work, we address this issue by analyzing the intuitiveness of certain algorithms when they are actively taught by users. After developing a theoretical framework of intuitiveness as a property of algorithms, we introduce an active teaching paradigm involving a prototypical two-dimensional spatial learning task as a method to judge the efficacy of human-machine interactions. Finally, we present and discuss the results of a large-scale user study into the performance and teaching strategies of 800 users interacting with two prominent machine learning algorithms in our system, providing first evidence for the role of intuition as an important factor impacting human-machine interaction.
    Multi-loss ensemble deep learning for chest X-ray classification. (arXiv:2109.14433v1 [eess.IV])
    (2 min) Class imbalance is common in medical image classification tasks, where the number of abnormal samples is fewer than the number of normal samples. The difficulty of imbalanced classification is compounded by other issues such as the size and distribution of the dataset. Reliable training of deep neural networks continues to be a major challenge in such class-imbalanced conditions. The loss function used to train the deep neural networks highly impact the performance of both balanced and imbalanced tasks. Currently, the cross-entropy loss remains the de-facto loss function for balanced and imbalanced classification tasks. This loss, however, asserts equal learning to all classes, leading to the classification of most samples as the majority normal class. To provide a critical analysis of different loss functions and identify those suitable for class-imbalanced classification, we benchmark various state-of-the-art loss functions and propose novel loss functions to train a DL model and analyze its performance in a multiclass classification setting that classifies pediatric chest X-rays as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. We also construct prediction-level and model-level ensembles of the models that are trained with various loss functions to improve classification performance. We performed localization studies to interpret model behavior to ensure that the individual models and their ensembles precisely learned the regions of interest showing disease manifestations to classify the chest X-rays to their respective categories.
    One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective. (arXiv:2109.14449v1 [cs.CV])
    (2 min) A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at https://github.com/kamwoh/orthohash
    PINNup: Robust neural network wavefield solutions using frequency upscaling and neuron splitting. (arXiv:2109.14536v1 [cs.LG])
    (2 min) Solving for the frequency-domain scattered wavefield via physics-informed neural network (PINN) has great potential in seismic modeling and inversion. However, when dealing with high-frequency wavefields, its accuracy and training cost limits its applications. Thus, we propose a novel implementation of PINN using frequency upscaling and neuron splitting, which allows the neural network model to grow in size as we increase the frequency while leveraging the information from the pre-trained model for lower-frequency wavefields, resulting in fast convergence to high-accuracy solutions. Numerical results show that, compared to the commonly used PINN with random initialization, the proposed PINN exhibits notable superiority in terms of convergence and accuracy and can achieve neuron based high-frequency wavefield solutions with a two-hidden-layer model.
    Fairness-Driven Private Collaborative Machine Learning. (arXiv:2109.14376v1 [cs.LG])
    (2 min) The performance of machine learning algorithms can be considerably improved when trained over larger datasets. In many domains, such as medicine and finance, larger datasets can be obtained if several parties, each having access to limited amounts of data, collaborate and share their data. However, such data sharing introduces significant privacy challenges. While multiple recent studies have investigated methods for private collaborative machine learning, the fairness of such collaborative algorithms was overlooked. In this work we suggest a feasible privacy-preserving pre-process mechanism for enhancing fairness of collaborative machine learning algorithms. Our experimentation with the proposed method shows that it is able to enhance fairness considerably with only a minor compromise in accuracy.
    Linear Asymptotic Convergence of Anderson Acceleration: Fixed-Point Analysis. (arXiv:2109.14176v1 [math.OC])
    (2 min) We study the asymptotic convergence of AA($m$), i.e., Anderson acceleration with window size $m$ for accelerating fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in R^n$. Convergence acceleration by AA($m$) has been widely observed but is not well understood. We consider the case where the fixed-point iteration function $q(x)$ is differentiable and the convergence of the fixed-point method itself is root-linear. We identify numerically several conspicuous properties of AA($m$) convergence: First, AA($m$) sequences $\{x_k\}$ converge root-linearly but the root-linear convergence factor depends strongly on the initial condition. Second, the AA($m$) acceleration coefficients $\beta^{(k)}$ do not converge but oscillate as $\{x_k\}$ converges to $x^*$. To shed light on these observations, we write the AA($m$) iteration as an augmented fixed-point iteration $z_{k+1} =\Psi(z_k)$, $z_k \in R^{n(m+1)}$ and analyze the continuity and differentiability properties of $\Psi(z)$ and $\beta(z)$. We find that the vector of acceleration coefficients $\beta(z)$ is not continuous at the fixed point $z^*$. However, we show that, despite the discontinuity of $\beta(z)$, the iteration function $\Psi(z)$ is Lipschitz continuous and directionally differentiable at $z^*$ for AA(1), and we generalize this to AA($m$) with $m>1$ for most cases. Furthermore, we find that $\Psi(z)$ is not differentiable at $z^*$. We then discuss how these theoretical findings relate to the observed convergence behaviour of AA($m$). The discontinuity of $\beta(z)$ at $z^*$ allows $\beta^{(k)}$ to oscillate as $\{x_k\}$ converges to $x^*$, and the non-differentiability of $\Psi(z)$ allows AA($m$) sequences to converge with root-linear convergence factors that strongly depend on the initial condition. Additional numerical results illustrate our findings.
    Multi-class Probabilistic Bounds for Self-learning. (arXiv:2109.14422v1 [cs.LG])
    (2 min) Self-learning is a classical approach for learning with both labeled and unlabeled observations which consists in giving pseudo-labels to unlabeled training instances with a confidence score over a predetermined threshold. At the same time, the pseudo-labeling technique is prone to error and runs the risk of adding noisy labels into unlabeled training data. In this paper, we present a probabilistic framework for analyzing self-learning in the multi-class classification scenario with partially labeled data. First, we derive a transductive bound over the risk of the multi-class majority vote classifier. Based on this result, we propose to automatically choose the threshold for pseudo-labeling that minimizes the transductive bound. Then, we introduce a mislabeling error model to analyze the error of the majority vote classifier in the case of the pseudo-labeled data. We derive a probabilistic C-bound over the majority vote error when an imperfect label is given. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised approaches.
    Implicit Generative Copulas. (arXiv:2109.14567v1 [stat.ML])
    (2 min) Copulas are a powerful tool for modeling multivariate distributions as they allow to separately estimate the univariate marginal distributions and the joint dependency structure. However, known parametric copulas offer limited flexibility especially in high dimensions, while commonly used non-parametric methods suffer from the curse of dimensionality. A popular remedy is to construct a tree-based hierarchy of conditional bivariate copulas. In this paper, we propose a flexible, yet conceptually simple alternative based on implicit generative neural networks. The key challenge is to ensure marginal uniformity of the estimated copula distribution. We achieve this by learning a multivariate latent distribution with unspecified marginals but the desired dependency structure. By applying the probability integral transform, we can then obtain samples from the high-dimensional copula distribution without relying on parametric assumptions or the need to find a suitable tree structure. Experiments on synthetic and real data from finance, physics, and image generation demonstrate the performance of this approach.
    Generalization Bounds For Meta-Learning: An Information-Theoretic Analysis. (arXiv:2109.14595v1 [cs.LG])
    (2 min) We derive a novel information-theoretic analysis of the generalization property of meta-learning algorithms. Concretely, our analysis proposes a generic understanding of both the conventional learning-to-learn framework and the modern model-agnostic meta-learning (MAML) algorithms. Moreover, we provide a data-dependent generalization bound for a stochastic variant of MAML, which is non-vacuous for deep few-shot learning. As compared to previous bounds that depend on the square norm of gradients, empirical validations on both simulated data and a well-known few-shot benchmark show that our bound is orders of magnitude tighter in most situations.
    Online Metro Origin-Destination Prediction via Heterogeneous Information Aggregation. (arXiv:2107.00946v4 [cs.LG] UPDATED)
    (2 min) Metro origin-destination prediction is a crucial yet challenging time-series analysis task in intelligent transportation systems, which aims to accurately forecast two specific types of cross-station ridership, i.e., Origin-Destination (OD) one and Destination-Origin (DO) one. However, complete OD matrices of previous time intervals can not be obtained immediately in online metro systems, and conventional methods only used limited information to forecast the future OD and DO ridership separately. In this work, we proposed a novel neural network module termed Heterogeneous Information Aggregation Machine (HIAM), which fully exploits heterogeneous information of historical data (e.g., incomplete OD matrices, unfinished order vectors, and DO matrices) to jointly learn the evolutionary patterns of OD and DO ridership. Specifically, an OD modeling branch estimates the potential destinations of unfinished orders explicitly to complement the information of incomplete OD matrices, while a DO modeling branch takes DO matrices as input to capture the spatial-temporal distribution of DO ridership. Moreover, a Dual Information Transformer is introduced to propagate the mutual information among OD features and DO features for modeling the OD-DO causality and correlation. Based on the proposed HIAM, we develop a unified Seq2Seq network to forecast the future OD and DO ridership simultaneously. Extensive experiments conducted on two large-scale benchmarks demonstrate the effectiveness of our method for online metro origin-destination prediction.
    Real-Time Decentralized knowledge Transfer at the Edge. (arXiv:2011.05961v3 [cs.LG] UPDATED)
    (2 min) The proliferation of edge networks creates islands of learning agents working on local streams of data. Transferring knowledge between these agents in real-time without exposing private data allows for collaboration to decrease learning time and increase model confidence. Incorporating knowledge from data that a local model did not see creates an ability to debias a local model or add to classification abilities on data never before seen. Transferring knowledge in a selective decentralized approach enables models to retain their local insights, allowing for local flavors of a machine learning model. This approach suits the decentralized architecture of edge networks, as a local edge node will serve a community of learning agents that will likely encounter similar data. We propose a method based on knowledge distillation for pairwise knowledge transfer pipelines from models trained on non-i.i.d. data and compare it to other popular knowledge transfer methods. Additionally, we test different scenarios of knowledge transfer network construction and show the practicality of our approach. Our experiments show knowledge transfer using our model outperforms standard methods in a real-time transfer scenario.
    Dynamic Regret Analysis for Online Meta-Learning. (arXiv:2109.14375v1 [cs.LG])
    (2 min) The online meta-learning framework has arisen as a powerful tool for the continual lifelong learning setting. The goal for an agent is to quickly learn new tasks by drawing on prior experience, while it faces with tasks one after another. This formulation involves two levels: outer level which learns meta-learners and inner level which learns task-specific models, with only a small amount of data from the current task. While existing methods provide static regret analysis for the online meta-learning framework, we establish performance in terms of dynamic regret which handles changing environments from a global prospective. We also build off of a generalized version of the adaptive gradient methods that covers both ADAM and ADAGRAD to learn meta-learners in the outer level. We carry out our analyses in a stochastic setting, and in expectation prove a logarithmic local dynamic regret which depends explicitly on the total number of iterations T and parameters of the learner. Apart from, we also indicate high probability bounds on the convergence rates of proposed algorithm with appropriate selection of parameters, which have not been argued before.
    Interpretable Locally Adaptive Nearest Neighbors. (arXiv:2011.03904v2 [cs.LG] UPDATED)
    (2 min) When training automated systems, it has been shown to be beneficial to adapt the representation of data by learning a problem-specific metric. This metric is global. We extend this idea and, for the widely used family of k nearest neighbors algorithms, develop a method that allows learning locally adaptive metrics. These local metrics not only improve performance but are naturally interpretable. To demonstrate important aspects of how our approach works, we conduct a number of experiments on synthetic data sets, and we show its usefulness on real-world benchmark data sets.
    Decomposable-Net: Scalable Low-Rank Compression for Neural Networks. (arXiv:1910.13141v3 [cs.LG] UPDATED)
    (2 min) Compressing DNNs is important for the real-world applications operating on resource-constrained devices. However, we typically observe drastic performance deterioration when changing model size after training is completed. Therefore, retraining is required to resume the performance of the compressed models suitable for different devices. In this paper, we propose Decomposable-Net (the network decomposable in any size), which allows flexible changes to model size without retraining. We decompose weight matrices in the DNNs via singular value decomposition and adjust ranks according to the target model size. Unlike the existing low-rank compression methods that specialize the model to a fixed size, we propose a novel backpropagation scheme that jointly minimizes losses for both of full- and low-rank networks. This enables not only to maintain the performance of a full-rank network {\it without retraining} but also to improve low-rank networks in multiple sizes. Additionally, we introduce a simple criterion for rank selection that effectively suppresses approximation error. In experiments on the ImageNet classification task, Decomposable-Net yields superior accuracy in a wide range of model sizes. In particular, Decomposable-Net achieves the top-1 accuracy of $73.2\%$ with $0.27\times$MACs with ResNet-50, compared to Tucker decomposition ($67.4\% / 0.30\times$), Trained Rank Pruning ($70.6\% / 0.28\times$), and universally slimmable networks ($71.4\% / 0.26\times$).
    Sublinear Time and Space Algorithms for Correlation Clustering via Sparse-Dense Decompositions. (arXiv:2109.14528v1 [cs.DS])
    (2 min) We present a new approach for solving (minimum disagreement) correlation clustering that results in sublinear algorithms with highly efficient time and space complexity for this problem. In particular, we obtain the following algorithms for $n$-vertex $(+/-)$-labeled graphs $G$: -- A sublinear-time algorithm that with high probability returns a constant approximation clustering of $G$ in $O(n\log^2{n})$ time assuming access to the adjacency list of the $(+)$-labeled edges of $G$ (this is almost quadratically faster than even reading the input once). Previously, no sublinear-time algorithm was known for this problem with any multiplicative approximation guarantee. -- A semi-streaming algorithm that with high probability returns a constant approximation clustering of $G$ in $O(n\log{n})$ space and a single pass over the edges of the graph $G$ (this memory is almost quadratically smaller than input size). Previously, no single-pass algorithm with $o(n^2)$ space was known for this problem with any approximation guarantee. The main ingredient of our approach is a novel connection to sparse-dense graph decompositions that are used extensively in the graph coloring literature. To our knowledge, this connection is the first application of these decompositions beyond graph coloring, and in particular for the correlation clustering problem, and can be of independent interest.
    Self-supervised Learning on Graphs: Contrastive, Generative,or Predictive. (arXiv:2105.07342v4 [cs.LG] UPDATED)
    (2 min) Deep learning on graphs has recently achieved remarkable success on a variety of tasks, while such success relies heavily on the massive and carefully labeled data. However, precise annotations are generally very expensive and time-consuming. To address this problem, self-supervised learning (SSL) is emerging as a new paradigm for extracting informative knowledge through well-designed pretext tasks without relying on manual labels. In this survey, we extend the concept of SSL, which first emerged in the fields of computer vision and natural language processing, to present a timely and comprehensive review of existing SSL techniques for graph data. Specifically, we divide existing graph SSL methods into three categories: contrastive, generative, and predictive. More importantly, unlike other surveys that only provide a high-level description of published research, we present an additional mathematical summary of existing works in a unified framework. Furthermore, to facilitate methodological development and empirical comparisons, we also summarize the commonly used datasets, evaluation metrics, downstream tasks, open-source implementations, and experimental study of various algorithms. Finally, we discuss the technical challenges and potential future directions for improving graph self-supervised learning. Latest advances in graph SSL are summarized in a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning.
    Guidelines for the Computational Testing of Machine Learning approaches to Vehicle Routing Problems. (arXiv:2109.13983v1 [cs.LG])
    (2 min) Despite the extensive research efforts and the remarkable results obtained on Vehicle Routing Problems (VRP) by using algorithms proposed by the Machine Learning community that are partially or entirely based on data-driven analysis, most of these approaches are still seldom employed by the Operations Research (OR) community. Among the possible causes, we believe, the different approach to the computational evaluation of the proposed methods may play a major role. With the current work, we want to highlight a number of challenges (and possible ways to handle them) arising during the computational studies of heuristic approaches to VRPs that, if appropriately addressed, may produce a computational study having the characteristics of those presented in OR papers, thus hopefully promoting the collaboration between the two communities.
    Efficient Reinforced Feature Selection via Early Stopping Traverse Strategy. (arXiv:2109.14180v1 [cs.LG])
    (2 min) In this paper, we propose a single-agent Monte Carlo based reinforced feature selection (MCRFS) method, as well as two efficiency improvement strategies, i.e., early stopping (ES) strategy and reward-level interactive (RI) strategy. Feature selection is one of the most important technologies in data prepossessing, aiming to find the optimal feature subset for a given downstream machine learning task. Enormous research has been done to improve its effectiveness and efficiency. Recently, the multi-agent reinforced feature selection (MARFS) has achieved great success in improving the performance of feature selection. However, MARFS suffers from the heavy burden of computational cost, which greatly limits its application in real-world scenarios. In this paper, we propose an efficient reinforcement feature selection method, which uses one agent to traverse the whole feature set, and decides to select or not select each feature one by one. Specifically, we first develop one behavior policy and use it to traverse the feature set and generate training data. And then, we evaluate the target policy based on the training data and improve the target policy by Bellman equation. Besides, we conduct the importance sampling in an incremental way, and propose an early stopping strategy to improve the training efficiency by the removal of skew data. In the early stopping strategy, the behavior policy stops traversing with a probability inversely proportional to the importance sampling weight. In addition, we propose a reward-level interactive strategy to improve the training efficiency via reward-level external advice. Finally, we design extensive experiments on real-world data to demonstrate the superiority of the proposed method.
    Information-Bottleneck-Based Behavior Representation Learning for Multi-agent Reinforcement learning. (arXiv:2109.14188v1 [cs.LG])
    (2 min) In multi-agent deep reinforcement learning, extracting sufficient and compact information of other agents is critical to attain efficient convergence and scalability of an algorithm. In canonical frameworks, distilling of such information is often done in an implicit and uninterpretable manner, or explicitly with cost functions not able to reflect the relationship between information compression and utility in representation. In this paper, we present Information-Bottleneck-based Other agents' behavior Representation learning for Multi-agent reinforcement learning (IBORM) to explicitly seek low-dimensional mapping encoder through which a compact and informative representation relevant to other agents' behaviors is established. IBORM leverages the information bottleneck principle to compress observation information, while retaining sufficient information relevant to other agents' behaviors used for cooperation decision. Empirical results have demonstrated that IBORM delivers the fastest convergence rate and the best performance of the learned policies, as compared with implicit behavior representation learning and explicit behavior representation learning without explicitly considering information compression and utility.
    Three-Stream 3D/1D CNN for Fine-Grained Action Classification and Segmentation in Table Tennis. (arXiv:2109.14306v1 [cs.CV])
    (2 min) This paper proposes a fusion method of modalities extracted from video through a three-stream network with spatio-temporal and temporal convolutions for fine-grained action classification in sport. It is applied to TTStroke-21 dataset which consists of untrimmed videos of table tennis games. The goal is to detect and classify table tennis strokes in the videos, the first step of a bigger scheme aiming at giving feedback to the players for improving their performance. The three modalities are raw RGB data, the computed optical flow and the estimated pose of the player. The network consists of three branches with attention blocks. Features are fused at the latest stage of the network using bilinear layers. Compared to previous approaches, the use of three modalities allows faster convergence and better performances on both tasks: classification of strokes with known temporal boundaries and joint segmentation and classification. The pose is also further investigated in order to offer richer feedback to the athletes.
    Anderson Acceleration as a Krylov Method with Application to Asymptotic Convergence Analysis. (arXiv:2109.14181v1 [math.NA])
    (2 min) Anderson acceleration is widely used for accelerating the convergence of fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in \mathbb{R}^n$. We consider the case of linear fixed-point methods $x_{k+1}=M x_{k}+b$ and obtain polynomial residual update formulas for AA($m$), i.e., Anderson acceleration with window size $m$. We find that the standard AA($m$) method with initial iterates $x_k$, $k=0, \ldots, m$ defined recursively using AA($k$), is a Krylov space method. This immediately implies that $k$ iterations of AA($m$) cannot produce a smaller residual than $k$ iterations of GMRES without restart (but without implying anything about the relative convergence speed of (windowed) AA($m$) versus restarted GMRES($m$)). We introduce the notion of multi-Krylov method and show that AA($m$) with general initial iterates $\{x_0, \ldots, x_m\}$ is a multi-Krylov method. We find that the AA($m$) residual polynomials observe a periodic memory effect where increasing powers of the error iteration matrix $M$ act on the initial residual as the iteration number increases. We derive several further results based on these polynomial residual update formulas, including orthogonality relations, a lower bound on the AA(1) acceleration coefficient $\beta_k$, and explicit nonlinear recursions for the AA(1) residuals and residual polynomials that do not include the acceleration coefficient $\beta_k$. We apply these results to study the influence of the initial guess on the asymptotic convergence factor of AA(1).
    (Machine) Learning to Improve the Empirical Performance of Discrete Algorithms. (arXiv:2109.14271v1 [cs.LG])
    (2 min) This paper discusses a data-driven, empirically-based framework to make algorithmic decisions or recommendations without expert knowledge. We improve the performance of two algorithmic case studies: the selection of a pivot rule for the Simplex method and the selection of an all-pair shortest paths algorithm. We train machine learning methods to select the optimal algorithm for given data without human expert opinion. We use two types of techniques, neural networks and boosted decision trees. We concluded, based on our experiments, that: 1) Our selection framework recommends various pivot rules that improve overall total performance over just using a fixed default pivot rule. Over many years experts identified steepest-edge pivot rule as a favorite pivot rule. Our data analysis corroborates that the number of iterations by steepest-edge is no more than 4 percent more than the optimal selection which corroborates human expert knowledge, but this time the knowledge was obtained using machine learning. Here our recommendation system is best when using gradient boosted trees. 2) For the all-pairs shortest path problem, the models trained made a large improvement and our selection is on average .07 percent away from the optimal choice. The conclusions do not seem to be affected by the machine learning method we used. We tried to make a parallel analysis of both algorithmic problems, but it is clear that there are intrinsic differences. For example, in the all-pairs shortest path problem the graph density is a reasonable predictor, but there is no analogous single parameter for decisions in the Simplex method.
    How Much Data Analytics is Enough? The ROI of Machine Learning Classification and its Application to Requirements Dependency Classification. (arXiv:2109.14097v1 [cs.SE])
    (2 min) Machine Learning (ML) can substantially improve the efficiency and effectiveness of organizations and is widely used for different purposes within Software Engineering. However, the selection and implementation of ML techniques rely almost exclusively on accuracy criteria. Thus, for organizations wishing to realize the benefits of ML investments, this narrow approach ignores crucial considerations around the anticipated costs of the ML activities across the ML lifecycle, while failing to account for the benefits that are likely to accrue from the proposed activity. We present findings for an approach that addresses this gap by enhancing the accuracy criterion with return on investment (ROI) considerations. Specifically, we analyze the performance of the two state-of-the-art ML techniques: Random Forest and Bidirectional Encoder Representations from Transformers (BERT), based on accuracy and ROI for two publicly available data sets. Specifically, we compare decision-making on requirements dependency extraction (i) exclusively based on accuracy and (ii) extended to include ROI analysis. As a result, we propose recommendations for selecting ML classification techniques based on the degree of training data used. Our findings indicate that considering ROI as additional criteria can drastically influence ML selection when compared to decisions based on accuracy as the sole criterion
    Deep Spatio-Temporal Wind Power Forecasting. (arXiv:2109.14530v1 [cs.LG])
    (2 min) Wind power forecasting has drawn increasing attention among researchers as the consumption of renewable energy grows. In this paper, we develop a deep learning approach based on encoder-decoder structure. Our model forecasts wind power generated by a wind turbine using its spatial location relative to other turbines and historical wind speed data. In this way, we effectively integrate spatial dependency and temporal trends to make turbine-specific predictions. The advantages of our method over existing work can be summarized as 1) it directly predicts wind power based on historical wind speed, without the need for prediction of wind speed first, and then using a transformation; 2) it can effectively capture long-term dependency 3) our model is more scalable and efficient compared with other deep learning based methods. We demonstrate the efficacy of our model on the benchmarks real-world datasets.
    Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics. (arXiv:2109.13986v1 [cs.LG])
    (2 min) Neural sequence models trained with maximum likelihood estimation have led to breakthroughs in many tasks, where success is defined by the gap between training and test performance. However, their ability to achieve stronger forms of generalization remains unclear. We consider the problem of symbolic mathematical integration, as it requires generalizing systematically beyond the test set. We develop a methodology for evaluating generalization that takes advantage of the problem domain's structure and access to a verifier. Despite promising in-distribution performance of sequence-to-sequence models in this domain, we demonstrate challenges in achieving robustness, compositionality, and out-of-distribution generalization, through both carefully constructed manual test suites and a genetic algorithm that automatically finds large collections of failures in a controllable manner. Our investigation highlights the difficulty of generalizing well with the predominant modeling and learning approach, and the importance of evaluating beyond the test set, across different aspects of generalization.
    Minimax Kernel Machine Learning for a Class of Doubly Robust Functionals with Application to Proximal Causal Inference. (arXiv:2104.02929v2 [stat.ML] UPDATED)
    (2 min) A moment function is called doubly robust if it is comprised of two nuisance functions and has the desired property that the estimator based on it is a consistent estimator of the target parameter even if one of the nuisance functions is misspecified. A common approach for obtaining such a moment function is based on using the influence function (IF) of the parameter of interest. Robins et al. (2008) introduced a large class of doubly robust IFs. However, that class does not include the IF of functionals for which the nuisance functions are solutions to integral equations. Such functionals are particularly important in the field of causal inference, specifically in the recently proposed proximal inference framework of (Miao et al., 2018; Tchetgen Tchetgen et al., 2020), which allows for estimating the average causal effect when unobserved confounders are present in the system. Motivated by the proximal inference framework, in this paper, we first extend the class of Robins et al. to include doubly robust IFs in which the nuisance functions are solutions to integral equations. Then we demonstrate that the double robustness property of these IFs can be leveraged to construct estimating equations for the nuisance functions, which enables us to solve the integral equations without resorting to parametric models. The main idea is to choose each nuisance function such that it minimizes the dependency of the expected value of the moment function to the other nuisance function. We frame this idea as a minimax optimization problem and use RKHSes as the function spaces. We provide convergence rates for the nuisance functions and conditions required for asymptotic linearity of the estimator of the functional of interest. The experiment results demonstrate that our proposed methodology leads to robust and high-performance estimators for average causal effect in the proximal inference framework.
    Simulation-based Bayesian inference for multi-fingered robotic grasping. (arXiv:2109.14275v1 [cs.RO])
    (2 min) Multi-fingered robotic grasping is an undeniable stepping stone to universal picking and dexterous manipulation. Yet, multi-fingered grippers remain challenging to control because of their rich nonsmooth contact dynamics or because of sensor noise. In this work, we aim to plan hand configurations by performing Bayesian posterior inference through the full stochastic forward simulation of the robot in its environment, hence robustly accounting for many of the uncertainties in the system. While previous methods either relied on simplified surrogates of the likelihood function or attempted to learn to directly predict maximum likelihood estimates, we bring a novel simulation-based approach for full Bayesian inference based on a deep neural network surrogate of the likelihood-to-evidence ratio. Hand configurations are found by directly optimizing through the resulting amortized and differentiable expression for the posterior. The geometry of the configuration space is accounted for by proposing a Riemannian manifold optimization procedure through the neural posterior. Simulation and physical benchmarks demonstrate the high success rate of the procedure.
    Designing Counterfactual Generators using Deep Model Inversion. (arXiv:2109.14274v1 [cs.LG])
    (2 min) Explanation techniques that synthesize small, interpretable changes to a given image while producing desired changes in the model prediction have become popular for introspecting black-box models. Commonly referred to as counterfactuals, the synthesized explanations are required to contain discernible changes (for easy interpretability) while also being realistic (consistency to the data manifold). In this paper, we focus on the case where we have access only to the trained deep classifier and not the actual training data. While the problem of inverting deep models to synthesize images from the training distribution has been explored, our goal is to develop a deep inversion approach to generate counterfactual explanations for a given query image. Despite their effectiveness in conditional image synthesis, we show that existing deep inversion methods are insufficient for producing meaningful counterfactuals. We propose DISC (Deep Inversion for Synthesizing Counterfactuals) that improves upon deep inversion by utilizing (a) stronger image priors, (b) incorporating a novel manifold consistency objective and (c) adopting a progressive optimization strategy. We find that, in addition to producing visually meaningful explanations, the counterfactuals from DISC are effective at learning classifier decision boundaries and are robust to unknown test-time corruptions.
    An Update on a Progressively Expanded Database for Automated Lung Sound Analysis. (arXiv:2102.04062v3 [cs.SD] UPDATED)
    (2 min) Purpose: We previously established an open-access lung sound database, HF_Lung_V1, and developed deep learning models for inhalation, exhalation, continuous adventitious sound (CAS), and discontinuous adventitious sound (DAS) detection. The amount of data used for training contributes to model accuracy. Herein, we collected larger quantities of data to further improve model performance. Moreover, the issues of noisy labels and sound overlapping were explored. Methods: HF_Lung_V1 was expanded to HF_Lung_V2 with a 1.45x increase in the number of audio files. Convolutional neural network-bidirectional gated recurrent unit network models were trained separately using the HF_Lung_V1 (V1_Train) and HF_Lung_V2 (V2_Train) training sets and then tested using the HF_Lung_V1 (V1_Test) and HF_Lung_V2 (V2_Test) test sets, respectively. Segment and event detection performance was evaluated using the F1 scores. Label quality was assessed. Moreover, the overlap ratios between inhalation, exhalation, CAS, and DAS labels were computed. Results: The model trained using V2_Train exhibited improved F1 scores in inhalation, exhalation, and CAS detection on both V1_Test and V2_Test but not in DAS detection. Poor CAS detection was attributed to the quality of CAS labels. DAS detection was strongly influenced by the overlapping of DAS labels with inhalation and exhalation labels. Conclusion: Collecting greater quantities of lung sound data is vital for developing more accurate lung sound analysis models. To build real ground-truth labels, the labels must be reworked; this process is ongoing. Furthermore, a method for addressing the sound overlapping problem in DAS detection must be formulated.
    Accelerating Recommendation System Training by Leveraging Popular Choices. (arXiv:2103.00686v3 [cs.IR] UPDATED)
    (2 min) Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy
    Hierarchical Character Tagger for Short Text Spelling Error Correction. (arXiv:2109.14259v1 [cs.CL])
    (2 min) State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.
    Decision Forests vs. Deep Networks: Conceptual Similarities and Empirical Differences at Small Sample Sizes. (arXiv:2108.13637v2 [cs.LG] UPDATED)
    (2 min) Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies using the most contemporary best practices has yet to be performed. Conceptually, we illustrate that both can be profitably viewed as "partition and vote" schemes. Specifically, the representation space that they both learn is a partitioning of feature space into a union of convex polytopes. For inference, each decides on the basis of votes from the activated nodes. This formulation allows for a unified basic understanding of the relationship between these methods. Empirically, we compare these two strategies on hundreds of tabular data settings, as well as several vision and auditory settings. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found forests to excel at tabular and structured data (vision and audition) with small sample sizes, whereas deep nets performed better on structured data with larger sample sizes. This suggests that further gains in both scenarios may be realized via further combining aspects of forests and networks. We will continue revising this technical report in the coming months with updated results.
    An Accelerated Stochastic Gradient for Canonical Polyadic Decomposition. (arXiv:2109.13964v1 [eess.SP])
    (2 min) We consider the problem of structured canonical polyadic decomposition. If the size of the problem is very big, then stochastic gradient approaches are viable alternatives to classical methods, such as Alternating Optimization and All-At-Once optimization. We extend a recent stochastic gradient approach by employing an acceleration step (Nesterov momentum) in each iteration. We compare our approach with state-of-the-art alternatives, using both synthetic and real-world data, and find it to be very competitive.
    On the Estimation Bias in Double Q-Learning. (arXiv:2109.14419v1 [cs.LG])
    (2 min) Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operation. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operator. To address the concerns of converging to non-optimal stationary solutions, we propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning. This approach leverages an approximate dynamic programming to bound the target value. We extensively evaluate our proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.
    A Seamless and High-Performance Out-of-Distribution Detection Approach Simply Replacing the SoftMax Loss. (arXiv:2105.14399v4 [cs.LG] UPDATED)
    (2 min) Current out-of-distribution detection approaches usually present special requirements (e.g., collecting outlier data and hyperparameter validation) and produce side effects (classification accuracy drop and slow/inefficient inferences). Recently, entropic out-of-distribution detection has been proposed as a seamless approach (i.e., a solution that avoids all the previously mentioned drawbacks). The entropic out-of-distribution detection solution comprises the IsoMax loss for training and the entropic score for out-of-distribution detection. The IsoMax loss works as a SoftMax loss drop-in replacement because swapping the SoftMax loss with the IsoMax loss requires no changes in the model's architecture or training procedures/hyperparameters. In this paper, we propose to perform what we call an isometrization of the distances used in the IsoMax loss. Additionally, we propose to replace the entropic score with the minimum distance score. Our experiments showed that these simple modifications increase out-of-distribution detection performance while keeping the solution seamless. Besides being competitive with or outperforming all major current approaches, our solution avoids all their current limitations in addition to being much easier to use, as just a simple loss replacement for training the neural network is required. Code available at https://github.com/dlmacedo/entropic-out-of-distribution-detection.
    Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence. (arXiv:2109.00539v3 [stat.ME] UPDATED)
    (2 min) In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression model with spatial constraints, to simultaneously handle the spatial nonstationarity, local homogeneity, and outlier contaminations. Compared with existing spatial regression models, our proposed model assumes the existence a few distinct regression models that are estimated based on observations that exhibit similar response-predictor relationships. As such, the proposed model not only accounts for nonstationarity in the spatial trend, but also clusters observations into a few distinct and homogenous groups. This provides an advantage on interpretation with a few stationary sub-processes identified that capture the predominant relationships between response and predictor variables. Moreover, the proposed method incorporates robust procedures to handle contaminations from both regression outliers and spatial outliers. By doing so, we robustly segment the spatial domain into distinct local regions with similar regression coefficients, and sporadic locations that are purely outliers. Rigorous statistical hypothesis testing procedure has been designed to test the significance of such segmentation. Experimental results on many synthetic and real-world datasets demonstrate the robustness, accuracy, and effectiveness of our proposed method, compared with other robust finite mixture regression, spatial regression and spatial segmentation methods.
    Learning Models for Actionable Recourse. (arXiv:2011.06146v2 [cs.LG] UPDATED)
    (2 min) As machine learning models are increasingly deployed in high-stakes domains such as legal and financial decision-making, there has been growing interest in post-hoc methods for generating counterfactual explanations. Such explanations provide individuals adversely impacted by predicted outcomes (e.g., an applicant denied a loan) with recourse -- i.e., a description of how they can change their features to obtain a positive outcome. We propose a novel algorithm that leverages adversarial training and PAC confidence sets to learn models that theoretically guarantee recourse to affected individuals with high probability. We demonstrate the efficacy of our approach with extensive experiments on real data.
    A Theoretical Connection Between Statistical Physics and Reinforcement Learning. (arXiv:1906.10228v2 [cs.LG] UPDATED)
    (2 min) Sequential decision making in the presence of uncertainty and stochastic dynamics gives rise to distributions over state/action trajectories in reinforcement learning (RL) and optimal control problems. This observation has led to a variety of connections between RL and inference in probabilistic graphical models (PGMs). Here we explore a different dimension to this relationship, examining reinforcement learning using the tools and abstractions of statistical physics. The central object in the statistical physics abstraction is the idea of a partition function $\mathcal{Z}$, and here we construct a partition function from the ensemble of possible trajectories that an agent might take in a Markov decision process. Although value functions and $Q$-functions can be derived from this partition function and interpreted via average energies, the $\mathcal{Z}$-function provides an object with its own Bellman equation that can form the basis of alternative dynamic programming approaches. Moreover, when the MDP dynamics are deterministic, the Bellman equation for $\mathcal{Z}$ is linear, allowing direct solutions that are unavailable for the nonlinear equations associated with traditional value functions. The policies learned via these $\mathcal{Z}$-based Bellman updates are tightly linked to Boltzmann-like policy parameterizations. In addition to sampling actions proportionally to the exponential of the expected cumulative reward as Boltzmann policies would, these policies take entropy into account favoring states from which many outcomes are possible.
    On the Variance of the Fisher Information for Deep Learning. (arXiv:2107.04205v2 [cs.LG] UPDATED)
    (2 min) The Fisher information matrix (FIM) has been applied to the realm of deep learning. It is closely related to the loss landscape, the variance of the parameters, second order optimization, and deep learning theory. However, the exact FIM is either unavailable in closed form or too expensive to compute. In practice, it is almost always estimated based on empirical samples. We investigate two such estimators based on two equivalent representations of the FIM -- both unbiased and consistent with respect to the underlying "true" FIM. Their estimation quality is characterized by their variance given in closed form. We bound their variances and analyze how the parametric structure of a deep neural network can impact the variance. We discuss the meaning of this variance measure and our bounds in the context of deep learning.
    Conditions and Assumptions for Constraint-based Causal Structure Learning. (arXiv:2103.13521v2 [math.ST] UPDATED)
    (2 min) This paper formalizes constraint-based structure learning of the "true" causal graph from observed data when unobserved variables are also existent. We provide conditions for a "natural" family of constraint-based structure-learning algorithms that output graphs that are Markov equivalent to the causal graph. Under the faithfulness assumption, this natural family contains all exact structure-learning algorithms. More importantly, we provide clear and testable assumptions, as an alternative to faithfulness, under which any natural structure-learning algorithm outputs Markov equivalent graphs to the causal graph. We provide these definitions and results for the general class of models under the assumption that the distribution is Markovian to the true causal graph, and we specialize the definitions and results for structural causal models.
    Beyond Background-Aware Correlation Filters: Adaptive Context Modeling by Hand-Crafted and Deep RGB Features for Visual Tracking. (arXiv:2004.02932v2 [cs.CV] UPDATED)
    (2 min) In recent years, the background-aware correlation filters have achie-ved a lot of research interest in the visual target tracking. However, these methods cannot suitably model the target appearance due to the exploitation of hand-crafted features. On the other hand, the recent deep learning-based visual tracking methods have provided a competitive performance along with extensive computations. In this paper, an adaptive background-aware correlation filter-based tracker is proposed that effectively models the target appearance by using either the histogram of oriented gradients (HOG) or convolutional neural network (CNN) feature maps. The proposed method exploits the fast 2D non-maximum suppression (NMS) algorithm and the semantic information comparison to detect challenging situations. When the HOG-based response map is not reliable, or the context region has a low semantic similarity with prior regions, the proposed method constructs the CNN context model to improve the target region estimation. Furthermore, the rejection option allows the proposed method to update the CNN context model only on valid regions. Comprehensive experimental results demonstrate that the proposed adaptive method clearly outperforms the accuracy and robustness of visual target tracking compared to the state-of-the-art methods on the OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 datasets.
    Synthetic Interventions. (arXiv:2006.07691v4 [econ.EM] UPDATED)
    (2 min) Consider a setting where there are $N$ heterogeneous units (e.g., individuals, sub-populations) and $D$ interventions (e.g., socio-economic policies). Our goal is to learn the potential outcome associated with every intervention on every unit (i.e., $N \times D$ causal parameters). Towards this, we present a causal framework, synthetic interventions (SI), to infer these $N \times D$ causal parameters while only observing each of the $N$ units under at most two interventions, independent of $D$. This can be significant as the number of interventions, i.e, level of personalization, grows. Importantly, our estimator also allows for latent confounders that determine how interventions are assigned. Theoretically, under a novel tensor factor model across units, measurements, and interventions, we formally establish an identification result for each of these $N \times D$ causal parameters and establish finite-sample consistency and asymptotic normality of our estimator. The estimator is furnished with a data-driven test to verify its suitability. Empirically, we validate our framework through both experimental and observational case studies; namely, a large-scale A/B test performed on an e-commerce platform, and an evaluation of mobility restriction on morbidity outcomes due to COVID-19. We believe this has important implications for program evaluation and the design of data-efficient RCTs with heterogeneous units and multiple interventions.
    A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning. (arXiv:2106.02097v2 [cs.AI] UPDATED)
    (2 min) We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state, in order to plan and to generalize better out-of-distribution. The agent uses a bottleneck mechanism over a set-based representation to force the number of entities to which the agent attends at each planning step to be small. In experiments, we investigate the bottleneck mechanism with several sets of customized environments featuring different challenges. We consistently observe that the design allows the planning agents to generalize their learned task-solving abilities in compatible unseen environments by attending to the relevant objects, leading to better out-of-distribution performance.
    UNETR: Transformers for 3D Medical Image Segmentation. (arXiv:2103.10504v2 [eess.IV] UPDATED)
    (2 min) Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr
    Itsy Bitsy SpiderNet: Fully Connected Residual Network for Fraud Detection. (arXiv:2105.08120v2 [cs.LG] UPDATED)
    (2 min) With the development of high technology, the scope of fraud is increasing, resulting in annual losses of billions of dollars worldwide. The preventive protection measures become obsolete and vulnerable over time, so effective detective tools are needed. In this paper, we propose a convolutional neural network architecture SpiderNet designed to solve fraud detection problems. We noticed that the principles of pooling and convolutional layers in neural networks are very similar to the way antifraud analysts work when conducting investigations. Moreover, the skip-connections used in neural networks make the usage of features of various power in antifraud models possible. Our experiments have shown that SpiderNet provides better quality compared to Random Forest and adapted for antifraud modeling problems 1D-CNN, 1D-DenseNet, F-DenseNet neural networks. We also propose new approaches for fraud feature engineering called B-tests and W-tests, which generalize the concepts of Benford's Law for fraud anomalies detection. Our results showed that B-tests and W-tests give a significant increase to the quality of our anti-fraud models. The SpiderNet code is available at https://github.com/aasmirnova24/SpiderNet
    A Comprehensive Survey and Performance Analysis of Activation Functions in Deep Learning. (arXiv:2109.14545v1 [cs.LG])
    (2 min) Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs), such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish and Mish. In this paper, a comprehensive overview and survey is presented for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based, ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different types of data. The insights of AFs are presented to benefit the researchers for doing further research and practitioners to select among different choices. The code used for experimental comparison is released at: \url{https://github.com/shivram1987/ActivationFunctions}.
    FLASHE: Additively Symmetric Homomorphic Encryption for Cross-Silo Federated Learning. (arXiv:2109.00675v2 [cs.CR] UPDATED)
    (2 min) Homomorphic encryption (HE) is a promising privacy-preserving technique for cross-silo federated learning (FL), where organizations perform collaborative model training on decentralized data. Despite the strong privacy guarantee, general HE schemes result in significant computation and communication overhead. Prior works employ batch encryption to address this problem, but it is still suboptimal in mitigating communication overhead and is incompatible with sparsification techniques. In this paper, we propose FLASHE, an HE scheme tailored for cross-silo FL. To capture the minimum requirements of security and functionality, FLASHE drops the asymmetric-key design and only involves modular addition operations with random numbers. Depending on whether to accommodate sparsification techniques, FLASHE is optimized in computation efficiency with different approaches. We have implemented FLASHE as a pluggable module atop FATE, an industrial platform for cross-silo FL. Compared to plaintext training, FLASHE slightly increases the training time by $\leq6\%$, with no communication overhead.
    Bayesian optimization assisted unsupervised learning for efficient intra-tumor partitioning in MRI and survival prediction for glioblastoma patients. (arXiv:2012.03115v2 [eess.IV] UPDATED)
    (2 min) Glioblastoma is profoundly heterogeneous in microstructure and vasculature, which may lead to tumor regional diversity and distinct treatment response. Although successful in tumor sub-region segmentation and survival prediction, radiomics based on machine learning algorithms, is challenged by its robustness, due to the vague intermediate process and track changes. Also, the weak interpretability of the model poses challenges to clinical application. Here we proposed a machine learning framework to semi-automatically fine-tune the clustering algorithms and quantitatively identify stable sub-regions for reliable clinical survival prediction. Hyper-parameters are automatically determined by the global minimum of the trained Gaussian Process (GP) surrogate model through Bayesian optimization(BO) to alleviate the difficulty of tuning parameters for clinical researchers. To enhance the interpretability of the survival prediction model, we incorporated the prior knowledge of intra-tumoral heterogeneity, by segmenting tumor sub-regions and extracting sub-regional features. The results demonstrated that the global minimum of the trained GP surrogate can be used as sub-optimal hyper-parameter solutions for efficient. The sub-regions segmented based on physiological MRI can be applied to predict patient survival, which could enhance the clinical interpretability for the machine learning model.
    SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data. (arXiv:2105.04211v2 [stat.ML] UPDATED)
    (2 min) Making predictions and quantifying their uncertainty when the input data is sequential is a fundamental learning challenge, recently attracting increasing attention. We develop SigGPDE, a new scalable sparse variational inference framework for Gaussian Processes (GPs) on sequential data. Our contribution is twofold. First, we construct inducing variables underpinning the sparse approximation so that the resulting evidence lower bound (ELBO) does not require any matrix inversion. Second, we show that the gradients of the GP signature kernel are solutions of a hyperbolic partial differential equation (PDE). This theoretical insight allows us to build an efficient back-propagation algorithm to optimize the ELBO. We showcase the significant computational gains of SigGPDE compared to existing methods, while achieving state-of-the-art performance for classification tasks on large datasets of up to 1 million multivariate time series.
    Multi-armed Bandit Requiring Monotone Arm Sequences. (arXiv:2106.03790v3 [cs.LG] UPDATED)
    (2 min) In many online learning or multi-armed bandit problems, the taken actions or pulled arms are ordinal and required to be monotone over time. Examples include dynamic pricing, in which the firms use markup pricing policies to please early adopters and deter strategic waiting, and clinical trials, in which the dose allocation usually follows the dose escalation principle to prevent dose limiting toxicities. We consider the continuum-armed bandit problem when the arm sequence is required to be monotone. We show that when the unknown objective function is Lipschitz continuous, the regret is $O(T)$. When in addition the objective function is unimodal or quasiconcave, the regret is $\tilde O(T^{3/4})$ under the proposed algorithm, which is also shown to be the optimal rate. This deviates from the optimal rate $\tilde O(T^{2/3})$ in the continuous-armed bandit literature and demonstrates the cost to the learning efficiency brought by the monotonicity requirement.
    Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games. (arXiv:2106.01969v3 [cs.LG] UPDATED)
    (2 min) Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where state-games can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.
    Training Spiking Neural Networks Using Lessons From Deep Learning. (arXiv:2109.12894v2 [cs.NE] UPDATED)
    (2 min) The brain is the perfect place to look for inspiration to develop more efficient neural networks. The inner workings of our synapses and neurons provide a glimpse at what the future of deep learning might look like. This paper shows how to apply the lessons learnt from several decades of research in deep learning, gradient descent, backpropagation and neuroscience to biologically plausible spiking neural neural networks. This paper explores the delicate interplay between encoding data as spikes and the learning process; the challenges and solutions of applying gradient-based learning to spiking neural networks; the subtle link between temporal backpropagation and spike timing dependent plasticity, and how deep learning might move towards biologically plausible online learning. Some ideas are well accepted and commonly used amongst the neuromorphic engineering community, while others are presented or justified for the first time here.
    Network Clustering by Embedding of Attribute-augmented Graphs. (arXiv:2109.09367v2 [cs.LG] UPDATED)
    (2 min) In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. The aim is to group vertices which are similar not only in terms of structural connectivity but also in terms of attribute values. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [5, 27]. The augmented graph is embedded in a Euclidean space associated to its Laplacian and apply a modified K-means algorithm to identify clusters. The modified K-means uses a vector distance measure where to each original vertex is assigned a vector-valued set of coordinates depending on both structural connectivity and attribute similarities. To define the coordinate vectors we employ an adaptive AMG (Algebraic MultiGrid) method to identify the coordinate directions in the embedding Euclidean space extending our previous result for graphs without attributes. We demonstrate the effectiveness of our proposed clustering method on both synthetic and real-world attributed graphs.
    An Instance-Dependent Simulation Framework for Learning with Label Noise. (arXiv:2107.11413v3 [cs.LG] UPDATED)
    (2 min) We propose a simulation framework for generating instance-dependent noisy labels via a pseudo-labeling paradigm. We show that the distribution of the synthetic noisy labels generated with our framework is closer to human labels compared to independent and class-conditional random flipping. Equipped with controllable label noise, we study the negative impact of noisy labels across a few practical settings to understand when label noise is more problematic. We also benchmark several existing algorithms for learning with noisy labels and compare their behavior on our synthetic datasets and on the datasets with independent random label noise. Additionally, with the availability of annotator information from our simulation framework, we propose a new technique, Label Quality Model (LQM), that leverages annotator features to predict and correct against noisy labels. We show that by adding LQM as a label correction step before applying existing noisy label techniques, we can further improve the models' performance.
    Metrical Task Systems with Online Machine Learned Advice. (arXiv:2012.09394v2 [cs.LG] UPDATED)
    (2 min) Machine learning algorithms are designed to make accurate predictions of the future based on existing data, while online algorithms seek to bound some performance measure (typically the competitive ratio) without knowledge of the future. Lykouris and Vassilvitskii demonstrated that augmenting online algorithms with a machine learned predictor can provably decrease the competitive ratio under as long as the predictor is suitably accurate. In this work we apply this idea to the Online Metrical Task System problem, which was put forth by Borodin, Linial, and Saks as a general model for dynamic systems processing tasks in an online fashion. We focus on the specific class of uniform task systems on $n$ tasks, for which the best deterministic algorithm is $O(n)$ competitive and the best randomized algorithm is $O(\log n)$ competitive. By giving an online algorithms access to a machine learned oracle with absolute predictive error bounded above by $\eta_0$, we construct a $\Theta(\min(\sqrt{\eta_0}, \log n))$ competitive algorithm for the uniform case of the metrical task systems problem. We also give a $\Theta(\log \eta_0)$ lower bound on the competitive ratio of any randomized algorithm.
    Compressive Visual Representations. (arXiv:2109.12909v2 [cs.LG] UPDATED)
    (2 min) Learning effective visual representations that generalize well without human supervision is a fundamental problem in order to apply Machine Learning to a wide variety of tasks. Recently, two families of self-supervised methods, contrastive learning and latent bootstrapping, exemplified by SimCLR and BYOL respectively, have made significant progress. In this work, we hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn. As Lipschitz continuity is closely related to robustness, this provides a new explanation for why compressed models are more robust. Our experiments confirm that adding compression to SimCLR and BYOL significantly improves linear evaluation accuracies and model robustness across a wide range of domain shifts. In particular, the compressed version of BYOL achieves 76.0% Top-1 linear evaluation accuracy on ImageNet with ResNet-50, and 78.8% with ResNet-50 2x.
    Non-Asymptotic Analysis for Two Time-scale TDC with General Smooth Function Approximation. (arXiv:2104.02836v2 [cs.LG] UPDATED)
    (2 min) Temporal-difference learning with gradient correction (TDC) is a two time-scale algorithm for policy evaluation in reinforcement learning. This algorithm was initially proposed with linear function approximation, and was later extended to the one with general smooth function approximation. The asymptotic convergence for the on-policy setting with general smooth function approximation was established in [bhatnagar2009convergent], however, the finite-sample analysis remains unsolved due to challenges in the non-linear and two-time-scale update structure, non-convex objective function and the time-varying projection onto a tangent plane. In this paper, we develop novel techniques to explicitly characterize the finite-sample error bound for the general off-policy setting with i.i.d.\ or Markovian samples, and show that it converges as fast as $\mathcal O(1/\sqrt T)$ (up to a factor of $\mathcal O(\log T)$). Our approach can be applied to a wide range of value-based reinforcement learning algorithms with general smooth function approximation.
    Clustering with Tangles: Algorithmic Framework and Theoretical Guarantees. (arXiv:2006.14444v2 [cs.LG] UPDATED)
    (2 min) Originally, tangles had been invented as an abstract tool in mathematical graph theory to prove the famous graph minor theorem. In this paper we showcase the practical potential of tangles in machine learning applications. Given a collection of weak cuts (bipartitions) of any dataset, tangles provide a mechanism to aggregate these cuts such that they point in the direction of a dense structure. As a result, a cluster is softly characterized by a set of consistent pointers. This highly flexible approach can solve soft clustering problems in a variety of setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. The output of our proposed framework is of a hierarchical nature and induces the notion of a soft dendrogram, which can be helpful to explore the cluster structure of a dataset. The computational complexity of aggregating the cuts is only linear in the number of data points and places the bottleneck on generating the cuts, for which simple and fast algorithms form a sufficient basis. The contribution of this paper is to translate the abstract mathematical literature into algorithms that can be used in machine learning. We demonstrate the power of tangles in many different use cases and provide theoretical guarantees for their performance.
    Learning When to Switch: Composing Controllers to Traverse a Sequence of Terrain Artifacts. (arXiv:2011.00440v2 [cs.RO] UPDATED)
    (2 min) Legged robots often use separate control policiesthat are highly engineered for traversing difficult terrain suchas stairs, gaps, and steps, where switching between policies isonly possible when the robot is in a region that is commonto adjacent controllers. Deep Reinforcement Learning (DRL)is a promising alternative to hand-crafted control design,though typically requires the full set of test conditions to beknown before training. DRL policies can result in complex(often unrealistic) behaviours that have few or no overlappingregions between adjacent policies, making it difficult to switchbehaviours. In this work we develop multiple DRL policieswith Curriculum Learning (CL), each that can traverse asingle respective terrain condition, while ensuring an overlapbetween policies. We then train a network for each destinationpolicy that estimates the likelihood of successfully switchingfrom any other policy. We evaluate our switching methodon a previously unseen combination of terrain artifacts andshow that it performs better than heuristic methods. Whileour method is trained on individual terrain types, it performscomparably to a Deep Q Network trained on the full set ofterrain conditions. This approach allows the development ofseparate policies in constrained conditions with embedded priorknowledge about each behaviour, that is scalable to any numberof behaviours, and prepares DRL methods for applications inthe real world
    Optimal Targeting in Fundraising: A Causal Machine-Learning Approach. (arXiv:2103.10251v3 [econ.EM] UPDATED)
    (2 min) Ineffective fundraising lowers the resources charities can use to provide goods. We combine a field experiment and a causal machine-learning approach to increase a charity's fundraising effectiveness. The approach optimally targets a fundraising instrument to individuals whose expected donations exceed solicitation costs. Our results demonstrate that machine-learning-based optimal targeting allows the charity to substantially increase donations net of fundraising costs relative to uniform benchmarks in which either everybody or no one receives the gift. To that end, it (a) should direct its fundraising efforts to a subset of past donors and (b) never address individuals who were previously asked but never donated. Further, we show that the benefits of machine-learning-based optimal targeting even materialize when the charity only exploits publicly available geospatial information or applies the estimated optimal targeting rule to later fundraising campaigns conducted in similar samples. We conclude that charities not engaging in optimal targeting waste significant resources.
    GNNLens: A Visual Analytics Approach for Prediction Error Diagnosis of Graph Neural Networks. (arXiv:2011.11048v4 [cs.HC] UPDATED)
    (2 min) Graph Neural Networks (GNNs) aim to extend deep learning techniques to graph data and have achieved significant progress in graph analysis tasks (e.g., node classification) in recent years. However, similar to other deep neural networks like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), GNNs behave like a black box with their details hidden from model developers and users. It is therefore difficult to diagnose possible errors of GNNs. Despite many visual analytics studies being done on CNNs and RNNs, little research has addressed the challenges for GNNs. This paper fills the research gap with an interactive visual analysis tool, GNNLens, to assist model developers and users in understanding and analyzing GNNs. Specifically, Parallel Sets View and Projection View enable users to quickly identify and validate error patterns in the set of wrong predictions; Graph View and Feature Matrix View offer a detailed analysis of individual nodes to assist users in forming hypotheses about the error patterns. Since GNNs jointly model the graph structure and the node features, we reveal the relative influences of the two types of information by comparing the predictions of three models: GNN, Multi-Layer Perceptron (MLP), and GNN Without Using Features (GNNWUF). Two case studies and interviews with domain experts demonstrate the effectiveness of GNNLens in facilitating the understanding of GNN models and their errors.
    FINE Samples for Learning with Noisy Labels. (arXiv:2102.11628v2 [cs.LG] UPDATED)
    (2 min) Modern deep neural networks (DNNs) become frail when the datasets contain noisy (incorrect) class labels. Robust techniques in the presence of noisy labels can be categorized into two folds: developing noise-robust functions or using noise-cleansing methods by detecting the noisy data. Recently, noise-cleansing methods have been considered as the most competitive noisy-label learning algorithms. Despite their success, their noisy label detectors are often based on heuristics more than a theory, requiring a robust classifier to predict the noisy data with loss values. In this paper, we propose a novel detector for filtering label noise. Unlike most existing methods, we focus on each data's latent representation dynamics and measure the alignment between the latent distribution and each representation using the eigendecomposition of the data covariance matrix. Our framework, coined as filtering noisy instances via their eigenvectors (FINE), provides a robust detector with derivative-free simple methods having theoretical guarantees. Under our framework, we propose three applications of the FINE: sample-selection approach, semi-supervised learning approach, and collaboration with noise-robust loss functions. Experimental results show that the proposed methods consistently outperform corresponding baselines for all three applications on various benchmark datasets.
    baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling Coordinated Agents. (arXiv:2104.11980v2 [cs.LG] UPDATED)
    (2 min) In many multi-agent spatiotemporal systems, agents operate under the influence of shared, unobserved variables (e.g., the play a team is executing in a game of basketball). As a result, the trajectories of the agents are often statistically dependent at any given time step; however, almost universally, multi-agent models implicitly assume the agents' trajectories are statistically independent at each time step. In this paper, we introduce baller2vec++, a multi-entity Transformer that can effectively model coordinated agents. Specifically, baller2vec++ applies a specially designed self-attention mask to a mixture of location and "look-ahead" trajectory sequences to learn the distributions of statistically dependent agent trajectories. We show that, unlike baller2vec (baller2vec++'s predecessor), baller2vec++ can learn to emulate the behavior of perfectly coordinated agents in a simulated toy dataset. Additionally, when modeling the trajectories of professional basketball players, baller2vec++ outperforms baller2vec by a wide margin.
    Graphical modelling in continuous-time: consistency guarantees and algorithms using Neural ODEs. (arXiv:2105.02522v2 [stat.ML] UPDATED)
    (2 min) The discovery of structure from time series data is a key problem in fields of study working with complex systems. Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time. Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling. In this paper, we consider score-based graph learning for the study of dynamical systems. We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations. Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems.
    Classification with Rejection Based on Cost-sensitive Classification. (arXiv:2010.11748v5 [stat.ML] UPDATED)
    (2 min) The goal of classification with rejection is to avoid risky misclassification in error-critical applications such as medical diagnosis and product inspection. In this paper, based on the relationship between classification with rejection and cost-sensitive classification, we propose a novel method of classification with rejection by learning an ensemble of cost-sensitive classifiers, which satisfies all the following properties: (i) it can avoid estimating class-posterior probabilities, resulting in improved classification accuracy, (ii) it allows a flexible choice of losses including non-convex ones, (iii) it does not require complicated modifications when using different losses, (iv) it is applicable to both binary and multiclass cases, and (v) it is theoretically justifiable for any classification-calibrated loss. Experimental results demonstrate the usefulness of our proposed approach in clean-labeled, noisy-labeled, and positive-unlabeled classification.
    An Energy Efficient Health Monitoring Approach with Wireless Body Area Networks. (arXiv:2109.14546v1 [eess.SP])
    (3 min) Wireless Body Area Networks (WBANs) comprise a network of sensors subcutaneously implanted or placed near the body surface and facilitate continuous monitoring of health parameters of a patient. Research endeavours involving WBAN are directed towards effective transmission of detected parameters to a Local Processing Unit (LPU, usually a mobile device) and analysis of the parameters at the LPU or a back-end cloud. An important concern in WBAN is the lightweight nature of WBAN nodes and the need to conserve their energy. This is especially true for subcutaneously implanted nodes that cannot be recharged or regularly replaced. Work in energy conservation is mostly aimed at optimising the routing of signals to minimise energy expended. In this paper, a simple yet innovative approach to energy…
    Untangling Braids with Multi-agent Q-Learning. (arXiv:2109.14502v1 [cs.LG])
    (2 min) We use reinforcement learning to tackle the problem of untangling braids. We experiment with braids with 2 and 3 strands. Two competing players learn to tangle and untangle a braid. We interface the braid untangling problem with the OpenAI Gym environment, a widely used way of connecting agents to reinforcement learning problems. The results provide evidence that the more we train the system, the better the untangling player gets at untangling braids. At the same time, our tangling player produces good examples of tangled braids.
    Variational Inference for Continuous-Time Switching Dynamical Systems. (arXiv:2109.14492v1 [cs.LG])
    (2 min) Switching dynamical systems provide a powerful, interpretable modeling framework for inference in time-series data in, e.g., the natural sciences or engineering applications. Since many areas, such as biology or discrete-event systems, are naturally described in continuous time, we present a model based on an Markov jump process modulating a subordinated diffusion process. We provide the exact evolution equations for the prior and posterior marginal densities, the direct solutions of which are however computationally intractable. Therefore, we develop a new continuous-time variational inference algorithm, combining a Gaussian process approximation on the diffusion level with posterior inference for Markov jump processes. By minimizing the path-wise Kullback-Leibler divergence we obtain (i) Bayesian latent state estimates for arbitrary points on the real axis and (ii) point estimates of unknown system parameters, utilizing variational expectation maximization. We extensively evaluate our algorithm under the model assumption and for real-world examples.
    CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning. (arXiv:1902.01981v3 [stat.ML] UPDATED)
    (2 min) We focus on the commonly used synchronous Gradient Descent paradigm for large-scale distributed learning, for which there has been a growing interest to develop efficient and robust gradient aggregation strategies that overcome two key system bottlenecks: communication bandwidth and stragglers' delays. In particular, Ring-AllReduce (RAR) design has been proposed to avoid bandwidth bottleneck at any particular node by allowing each worker to only communicate with its neighbors that are arranged in a logical ring. On the other hand, Gradient Coding (GC) has been recently proposed to mitigate stragglers in a master-worker topology by allowing carefully designed redundant allocation of the data set to the workers. We propose a joint communication topology design and data set allocation strategy, named CodedReduce (CR), that combines the best of both RAR and GC. That is, it parallelizes the communications over a tree topology leading to efficient bandwidth utilization, and carefully designs a redundant data set allocation and coding strategy at the nodes to make the proposed gradient aggregation scheme robust to stragglers. In particular, we quantify the communication parallelization gain and resiliency of the proposed CR scheme, and prove its optimality when the communication topology is a regular tree. Moreover, we characterize the expected run-time of CR and show order-wise speedups compared to the benchmark schemes. Finally, we empirically evaluate the performance of our proposed CR design over Amazon EC2 and demonstrate that it achieves speedups of up to 27.2x and 7.0x, respectively over the benchmarks GC and RAR.
    On Assessing the Usefulness of Proxy Domains for Developing and Evaluating Embodied Agents. (arXiv:2109.14516v1 [cs.LG])
    (2 min) In many situations it is either impossible or impractical to develop and evaluate agents entirely on the target domain on which they will be deployed. This is particularly true in robotics, where doing experiments on hardware is much more arduous than in simulation. This has become arguably more so in the case of learning-based agents. To this end, considerable recent effort has been devoted to developing increasingly realistic and higher fidelity simulators. However, we lack any principled way to evaluate how good a ``proxy domain'' is, specifically in terms of how useful it is in helping us achieve our end objective of building an agent that performs well in the target domain. In this work, we investigate methods to address this need. We begin by clearly separating two uses of proxy domains that are often conflated: 1) their ability to be a faithful predictor of agent performance and 2) their ability to be a useful tool for learning. In this paper, we attempt to clarify the role of proxy domains and establish new proxy usefulness (PU) metrics to compare the usefulness of different proxy domains. We propose the relative predictive PU to assess the predictive ability of a proxy domain and the learning PU to quantify the usefulness of a proxy as a tool to generate learning data. Furthermore, we argue that the value of a proxy is conditioned on the task that it is being used to help solve. We demonstrate how these new metrics can be used to optimize parameters of the proxy domain for which obtaining ground truth via system identification is not trivial.
    Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure. (arXiv:2008.03006v3 [math.OC] UPDATED)
    (2 min) Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginals k and their support sizes n. This paper develops a general theory about what "structure" makes MOT solvable in poly(n,k) time. We develop a unified algorithmic framework for solving MOT in poly(n,k) time by characterizing the "structure" that different algorithms require in terms of simple variants of the dual feasibility oracle. This framework has several benefits. First, it enables us to show that the Sinkhorn algorithm, which is currently the most popular MOT algorithm, requires strictly more structure than other algorithms do to solve MOT in poly(n,k) time. Second, our framework makes it much simpler to develop poly(n,k) time algorithms for a given MOT problem. In particular, it is necessary and sufficient to (approximately) solve the dual feasibility oracle -- which is much more amenable to standard algorithmic techniques. We illustrate this ease-of-use by developing poly(n,k) time algorithms for three general classes of MOT cost structures: (1) graphical structure; (2) set-optimization structure; and (3) low-rank plus sparse structure. For structure (1), we recover the known result that Sinkhorn has poly(n,k) runtime; moreover, we provide the first poly(n,k) time algorithms for computing solutions that are exact and sparse. For structures (2)-(3), we give the first poly(n,k) time algorithms, even for approximate computation. Together, these three structures encompass many -- if not most -- current applications of MOT.
    PAC-Bayes Information Bottleneck. (arXiv:2109.14509v1 [cs.LG])
    (2 min) Information bottleneck (IB) depicts a trade-off between the accuracy and conciseness of encoded representations. IB has succeeded in explaining the objective and behavior of neural networks (NNs) as well as learning better representations. However, there are still critics of the universality of IB, e.g., phase transition usually fades away, representation compression is not causally related to generalization, and IB is trivial in deterministic cases. In this work, we build a new IB based on the trade-off between the accuracy and complexity of learned weights of NNs. We argue that this new IB represents a more solid connection to the objective of NNs since the information stored in weights bounds their PAC-Bayes generalization capability, hence we name it as PAC-Bayes IB (PIB). On PIB, we can identify the phase transition phenomenon in general cases and solidify the causality between compression and generalization. We then derive a tractable solution of PIB and design a stochastic inference algorithm by Markov chain Monte Carlo sampling. We empirically verify our claims through extensive experiments. We also substantiate the superiority of the proposed algorithm on training better NNs.
    Learning Off-Policy with Online Planning. (arXiv:2008.10066v4 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) in low-data and risk-sensitive domains requires performant and flexible deployment policies that can readily incorporate constraints during deployment. One such class of policies are the semi-parametric H-step lookahead policies, which select actions using trajectory optimization over a dynamics model for a fixed horizon with a terminal value function. In this work, we investigate a novel instantiation of H-step lookahead with a learned model and a terminal value function learned by a model-free off-policy algorithm, named Learning Off-Policy with Online Planning (LOOP). We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and value function errors and empirically demonstrate this tradeoff to be beneficial in deep reinforcement learning. Furthermore, we identify the "Actor Divergence" issue in this framework and propose Actor Regularized Control (ARC), a modified trajectory optimization procedure. We evaluate our method on a set of robotic tasks for Offline and Online RL and demonstrate improved performance. We also show the flexibility of LOOP to incorporate safety constraints during deployment with a set of navigation environments. We demonstrate that LOOP is a desirable framework for robotics applications based on its strong performance in various important RL settings. Project video and details can be found at $\href{https://hari-sikchi.github.io/loop}{\text{hari-sikchi.github.io/loop}}$.
    Robust Temporal Ensembling for Learning with Noisy Labels. (arXiv:2109.14563v1 [cs.CV])
    (2 min) Successful training of deep neural networks with noisy labels is an essential capability as most real-world datasets contain some amount of mislabeled data. Left unmitigated, label noise can sharply degrade typical supervised learning approaches. In this paper, we present robust temporal ensembling (RTE), which combines robust loss with semi-supervised regularization methods to achieve noise-robust learning. We demonstrate that RTE achieves state-of-the-art performance across the CIFAR-10, CIFAR-100, ImageNet, WebVision, and Food-101N datasets, while forgoing the recent trend of label filtering and/or fixing. Finally, we show that RTE also retains competitive corruption robustness to unforeseen input noise using CIFAR-10-C, obtaining a mean corruption error (mCE) of 13.50% even in the presence of an 80% noise ratio, versus 26.9% mCE with standard methods on clean data.
    Towards Understanding Cooperative Multi-Agent Q-Learning with Value Factorization. (arXiv:2006.00587v4 [cs.LG] UPDATED)
    (2 min) Value factorization is a popular and promising approach to scaling up multi-agent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we formalize a multi-agent fitted Q-iteration framework for analyzing factorized multi-agent Q-learning. Based on this framework, we investigate linear value factorization and reveal that multi-agent Q-learning with this simple decomposition implicitly realizes a powerful counterfactual credit assignment, but may not converge in some settings. Through further analysis, we find that on-policy training or richer joint value function classes can improve its local or global convergence properties, respectively. Finally, to support and extend our theoretical implications to practical realization, we conduct an empirical analysis of state-of-the-art deep multi-agent Q-learning algorithms on didactic examples and a broad set of StarCraft II unit micromanagement tasks.
    On the importance of cross-task features for class-incremental learning. (arXiv:2106.11930v3 [cs.LG] UPDATED)
    (2 min) In class-incremental learning, an agent with limited resources needs to learn a sequence of classification tasks, forming an ever growing classification problem, with the constraint of not being able to access data from previous tasks. The main difference with task-incremental learning, where a task-ID is available at inference time, is that the learner also needs to perform cross-task discrimination, i.e. distinguish between classes that have not been seen together. Approaches to tackle this problem are numerous and mostly make use of an external memory (buffer) of non-negligible size. In this paper, we ablate the learning of cross-task features and study its influence on the performance of basic replay strategies used for class-IL. We also define a new forgetting measure for class-incremental learning, and see that forgetting is not the principal cause of low performance. Our experimental results show that future algorithms for class-incremental learning should not only prevent forgetting, but also aim to improve the quality of the cross-task features, and the knowledge transfer between tasks. This is especially important when tasks contain limited amount of data.
    Minimal Expected Regret in Linear Quadratic Control. (arXiv:2109.14429v1 [cs.LG])
    (2 min) We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by $\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by $\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild non-degeneracy condition ($d_x$ and $d_u$ denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in $T$, $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when $d_x\le d_u$ [SF20], and in scenario (ii) [lai1986]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting). Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on $A$ and $B$ and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of $A$ and $B$ and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.
    Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization. (arXiv:2109.14549v1 [cs.RO])
    (2 min) Developing robust vision-guided controllers for quadrupedal robots in complex environments, with various obstacles, dynamical surroundings and uneven terrains, is very challenging. While Reinforcement Learning (RL) provides a promising paradigm for agile locomotion skills with vision inputs in simulation, it is still very challenging to deploy the RL policy in the real world. Our key insight is that aside from the discrepancy in the domain gap, in visual appearance between the simulation and the real world, the latency from the control pipeline is also a major cause of difficulty. In this paper, we propose Multi-Modal Delay Randomization (MMDR) to address this issue when training RL agents. Specifically, we simulate the latency of real hardware by using past observations, sampled with randomized periods, for both proprioception and vision. We train the RL policy for end-to-end control in a physical simulator without any predefined controller or reference motion, and directly deploy it on the real A1 quadruped robot running in the wild. We evaluate our method in different outdoor environments with complex terrains and obstacles. We demonstrate the robot can smoothly maneuver at a high speed, avoid the obstacles, and show significant improvement over the baselines. Our project page with videos is at https://mehooz.github.io/mmdr-wild/.
    Towards a theory of out-of-distribution learning. (arXiv:2109.14501v1 [stat.ML])
    (2 min) What is learning? 20 century formalizations of learning theory -- which precipitated revolutions in artificial intelligence -- focus primarily on \textit{in-distribution} learning, that is, learning under the assumption that the training data are sampled from the same distribution as the evaluation distribution. This assumption renders these theories inadequate for characterizing 21$^{st}$ century real world data problems, which are typically characterized by evaluation distributions that differ from the training data distributions (referred to as out-of-distribution learning). We therefore make a small change to existing formal definitions of learnability by relaxing that assumption. We then introduce \textbf{learning efficiency} (LE) to quantify the amount a learner is able to leverage data for a given problem, regardless of whether it is an in- or out-of-distribution problem. We then define and prove the relationship between generalized notions of learnability, and show how this framework is sufficiently general to characterize transfer, multitask, meta, continual, and lifelong learning. We hope this unification helps bridge the gap between empirical practice and theoretical guidance in real world problems. Finally, because biological learning continues to outperform machine learning algorithms on certain OOD challenges, we discuss the limitations of this framework vis-\'a-vis its ability to formalize biological learning, suggesting multiple avenues for future research.
    PyHard: a novel tool for generating hardness embeddings to support data-centric analysis. (arXiv:2109.14430v1 [cs.LG])
    (2 min) For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models and are therefore worth closer inspection, and the delineation of regions of strengths and weaknesses of ML models.
    A gradient-based variable selection for binary classification in reproducing kernel Hilbert space. (arXiv:2109.14282v1 [stat.ML])
    (2 min) Variable selection is essential in high-dimensional data analysis. Although various variable selection methods have been developed, most rely on the linear model assumption. This article proposes a nonparametric variable selection method for the large-margin classifier defined by reproducing the kernel Hilbert space (RKHS). we propose a gradient-based representation of the large-margin classifier and then regularize the gradient functions by the group-lasso penalty to obtain sparse gradients that naturally lead to the variable selection. The groupwise-majorization-decent algorithm (GMD, Yang and Zou, 2015) is proposed to efficiently solve the proposed problem with a large number of parameters. We employ the strong sequential rule (Tibshirani et al., 2012) to facilitate the tuning procedure. The selection consistency of the proposed method is established by obtaining the risk bound of the estimated classifier and its gradient. Finally, we demonstrate the promising performance of the proposed method through simulations and real data illustration.
    Sequential Deep Learning Architectures for Anomaly Detection in Virtual Network Function Chains. (arXiv:2109.14276v1 [cs.LG])
    (2 min) Software-defined networking (SDN) and network function virtualization (NFV) have enabled the efficient provision of network service. However, they also raised new tasks to monitor and ensure the status of virtualized service, and anomaly detection is one of such tasks. There have been many data-driven approaches to implement anomaly detection system (ADS) for virtual network functions in service function chains (SFCs). In this paper, we aim to develop more advanced deep learning models for ADS. Previous approaches used learning algorithms such as random forest (RF), gradient boosting machine (GBM), or deep neural networks (DNNs). However, these models have not utilized sequential dependencies in the data. Furthermore, they are limited as they can only apply to the SFC setting from which they were trained. Therefore, we propose several sequential deep learning models to learn time-series patterns and sequential patterns of the virtual network functions (VNFs) in the chain with variable lengths. As a result, the suggested models improve detection performance and apply to SFCs with varying numbers of VNFs.
    Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration. (arXiv:2109.14591v1 [cs.LG])
    (2 min) An increasingly common use case for machine learning models is augmenting the abilities of human decision makers. For classification tasks where neither the human or model are perfectly accurate, a key step in obtaining high performance is combining their individual predictions in a manner that leverages their relative strengths. In this work, we develop a set of algorithms that combine the probabilistic output of a model with the class-level output of a human. We show theoretically that the accuracy of our combination model is driven not only by the individual human and model accuracies, but also by the model's confidence. Empirical results on image classification with CIFAR-10 and a subset of ImageNet demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints.
    Partitioning Cloud-based Microservices (via Deep Learning). (arXiv:2109.14569v1 [cs.LG])
    (2 min) Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle''; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.
    Online Aggregation of Probability Forecasts with Confidence. (arXiv:2109.14309v1 [cs.AI])
    (2 min) The paper presents numerical experiments and some theoretical developments in prediction with expert advice (PEA). One experiment deals with predicting electricity consumption depending on temperature and uses real data. As the pattern of dependence can change with season and time of the day, the domain naturally admits PEA formulation with experts having different ``areas of expertise''. We consider the case where several competing methods produce online predictions in the form of probability distribution functions. The dissimilarity between a probability forecast and an outcome is measured by a loss function (scoring rule). A popular example of scoring rule for continuous outcomes is Continuous Ranked Probability Score (CRPS). In this paper the problem of combining probabilistic forecasts is considered in the PEA framework. We show that CRPS is a mixable loss function and then the time-independent upper bound for the regret of the Vovk aggregating algorithm using CRPS as a loss function can be obtained. Also, we incorporate a ``smooth'' version of the method of specialized experts in this scheme which allows us to combine the probabilistic predictions of the specialized experts with overlapping domains of their competence.
    Overview of the Arabic Sentiment Analysis 2021 Competition at KAUST. (arXiv:2109.14456v1 [cs.CL])
    (2 min) This paper provides an overview of the Arabic Sentiment Analysis Challenge organized by King Abdullah University of Science and Technology (KAUST). The task in this challenge is to develop machine learning models to classify a given tweet into one of the three categories Positive, Negative, or Neutral. From our recently released ASAD dataset, we provide the competitors with 55K tweets for training, and 20K tweets for validation, based on which the performance of participating teams are ranked on a leaderboard, https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust. The competition received in total 1247 submissions from 74 teams (99 team members). The final winners are determined by another private set of 20K tweets that have the same distribution as the training and validation set. In this paper, we present the main findings in the competition and summarize the methods and tools used by the top ranked teams. The full dataset of 100K labeled tweets is also released for public usage, at https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust/data.
    How Does Adversarial Fine-Tuning Benefit BERT?. (arXiv:2108.13602v2 [cs.CL] UPDATED)
    (2 min) Adversarial training (AT) is one of the most reliable methods for defending against adversarial attacks in machine learning. Variants of this method have been used as regularization mechanisms to achieve SOTA results on NLP benchmarks, and they have been found to be useful for transfer learning and continual learning. We search for the reasons for the effectiveness of AT by contrasting vanilla and adversarially fine-tuned BERT models. We identify partial preservation of BERT's syntactic abilities during fine-tuning as the key to the success of AT. We observe that adversarially fine-tuned models remain more faithful to BERT's language modeling behavior and are more sensitive to the word order. As concrete examples of syntactic abilities, an adversarially fine-tuned model could have an advantage of up to 38% on anaphora agreement and up to 11% on dependency parsing. Our analysis demonstrates that vanilla fine-tuning oversimplifies the sentence representation by focusing heavily on a small subset of words. AT, however, moderates the effect of these influential words and encourages representational diversity. This allows for a more hierarchical representation of a sentence and leads to the mitigation of BERT's loss of syntactic abilities.
    Learning Dynamics Models for Model Predictive Agents. (arXiv:2109.14311v1 [cs.LG])
    (2 min) Model-Based Reinforcement Learning involves learning a \textit{dynamics model} from data, and then using this model to optimise behaviour, most often with an online \textit{planner}. Much of the recent research along these lines presents a particular set of design choices, involving problem definition, model learning and planning. Given the multiple contributions, it is difficult to evaluate the effects of each. This paper sets out to disambiguate the role of different design choices for learning dynamics models, by comparing their performance to planning with a ground-truth model -- the simulator. First, we collect a rich dataset from the training sequence of a model-free agent on 5 domains of the DeepMind Control Suite. Second, we train feed-forward dynamics models in a supervised fashion, and evaluate planner performance while varying and analysing different model design choices, including ensembling, stochasticity, multi-step training and timestep size. Besides the quantitative analysis, we describe a set of qualitative findings, rules of thumb, and future research directions for planning with learned dynamics models. Videos of the results are available at https://sites.google.com/view/learning-better-models.
    Deep Reinforcement Q-Learning for Intelligent Traffic Signal Control with Partial Detection. (arXiv:2109.14337v1 [cs.LG])
    (2 min) Intelligent traffic signal controllers, applying DQN algorithms to traffic light policy optimization, efficiently reduce traffic congestion by adjusting traffic signals to real-time traffic. Most propositions in the literature however consider that all vehicles at the intersection are detected, an unrealistic scenario. Recently, new wireless communication technologies have enabled cost-efficient detection of connected vehicles by infrastructures. With only a small fraction of the total fleet currently equipped, methods able to perform under low detection rates are desirable. In this paper, we propose a deep reinforcement Q-learning model to optimize traffic signal control at an isolated intersection, in a partially observable environment with connected vehicles. First, we present the novel DQN model within the RL framework. We introduce a new state representation for partially observable environments and a new reward function for traffic signal control, and provide a network architecture and tuned hyper-parameters. Second, we evaluate the performances of the model in numerical simulations on multiple scenarios, in two steps. At first in full detection against existing actuated controllers, then in partial detection with loss estimates for proportions of connected vehicles. Finally, from the obtained results, we define thresholds for detection rates with acceptable and optimal performance levels.
    Large-scale Crash Localization using Multi-Task Learning. (arXiv:2109.14326v1 [cs.SE])
    (2 min) Crash localization, an important step in debugging crashes, is challenging when dealing with an extremely large number of diverse applications and platforms and underlying root causes. Large-scale error reporting systems, e.g., Windows Error Reporting (WER), commonly rely on manually developed rules and heuristics to localize blamed frames causing the crashes. As new applications and features are routinely introduced and existing applications are run under new environments, developing new rules and maintaining existing ones become extremely challenging. We propose a data-driven solution to address the problem. We start with the first large-scale empirical study of 362K crashes and their blamed methods reported to WER by tens of thousands of applications running in the field. The analysis provides valuable insights on where and how the crashes happen and what methods to blame for the crashes. These insights enable us to develop DeepAnalyze, a novel multi-task sequence labeling approach for identifying blamed frames in stack traces. We evaluate our model with over a million real-world crashes from four popular Microsoft applications and show that DeepAnalyze, trained with crashes from one set of applications, not only accurately localizes crashes of the same applications, but also bootstraps crash localization for other applications with zero to very little additional training data.
    Self-Supervised Learning for 3D Medical Image Analysis using 3D SimCLR and Monte Carlo Dropout. (arXiv:2109.14288v1 [cs.LG])
    (2 min) Self-supervised learning methods can be used to learn meaningful representations from unlabeled data that can be transferred to supervised downstream tasks to reduce the need for labeled data. In this paper, we propose a 3D self-supervised method that is based on the contrastive (SimCLR) method. Additionally, we show that employing Bayesian neural networks (with Monte-Carlo Dropout) during the inference phase can further enhance the results on the downstream tasks. We showcase our models on two medical imaging segmentation tasks: i) Brain Tumor Segmentation from 3D MRI, ii) Pancreas Tumor Segmentation from 3D CT. Our experimental results demonstrate the benefits of our proposed methods in both downstream data-efficiency and performance.
    Be Confident! Towards Trustworthy Graph Neural Networks via Confidence Calibration. (arXiv:2109.14285v1 [cs.LG])
    (2 min) Despite Graph Neural Networks (GNNs) have achieved remarkable accuracy, whether the results are trustworthy is still unexplored. Previous studies suggest that many modern neural networks are over-confident on the predictions, however, surprisingly, we discover that GNNs are primarily in the opposite direction, i.e., GNNs are under-confident. Therefore, the confidence calibration for GNNs is highly desired. In this paper, we propose a novel trustworthy GNN model by designing a topology-aware post-hoc calibration function. Specifically, we first verify that the confidence distribution in a graph has homophily property, and this finding inspires us to design a calibration GNN model (CaGCN) to learn the calibration function. CaGCN is able to obtain a unique transformation from logits of GNNs to the calibrated confidence for each node, meanwhile, such transformation is able to preserve the order between classes, satisfying the accuracy-preserving property. Moreover, we apply the calibration GNN to self-training framework, showing that more trustworthy pseudo labels can be obtained with the calibrated confidence and further improve the performance. Extensive experiments demonstrate the effectiveness of our proposed model in terms of both calibration and accuracy.
    Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks. (arXiv:2109.14320v1 [cs.AR])
    (2 min) Emerging edge computing platforms often contain machine learning (ML) accelerators that can accelerate inference for a wide range of neural network (NN) models. These models are designed to fit within the limited area and energy constraints of the edge computing platforms, each targeting various applications (e.g., face detection, speech recognition, translation, image captioning, video analytics). To understand how edge ML accelerators perform, we characterize the performance of a commercial Google Edge TPU, using 24 Google edge NN models (which span a wide range of NN model types) and analyzing each NN layer within each model. We find that the Edge TPU suffers from three major shortcomings: (1) it operates significantly below peak computational throughput, (2) it operates significantly below its theoretical energy efficiency, and (3) its memory system is a large energy and performance bottleneck. Our characterization reveals that the one-size-fits-all, monolithic design of the Edge TPU ignores the high degree of heterogeneity both across different NN models and across different NN layers within the same NN model, leading to the shortcomings we observe. We propose a new acceleration framework called Mensa. Mensa incorporates multiple heterogeneous edge ML accelerators (including both on-chip and near-data accelerators), each of which caters to the characteristics of a particular subset of NN models and layers. During NN inference, for each NN layer, Mensa decides which accelerator to schedule the layer on, taking into account both the optimality of each accelerator for the layer and layer-to-layer communication costs. Averaged across all 24 Google edge NN models, Mensa improves energy efficiency and throughput by 3.0x and 3.1x over the Edge TPU, and by 2.4x and 4.3x over Eyeriss~v2, a state-of-the-art accelerator.
    Distribution Knowledge Embedding for Graph Pooling. (arXiv:2109.14333v1 [cs.LG])
    (2 min) Graph-level representation learning is the pivotal step for downstream tasks that operate on the whole graph. The most common approach to this problem heretofore is graph pooling, where node features are typically averaged or summed to obtain the graph representations. However, pooling operations like averaging or summing inevitably cause massive information missing, which may severely downgrade the final performance. In this paper, we argue what is crucial to graph-level downstream tasks includes not only the topological structure but also the distribution from which nodes are sampled. Therefore, powered by existing Graph Neural Networks (GNN), we propose a new plug-and-play pooling module, termed as Distribution Knowledge Embedding (DKEPool), where graphs are rephrased as distributions on top of GNNs and the pooling goal is to summarize the entire distribution information instead of retaining a certain feature vector by simple predefined pooling operations. A DKEPool network de facto disassembles representation learning into two stages, structure learning and distribution learning. Structure learning follows a recursive neighborhood aggregation scheme to update node features where structure information is obtained. Distribution learning, on the other hand, omits node interconnections and focuses more on the distribution depicted by all the nodes. Extensive experiments demonstrate that the proposed DKEPool significantly and consistently outperforms the state-of-the-art methods.
    Error rate control for classification rules in multiclass mixture models. (arXiv:2109.14235v1 [cs.LG])
    (2 min) In the context of finite mixture models one considers the problem of classifying as many observations as possible in the classes of interest while controlling the classification error rate in these same classes. Similar to what is done in the framework of statistical test theory, different type I and type II-like classification error rates can be defined, along with their associated optimal rules, where optimality is defined as minimizing type II error rate while controlling type I error rate at some nominal level. It is first shown that finding an optimal classification rule boils down to searching an optimal region in the observation space where to apply the classical Maximum A Posteriori (MAP) rule. Depending on the misclassification rate to be controlled, the shape of the optimal region is provided, along with a heuristic to compute the optimal classification rule in practice. In particular, a multiclass FDR-like optimal rule is defined and compared to the thresholded MAP rules that is used in most applications. It is shown on both simulated and real datasets that the FDR-like optimal rule may be significantly less conservative than the thresholded MAP rule.
    Road Network Guided Fine-Grained Urban Traffic Flow Inference. (arXiv:2109.14251v1 [cs.LG])
    (2 min) Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem, which can help greatly reduce the number of traffic monitoring sensors for cost savings. In this work, we notice that traffic flow has a high correlation with road network, which was either completely ignored or simply treated as an external factor in previous works. To facilitate this problem, we propose a novel Road-Aware Traffic Flow Magnifier (RATFM) that explicitly exploits the prior knowledge of road networks to fully learn the road-aware spatial distribution of fine-grained traffic flow. Specifically, a multi-directional 1D convolutional layer is first introduced to extract the semantic feature of the road network. Subsequently, we incorporate the road network feature and coarse-grained flow feature to regularize the short-range spatial distribution modeling of road-relative traffic flow. Furthermore, we take the road network feature as a query to capture the long-range spatial distribution of traffic flow with a transformer architecture. Benefiting from the road-aware inference mechanism, our method can generate high-quality fine-grained traffic flow maps. Extensive experiments on three real-world datasets show that the proposed RATFM outperforms state-of-the-art models under various scenarios.
    Federated Learning Algorithms for Generalized Mixed-effects Model (GLMM) on Horizontally Partitioned Data from Distributed Sources. (arXiv:2109.14046v1 [stat.ML])
    (2 min) Objectives: This paper develops two algorithms to achieve federated generalized linear mixed effect models (GLMM), and compares the developed model's outcomes with each other, as well as that from the standard R package (`lme4'). Methods: The log-likelihood function of GLMM is approximated by two numerical methods (Laplace approximation and Gaussian Hermite approximation), which supports federated decomposition of GLMM to bring computation to data. Results: Our developed method can handle GLMM to accommodate hierarchical data with multiple non-independent levels of observations in a federated setting. The experiment results demonstrate comparable (Laplace) and superior (Gaussian-Hermite) performances with simulated and real-world data. Conclusion: We developed and compared federated GLMMs with different approximations, which can support researchers in analyzing biomedical data to accommodate mixed effects and address non-independence due to hierarchical structures (i.e., institutes, region, country, etc.).
    Improving Dialogue State Tracking by Joint Slot Modeling. (arXiv:2109.14144v1 [cs.CL])
    (2 min) Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they are able to alleviate the confusion mentioned above, and they push the state-of-the-art on dataset MultiWoZ 2.1 from 58.7 to 61.3. Our implementation is available at https://github.com/CTinRay/Trippy-Joint.
    Formulation and validation of a car-following model based on deep reinforcement learning. (arXiv:2109.14268v1 [cs.LG])
    (2 min) We propose and validate a novel car following model based on deep reinforcement learning. Our model is trained to maximize externally given reward functions for the free and car-following regimes rather than reproducing existing follower trajectories. The parameters of these reward functions such as desired speed, time gap, or accelerations resemble that of traditional models such as the Intelligent Driver Model (IDM) and allow for explicitly implementing different driving styles. Moreover, they partially lift the black-box nature of conventional neural network models. The model is trained on leading speed profiles governed by a truncated Ornstein-Uhlenbeck process reflecting a realistic leader's kinematics. This allows for arbitrary driving situations and an infinite supply of training data. For various parameterizations of the reward functions, and for a wide variety of artificial and real leader data, the model turned out to be unconditionally string stable, comfortable, and crash-free. String stability has been tested with a platoon of five followers following an artificial and a real leading trajectory. A cross-comparison with the IDM calibrated to the goodness-of-fit of the relative gaps showed a higher reward compared to the traditional model and a better goodness-of-fit.
    Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual Observations. (arXiv:2109.14026v1 [cs.RO])
    (2 min) Legged robots have achieved remarkable performance in blind walking using either model-based control or data-driven deep reinforcement learning. To proactively navigate and traverse various terrains, active use of visual perception becomes indispensable, and this work aims to exploit the use of sparse visual observations to achieve perceptual locomotion over a range of commonly seen bumps, ramps, and stairs in human-centred environments. We first formulate the selection of minimal visual input that can represent the uneven surfaces of interest, and propose a learning framework that integrates such exteroceptive and proprioceptive data. We specifically select state observations and design a training curriculum to learn feedback control policies more effectively over a range of different terrains. Using an extensive benchmark, we validate the learned policy in tasks that require omnidirectional walking over flat ground and forward locomotion over terrains with obstacles, showing a high success rate of traversal. Particularly, the robot performs autonomous perceptual locomotion with minimal visual perception using depth measurements, which are easily available from a Lidar or RGB-D sensor, and successfully demonstrates robust ascent and descent over high stairs of 20 cm step height, i.e., 50% of its leg length.
    Apple Tasting Revisited: Bayesian Approaches to Partially Monitored Online Binary Classification. (arXiv:2109.14412v1 [cs.LG])
    (2 min) We consider a variant of online binary classification where a learner sequentially assigns labels ($0$ or $1$) to items with unknown true class. If, but only if, the learner chooses label $1$ they immediately observe the true label of the item. The learner faces a trade-off between short-term classification accuracy and long-term information gain. This problem has previously been studied under the name of the `apple tasting' problem. We revisit this problem as a partial monitoring problem with side information, and focus on the case where item features are linked to true classes via a logistic regression model. Our principal contribution is a study of the performance of Thompson Sampling (TS) for this problem. Using recently developed information-theoretic tools, we show that TS achieves a Bayesian regret bound of an improved order to previous approaches. Further, we experimentally verify that efficient approximations to TS and Information Directed Sampling via P\'{o}lya-Gamma augmentation have superior empirical performance to existing methods.
    Click-through Rate Prediction with Auto-Quantized Contrastive Learning. (arXiv:2109.13921v1 [cs.IR])
    (2 min) Click-through rate (CTR) prediction becomes indispensable in ubiquitous web recommendation applications. Nevertheless, the current methods are struggling under the cold-start scenarios where the user interactions are extremely sparse. We consider this problem as an automatic identification about whether the user behaviors are rich enough to capture the interests for prediction, and propose an Auto-Quantized Contrastive Learning (AQCL) loss to regularize the model. Different from previous methods, AQCL explores both the instance-instance and the instance-cluster similarity to robustify the latent representation, and automatically reduces the information loss to the active users due to the quantization. The proposed framework is agnostic to different model architectures and can be trained in an end-to-end fashion. Extensive results show that it consistently improves the current state-of-the-art CTR models.
    LightSecAgg: Rethinking Secure Aggregation in Federated Learning. (arXiv:2109.14236v1 [cs.LG])
    (2 min) Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user's individual model, while allowing their global aggregation. It can be applied to any aggregation-based approaches, including algorithms for training a global model, as well as personalized FL frameworks. Model aggregation needs to also be resilient to likely user dropouts in FL system, making its design substantially more complex. State-of-the-art secure aggregation protocols essentially rely on secret sharing of the random-seeds that are used for mask generations at the users, in order to enable the reconstruction and cancellation of those belonging to dropped users. The complexity of such approaches, however, grows substantially with the number of dropped users. We propose a new approach, named LightSecAgg, to overcome this bottleneck by turning the focus from "random-seed reconstruction of the dropped users" to "one-shot aggregate-mask reconstruction of the active users". More specifically, in LightSecAgg each user protects its local model by generating a single random mask. This mask is then encoded and shared to other users, in such a way that the aggregate-mask of any sufficiently large set of active users can be reconstructed directly at the server via encoded masks. We show that LightSecAgg achieves the same privacy and dropout-resiliency guarantees as the state-of-the-art protocols, while significantly reducing the overhead for resiliency to dropped users. Furthermore, our system optimization helps to hide the runtime cost of offline processing by parallelizing it with model training. We evaluate LightSecAgg via extensive experiments for training diverse models on various datasets in a realistic FL system, and demonstrate that LightSecAgg significantly reduces the total training time, achieving a performance gain of up to $12.7\times$ over baselines.
    Fine-Grained Zero-Shot Learning with DNA as Side Information. (arXiv:2109.14133v1 [cs.CV])
    (2 min) Fine-grained zero-shot learning task requires some form of side-information to transfer discriminative information from seen to unseen classes. As manually annotated visual attributes are extremely costly and often impractical to obtain for a large number of classes, in this study we use DNA as side information for the first time for fine-grained zero-shot classification of species. Mitochondrial DNA plays an important role as a genetic marker in evolutionary biology and has been used to achieve near-perfect accuracy in the species classification of living organisms. We implement a simple hierarchical Bayesian model that uses DNA information to establish the hierarchy in the image space and employs local priors to define surrogate classes for unseen ones. On the benchmark CUB dataset, we show that DNA can be equally promising yet in general a more accessible alternative than word vectors as a side information. This is especially important as obtaining robust word representations for fine-grained species names is not a practicable goal when information about these species in free-form text is limited. On a newly compiled fine-grained insect dataset that uses DNA information from over a thousand species, we show that the Bayesian approach outperforms state-of-the-art by a wide margin.
    Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation. (arXiv:2109.14200v1 [eess.AS])
    (2 min) Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and without the units being proximal learning targets for the learner. In this study, we formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated...
    Generating Summaries for Scientific Paper Review. (arXiv:2109.14059v1 [cs.CL])
    (2 min) The review process is essential to ensure the quality of publications. Recently, the increase of submissions for top venues in machine learning and NLP has caused a problem of excessive burden on reviewers and has often caused concerns regarding how this may not only overload reviewers, but also may affect the quality of the reviews. An automatic system for assisting with the reviewing process could be a solution for ameliorating the problem. In this paper, we explore automatic review summary generation for scientific papers. We posit that neural language models have the potential to be valuable candidates for this task. In order to test this hypothesis, we release a new dataset of scientific papers and their reviews, collected from papers published in the NeurIPS conference from 2013 to 2020. We evaluate state of the art neural summarization models, present initial results on the feasibility of automatic review summary generation, and propose directions for the future.
    Can multi-label classification networks know what they don't know?. (arXiv:2109.14162v1 [cs.LG])
    (2 min) Estimating out-of-distribution (OOD) uncertainty is a central challenge for safely deploying machine learning models in the open-world environment. Improved methods for OOD detection in multi-class classification have emerged, while OOD detection methods for multi-label classification remain underexplored and use rudimentary techniques. We propose JointEnergy, a simple and effective method, which estimates the OOD indicator scores by aggregating energy scores from multiple labels. We show that JointEnergy can be mathematically interpreted from a joint likelihood perspective. Our results show consistent improvement over previous methods that are based on the maximum-valued scores, which fail to capture joint information from multiple labels. We demonstrate the effectiveness of our method on three common multi-label classification benchmarks, including MS-COCO, PASCAL-VOC, and NUS-WIDE. We show that JointEnergy can reduce the FPR95 by up to 10.05% compared to the previous best baseline, establishing state-of-the-art performance.
    Non-stationary Gaussian process discriminant analysis with variable selection for high-dimensional functional data. (arXiv:2109.14171v1 [stat.ML])
    (2 min) High-dimensional classification and feature selection tasks are ubiquitous with the recent advancement in data acquisition technology. In several application areas such as biology, genomics and proteomics, the data are often functional in their nature and exhibit a degree of roughness and non-stationarity. These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately. We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework. Our model is a two-layer non-stationary Gaussian process coupled with an Ising prior to identify differentially-distributed locations. Scalable inference is achieved via developing a variational scheme that exploits advances in the use of sparse inverse covariance matrices. We demonstrate the performance of our methodology on simulated datasets and two proteomics datasets: breast cancer and SARS-CoV-2. Our approach distinguishes itself by offering explainability as well as uncertainty quantification in addition to low computational cost, which are crucial to increase trust and social acceptance of data-driven tools.
    EBSD Grain Knowledge Graph Representation Learning for Material Structure-Property Prediction. (arXiv:2109.14248v1 [cs.LG])
    (2 min) The microstructure is an essential part of materials, storing the genes of materials and having a decisive influence on materials' physical and chemical properties. The material genetic engineering program aims to establish the relationship between material composition/process, organization, and performance to realize the reverse design of materials, thereby accelerating the research and development of new materials. However, tissue analysis methods of materials science, such as metallographic analysis, XRD analysis, and EBSD analysis, cannot directly establish a complete quantitative relationship between tissue structure and performance. Therefore, this paper proposes a novel data-knowledge-driven organization representation and performance prediction method to obtain a quantitative structure-performance relationship. First, a knowledge graph based on EBSD is constructed to describe the material's mesoscopic microstructure. Then a graph representation learning network based on graph attention is constructed, and the EBSD organizational knowledge graph is input into the network to obtain graph-level feature embedding. Finally, the graph-level feature embedding is input to a graph feature mapping network to obtain the material's mechanical properties. The experimental results show that our method is superior to traditional machine learning and machine vision methods.
    On Brightness Agnostic Adversarial Examples Against Face Recognition Systems. (arXiv:2109.14205v1 [cs.CV])
    (2 min) This paper introduces a novel adversarial example generation method against face recognition systems (FRSs). An adversarial example (AX) is an image with deliberately crafted noise to cause incorrect predictions by a target system. The AXs generated from our method remain robust under real-world brightness changes. Our method performs non-linear brightness transformations while leveraging the concept of curriculum learning during the attack generation procedure. We demonstrate that our method outperforms conventional techniques from comprehensive experimental investigations in the digital and physical world. Furthermore, this method enables practical risk assessment of FRSs against brightness agnostic AXs.
    Equivariant Neural Network for Factor Graphs. (arXiv:2109.14218v1 [cs.LG])
    (2 min) Several indices used in a factor graph data structure can be permuted without changing the underlying probability distribution. An algorithm that performs inference on a factor graph should ideally be equivariant or invariant to permutations of global indices of nodes, variable orderings within a factor, and variable assignment orderings. However, existing neural network-based inference procedures fail to take advantage of this inductive bias. In this paper, we precisely characterize these isomorphic properties of factor graphs and propose two inference models: Factor-Equivariant Neural Belief Propagation (FE-NBP) and Factor-Equivariant Graph Neural Networks (FE-GNN). FE-NBP is a neural network that generalizes BP and respects each of the above properties of factor graphs while FE-GNN is an expressive GNN model that relaxes an isomorphic property in favor of greater expressivity. Empirically, we demonstrate on both real-world and synthetic datasets, for both marginal inference and MAP inference, that FE-NBP and FE-GNN together cover a range of sample complexity regimes: FE-NBP achieves state-of-the-art performance on small datasets while FE-GNN achieves state-of-the-art performance on large datasets.
    On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization. (arXiv:2109.14213v1 [cs.LG])
    (2 min) Adam-type methods, the extension of adaptive gradient methods, have shown great performance in the training of both supervised and unsupervised machine learning models. In particular, Adam-type optimizers have been widely used empirically as the default tool for training generative adversarial networks (GANs). On the theory side, however, despite the existence of theoretical results showing the efficiency of Adam-type methods in minimization problems, the reason of their wonderful performance still remains absent in GAN's training. In existing works, the fast convergence has long been considered as one of the most important reasons and multiple works have been proposed to give a theoretical guarantee of the convergence to a critical point of min-max optimization algorithms under certain assumptions. In this paper, we firstly argue empirically that in GAN's training, Adam does not converge to a critical point even upon successful training: Only the generator is converging while the discriminator's gradient norm remains high throughout the training. We name this one-sided convergence. Then we bridge the gap between experiments and theory by showing that Adam-type algorithms provably converge to a one-sided first order stationary points in min-max optimization problems under the one-sided MVI condition. We also empirically verify that such one-sided MVI condition is satisfied for standard GANs after trained over standard data sets. To the best of our knowledge, this is the very first result which provides an empirical observation and a strict theoretical guarantee on the one-sided convergence of Adam-type algorithms in min-max optimization.
    Second-Order Neural ODE Optimizer. (arXiv:2109.14158v1 [cs.LG])
    (2 min) We propose a novel second-order optimization framework for training the emerging deep continuous-time models, specifically the Neural Ordinary Differential Equations (Neural ODEs). Since their training already involves expensive gradient computation by solving a backward ODE, deriving efficient second-order methods becomes highly nontrivial. Nevertheless, inspired by the recent Optimal Control (OC) interpretation of training deep networks, we show that a specific continuous-time OC methodology, called Differential Programming, can be adopted to derive backward ODEs for higher-order derivatives at the same O(1) memory cost. We further explore a low-rank representation of the second-order derivatives and show that it leads to efficient preconditioned updates with the aid of Kronecker-based factorization. The resulting method converges much faster than first-order baselines in wall-clock time, and the improvement remains consistent across various applications, e.g. image classification, generative flow, and time-series prediction. Our framework also enables direct architecture optimization, such as the integration time of Neural ODEs, with second-order feedback policies, strengthening the OC perspective as a principled tool of analyzing optimization in deep learning.
    Adaptive Multi-layer Contrastive Graph Neural Networks. (arXiv:2109.14159v1 [cs.LG])
    (2 min) We present Adaptive Multi-layer Contrastive Graph Neural Networks (AMC-GNN), a self-supervised learning framework for Graph Neural Network, which learns feature representations of sample data without data labels. AMC-GNN generates two graph views by data augmentation and compares different layers' output embeddings of Graph Neural Network encoders to obtain feature representations, which could be used for downstream tasks. AMC-GNN could learn the importance weights of embeddings in different layers adaptively through the attention mechanism, and an auxiliary encoder is introduced to train graph contrastive encoders better. The accuracy is improved by maximizing the representation's consistency of positive pairs in the early layers and the final embedding space. Our experiments show that the results can be consistently improved by using the AMC-GNN framework, across four established graph benchmarks: Cora, Citeseer, Pubmed, DBLP citation network datasets, as well as four newly proposed datasets: Co-author-CS, Co-author-Physics, Amazon-Computers, Amazon-Photo.
    Temporal Clustering with External Memory Network for Disease Progression Modeling. (arXiv:2109.14147v1 [cs.LG])
    (2 min) Disease progression modeling (DPM) involves using mathematical frameworks to quantitatively measure the severity of how certain disease progresses. DPM is useful in many ways such as predicting health state, categorizing disease stages, and assessing patients disease trajectory etc. Recently, with wider availability of electronic health records (EHR) and the broad application of data-driven machine learning method, DPM has attracted much attention yet remains two major challenges: (i) Due to the existence of irregularity, heterogeneity and long-term dependency in EHRs, most existing DPM methods might not be able to provide comprehensive patient representations. (ii) Lots of records in EHRs might be irrelevant to the target disease. Most existing models learn to automatically focus on the relevant information instead of explicitly capture the target-relevant events, which might make the learned model suboptimal. To address these two issues, we propose Temporal Clustering with External Memory Network (TC-EMNet) for DPM that groups patients with similar trajectories to form disease clusters/stages. TC-EMNet uses a variational autoencoder (VAE) to capture internal complexity from the input data and utilizes an external memory work to capture long term distance information, both of which are helpful for producing comprehensive patient states. Last but not least, k-means algorithm is adopted to cluster the extracted comprehensive patient states to capture disease progression. Experiments on two real-world datasets show that our model demonstrates competitive clustering performance against state-of-the-art methods and is able to identify clinically meaningful clusters. The visualization of the extracted patient states shows that the proposed model can generate better patient states than the baselines.
    Flow Based Models For Manifold Data. (arXiv:2109.14216v1 [stat.ML])
    (2 min) Flow-based generative models typically define a latent space with dimensionality identical to the observational space. In many problems, however, the data does not populate the full ambient data-space that they natively reside in, rather inhabiting a lower-dimensional manifold. In such scenarios, flow-based models are unable to represent data structures exactly as their density will always have support off the data manifold, potentially resulting in degradation of model performance. In addition, the requirement for equal latent and data space dimensionality can unnecessarily increase complexity for contemporary flow models. Towards addressing these problems, we propose to learn a manifold prior that affords benefits to both sample generation and representation quality. An auxiliary benefit of our approach is the ability to identify the intrinsic dimension of the data distribution.
    Exact Statistical Inference for the Wasserstein Distance by Selective Inference. (arXiv:2109.14206v1 [stat.ML])
    (2 min) In this paper, we study statistical inference for the Wasserstein distance, which has attracted much attention and has been applied to various machine learning tasks. Several studies have been proposed in the literature, but almost all of them are based on asymptotic approximation and do not have finite-sample validity. In this study, we propose an exact (non-asymptotic) inference method for the Wasserstein distance inspired by the concept of conditional Selective Inference (SI). To our knowledge, this is the first method that can provide a valid confidence interval (CI) for the Wasserstein distance with finite-sample coverage guarantee, which can be applied not only to one-dimensional problems but also to multi-dimensional problems. We evaluate the performance of the proposed method on both synthetic and real-world datasets.
    Neural Network Ensembles: Theory, Training, and the Importance of Explicit Diversity. (arXiv:2109.14117v1 [cs.LG])
    (2 min) Ensemble learning is a process by which multiple base learners are strategically generated and combined into one composite learner. There are two features that are essential to an ensemble's performance, the individual accuracies of the component learners and the overall diversity in the ensemble. The right balance of learner accuracy and ensemble diversity can improve the performance of machine learning tasks on benchmark and real-world data sets, and recent theoretical and practical work has demonstrated the subtle trade-off between accuracy and diversity in an ensemble. In this paper, we extend the extant literature by providing a deeper theoretical understanding for assessing and improving the optimality of any given ensemble, including random forests and deep neural network ensembles. We also propose a training algorithm for neural network ensembles and demonstrate that our approach provides improved performance when compared to both state-of-the-art individual learners and ensembles of state-of-the-art learners trained using standard loss functions. Our key insight is that it is better to explicitly encourage diversity in an ensemble, rather than merely allowing diversity to occur by happenstance, and that rigorous theoretical bounds on the trade-off between diversity and learner accuracy allow one to know when an optimal arrangement has been achieved.
    Vitruvion: A Generative Model of Parametric CAD Sketches. (arXiv:2109.14124v1 [cs.LG])
    (2 min) Parametric computer-aided design (CAD) tools are the predominant way that engineers specify physical structures, from bicycle pedals to airplanes to printed circuit boards. The key characteristic of parametric CAD is that design intent is encoded not only via geometric primitives, but also by parameterized constraints between the elements. This relational specification can be viewed as the construction of a constraint program, allowing edits to coherently propagate to other parts of the design. Machine learning offers the intriguing possibility of accelerating the design process via generative modeling of these structures, enabling new tools such as autocompletion, constraint inference, and conditional synthesis. In this work, we present such an approach to generative modeling of parametric CAD sketches, which constitute the basic computational building blocks of modern mechanical design. Our model, trained on real-world designs from the SketchGraphs dataset, autoregressively synthesizes sketches as sequences of primitives, with initial coordinates, and constraints that reference back to the sampled primitives. As samples from the model match the constraint graph representation used in standard CAD software, they may be directly imported, solved, and edited according to downstream design tasks. In addition, we condition the model on various contexts, including partial sketches (primers) and images of hand-drawn sketches. Evaluation of the proposed approach demonstrates its ability to synthesize realistic CAD sketches and its potential to aid the mechanical design workflow.
    On the Provable Generalization of Recurrent Neural Networks. (arXiv:2109.14142v1 [cs.LG])
    (2 min) Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: 1) For a RNN with input sequence $x=(X_1,X_2,...,X_L)$, previous works study to learn functions that are summation of $f(\beta^T_lX_l)$ and require normalized conditions that $||X_l||\leq\epsilon$ with some very small $\epsilon$ depending on the complexity of $f$. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length $L$. 2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form $f(\beta^T[X_{l_1},...,X_{l_N}])$, which do not belong to the ``additive'' concept class, i,e., the summation of function $f(X_l)$. And we show that when either $N$ or $l_0=\max(l_1,..,l_N)-\min(l_1,..,l_N)$ is small, $f(\beta^T[X_{l_1},...,X_{l_N}])$ will be learnable with the number iterations and samples scaling almost-polynomially in the input length $L$.
    Breaking the curse of dimensionality with Isolation Kernel. (arXiv:2109.14198v1 [cs.LG])
    (2 min) The curse of dimensionality has been studied in different aspects. However, breaking the curse has been elusive. We show for the first time that it is possible to break the curse using the recently introduced Isolation Kernel. We show that only Isolation Kernel performs consistently well in indexed search, spectral & density peaks clustering, SVM classification and t-SNE visualization in both low and high dimensions, compared with distance, Gaussian and linear kernels. This is also supported by our theoretical analyses that Isolation Kernel is the only kernel that has the provable ability to break the curse, compared with existing metric-based Lipschitz continuous kernels.
    An Explainable-AI approach for Diagnosis of COVID-19 using MALDI-ToF Mass Spectrometry. (arXiv:2109.14099v1 [cs.LG])
    (2 min) The novel severe acute respiratory syndrome coronavirus type-2 (SARS-CoV-2) caused a global pandemic that has taken more than 4.5 million lives and severely affected the global economy. To curb the spread of the virus, an accurate, cost-effective, and quick testing for large populations is exceedingly important in order to identify, isolate, and treat infected people. Current testing methods commonly use PCR (Polymerase Chain Reaction) based equipment that have limitations on throughput, cost-effectiveness, and simplicity of procedure which creates a compelling need for developing additional coronavirus disease-2019 (COVID-19) testing mechanisms, that are highly sensitive, rapid, trustworthy, and convenient to use by the public. We propose a COVID-19 testing method using artificial intelligence (AI) techniques on MALDI-ToF (matrix-assisted laser desorption/ionization time-of-flight) data extracted from 152 human gargle samples (60 COVID-19 positive tests and 92 COVID-19 negative tests). Our AI-based approach leverages explainable-AI (X-AI) methods to explain the decision rules behind the predictive algorithm both on a local (per-sample) and global (all-samples) basis to make the AI model more trustworthy. Finally, we evaluated our proposed method using a 70%-30% train-test-split strategy and achieved a training accuracy of 86.79% and a testing accuracy of 91.30%.
    Meta Learning on a Sequence of Imbalanced Domains with Difficulty Awareness. (arXiv:2109.14120v1 [cs.LG])
    (2 min) Recognizing new objects by learning from a few labeled examples in an evolving environment is crucial to obtain excellent generalization ability for real-world machine learning systems. A typical setting across current meta learning algorithms assumes a stationary task distribution during meta training. In this paper, we explore a more practical and challenging setting where task distribution changes over time with domain shift. Particularly, we consider realistic scenarios where task distribution is highly imbalanced with domain labels unavailable in nature. We propose a kernel-based method for domain change detection and a difficulty-aware memory management mechanism that jointly considers the imbalanced domain size and domain importance to learn across domains continuously. Furthermore, we introduce an efficient adaptive task sampling method during meta training, which significantly reduces task gradient variance with theoretical guarantees. Finally, we propose a challenging benchmark with imbalanced domain sequences and varied domain difficulty. We have performed extensive evaluations on the proposed benchmark, demonstrating the effectiveness of our method. We made our code publicly available.
    The impact of non-target events in synthetic soundscapes for sound event detection. (arXiv:2109.14061v1 [eess.AS])
    (2 min) Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v1 [stat.ME])
    (2 min) This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d. samples, circumventing usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work of M\'emoli (2011) and Sturm (2012) on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers in order to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogenous metric measure spaces $(\mathcal{X}, \mu, c_{\mathcal{X}})$ and $(\mathcal{Y}, \nu, c_{\mathcal{Y}})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. Analytic properties of RGM distance are derived; statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.
    Stochastic Training is Not Necessary for Generalization. (arXiv:2109.14119v1 [cs.LG])
    (2 min) It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be completely replaced with explicit regularization. This strongly suggests that theories that rely heavily on properties of stochastic sampling to explain generalization are incomplete, as strong generalization behavior is still observed in the absence of stochastic sampling. Fundamentally, deep learning can succeed without stochasticity. Our observations further indicate that the perceived difficulty of full-batch training is largely the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.
    Sample-Efficient Safety Assurances using Conformal Prediction. (arXiv:2109.14082v1 [cs.RO])
    (2 min) When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than $\epsilon$ will occur without an alert. In this work, we present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics, in order to tune warning systems to provably achieve an $\epsilon$ false negative rate using as few as $1/\epsilon$ data points. We apply our framework to a driver warning system and a robotic grasping application, and empirically demonstrate guaranteed false negative rate and low false detection (positive) rate using very little data.
    Deep Unrolled Recovery in Sparse Biological Imaging. (arXiv:2109.14025v1 [cs.CV])
    (2 min) Deep algorithm unrolling has emerged as a powerful model-based approach to develop deep architectures that combine the interpretability of iterative algorithms with the performance gains of supervised deep learning, especially in cases of sparse optimization. This framework is well-suited to applications in biological imaging, where physics-based models exist to describe the measurement process and the information to be recovered is often highly structured. Here, we review the method of deep unrolling, and show how it improves source localization in several biological imaging settings.
    RAFT: A Real-World Few-Shot Text Classification Benchmark. (arXiv:2109.14076v1 [cs.CL])
    (2 min) Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .
    slimTrain -- A Stochastic Approximation Method for Training Separable Deep Neural Networks. (arXiv:2109.14002v1 [cs.LG])
    (2 min) Deep neural networks (DNNs) have shown their success as high-dimensional function approximators in many applications; however, training DNNs can be challenging in general. DNN training is commonly phrased as a stochastic optimization problem whose challenges include non-convexity, non-smoothness, insufficient regularization, and complicated data distributions. Hence, the performance of DNNs on a given task depends crucially on tuning hyperparameters, especially learning rates and regularization parameters. In the absence of theoretical guidelines or prior experience on similar tasks, this requires solving many training problems, which can be time-consuming and demanding on computational resources. This can limit the applicability of DNNs to problems with non-standard, complex, and scarce datasets, e.g., those arising in many scientific applications. To remedy the challenges of DNN training, we propose slimTrain, a stochastic optimization method for training DNNs with reduced sensitivity to the choice hyperparameters and fast initial convergence. The central idea of slimTrain is to exploit the separability inherent in many DNN architectures; that is, we separate the DNN into a nonlinear feature extractor followed by a linear model. This separability allows us to leverage recent advances made for solving large-scale, linear, ill-posed inverse problems. Crucially, for the linear weights, slimTrain does not require a learning rate and automatically adapts the regularization parameter. Since our method operates on mini-batches, its computational overhead per iteration is modest. In our numerical experiments, slimTrain outperforms existing DNN training methods with the recommended hyperparameter settings and reduces the sensitivity of DNN training to the remaining hyperparameters.
    Formalizing the Generalization-Forgetting Trade-off in Continual Learning. (arXiv:2109.14035v1 [cs.LG])
    (2 min) We formulate the continual learning (CL) problem via dynamic programming and model the trade-off between catastrophic forgetting and generalization as a two-player sequential game. In this approach, player 1 maximizes the cost due to lack of generalization whereas player 2 minimizes the cost due to catastrophic forgetting. We show theoretically that a balance point between the two players exists for each task and that this point is stable (once the balance is achieved, the two players stay at the balance point). Next, we introduce balanced continual learning (BCL), which is designed to attain balance between generalization and forgetting and empirically demonstrate that BCL is comparable to or better than the state of the art.
    Local Repair of Neural Networks Using Optimization. (arXiv:2109.14041v1 [cs.LG])
    (2 min) In this paper, we propose a framework to repair a pre-trained feed-forward neural network (NN) to satisfy a set of properties. We formulate the properties as a set of predicates that impose constraints on the output of NN over the target input domain. We define the NN repair problem as a Mixed Integer Quadratic Program (MIQP) to adjust the weights of a single layer subject to the given predicates while minimizing the original loss function over the original training domain. We demonstrate the application of our framework in bounding an affine transformation, correcting an erroneous NN in classification, and bounding the inputs of a NN controller.
    Risk averse non-stationary multi-armed bandits. (arXiv:2109.13977v1 [cs.LG])
    (2 min) This paper tackles the risk averse multi-armed bandits problem when incurred losses are non-stationary. The conditional value-at-risk (CVaR) is used as the objective function. Two estimation methods are proposed for this objective function in the presence of non-stationary losses, one relying on a weighted empirical distribution of losses and another on the dual representation of the CVaR. Such estimates can then be embedded into classic arm selection methods such as epsilon-greedy policies. Simulation experiments assess the performance of the arm selection algorithms based on the two novel estimation approaches, and such policies are shown to outperform naive benchmarks not taking non-stationarity into account.
    IGLU: Efficient GCN Training via Lazy Updates. (arXiv:2109.13995v1 [cs.LG])
    (2 min) Graph Convolution Networks (GCN) are used in numerous settings involving a large underlying graph as well as several layers. Standard SGD-based training scales poorly here since each descent step ends up updating node embeddings for a large portion of the graph. Recent methods attempt to remedy this by sub-sampling the graph which does reduce the compute load, but at the cost of biased gradients which may offer suboptimal performance. In this work we introduce a new method IGLU that caches forward-pass embeddings for all nodes at various GCN layers. This enables IGLU to perform lazy updates that do not require updating a large number of node embeddings during descent which offers much faster convergence but does not significantly bias the gradients. Under standard assumptions such as objective smoothness, IGLU provably converges to a first-order saddle point. We validate IGLU extensively on a variety of benchmarks, where it offers up to 1.2% better accuracy despite requiring up to 88% less wall-clock time.
    Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models. (arXiv:2109.13925v1 [cs.CV])
    (2 min) Transformers are state-of-the-art deep learning models that are composed of stacked attention and point-wise, fully connected layers designed for handling sequential data. Transformers are not only ubiquitous throughout Natural Language Processing (NLP), but, recently, they have inspired a new wave of Computer Vision (CV) applications research. In this work, a Vision Transformer (ViT) is applied to predict the state variables of 2-dimensional Ising model simulations. Our experiments show that ViT outperform state-of-the-art Convolutional Neural Networks (CNN) when using a small number of microstate images from the Ising model corresponding to various boundary conditions and temperatures. This work opens the possibility of applying ViT to other simulations, and raises interesting research directions on how attention maps can learn about the underlying physics governing different phenomena.
    Smart at what cost? Characterising Mobile Deep Neural Networks in the wild. (arXiv:2109.13963v1 [cs.LG])
    (2 min) With smartphones' omnipresence in people's pockets, Machine Learning (ML) on mobile is gaining traction as devices become more powerful. With applications ranging from visual filters to voice assistants, intelligence on mobile comes in many forms and facets. However, Deep Neural Network (DNN) inference remains a compute intensive workload, with devices struggling to support intelligence at the cost of responsiveness.On the one hand, there is significant research on reducing model runtime requirements and supporting deployment on embedded devices. On the other hand, the strive to maximise the accuracy of a task is supported by deeper and wider neural networks, making mobile deployment of state-of-the-art DNNs a moving target. In this paper, we perform the first holistic study of DNN usage in the wild in an attempt to track deployed models and match how these run on widely deployed devices. To this end, we analyse over 16k of the most popular apps in the Google Play Store to characterise their DNN usage and performance across devices of different capabilities, both across tiers and generations. Simultaneously, we measure the models' energy footprint, as a core cost dimension of any mobile deployment. To streamline the process, we have developed gaugeNN, a tool that automates the deployment, measurement and analysis of DNNs on devices, with support for different frameworks and platforms. Results from our experience study paint the landscape of deep learning deployments on smartphones and indicate their popularity across app developers. Furthermore, our study shows the gap between bespoke techniques and real-world deployments and the need for optimised deployment of deep learning models in a highly dynamic and heterogeneous ecosystem.

2021-09-29

  • cs.CL updates on arXiv.org

    Unsolved Problems in ML Safety. (arXiv:2109.13916v1 [cs.LG])
    (2 min) Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), steering ML systems ("Alignment"), and reducing risks to how ML systems are handled ("External Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
    Expectation-based Minimalist Grammars. (arXiv:2109.13871v1 [cs.CL])
    (2 min) Expectation-based Minimalist Grammars (e-MGs) are simplified versions of the (Conflated) Minimalist Grammars, (C)MGs, formalized by Stabler (Stabler, 2011, 2013, 1997) and Phase-based Minimalist Grammars, PMGs (Chesi, 2005, 2007; Stabler, 2011). The crucial simplification consists of driving structure building only by relying on lexically encoded categorial top-down expectations. The commitment on a top-down derivation (as in e-MGs and PMGs, as opposed to (C)MGs, Chomsky, 1995; Stabler, 2011) allows us to define a core derivation that should be the same in both parsing and generation (Momma & Phillips, 2018).
    Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets. (arXiv:2109.13723v1 [cs.CL])
    (2 min) This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such character-level models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
    Identifying and Mitigating Gender Bias in Hyperbolic Word Embeddings. (arXiv:2109.13767v1 [cs.CL])
    (2 min) Euclidean word embedding models such as GloVe and Word2Vec have been shown to reflect human-like gender biases. In this paper, we extend the study of gender bias to the recently popularized hyperbolic word embeddings. We propose gyrocosine bias, a novel measure for quantifying gender bias in hyperbolic word representations and observe a significant presence of gender bias. To address this problem, we propose Poincar\'e Gender Debias (PGD), a novel debiasing procedure for hyperbolic word representations. Experiments on a suit of evaluation tests show that PGD effectively reduces bias while adding a minimal semantic offset.
    TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. (arXiv:2102.07988v2 [cs.LG] UPDATED)
    (2 min) Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe
    Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations. (arXiv:2109.13059v2 [cs.CL] UPDATED)
    (2 min) In NLP, a large volume of tasks involve pairwise comparison between two sequences (e.g. sentence similarity and paraphrase identification). Predominantly, two formulations are used for sentence-pair tasks: bi-encoders and cross-encoders. Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient, however, they usually underperform cross-encoders. Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance but they require task fine-tuning and are computationally more expensive. In this paper, we present a completely unsupervised sentence representation model termed as Trans-Encoder that combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders. Specifically, on top of a pre-trained Language Model (PLM), we start with converting it to an unsupervised bi-encoder, and then alternate between the bi- and cross-encoder task formulations. In each alternation, one task formulation will produce pseudo-labels which are used as learning signals for the other task formulation. We then propose an extension to conduct such self-distillation approach on multiple PLMs in parallel and use the average of their pseudo-labels for mutual-distillation. Trans-Encoder creates, to the best of our knowledge, the first completely unsupervised cross-encoder and also a state-of-the-art unsupervised bi-encoder for sentence similarity. Both the bi-encoder and cross-encoder formulations of Trans-Encoder outperform recently proposed state-of-the-art unsupervised sentence encoders such as Mirror-BERT and SimCSE by up to 5% on the sentence similarity benchmarks.
    Arabic aspect based sentiment analysis using BERT. (arXiv:2107.13290v2 [cs.CL] UPDATED)
    (2 min) Aspect-based sentiment analysis(ABSA) is a textual analysis methodology that defines the polarity of opinions on certain aspects related to specific targets. The majority of research on ABSA is in English, with a small amount of work available in Arabic. Most previous Arabic research has relied on deep learning models that depend primarily on context-independent word embeddings (e.g.word2vec), where each word has a fixed representation independent of its context. This article explores the modeling capabilities of contextual embeddings from pre-trained language models, such as BERT, and making use of sentence pair input on Arabic ABSA tasks. In particular, we are building a simple but effective BERT-based neural baseline to handle this task. Our BERT architecture with a simple linear classification layer surpassed the state-of-the-art works, according to the experimental results on the benchmarked Arabic hotel reviews dataset.
    Open Knowledge Graphs Canonicalization using Variational Autoencoders. (arXiv:2012.04780v2 [cs.CL] UPDATED)
    (2 min) Noun phrases and Relation phrases in open knowledge graphs are not canonicalized, leading to an explosion of redundant and ambiguous subject-relation-object triples. Existing approaches to solve this problem take a two-step approach. First, they generate embedding representations for both noun and relation phrases, then a clustering algorithm is used to group them using the embeddings as features. In this work, we propose Canonicalizing Using Variational Autoencoders (CUVA), a joint model to learn both embeddings and cluster assignments in an end-to-end approach, which leads to a better vector representation for the noun and relation phrases. Our evaluation over multiple benchmarks shows that CUVA outperforms the existing state-of-the-art approaches. Moreover, we introduce CanonicNell, a novel dataset to evaluate entity canonicalization systems.
    Hidden Backdoors in Human-Centric Language Models. (arXiv:2105.00164v3 [cs.CL] UPDATED)
    (2 min) Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers for textual backdoor attacks, \textit{hidden backdoors}, where triggers can fool both modern language models and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike character replacement. The second approach uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks, representative of modern human-centric NLP systems, including toxic comment detection, neural machine translation (NMT), and question answering (QA). Our two hidden backdoor attacks can achieve an Attack Success Rate (ASR) of at least $97\%$ with an injection rate of only $3\%$ in toxic comment detection, $95.1\%$ ASR in NMT with less than $0.5\%$ injected data, and finally $91.12\%$ ASR against QA updated with only 27 poisoning data samples on a model previously trained with 92,024 samples (0.029\%). We are able to demonstrate the adversary's high success rate of attacks, while maintaining functionality for regular users, with triggers inconspicuous by the human administrators.
    Prefix-to-SQL: Text-to-SQL Generation from Incomplete User Questions. (arXiv:2109.13066v2 [cs.CL] UPDATED)
    (2 min) Existing text-to-SQL research only considers complete questions as the input, but lay-users might strive to formulate a complete question. To build a smarter natural language interface to database systems (NLIDB) that also processes incomplete questions, we propose a new task, prefix-to-SQL which takes question prefix from users as the input and predicts the intended SQL. We construct a new benchmark called PAGSAS that contains 124K user question prefixes and the intended SQL for 5 sub-tasks Advising, GeoQuery, Scholar, ATIS, and Spider. Additionally, we propose a new metric SAVE to measure how much effort can be saved by users. Experimental results show that PAGSAS is challenging even for strong baseline models such as T5. As we observe the difficulty of prefix-to-SQL is related to the number of omitted tokens, we incorporate curriculum learning of feeding examples with an increasing number of omitted tokens. This improves scores on various sub-tasks by as much as 9% recall scores on sub-task GeoQuery in PAGSAS.
    Automatic Generation of Word Problems for Academic Education via Natural Language Processing (NLP). (arXiv:2109.13123v2 [cs.CL] UPDATED)
    (2 min) Digital learning platforms enable students to learn on a flexible and individual schedule as well as providing instant feedback mechanisms. The field of STEM education requires students to solve numerous training exercises to grasp underlying concepts. It is apparent that there are restrictions in current online education in terms of exercise diversity and individuality. Many exercises show little variance in structure and content, hindering the adoption of abstraction capabilities by students. This thesis proposes an approach to generate diverse, context rich word problems. In addition to requiring the generated language to be grammatically correct, the nature of word problems implies additional constraints on the validity of contents. The proposed approach is proven to be effective in generating valid word problems for mathematical statistics. The experimental results present a tradeoff between generation time and exercise validity. The system can easily be parametrized to handle this tradeoff according to the requirements of specific use cases.
    SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation. (arXiv:2107.14154v2 [cs.CL] UPDATED)
    (2 min) To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.
    Improving Low-resource Reading Comprehension via Cross-lingual Transposition Rethinking. (arXiv:2107.05002v2 [cs.CL] UPDATED)
    (2 min) Extractive Reading Comprehension (ERC) has made tremendous advances enabled by the availability of large-scale high-quality ERC training data. Despite of such rapid progress and widespread application, the datasets in languages other than high-resource languages such as English remain scarce. To address this issue, we propose a Cross-Lingual Transposition ReThinking (XLTT) model by modelling existing high-quality extractive reading comprehension datasets in a multilingual environment. To be specific, we present multilingual adaptive attention (MAA) to combine intra-attention and inter-attention to learn more general generalizable semantic and lexical knowledge from each pair of language families. Furthermore, to make full use of existing datasets, we adopt a new training framework to train our model by calculating task-level similarities between each existing dataset and target dataset. The experimental results show that our XLTT model surpasses six baselines on two multilingual ERC benchmarks, especially more effective for low-resource languages with 3.9 and 4.1 average improvement in F1 and EM, respectively.
    HeadlineCause: A Dataset of News Headlines for Detecting Causalities. (arXiv:2108.12626v2 [cs.CL] UPDATED)
    (2 min) Detecting implicit causal relations in texts is a task that requires both common sense and world knowledge. Existing datasets are focused either on commonsense causal reasoning or explicit causal relations. In this work, we present HeadlineCause, a dataset for detecting implicit causal relations between pairs of news headlines. The dataset includes over 5000 headline pairs from English news and over 9000 headline pairs from Russian news labeled through crowdsourcing. The pairs vary from totally unrelated or belonging to the same general topic to the ones including causation and refutation relations. We also present a set of models and experiments that demonstrates the dataset validity, including a multilingual XLM-RoBERTa based model for causality detection and a GPT-2 based model for possible effects prediction.
    Sentiment Analysis in Twitter for Macedonian. (arXiv:2109.13725v1 [cs.CL])
    (2 min) We present work on sentiment analysis in Twitter for Macedonian. As this is pioneering work for this combination of language and genre, we created suitable resources for training and evaluating a system for sentiment analysis of Macedonian tweets. In particular, we developed a corpus of tweets annotated with tweet-level sentiment polarity (positive, negative, and neutral), as well as with phrase-level sentiment, which we made freely available for research purposes. We further bootstrapped several large-scale sentiment lexicons for Macedonian, motivated by previous work for English. The impact of several different pre-processing steps as well as of various features is shown in experiments that represent the first attempt to build a system for sentiment analysis in Twitter for the morphologically rich Macedonian language. Overall, our experimental results show an F1-score of 92.16, which is very strong and is on par with the best results for English, which were achieved in recent SemEval competitions.
    Single-dataset Experts for Multi-dataset Question Answering. (arXiv:2109.13880v1 [cs.CL])
    (2 min) Many datasets have been created for training reading comprehension models, and a natural question is whether we can combine them to build models that (1) perform better on all of the training datasets and (2) generalize and transfer better to new datasets. Prior work has addressed this goal by training one network simultaneously on multiple datasets, which works well on average but is prone to over- or under-fitting different sub-distributions and might transfer worse compared to source models with more overlap with the target dataset. Our approach is to model multi-dataset question answering with a collection of single-dataset experts, by training a collection of lightweight, dataset-specific adapter modules (Houlsby et al., 2019) that share an underlying Transformer model. We find that these Multi-Adapter Dataset Experts (MADE) outperform all our baselines in terms of in-distribution accuracy, and simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance, offering a strong and versatile starting point for building new reading comprehension systems.
    Chekhov's Gun Recognition. (arXiv:2109.13855v1 [cs.CL])
    (2 min) Chekhov's gun is a dramatic principle stating that every element in a story must be necessary, and irrelevant elements should be removed. This paper presents a new natural language processing task - Chekhov's gun recognition or (CGR) - recognition of entities that are pivotal for the development of the plot. Though similar to classical Named Entity Recognition (NER) it has profound differences and is crucial for the tasks of narrative processing, since Chekhov's guns have a profound impact on the causal relationship in a story. The paper presents a new benchmark dataset for the CGR task that includes 5550 descriptions with one or more Chekhov's Gun in each and validates the task on two more datasets available in the natural language processing (NLP) literature.
    On Homophony and R\'enyi Entropy. (arXiv:2109.13766v1 [cs.CL])
    (2 min) Homophony's widespread presence in natural languages is a controversial topic. Recent theories of language optimality have tried to justify its prevalence, despite its negative effects on cognitive processing time; e.g., Piantadosi et al. (2012) argued homophony enables the reuse of efficient wordforms and is thus beneficial for languages. This hypothesis has recently been challenged by Trott and Bergen (2020), who posit that good wordforms are more often homophonous simply because they are more phonotactically probable. In this paper, we join in on the debate. We first propose a new information-theoretic quantification of a language's homophony: the sample R\'enyi entropy. Then, we use this quantification to revisit Trott and Bergen's claims. While their point is theoretically sound, a specific methodological issue in their experiments raises doubts about their results. After addressing this issue, we find no clear pressure either towards or against homophony -- a much more nuanced result than either Piantadosi et al.'s or Trott and Bergen's findings.
    Temporal Information and Event Markup Language: TIE-ML Markup Process and Schema Version 1.0. (arXiv:2109.13892v1 [cs.CL])
    (2 min) Temporal Information and Event Markup Language (TIE-ML) is a markup strategy and annotation schema to improve the productivity and accuracy of temporal and event related annotation of corpora to facilitate machine learning based model training. For the annotation of events, temporal sequencing, and durations, it is significantly simpler by providing an extremely reduced tag set for just temporal relations and event enumeration. In comparison to other standards, as for example the Time Markup Language (TimeML), it is much easier to use by dropping sophisticated formalisms, theoretical concepts, and annotation approaches. Annotations of corpora using TimeML can be mapped to TIE-ML with a loss, and TIE-ML annotations can be fully mapped to TimeML with certain under-specification.
    How Different Text-preprocessing Techniques Using The BERT Model Affect The Gender Profiling of Authors. (arXiv:2109.13890v1 [cs.CL])
    (2 min) Forensic author profiling plays an important role in indicating possible profiles for suspects. Among the many automated solutions recently proposed for author profiling, transfer learning outperforms many other state-of-the-art techniques in natural language processing. Nevertheless, the sophisticated technique has yet to be fully exploited for author profiling. At the same time, whereas current methods of author profiling, all largely based on features engineering, have spawned significant variation in each model used, transfer learning usually requires a preprocessed text to be fed into the model. We reviewed multiple references in the literature and determined the most common preprocessing techniques associated with authors' genders profiling. Considering the variations in potential preprocessing techniques, we conducted an experimental study that involved applying five such techniques to measure each technique's effect while using the BERT model, chosen for being one of the most-used stock pretrained models. We used the Hugging face transformer library to implement the code for each preprocessing case. In our five experiments, we found that BERT achieves the best accuracy in predicting the gender of the author when no preprocessing technique is applied. Our best case achieved 86.67% accuracy in predicting the gender of authors.
    Multi-Task Triplet Loss for Named Entity Recognition using Supplementary Text. (arXiv:2109.13736v1 [cs.CL])
    (2 min) Retail item data contains many different forms of text like the title of an item, the description of an item, item name and reviews. It is of interest to identify the item name in the other forms of text using a named entity tagger. However, the title of an item and its description are syntactically different (but semantically similar) in that the title is not necessarily a well formed sentence while the description is made up of well formed sentences. In this work, we use a triplet loss to contrast the embeddings of the item title with the description to establish a proof of concept. We find that using the triplet loss in a multi-task NER algorithm improves both the precision and recall by a small percentage. While the improvement is small, we think it is a step in the right direction of using various forms of text in a multi-task algorithm. In addition to precision and recall, the multi task triplet loss method is also found to significantly improve the exact match accuracy i.e. the accuracy of tagging the entire set of tokens in the text with correct tags.
    Exposing Paid Opinion Manipulation Trolls. (arXiv:2109.13726v1 [cs.LG])
    (2 min) Recently, Web forums have been invaded by opinion manipulation trolls. Some trolls try to influence the other users driven by their own convictions, while in other cases they can be organized and paid, e.g., by a political party or a PR agency that gives them specific instructions what to write. Finding paid trolls automatically using machine learning is a hard task, as there is no enough training data to train a classifier; yet some test data is possible to obtain, as these trolls are sometimes caught and widely exposed. In this paper, we solve the training data problem by assuming that a user who is called a troll by several different people is likely to be such, and one who has never been called a troll is unlikely to be such. We compare the profiles of (i) paid trolls vs. (ii)"mentioned" trolls vs. (iii) non-trolls, and we further show that a classifier trained to distinguish (ii) from (iii) does quite well also at telling apart (i) from (iii).
    Translating from Morphologically Complex Languages: A Paraphrase-Based Approach. (arXiv:2109.13724v1 [cs.CL])
    (2 min) We propose a novel approach to translating from a morphologically complex language. Unlike previous research, which has targeted word inflections and concatenations, we focus on the pairwise relationship between morphologically related words, which we treat as potential paraphrases and handle using paraphrasing techniques at the word, phrase, and sentence level. An important advantage of this framework is that it can cope with derivational morphology, which has so far remained largely beyond the capabilities of statistical machine translation systems. Our experiments translating from Malay, whose morphology is mostly derivational, into English show significant improvements over rivaling approaches based on five automatic evaluation measures (for 320,000 sentence pairs; 9.5 million English word tokens).
    Micromodels for Efficient, Explainable, and Reusable Systems: A Case Study on Mental Health. (arXiv:2109.13770v1 [cs.CL])
    (2 min) Many statistical models have high accuracy on test benchmarks, but are not explainable, struggle in low-resource scenarios, cannot be reused for multiple tasks, and cannot easily integrate domain expertise. These factors limit their use, particularly in settings such as mental health, where it is difficult to annotate datasets and model outputs have significant impact. We introduce a micromodel architecture to address these challenges. Our approach allows researchers to build interpretable representations that embed domain knowledge and provide explanations throughout the model's decision process. We demonstrate the idea on multiple mental health tasks: depression classification, PTSD classification, and suicidal risk assessment. Our systems consistently produce strong results, even in low-resource scenarios, and are more interpretable than alternative methods.
    Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS. (arXiv:2109.13673v1 [cs.CL])
    (2 min) This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive performance in naturalness and robustness on two Mandarin corpora.
    One to rule them all: Towards Joint Indic Language Hate Speech Detection. (arXiv:2109.13711v1 [cs.CL])
    (2 min) This paper is a contribution to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) 2021 shared task. Social media today is a hotbed of toxic and hateful conversations, in various languages. Recent news reports have shown that current models struggle to automatically identify hate posted in minority languages. Therefore, efficiently curbing hate speech is a critical challenge and problem of interest. We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection across three languages namely, English, Hindi, and Marathi. On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B. These results show the efficacy of exploiting a multilingual training scheme.
    CIDEr-R: Robust Consensus-based Image Description Evaluation. (arXiv:2109.13701v1 [cs.CV])
    (2 min) This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high sentence length variance. We demonstrate that CIDEr-R is more accurate and closer to human judgment than CIDEr-D; CIDEr-R is more robust regarding the number of available references. Our results reveal that using Self-Critical Sequence Training to optimize CIDEr-R generates descriptive captions. In contrast, when CIDEr-D is optimized, the generated captions' length tends to be similar to the reference length. However, the models also repeat several times the same word to increase the sentence length.
    "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations. (arXiv:2109.13489v1 [cs.CL])
    (2 min) Most prior work in dialogue modeling has been on written conversations mostly because of existing data sets. However, written dialogues are not sufficient to fully capture the nature of spoken conversations as well as the potential speech recognition errors in practical spoken dialogue systems. This work presents a new benchmark on spoken task-oriented conversations, which is intended to study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling. We report that the existing state-of-the-art models trained on written conversations are not performing well on our spoken data, as expected. Furthermore, we observe improvements in task performances when leveraging n-best speech recognition hypotheses such as by combining predictions based on individual hypotheses. Our data set enables speech-based benchmarking of task-oriented dialogue systems.
    Generating texts under constraint through discriminator-guided MCTS. (arXiv:2109.13582v1 [cs.CL])
    (2 min) Large pre-trained language models (LM) based on Transformers allow to generate very plausible long texts. In this paper, we explore how this generation can be further controlled to satisfy certain constraints (eg. being non-toxic, positive or negative, convey certain emotions, etc.) without fine-tuning the LM. Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator according to how well the associated sequence respects the constraint. Using a discriminator to guide this generation, rather than fine-tuning the LM, in addition to be easier and cheaper to train, allows to apply the constraint more finely and dynamically. We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. We evaluate these methods on two types of constraints and languages: review polarity and emotion control in French and English. We show that MCTS achieves state-of-the-art results in constrained generation, without having to tune the language model, in both tasks and languages. We also demonstrate that our other proposed methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged.
    DeepPSL: End-to-end perception and reasoning with applications to zero shot learning. (arXiv:2109.13662v1 [eess.SY])
    (2 min) We introduce DeepPSL a variant of Probabilistic Soft Logic (PSL) to produce an end-to-end trainable system that integrates reasoning and perception. PSL represents first-order logic in terms of a convex graphical model -- Hinge Loss Markov random fields (HL-MRFs). PSL stands out among probabilistic logic frameworks due to its tractability having been applied to systems of more than 1 billion ground rules. The key to our approach is to represent predicates in first-order logic using deep neural networks and then to approximately back-propagate through the HL-MRF and thus train every aspect of the first-order system being represented. We believe that this approach represents an interesting direction for the integration of deep learning and reasoning techniques with applications to knowledge base learning, multi-task learning, and explainability. We evaluate DeepPSL on a zero shot learning problem in image classification. State of the art results demonstrate the utility and flexibility of our approach.
    SYGMA: System for Generalizable Modular Question Answering OverKnowledge Bases. (arXiv:2109.13430v1 [cs.CL])
    (2 min) Knowledge Base Question Answering (KBQA) tasks that in-volve complex reasoning are emerging as an important re-search direction. However, most KBQA systems struggle withgeneralizability, particularly on two dimensions: (a) acrossmultiple reasoning types where both datasets and systems haveprimarily focused on multi-hop reasoning, and (b) across mul-tiple knowledge bases, where KBQA approaches are specif-ically tuned to a single knowledge base. In this paper, wepresent SYGMA, a modular approach facilitating general-izability across multiple knowledge bases and multiple rea-soning types. Specifically, SYGMA contains three high levelmodules: 1) KB-agnostic question understanding module thatis common across KBs 2) Rules to support additional reason-ing types and 3) KB-specific question mapping and answeringmodule to address the KB-specific aspects of the answer ex-traction. We demonstrate effectiveness of our system by evalu-ating on datasets belonging to two distinct knowledge bases,DBpedia and Wikidata. In addition, to demonstrate extensi-bility to additional reasoning types we evaluate on multi-hopreasoning datasets and a new Temporal KBQA benchmarkdataset on Wikidata, namedTempQA-WD1, introduced in thispaper. We show that our generalizable approach has bettercompetetive performance on multiple datasets on DBpediaand Wikidata that requires both multi-hop and temporal rea-soning
    Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation. (arXiv:2109.13318v1 [cs.CL])
    (2 min) Automating sign language translation (SLT) is a challenging real world application. Despite its societal importance, though, research progress in the field remains rather poor. Crucially, existing methods that yield viable performance necessitate the availability of laborious to obtain gloss sequence groundtruth. In this paper, we attenuate this need, by introducing an end-to-end SLT model that does not entail explicit use of glosses; the model only needs text groundtruth. This is in stark contrast to existing end-to-end models that use gloss sequence groundtruth, either in the form of a modality that is recognized at an intermediate model stage, or in the form of a parallel output process, jointly trained with the SLT model. Our approach constitutes a Transformer network with a novel type of layers that combines: (i) local winner-takes-all (LWTA) layers with stochastic winner sampling, instead of conventional ReLU layers, (ii) stochastic weights with posterior distributions estimated via variational inference, and (iii) a weight compression technique at inference time that exploits estimated posterior variance to perform massive, almost lossless compression. We demonstrate that our approach can reach the currently best reported BLEU-4 score on the PHOENIX 2014T benchmark, but without making use of glosses for model training, and with a memory footprint reduced by more than 70%.
    Active Learning for Argument Mining: A Practical Approach. (arXiv:2109.13611v1 [cs.CL])
    (2 min) Despite considerable recent progress, the creation of well-balanced and diverse resources remains a time-consuming and costly challenge in Argument Mining. Active Learning reduces the amount of data necessary for the training of machine learning models by querying the most informative samples for annotation and therefore is a promising method for resource creation. In a large scale comparison of several Active Learning methods, we show that Active Learning considerably decreases the effort necessary to get good deep learning performance on the task of Argument Unit Recognition and Classification (AURC).
    Exploring Teacher-Student Learning Approach for Multi-lingual Speech-to-Intent Classification. (arXiv:2109.13486v1 [cs.CL])
    (2 min) End-to-end speech-to-intent classification has shown its advantage in harvesting information from both text and speech. In this paper, we study a technique to develop such an end-to-end system that supports multiple languages. To overcome the scarcity of multi-lingual speech corpus, we exploit knowledge from a pre-trained multi-lingual natural language processing model. Multi-lingual bidirectional encoder representations from transformers (mBERT) models are trained on multiple languages and hence expected to perform well in the multi-lingual scenario. In this work, we employ a teacher-student learning approach to sufficiently extract information from an mBERT model to train a multi-lingual speech model. In particular, we use synthesized speech generated from an English-Mandarin text corpus for analysis and training of a multi-lingual intent classification model. We also demonstrate that the teacher-student learning approach obtains an improved performance (91.02%) over the traditional end-to-end (89.40%) intent classification approach in a practical multi-lingual scenario.
    Evaluating Biomedical BERT Models for Vocabulary Alignment at Scale in the UMLS Metathesaurus. (arXiv:2109.13348v1 [cs.CL])
    (2 min) The current UMLS (Unified Medical Language System) Metathesaurus construction process for integrating over 200 biomedical source vocabularies is expensive and error-prone as it relies on the lexical algorithms and human editors for deciding if the two biomedical terms are synonymous. Recent advances in Natural Language Processing such as Transformer models like BERT and its biomedical variants with contextualized word embeddings have achieved state-of-the-art (SOTA) performance on downstream tasks. We aim to validate if these approaches using the BERT models can actually outperform the existing approaches for predicting synonymy in the UMLS Metathesaurus. In the existing Siamese Networks with LSTM and BioWordVec embeddings, we replace the BioWordVec embeddings with the biomedical BERT embeddings extracted from each BERT model using different ways of extraction. In the Transformer architecture, we evaluate the use of the different biomedical BERT models that have been pre-trained using different datasets and tasks. Given the SOTA performance of these BERT models for other downstream tasks, our experiments yield surprisingly interesting results: (1) in both model architectures, the approaches employing these biomedical BERT-based models do not outperform the existing approaches using Siamese Network with BioWordVec embeddings for the UMLS synonymy prediction task, (2) the original BioBERT large model that has not been pre-trained with the UMLS outperforms the SapBERT models that have been pre-trained with the UMLS, and (3) using the Siamese Networks yields better performance for synonymy prediction when compared to using the biomedical BERT models.
    Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking. (arXiv:2109.13620v1 [cs.CL])
    (2 min) Recent progress in task-oriented neural dialogue systems is largely focused on a handful of languages, as annotation of training data is tedious and expensive. Machine translation has been used to make systems multilingual, but this can introduce a pipeline of errors. Another promising solution is using cross-lingual transfer learning through pretrained multilingual models. Existing methods train multilingual models with additional code-mixed task data or refine the cross-lingual representations through parallel ontologies. In this work, we enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models, where the multilingual models are fine-tuned with different but related data and/or tasks. Specifically, we use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks suitable for downstream dialogue tasks. We use only 200K lines of parallel data for intermediate fine-tuning which is already available for 1782 language pairs. We test our approach on the cross-lingual dialogue state tracking task for the parallel MultiWoZ (English -> Chinese, Chinese -> English) and Multilingual WoZ (English -> German, English -> Italian) datasets. We achieve impressive improvements (> 20% on joint goal accuracy) on the parallel MultiWoZ dataset and the Multilingual WoZ dataset over the vanilla baseline with only 10% of the target language task data and zero-shot setup respectively.
    VoxCeleb Enrichment for Age and Gender Recognition. (arXiv:2109.13510v1 [cs.LG])
    (2 min) VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes. First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive age and gender labels. We also compare the original VoxCeleb gender labels with our labels to identify records that might be mislabeled in the original VoxCeleb data. On modeling side, we design a comprehensive study of multiple features and models for recognizing gender and age. Our best system, using i-vector features, achieved an F1-score of 0.9829 for gender recognition task using logistic regression, and the lowest mean absolute error (MAE) in age regression, 9.443 years, is obtained with ridge regression. This indicates challenge in age estimation from in-the-wild style speech data.
    Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators' Disagreement. (arXiv:2109.13563v1 [cs.CL])
    (2 min) Since state-of-the-art approaches to offensive language detection rely on supervised learning, it is crucial to quickly adapt them to the continuously evolving scenario of social media. While several approaches have been proposed to tackle the problem from an algorithmic perspective, so to reduce the need for annotated data, less attention has been paid to the quality of these data. Following a trend that has emerged recently, we focus on the level of agreement among annotators while selecting data to create offensive language datasets, a task involving a high level of subjectivity. Our study comprises the creation of three novel datasets of English tweets covering different topics and having five crowd-sourced judgments each. We also present an extensive set of experiments showing that selecting training and test data according to different levels of annotators' agreement has a strong effect on classifiers performance and robustness. Our findings are further validated in cross-domain experiments and studied using a popular benchmark dataset. We show that such hard cases, where low agreement is present, are not necessarily due to poor-quality annotation and we advocate for a higher presence of ambiguous cases in future datasets, particularly in test sets, to better account for the different points of view expressed online.
    Multilingual Counter Narrative Type Classification. (arXiv:2109.13664v1 [cs.CL])
    (2 min) The growing interest in employing counter narratives for hatred intervention brings with it a focus on dataset creation and automation strategies. In this scenario, learning to recognize counter narrative types from natural text is expected to be useful for applications such as hate speech countering, where operators from non-governmental organizations are supposed to answer to hate with several and diverse arguments that can be mined from online sources. This paper presents the first multilingual work on counter narrative type classification, evaluating SoTA pre-trained language models in monolingual, multilingual and cross-lingual settings. When considering a fine-grained annotation of counter narrative classes, we report strong baseline classification results for the majority of the counter narrative types, especially if we translate every language to English before cross-lingual prediction. This suggests that knowledge about counter narratives can be successfully transferred across languages.
    On Isotropy Calibration of Transformers. (arXiv:2109.13304v1 [cs.CL])
    (2 min) Different studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic - the embeddings are distributed in a narrow cone. Meanwhile, static word representations (e.g., Word2Vec or GloVe) have been shown to benefit from isotropic spaces. Therefore, previous work has developed methods to calibrate the embedding space of transformers in order to ensure isotropy. However, a recent study (Cai et al. 2021) shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space. In this work, we conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks. These results support the thesis that, given the local isotropy, transformers do not benefit from additional isotropy calibration.
    When in Doubt: Improving Classification Performance with Alternating Normalization. (arXiv:2109.13449v1 [cs.LG])
    (2 min) We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks.
    Template-free Prompt Tuning for Few-shot NER. (arXiv:2109.13532v1 [cs.CL])
    (2 min) Prompt-based methods have been successfully applied in sentence-level few-shot learning tasks, mostly owing to the sophisticated design of templates and label words. However, when applied to token-level labeling tasks such as NER, it would be time-consuming to enumerate the template queries over all potential entity spans. In this work, we propose a more elegant method to reformulate NER tasks as LM problems without any templates. Specifically, we discard the template construction process while maintaining the word prediction paradigm of pre-training models to predict a class-related pivot word (or label word) at the entity position. Meanwhile, we also explore principled ways to automatically search for appropriate label words that the pre-trained models can easily adapt to. While avoiding complicated template-based process, the proposed LM objective also reduces the gap between different objectives used in pre-training and fine-tuning, thus it can better benefit the few-shot performance. Experimental results demonstrate the effectiveness of the proposed method over bert-tagger and template-based method under few-shot setting. Moreover, the decoding speed of the proposed method is up to 1930.12 times faster than the template-based method.
    Instance-Based Neural Dependency Parsing. (arXiv:2109.13497v1 [cs.CL])
    (2 min) Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.
    TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation. (arXiv:2109.13296v1 [cs.CL])
    (2 min) Recent progress in generative language models has enabled machines to generate astonishingly realistic texts. While there are many legitimate applications of such models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). However, to our best knowledge, there is currently no benchmark environment with datasets and tasks to systematically study the so-called "Turing Test" problem for neural text generation methods. In this work, we present the TuringBench benchmark environment, which is comprised of (1) a dataset with 200K human- or machine-generated samples across 20 labels {Human, GPT-1, GPT-2_small, GPT-2_medium, GPT-2_large, GPT-2_xl, GPT-2_PyTorch, GPT-3, GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XLNET_large, FAIR_wmt19, FAIR_wmt20, TRANSFORMER_XL, PPLM_distil, PPLM_gpt2}, (2) two benchmark tasks -- i.e., Turing Test (TT) and Authorship Attribution (AA), and (3) a website with leaderboards. Our preliminary experimental results using TuringBench show that FAIR_wmt20 and GPT-3 are the current winners, among all language models tested, in generating the most human-like indistinguishable texts with the lowest F1 score by five state-of-the-art TT detection models. The TuringBench is available at: https://turingbench.ist.psu.edu/
    Visually Grounded Reasoning across Languages and Cultures. (arXiv:2109.13238v1 [cs.CL])
    (2 min) The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for {M}ulticultur{a}l {R}easoning over {V}ision and {L}anguage (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.
  • cs.CV updates on arXiv.org

    Compound eye inspired flat lensless imaging with spatially-coded Voronoi-Fresnel phase. (arXiv:2109.13703v1 [cs.CV])
    (2 min) Lensless cameras are a class of imaging devices that shrink the physical dimensions to the very close vicinity of the image sensor by integrating flat optics and computational algorithms. Here we report a flat lensless camera with spatially-coded Voronoi-Fresnel phase, partly inspired by biological apposition compound eye, to achieve superior image quality. We propose a design principle of maximizing the information in optics to facilitate the computational reconstruction. By introducing a Fourier domain metric, Modulation Transfer Function volume (MTFv), we devise an optimization framework to guide the optimal design of the optical element. The resulting Voronoi-Fresnel phase features an irregular array of quasi-Centroidal Voronoi cells containing a base first-order Fresnel phase function. We demonstrate and verify the imaging performance with a prototype Voronoi-Fresnel lensless camera on a 1.6-megapixel image sensor in various illumination conditions. The proposed design could benefit the development of compact imaging systems working in extreme physical conditions.
    Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy. (arXiv:2106.01100v2 [eess.IV] UPDATED)
    (3 min) During lung cancer radiotherapy, the position of infrared reflective objects on the chest can be recorded to estimate the tumor location. However, radiotherapy systems usually have a latency inherent to robot control limitations that impedes the radiation delivery precision. Not taking this phenomenon into account may cause unwanted damage to healthy tissues and lead to side effects such as radiation pneumonitis. In this research, we use nine observation records of the three-dimensional position of three external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The sampling frequency is equal to 10Hz and the amplitudes of the recorded trajectories range from 6mm to 40mm in the superior-inferior direction. We forecast the location of each marker simultaneously with a horizon value (the time interval in advance for which the prediction is made) between 0.1s and 2.0s, using a recurrent neural network (RNN) trained with unbiased online recurrent optimization (UORO). We compare its performance with an RNN trained with real-time recurrent learning, least mean squares (LMS), and offline linear regression. Training and cross-validation are performed during the first minute of each sequence. On average, UORO achieves the lowest root-mean-square (RMS) and maximum error, equal respectively to 1.3mm and 8.8mm, with a prediction time per time step lower than 2.8ms (Dell Intel core i9-9900K 3.60Ghz). Linear regression has the lowest RMS error for the horizon values 0.1s and 0.2s, followed by LMS for horizon values between 0.3s and 0.5s, and UORO for horizon values greater than 0.6s.
    COV-ELM classifier: An Extreme Learning Machine based identification of COVID-19 using Chest X-Ray Images. (arXiv:2007.08637v6 [eess.IV] UPDATED)
    (3 min) Coronaviruses constitute a family of viruses that gives rise to respiratory diseases. As COVID-19 is highly contagious, early diagnosis of COVID-19 is crucial for an effective treatment strategy. However, the RT-PCR test which is considered to be a gold standard in the diagnosis of COVID-19 suffers from a high false-negative rate. Chest X-ray (CXR) image analysis has emerged as a feasible and effective diagnostic technique towards this objective. In this work, we propose the COVID-19 classification problem as a three-class classification problem to distinguish between COVID-19, normal, and pneumonia classes. We propose a three-stage framework, named COV-ELM. Stage one deals with preprocessing and transformation while stage two deals with feature extraction. These extracted features are passed as an input to the ELM at the third stage, resulting in the identification of COVID-19. The choice of ELM in this work has been motivated by its faster convergence, better generalization capability, and shorter training time in comparison to the conventional gradient-based learning algorithms. As bigger and diverse datasets become available, ELM can be quickly retrained as compared to its gradient-based competitor models. The proposed model achieved a macro average F1-score of 0.95 and the overall sensitivity of ${0.94 \pm 0.02} at a 95% confidence interval. When compared to state-of-the-art machine learning algorithms, the COV-ELM is found to outperform its competitors in this three-class classification scenario. Further, LIME has been integrated with the proposed COV-ELM model to generate annotated CXR images. The annotations are based on the superpixels that have contributed to distinguish between the different classes. It was observed that the superpixels correspond to the regions of the human lungs that are clinically observed in COVID-19 and Pneumonia cases.
    Inter Extreme Points Geodesics for Weakly Supervised Image Segmentation. (arXiv:2107.00583v2 [cs.CV] UPDATED)
    (2 min) We introduce $\textit{InExtremIS}$, a weakly supervised 3D approach to train a deep image segmentation network using particularly weak train-time annotations: only 6 extreme clicks at the boundary of the objects of interest. Our fully-automatic method is trained end-to-end and does not require any test-time annotations. From the extreme points, 3D bounding boxes are extracted around objects of interest. Then, deep geodesics connecting extreme points are generated to increase the amount of "annotated" voxels within the bounding boxes. Finally, a weakly supervised regularised loss derived from a Conditional Random Field formulation is used to encourage prediction consistency over homogeneous regions. Extensive experiments are performed on a large open dataset for Vestibular Schwannoma segmentation. $\textit{InExtremIS}$ obtained competitive performance, approaching full supervision and outperforming significantly other weakly supervised techniques based on bounding boxes. Moreover, given a fixed annotation time budget, $\textit{InExtremIS}$ outperforms full supervision. Our code and data are available online.
    Two-step Domain Adaptation for Mitosis Cell Detection in Histopathology Images. (arXiv:2109.00109v2 [cs.CV] UPDATED)
    (2 min) We propose a two-step domain shift-invariant mitosis cell detection method based on Faster RCNN and a convolutional neural network (CNN). We generate various domain-shifted versions of existing histopathology images using a stain augmentation technique, enabling our method to effectively learn various stain domains and achieve better generalization. The performance of our method is evaluated on the preliminary test data set of the MIDOG-2021 challenge. The experimental results demonstrate that the proposed mitosis detection method can achieve promising performance for domain-shifted histopathology images.
    A Contrastive Learning Approach to Auroral Identification and Classification. (arXiv:2109.13899v1 [cs.CV])
    (2 min) Unsupervised learning algorithms are beginning to achieve accuracies comparable to their supervised counterparts on benchmark computer vision tasks, but their utility for practical applications has not yet been demonstrated. In this work, we present a novel application of unsupervised learning to the task of auroral image classification. Specifically, we modify and adapt the Simple framework for Contrastive Learning of Representations (SimCLR) algorithm to learn representations of auroral images in a recently released auroral image dataset constructed using image data from Time History of Events and Macroscale Interactions during Substorms (THEMIS) all-sky imagers. We demonstrate that (a) simple linear classifiers fit to the learned representations of the images achieve state-of-the-art classification performance, improving the classification accuracy by almost 10 percentage points over the current benchmark; and (b) the learned representations naturally cluster into more clusters than exist manually assigned categories, suggesting that existing categorizations are overly coarse and may obscure important connections between auroral types, near-earth solar wind conditions, and geomagnetic disturbances at the earth's surface. Moreover, our model is much lighter than the previous benchmark on this dataset, requiring in the area of fewer than 25\% of the number of parameters. Our approach exceeds an established threshold for operational purposes, demonstrating readiness for deployment and utilization.
    Cross-Camera Convolutional Color Constancy. (arXiv:2011.11890v5 [cs.CV] UPDATED)
    (2 min) We present "Cross-Camera Convolutional Color Constancy" (C5), a learning-based method, trained on images from multiple cameras, that accurately estimates a scene's illuminant color from raw images captured by a new camera previously unseen during training. C5 is a hypernetwork-like extension of the convolutional color constancy (CCC) approach: C5 learns to generate the weights of a CCC model that is then evaluated on the input image, with the CCC weights dynamically adapted to different input content. Unlike prior cross-camera color constancy models, which are usually designed to be agnostic to the spectral properties of test-set images from unobserved cameras, C5 approaches this problem through the lens of transductive inference: additional unlabeled images are provided as input to the model at test time, which allows the model to calibrate itself to the spectral properties of the test-set camera during inference. C5 achieves state-of-the-art accuracy for cross-camera color constancy on several datasets, is fast to evaluate (~7 and ~90 ms per image on a GPU or CPU, respectively), and requires little memory (~2 MB), and thus is a practical solution to the problem of calibration-free automatic white balance for mobile photography.
    KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. (arXiv:2109.13410v1 [cs.CV])
    (2 min) For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k semantic and instance annotated images and 1B annotated 3D points. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset. KITTI-360 will enable progress at the intersection of these research areas and thus contributing towards solving one of our grand challenges: the development of fully autonomous self-driving systems.
    Unsupervised Diffeomorphic Surface Registration and Non-Linear Modelling. (arXiv:2109.13630v1 [eess.IV])
    (2 min) Registration is an essential tool in image analysis. Deep learning based alternatives have recently become popular, achieving competitive performance at a faster speed. However, many contemporary techniques are limited to volumetric representations, despite increased popularity of 3D surface and shape data in medical image analysis. We propose a one-step registration model for 3D surfaces that internalises a lower dimensional probabilistic deformation model (PDM) using conditional variational autoencoders (CVAE). The deformations are constrained to be diffeomorphic using an exponentiation layer. The one-step registration model is benchmarked against iterative techniques, trading in a slightly lower performance in terms of shape fit for a higher compactness. We experiment with two distance metrics, Chamfer distance (CD) and Sinkhorn divergence (SD), as specific distance functions for surface data in real-world registration scenarios. The internalised deformation model is benchmarked against linear principal component analysis (PCA) achieving competitive results and improved generalisability from lower dimensions.
    When in Doubt: Improving Classification Performance with Alternating Normalization. (arXiv:2109.13449v1 [cs.LG])
    (2 min) We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks.
    The VVAD-LRS3 Dataset for Visual Voice Activity Detection. (arXiv:2109.13789v1 [cs.CV])
    (2 min) Robots are becoming everyday devices, increasing their interaction with humans. To make human-machine interaction more natural, cognitive features like Visual Voice Activity Detection (VVAD), which can detect whether a person is speaking or not, given visual input of a camera, need to be implemented. Neural networks are state of the art for tasks in Image Processing, Time Series Prediction, Natural Language Processing and other domains. Those Networks require large quantities of labeled data. Currently there are not many datasets for the task of VVAD. In this work we created a large scale dataset called the VVAD-LRS3 dataset, derived by automatic annotations from the LRS3 dataset. The VVAD-LRS3 dataset contains over 44K samples, over three times the next competitive dataset (WildVVAD). We evaluate different baselines on four kinds of features: facial and lip images, and facial and lip landmark features. With a Convolutional Neural Network Long Short Term Memory (CNN LSTM) on facial images an accuracy of 92% was reached on the test set. A study with humans showed that they reach an accuracy of 87.93% on the test set.
    Visually Grounded Reasoning across Languages and Cultures. (arXiv:2109.13238v1 [cs.CL])
    (2 min) The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for {M}ulticultur{a}l {R}easoning over {V}ision and {L}anguage (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.
    Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients. (arXiv:2109.13333v1 [cs.RO])
    (2 min) In this work we are the first to present an offline policy gradient method for learning imitative policies for complex urban driving from a large corpus of real-world demonstrations. This is achieved by building a differentiable data-driven simulator on top of perception outputs and high-fidelity HD maps of the area. It allows us to synthesize new driving experiences from existing demonstrations using mid-level representations. Using this simulator we then train a policy network in closed-loop employing policy gradients. We train our proposed method on 100 hours of expert demonstrations on urban roads and show that it learns complex driving policies that generalize well and can perform a variety of driving maneuvers. We demonstrate this in simulation as well as deploy our model to self-driving vehicles in the real-world. Our method outperforms previously demonstrated state-of-the-art for urban driving scenarios -- all this without the need for complex state perturbations or collecting additional on-policy data during training. We make code and data publicly available.
    A Strong Baseline for the VIPriors Data-Efficient Image Classification Challenge. (arXiv:2109.13561v1 [cs.CV])
    (2 min) Learning from limited amounts of data is the hallmark of intelligence, requiring strong generalization and abstraction skills. In a machine learning context, data-efficient methods are of high practical importance since data collection and annotation are prohibitively expensive in many domains. Thus, coordinated efforts to foster progress in this area emerged recently, e.g., in the form of dedicated workshops and competitions. Besides a common benchmark, measuring progress requires strong baselines. We present such a strong baseline for data-efficient image classification on the VIPriors challenge dataset, which is a sub-sampled version of ImageNet-1k with 100 images per class. We do not use any methods tailored to data-efficient classification but only standard models and techniques as well as common competition tricks and thorough hyper-parameter tuning. Our baseline achieves 69.7% accuracy on the VIPriors image classification dataset and outperforms 50% of submissions to the VIPriors 2021 challenge.
    Physics-guided Neural Networks (PGNN): An Application in Lake Temperature Modeling. (arXiv:1710.11431v3 [cs.LG] UPDATED)
    (2 min) This paper introduces a framework for combining scientific knowledge of physics-based models with neural networks to advance scientific discovery. This framework, termed physics-guided neural networks (PGNN), leverages the output of physics-based model simulations along with observational features in a hybrid modeling setup to generate predictions using a neural network architecture. Further, this framework uses physics-based loss functions in the learning objective of neural networks to ensure that the model predictions not only show lower errors on the training set but are also scientifically consistent with the known physics on the unlabeled set. We illustrate the effectiveness of PGNN for the problem of lake temperature modeling, where physical relationships between the temperature, density, and depth of water are used to design a physics-based loss function. By using scientific knowledge to guide the construction and learning of neural networks, we are able to show that the proposed framework ensures better generalizability as well as scientific consistency of results. All the code and datasets used in this study have been made available on this link \url{https://github.com/arkadaw9/PGNN}.
    Harrisz+: Harris Corner Selection for Next-Gen Image Matching Pipelines. (arXiv:2109.12925v2 [cs.CV] UPDATED)
    (2 min) Due to its role in many computer vision tasks, image matching has been subjected to an active investigation by researchers, which has lead to better and more discriminant feature descriptors and to more robust matching strategies, also thanks to the advent of the deep learning and the increased computational power of the modern hardware. Despite of these achievements, the keypoint extraction process at the base of the image matching pipeline has not seen equivalent progresses. This paper presents Harrisz$^{+}$, an upgrade to the HarrisZ corner detector, optimized to synergically take advance of the recent improvements of the other steps of the image matching pipeline. Harrisz$^{+}$ does not only consists of a tuning of the setup parameters, but introduces further refinements to the selection criteria delineated by HarrisZ, so providing more, yet discriminative, keypoints, which are better distributed on the image and with higher localization accuracy. The image matching pipeline including Harrisz$^{+}$, together with the other modern components, obtained in different recent matching benchmarks state-of-the-art results among the classic image matching pipelines, closely following results of the more recent fully deep end-to-end trainable approaches.
    Neural Human Deformation Transfer. (arXiv:2109.01588v2 [cs.CV] UPDATED)
    (2 min) We consider the problem of human deformation transfer, where the goal is to retarget poses between different characters. Traditional methods that tackle this problem require a clear definition of the pose, and use this definition to transfer poses between characters. In this work, we take a different approach and transform the identity of a character into a new identity without modifying the character's pose. This offers the advantage of not having to define equivalences between 3D human poses, which is not straightforward as poses tend to change depending on the identity of the character performing them, and as their meaning is highly contextual. To achieve the deformation transfer, we propose a neural encoder-decoder architecture where only identity information is encoded and where the decoder is conditioned on the pose. We use pose independent representations, such as isometry-invariant shape characteristics, to represent identity features. Our model uses these features to supervise the prediction of offsets from the deformed pose to the result of the transfer. We show experimentally that our method outperforms state-of-the-art methods both quantitatively and qualitatively, and generalises better to poses not seen during training. We also introduce a fine-tuning step that allows to obtain competitive results for extreme identities, and allows to transfer simple clothing.
    Fiducial marker recovery and detection from severely truncated data in navigation assisted spine surgery. (arXiv:2108.13844v3 [eess.IV] UPDATED)
    (2 min) Fiducial markers are commonly used in navigation assisted minimally invasive spine surgery (MISS) and they help transfer image coordinates into real world coordinates. In practice, these markers might be located outside the field-of-view (FOV), due to the limited detector sizes of C-arm cone-beam computed tomography (CBCT) systems used in intraoperative surgeries. As a consequence, reconstructed markers in CBCT volumes suffer from artifacts and have distorted shapes, which sets an obstacle for navigation. In this work, we propose two fiducial marker detection methods: direct detection from distorted markers (direct method) and detection after marker recovery (recovery method). For direct detection from distorted markers in reconstructed volumes, an efficient automatic marker detection method using two neural networks and a conventional circle detection algorithm is proposed. For marker recovery, a task-specific learning strategy is proposed to recover markers from severely truncated data. Afterwards, a conventional marker detection algorithm is applied for position detection. The two methods are evaluated on simulated data and real data, both achieving a marker registration error smaller than 0.2 mm. Our experiments demonstrate that the direct method is capable of detecting distorted markers accurately and the recovery method with task-specific learning has high robustness and generalizability on various data sets.
    UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. (arXiv:2107.00781v2 [cs.CV] UPDATED)
    (2 min) Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.
    Conditional Extreme Value Theory for Open Set Video Domain Adaptation. (arXiv:2109.00522v2 [cs.CV] UPDATED)
    (2 min) With the advent of media streaming, video action recognition has become progressively important for various applications, yet at the high expense of requiring large-scale data labelling. To overcome the problem of expensive data labelling, domain adaptation techniques have been proposed that transfers knowledge from fully labelled data (i.e., source domain) to unlabelled data (i.e., target domain). The majority of video domain adaptation algorithms are proposed for closed-set scenarios in which all the classes are shared among the domains. In this work, we propose an open-set video domain adaptation approach to mitigate the domain discrepancy between the source and target data, allowing the target data to contain additional classes that do not belong to the source domain. Different from previous works, which only focus on improving accuracy for shared classes, we aim to jointly enhance the alignment of shared classes and recognition of unknown samples. Towards this goal, class-conditional extreme value theory is applied to enhance the unknown recognition. Specifically, the entropy values of target samples are modelled as generalised extreme value distributions, which allows separating unknown samples lying in the tail of the distribution. To alleviate the negative transfer issue, weights computed by the distance from the sample entropy to the threshold are leveraged in adversarial learning in the sense that confident source and target samples are aligned, and unconfident samples are pushed away. The proposed method has been thoroughly evaluated on both small-scale and large-scale cross-domain video datasets and achieved the state-of-the-art performance.
    Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval. (arXiv:2105.07391v2 [cs.CV] UPDATED)
    (2 min) Visual-semantic embedding is an interesting research topic because it is useful for various tasks, such as visual question answering (VQA), image-text retrieval, image captioning, and scene graph generation. In this paper, we focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area. First, we provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching and how the technology has evolved over time. In addition, a description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented. We also introduce the implementation available on github for use in confirming the accuracy of experiments and for further improvement. We hope that this survey paper will encourage researchers to further develop their research on bridging images and languages.
    mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets. (arXiv:2012.08385v2 [cs.CV] UPDATED)
    (2 min) One challenge of object recognition is to generalize to new domains, to more classes and/or to new modalities. This necessitates methods to combine and reuse existing datasets that may belong to different domains, have partial annotations, and/or have different data modalities. This paper formulates this as a multi-source domain adaptation and label unification problem, and proposes a novel method for it. Our method consists of a partially-supervised adaptation stage and a fully-supervised adaptation stage. In the former, partial knowledge is transferred from multiple source domains to the target domain and fused therein. Negative transfer between unmatching label spaces is mitigated via three new modules: domain attention, uncertainty maximization and attention-guided adversarial alignment. In the latter, knowledge is transferred in the unified label space after a label completion process with pseudo-labels. Extensive experiments on three different tasks - image classification, 2D semantic image segmentation, and joint 2D-3D semantic segmentation - show that our method outperforms all competing methods significantly.
    Refractive Geometry for Underwater Domes. (arXiv:2108.06575v2 [cs.CV] UPDATED)
    (2 min) Underwater cameras are typically placed behind glass windows to protect them from the water. Spherical glass, a dome port, is well suited for high water pressures at great depth, allows for a large field of view, and avoids refraction if a pinhole camera is positioned exactly at the sphere's center. Adjusting a real lens perfectly to the dome center is a challenging task, both in terms of how to actually guide the centering process (e.g. visual servoing) and how to measure the alignment quality, but also, how to mechanically perform the alignment. Consequently, such systems are prone to being decentered by some offset, leading to challenging refraction patterns at the sphere that invalidate the pinhole camera model. We show that the overall camera system becomes an axial camera, even for thick domes as used for deep sea exploration and provide a non-iterative way to compute the center of refraction without requiring knowledge of exact air, glass or water properties. We also analyze the refractive geometry at the sphere, looking at effects such as forward- vs. backward decentering, iso-refraction curves and obtain a 6th-degree polynomial equation for forward projection of 3D points in thin domes. We then propose a pure underwater calibration procedure to estimate the decentering from multiple images. This estimate can either be used during adjustment to guide the mechanical position of the lens, or can be considered in photogrammetric underwater applications.
    FSNet: A Failure Detection Framework for Semantic Segmentation. (arXiv:2108.08748v2 [cs.CV] UPDATED)
    (2 min) Semantic segmentation is an important task that helps autonomous vehicles understand their surroundings and navigate safely. During deployment, even the most mature segmentation models are vulnerable to various external factors that can degrade the segmentation performance with potentially catastrophic consequences for the vehicle and its surroundings. To address this issue, we propose a failure detection framework to identify pixel-level misclassification. We do so by exploiting internal features of the segmentation model and training it simultaneously with a failure detection network. During deployment, the failure detector can flag areas in the image where the segmentation model have failed to segment correctly. We evaluate the proposed approach against state-of-the-art methods and achieve 12.30%, 9.46%, and 9.65% performance improvement in the AUPR-Error metric for Cityscapes, BDD100K, and Mapillary semantic segmentation datasets.
    Generalisable and distinctive 3D local deep descriptors for point cloud registration. (arXiv:2105.10382v2 [cs.CV] UPDATED)
    (2 min) An effective 3D descriptor should be invariant to different geometric transformations, such as scale and rotation, repeatable in the case of occlusions and clutter, and generalisable in different contexts when data is captured with different sensors. We present a simple but yet effective method to learn generalisable and distinctive 3D local descriptors that can be used to register point clouds captured in different contexts with different sensors. Point cloud patches are extracted, canonicalised with respect to their local reference frame, and encoded into scale and rotation-invariant compact descriptors by a point permutation-invariant deep neural network. Our descriptors can effectively generalise across sensor modalities from locally and randomly sampled points. We evaluate and compare our descriptors with alternative handcrafted and deep learning-based descriptors on several indoor and outdoor datasets reconstructed using both RGBD sensors and laser scanners. Our descriptors outperform most recent descriptors by a large margin in terms of generalisation, and become the state of the art also in benchmarks where training and testing are performed in the same scenarios.
    A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking. (arXiv:2105.02480v2 [cs.CV] UPDATED)
    (2 min) Siamese trackers are shown to be vulnerable to adversarial attacks recently. However, the existing attack methods craft the perturbations for each video independently, which comes at a non-negligible computational cost. In this paper, we show the existence of universal perturbations that can enable the targeted attack, e.g., forcing a tracker to follow the ground-truth trajectory with specified offsets, to be video-agnostic and free from inference in a network. Specifically, we attack a tracker by adding a universal imperceptible perturbation to the template image and adding a fake target, i.e., a small universal adversarial patch, into the search images adhering to the predefined trajectory, so that the tracker outputs the location and size of the fake target instead of the real target. Our approach allows perturbing a novel video to come at no additional cost except the mere addition operations -- and not require gradient optimization or network inference. Experimental results on several datasets demonstrate that our approach can effectively fool the Siamese trackers in a targeted attack manner. We show that the proposed perturbations are not only universal across videos, but also generalize well across different trackers. Such perturbations are therefore doubly universal, both with respect to the data and the network architectures. We will make our code publicly available.
    3D Hand Pose and Shape Estimation from RGB Images for Improved Keypoint-Based Hand-Gesture Recognition. (arXiv:2109.13879v1 [cs.CV])
    (2 min) Estimating the 3D hand pose from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand-gesture recognition. Currently, good estimations can be computed starting from single RGB images, especially when forcing the system to also consider, through a multi-task learning approach, the hand shape when the pose is determined. However, when addressing the aforementioned real-life tasks, performances can drop considerably depending on the hand representation, thus suggesting that stable descriptions are required to achieve satisfactory results. As a consequence, in this paper we present a keypoint-based end-to-end framework for the 3D hand and pose estimation, and successfully apply it to the hand-gesture recognition task as a study case. Specifically, after a pre-processing step where the images are normalized, the proposed pipeline comprises a multi-task semantic feature extractor generating 2D heatmaps and hand silhouettes from RGB images; a viewpoint encoder predicting hand and camera view parameters; a stable hand estimator producing the 3D hand pose and shape; and a loss function designed to jointly guide all of the components during the learning phase. To assess the proposed framework, tests were performed on a 3D pose and shape estimation benchmark dataset, obtaining state-of-the-art performances. What is more, the devised system was also evaluated on 2 hand-gesture recognition benchmark datasets, where the framework significantly outperforms other keypoint-based approaches; indicating that the presented method is an effective solution able to generate stable 3D estimates for the hand pose and shape.
    Deep graph matching meets mixed-integer linear programming: Relax at your own risk ?. (arXiv:2108.00394v3 [cs.CV] UPDATED)
    (2 min) Graph matching is an important problem that has received widespread attention, especially in the field of computer vision. Recently, state-of-the-art methods seek to incorporate graph matching with deep learning. However, there is no research to explain what role the graph matching algorithm plays in the model. Therefore, we propose an approach integrating a MILP formulation of the graph matching problem. This formulation is solved to optimal and it provides inherent baseline. Meanwhile, similar approaches are derived by releasing the optimal guarantee of the graph matching solver and by introducing a quality level. This quality level controls the quality of the solutions provided by the graph matching solver. In addition, several relaxations of the graph matching problem are put to the test. Our experimental evaluation gives several theoretical insights and guides the direction of deep graph matching methods.
    Anomaly Detection With Partitioning Overfitting Autoencoder Ensembles. (arXiv:2009.02755v8 [cs.LG] UPDATED)
    (2 min) In this paper, we propose POTATOES (Partitioning OverfiTting AuTOencoder EnSemble), a new method for unsupervised outlier detection (UOD). More precisely, given any autoencoder for UOD, this technique can be used to improve its accuracy while at the same time removing the burden of tuning its regularization. The idea is to not regularize at all, but to rather randomly partition the data into sufficiently many equally sized parts, overfit each part with its own autoencoder, and to use the maximum over all autoencoder reconstruction errors as the anomaly score. We apply our model to various realistic datasets and show that if the set of inliers is dense enough, our method indeed improves the UOD performance of a given autoencoder significantly. For reproducibility, the code is made available on github so the reader can recreate the results in this paper as well as apply the method to other autoencoders and datasets.
    Investigating Bi-Level Optimization for Learning and Vision from a Unified Perspective: A Survey and Beyond. (arXiv:2101.11517v3 [cs.LG] UPDATED)
    (2 min) Bi-Level Optimization (BLO) is originated from the area of economic game theory and then introduced into the optimization community. BLO is able to handle problems with a hierarchical structure, involving two levels of optimization tasks, where one task is nested inside the other. In machine learning and computer vision fields, despite the different motivations and mechanisms, a lot of complex problems, such as hyper-parameter optimization, multi-task and meta-learning, neural architecture search, adversarial learning and deep reinforcement learning, actually all contain a series of closely related subproblms. In this paper, we first uniformly express these complex learning and vision problems from the perspective of BLO. Then we construct a best-response-based single-level reformulation and establish a unified algorithmic framework to understand and formulate mainstream gradient-based BLO methodologies, covering aspects ranging from fundamental automatic differentiation schemes to various accelerations, simplifications, extensions and their convergence and complexity properties. Last but not least, we discuss the potentials of our unified BLO framework for designing new algorithms and point out some promising directions for future research.
    Towards clinically applicable automated aneurysm detection in TOF-MRA: weak labels, anatomical knowledge, and open data. (arXiv:2103.06168v3 [eess.IV] UPDATED)
    (3 min) Purpose: 1) Develop a deep learning algorithm for brain aneurysm detection exploiting weak labels and prior anatomical knowledge. 2) Describe and release the largest Time-Of-Flight Magnetic Resonance Angiography (TOF-MRA) dataset to the community. Materials and Methods: In this retrospective study we retrieved TOF-MRA images of 284 subjects (170 females) scanned between 2010 and 2015. Out of these, 157 are patients with a total of 198 aneurysms, while 127 are controls. We used spherical weak labels as detection ground truth, thus making data annotation, a major bottleneck for medical AI, noticeably faster. Since aneurysms mainly occur in specific locations, we built our deep neural network leveraging the anatomy of the brain vasculature. To assess model robustness, we participated in the first public challenge for TOF-MRA data (93 patients, 20 controls, 125 aneurysms). We stratified results according to aneurysm risk-of-rupture, location, and size. Results: Our network achieves a sensitivity of 80% on the in-house data, with False Positive (FP) rate of 1.2 per patient. On the public challenge data, sensitivity was 68% (FP rate = 2.5), ranking 4th/16 on the open leaderboard. We found no significant difference in sensitivity between risk groups (p = 0.75), locations (p = 0.72), or sizes (p = 0.15). Conclusion: Competitive results can be obtained using fast weak labels and anatomical knowledge for automated aneurysm detection. Our open-source code and open access dataset can foster reproducibility, and bring us closer to clinical application.
    Self-Supervised Learning in Multi-Task Graphs through Iterative Consensus Shift. (arXiv:2103.14417v2 [cs.LG] UPDATED)
    (2 min) The human ability to synchronize the feedback from all their senses inspired recent works in multi-task and multi-modal learning. While these works rely on expensive supervision, our multi-task graph requires only pseudo-labels from expert models. Every graph node represents a task, and each edge learns between tasks transformations. Once initialized, the graph learns self-supervised, based on a novel consensus shift algorithm that intelligently exploits the agreement between graph pathways to generate new pseudo-labels for the next learning cycle. We demonstrate significant improvement from one unsupervised learning iteration to the next, outperforming related recent methods in extensive multi-task learning experiments on two challenging datasets.
    ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering. (arXiv:2011.10331v3 [cs.CV] UPDATED)
    (2 min) Multi-view clustering has wide applications in many image processing scenarios. In these scenarios, original image data often contain missing instances and noises, which is ignored by most multi-view clustering methods. However, missing instances may make these methods difficult to use directly and noises will lead to unreliable clustering results. In this paper, we propose a novel Auto-weighted Noisy and Incomplete Multi-view Clustering framework (ANIMC) via a soft auto-weighted strategy and a doubly soft regular regression model. Firstly, by designing adaptive semi-regularized nonnegative matrix factorization (adaptive semi-RNMF), the soft auto-weighted strategy assigns a proper weight to each view and adds a soft boundary to balance the influence of noises and incompleteness. Secondly, by proposing{\theta}-norm, the doubly soft regularized regression model adjusts the sparsity of our model by choosing different{\theta}. Compared with existing methods, ANIMC has three unique advantages: 1) it is a soft algorithm to adjust our framework in different scenarios, thereby improving its generalization ability; 2) it automatically learns a proper weight for each view, thereby reducing the influence of noises; 3) it performs doubly soft regularized regression that aligns the same instances in different views, thereby decreasing the impact of missing instances. Extensive experimental results demonstrate its superior advantages over other state-of-the-art methods.
    Robust Retinal Vessel Segmentation from a Data Augmentation Perspective. (arXiv:2007.15883v2 [eess.IV] UPDATED)
    (2 min) Retinal vessel segmentation is a fundamental step in screening, diagnosis, and treatment of various cardiovascular and ophthalmic diseases. Robustness is one of the most critical requirements for practical utilization, since the test images may be captured using different fundus cameras, or be affected by various pathological changes. We investigate this problem from a data augmentation perspective, with the merits of no additional training data or inference time. In this paper, we propose two new data augmentation modules, namely, channel-wise random Gamma correction and channel-wise random vessel augmentation. Given a training color fundus image, the former applies random gamma correction on each color channel of the entire image, while the latter intentionally enhances or decreases only the fine-grained blood vessel regions using morphological transformations. With the additional training samples generated by applying these two modules sequentially, a model could learn more invariant and discriminating features against both global and local disturbances. Experimental results on both real-world and synthetic datasets demonstrate that our method can improve the performance and robustness of a classic convolutional neural network architecture. The source code is available at \url{https://github.com/PaddlePaddle/Research/tree/master/CV/robust_vessel_segmentation}.
    Head2Head++: Deep Facial Attributes Re-Targeting. (arXiv:2006.10199v2 [cs.CV] UPDATED)
    (2 min) Facial video re-targeting is a challenging problem aiming to modify the facial attributes of a target subject in a seamless manner by a driving monocular sequence. We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment. Our method is different to purely 3D model-based approaches, or recent image-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames. We manage to capture the complex non-rigid facial motion from the driving monocular performances and synthesise temporally consistent videos, with the aid of a sequential Generator and an ad-hoc Dynamics Discriminator network. We conduct a comprehensive set of quantitative and qualitative tests and demonstrate experimentally that our proposed method can successfully transfer facial expressions, head pose and eye gaze from a source video to a target subject, in a photo-realistic and faithful fashion, better than other state-of-the-art methods. Most importantly, our system performs end-to-end reenactment in nearly real-time speed (18 fps).
    $f$-Cal: Calibrated aleatoric uncertainty estimation from neural networks for robot perception. (arXiv:2109.13913v1 [cs.CV])
    (2 min) While modern deep neural networks are performant perception modules, performance (accuracy) alone is insufficient, particularly for safety-critical robotic applications such as self-driving vehicles. Robot autonomy stacks also require these otherwise blackbox models to produce reliable and calibrated measures of confidence on their predictions. Existing approaches estimate uncertainty from these neural network perception stacks by modifying network architectures, inference procedure, or loss functions. However, in general, these methods lack calibration, meaning that the predictive uncertainties do not faithfully represent the true underlying uncertainties (process noise). Our key insight is that calibration is only achieved by imposing constraints across multiple examples, such as those in a mini-batch; as opposed to existing approaches which only impose constraints per-sample, often leading to overconfident (thus miscalibrated) uncertainty estimates. By enforcing the distribution of outputs of a neural network to resemble a target distribution by minimizing an $f$-divergence, we obtain significantly better-calibrated models compared to prior approaches. Our approach, $f$-Cal, outperforms existing uncertainty calibration approaches on robot perception tasks such as object detection and monocular depth estimation over multiple real-world benchmarks.
    Degenerative Adversarial NeuroImage Nets for Brain Scan Simulations: Application in Ageing and Dementia. (arXiv:1912.01526v4 [eess.IV] UPDATED)
    (3 min) Accurate and realistic simulation of high-dimensional medical images has become an important research area relevant to many AI-enabled healthcare applications. However, current state-of-the-art approaches lack the ability to produce satisfactory high-resolution and accurate subject-specific images. In this work, we present a deep learning framework, namely 4D-Degenerative Adversarial NeuroImage Net (4D-DANI-Net), to generate high-resolution, longitudinal MRI scans that mimic subject-specific neurodegeneration in ageing and dementia. 4D-DANI-Net is a modular framework based on adversarial training and a set of novel spatiotemporal, biologically-informed constraints. To ensure efficient training and overcome memory limitations affecting such high-dimensional problems, we rely on three key technological advances: i) a new 3D training consistency mechanism called Profile Weight Functions (PWFs), ii) a 3D super-resolution module and iii) a transfer learning strategy to fine-tune the system for a given individual. To evaluate our approach, we trained the framework on 9852 T1-weighted MRI scans from 876 participants in the Alzheimer's Disease Neuroimaging Initiative dataset and held out a separate test set of 1283 MRI scans from 170 participants for quantitative and qualitative assessment of the personalised time series of synthetic images. We performed three evaluations: i) image quality assessment; ii) quantifying the accuracy of regional brain volumes over and above benchmark models; and iii) quantifying visual perception of the synthetic images by medical experts. Overall, both quantitative and qualitative results show that 4D-DANI-Net produces realistic, low-artefact, personalised time series of synthetic T1 MRI that outperforms benchmark models.
    Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding. (arXiv:1911.00232v4 [cs.CV] UPDATED)
    (2 min) Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information present in each video in training. Towards this goal, we present the Multi-Moments in Time dataset (M-MiT) which includes over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning, provide improved methods for visualizing and interpreting models trained for multi-label action detection and show the strength of transferring models trained on M-MiT to smaller datasets.
    Unsolved Problems in ML Safety. (arXiv:2109.13916v1 [cs.LG])
    (2 min) Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), steering ML systems ("Alignment"), and reducing risks to how ML systems are handled ("External Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
    ViT Cane: Visual Assistant for the Visually Impaired. (arXiv:2109.13857v1 [cs.CV])
    (2 min) Blind and visually challenged face multiple issues with navigating the world independently. Some of these challenges include finding the shortest path to a destination and detecting obstacles from a distance. To tackle this issue, this paper proposes ViT Cane, which leverages a vision transformer model in order to detect obstacles in real-time. Our entire system consists of a Pi Camera Module v2, Raspberry Pi 4B with 8GB Ram and 4 motors. Based on tactile input using the 4 motors, the obstacle detection model is highly efficient in helping visually impaired navigate unknown terrain and is designed to be easily reproduced. The paper discusses the utility of a Visual Transformer model in comparison to other CNN based models for this specific application. Through rigorous testing, the proposed obstacle detection model has achieved higher performance on the Common Object in Context (COCO) data set than its CNN counterpart. Comprehensive field tests were conducted to verify the effectiveness of our system for holistic indoor understanding and obstacle avoidance.
    PFENet++: Boosting Few-shot Semantic Segmentation with the Noise-filtered Context-aware Prior Mask. (arXiv:2109.13788v1 [cs.CV])
    (2 min) In this work, we revisit the prior mask guidance proposed in "Prior Guided Feature Enrichment Network for Few-Shot Segmentation". The prior mask serves as an indicator that highlights the region of interests of unseen categories, and it is effective in achieving better performance on different frameworks of recent studies. However, the current method directly takes the maximum element-to-element correspondence between the query and support features to indicate the probability of belonging to the target class, thus the broader contextual information is seldom exploited during the prior mask generation. To address this issue, first, we propose the Context-aware Prior Mask (CAPM) that leverages additional nearby semantic cues for better locating the objects in query images. Second, since the maximum correlation value is vulnerable to noisy features, we take one step further by incorporating a lightweight Noise Suppression Module (NSM) to screen out the unnecessary responses, yielding high-quality masks for providing the prior knowledge. Both two contributions are experimentally shown to have substantial practical merit, and the new model named PFENet++ significantly outperforms the baseline PFENet as well as all other competitors on three challenging benchmarks PASCAL-5$^i$, COCO-20$^i$ and FSS-1000. The new state-of-the-art performance is achieved without compromising the efficiency, manifesting the potential for being a new strong baseline in few-shot semantic segmentation. Our code will be available at https://github.com/dvlab-research/PFENet++.
    Real-Time Glaucoma Detection from Digital Fundus Images using Self-ONNs. (arXiv:2109.13604v1 [eess.IV])
    (2 min) Glaucoma leads to permanent vision disability by damaging the optical nerve that transmits visual images to the brain. The fact that glaucoma does not show any symptoms as it progresses and cannot be stopped at the later stages, makes it critical to be diagnosed in its early stages. Although various deep learning models have been applied for detecting glaucoma from digital fundus images, due to the scarcity of labeled data, their generalization performance was limited along with high computational complexity and special hardware requirements. In this study, compact Self-Organized Operational Neural Networks (Self- ONNs) are proposed for early detection of glaucoma in fundus images and their performance is compared against the conventional (deep) Convolutional Neural Networks (CNNs) over three benchmark datasets: ACRIMA, RIM-ONE, and ESOGU. The experimental results demonstrate that Self-ONNs not only achieve superior detection performance but can also significantly reduce the computational complexity making it a potentially suitable network model for biomedical datasets especially when the data is scarce.
    PDC-Net+: Enhanced Probabilistic Dense Correspondence Network. (arXiv:2109.13912v1 [cs.CV])
    (2 min) Establishing robust and accurate correspondences between a pair of images is a long-standing computer vision problem with numerous applications. While classically dominated by sparse methods, emerging dense approaches offer a compelling alternative paradigm that avoids the keypoint detection step. However, dense flow estimation is often inaccurate in the case of large displacements, occlusions, or homogeneous regions. In order to apply dense methods to real-world applications, such as pose estimation, image manipulation, or 3D reconstruction, it is therefore crucial to estimate the confidence of the predicted matches. We propose the Enhanced Probabilistic Dense Correspondence Network, PDC-Net+, capable of estimating accurate dense correspondences along with a reliable confidence map. We develop a flexible probabilistic approach that jointly learns the flow prediction and its uncertainty. In particular, we parametrize the predictive distribution as a constrained mixture model, ensuring better modelling of both accurate flow predictions and outliers. Moreover, we develop an architecture and an enhanced training strategy tailored for robust and generalizable uncertainty prediction in the context of self-supervised training. Our approach obtains state-of-the-art results on multiple challenging geometric matching and optical flow datasets. We further validate the usefulness of our probabilistic confidence estimation for the tasks of pose estimation, 3D reconstruction, image-based localization, and image retrieval. Code and models are available at https://github.com/PruneTruong/DenseMatching.
    Turning old models fashion again: Recycling classical CNN networks using the Lattice Transformation. (arXiv:2109.13885v1 [cs.CV])
    (2 min) In the early 1990s, the first signs of life of the CNN era were given: LeCun et al. proposed a CNN model trained by the backpropagation algorithm to classify low-resolution images of handwritten digits. Undoubtedly, it was a breakthrough in the field of computer vision. But with the rise of other classification methods, it fell out fashion. That was until 2012, when Krizhevsky et al. revived the interest in CNNs by exhibiting considerably higher image classification accuracy on the ImageNet challenge. Since then, the complexity of the architectures are exponentially increasing and many structures are rapidly becoming obsolete. Using multistream networks as a base and the feature infusion precept, we explore the proposed LCNN cross-fusion strategy to use the backbones of former state-of-the-art networks on image classification in order to discover if the technique is able to put these designs back in the game. In this paper, we showed that we can obtain an increase of accuracy up to 63.21% on the NORB dataset we comparing with the original structure. However, no technique is definitive. While our goal is to try to reuse previous state-of-the-art architectures with few modifications, we also expose the disadvantages of our explored strategy.
    Introduce the Result Into Self-Attention. (arXiv:2109.13860v1 [cs.CV])
    (2 min) Traditional self-attention mechanisms in convolutional networks tend to use only the output of the previous layer as input to the attention network, such as SENet, CBAM, etc. In this paper, we propose a new attention modification method that tries to get the output of the classification network in advance and use it as a part of the input of the attention network. We used the auxiliary classifier proposed in GoogLeNet to obtain the results in advance and pass them into attention networks. we added this mechanism to SE-ResNet for our experiments and achieved a classification accuracy improvement of at most 1.94% on cifar100.
    Image scaling by de la Vall\'ee-Poussin filtered interpolation. (arXiv:2109.13897v1 [cs.CV])
    (2 min) We present a new image scaling method both for downscaling and upscaling, running with any scale factor or desired size. It is based on the sampling of an approximating bivariate polynomial, which globally interpolates the data and is defined by a filter of de la Vall\'ee Poussin type whose action ray is suitable regulated to improve the approximation. The method has been tested on a significant number of different image datasets. The results are evaluated in qualitative and quantitative terms and compared with other available competitive methods. The perceived quality of the resulting scaled images is such that important details are preserved, and the appearance of artifacts is low. Very high-quality measure values in downscaling and the competitive ones in upscaling evidence the effectiveness of the method. Good visual quality, limited computational effort, and moderate memory demanding make the method suitable for real-world applications.
    Not Color Blind: AI Predicts Racial Identity from Black and White Retinal Vessel Segmentations. (arXiv:2109.13845v1 [cs.CV])
    (2 min) Background: Artificial intelligence (AI) may demonstrate racial bias when skin or choroidal pigmentation is present in medical images. Recent studies have shown that convolutional neural networks (CNNs) can predict race from images that were not previously thought to contain race-specific features. We evaluate whether grayscale retinal vessel maps (RVMs) of patients screened for retinopathy of prematurity (ROP) contain race-specific features. Methods: 4095 retinal fundus images (RFIs) were collected from 245 Black and White infants. A U-Net generated RVMs from RFIs, which were subsequently thresholded, binarized, or skeletonized. To determine whether RVM differences between Black and White eyes were physiological, CNNs were trained to predict race from color RFIs, raw RVMs, and thresholded, binarized, or skeletonized RVMs. Area under the precision-recall curve (AUC-PR) was evaluated. Findings: CNNs predicted race from RFIs near perfectly (image-level AUC-PR: 0.999, subject-level AUC-PR: 1.000). Raw RVMs were almost as informative as color RFIs (image-level AUC-PR: 0.938, subject-level AUC-PR: 0.995). Ultimately, CNNs were able to detect whether RFIs or RVMs were from Black or White babies, regardless of whether images contained color, vessel segmentation brightness differences were nullified, or vessel segmentation widths were normalized. Interpretation: AI can detect race from grayscale RVMs that were not thought to contain racial information. Two potential explanations for these findings are that: retinal vessels physiologically differ between Black and White babies or the U-Net segments the retinal vasculature differently for various fundus pigmentations. Either way, the implications remain the same: AI algorithms have potential to demonstrate racial bias in practice, even when preliminary attempts to remove such information from the underlying images appear to be successful.
    Visual resemblance and communicative context constrain the emergence of graphical conventions. (arXiv:2109.13861v1 [cs.CV])
    (2 min) From photorealistic sketches to schematic diagrams, drawing provides a versatile medium for communicating about the visual world. How do images spanning such a broad range of appearances reliably convey meaning? Do viewers understand drawings based solely on their ability to resemble the entities they refer to (i.e., as images), or do they understand drawings based on shared but arbitrary associations with these entities (i.e., as symbols)? In this paper, we provide evidence for a cognitive account of pictorial meaning in which both visual and social information is integrated to support effective visual communication. To evaluate this account, we used a communication task where pairs of participants used drawings to repeatedly communicate the identity of a target object among multiple distractor objects. We manipulated social cues across three experiments and a full internal replication, finding pairs of participants develop referent-specific and interaction-specific strategies for communicating more efficiently over time, going beyond what could be explained by either task practice or a pure resemblance-based account alone. Using a combination of model-based image analyses and crowdsourced sketch annotations, we further determined that drawings did not drift toward arbitrariness, as predicted by a pure convention-based account, but systematically preserved those visual features that were most distinctive of the target object. Taken together, these findings advance theories of pictorial meaning and have implications for how successful graphical conventions emerge via complex interactions between visual perception, communicative experience, and social context.
    3N-GAN: Semi-Supervised Classification of X-Ray Images with a 3-Player Adversarial Framework. (arXiv:2109.13862v1 [eess.IV])
    (2 min) The success of deep learning for medical imaging tasks, such as classification, is heavily reliant on the availability of large-scale datasets. However, acquiring datasets with large quantities of labeled data is challenging, as labeling is expensive and time-consuming. Semi-supervised learning (SSL) is a growing alternative to fully-supervised learning, but requires unlabeled samples for training. In medical imaging, many datasets lack unlabeled data entirely, so SSL can't be conventionally utilized. We propose 3N-GAN, or 3 Network Generative Adversarial Networks, to perform semi-supervised classification of medical images in fully-supervised settings. We incorporate a classifier into the adversarial relationship such that the generator trains adversarially against both the classifier and discriminator. Our preliminary results show improved classification performance and GAN generations over various algorithms. Our work can seamlessly integrate with numerous other medical imaging model architectures and SSL methods for greater performance.
    Stable training of autoencoders for hyperspectral unmixing. (arXiv:2109.13748v1 [cs.CV])
    (2 min) Neural networks, autoencoders in particular, are one of the most promising solutions for unmixing hyperspectral data, i.e. reconstructing the spectra of observed substances (endmembers) and their relative mixing fractions (abundances). Unmixing is needed for effective hyperspectral analysis and classification. However, as we show in this paper, the training of autoencoders for unmixing is highly dependent on weights initialisation. Some sets of weights lead to degenerate or low performance solutions, introducing negative bias in expected performance. In this work we present the results of experiments investigating autoencoders' stability, verifying the dependence of reconstruction error on initial weights and exploring conditions needed for successful optimisation of autoencoder parameters.
    SafetyNet: Safe planning for real-world self-driving vehicles using machine-learned policies. (arXiv:2109.13602v1 [cs.RO])
    (2 min) In this paper we present the first safe system for full control of self-driving vehicles trained from human demonstrations and deployed in challenging, real-world, urban environments. Current industry-standard solutions use rule-based systems for planning. Although they perform reasonably well in common scenarios, the engineering complexity renders this approach incompatible with human-level performance. On the other hand, the performance of machine-learned (ML) planning solutions can be improved by simply adding more exemplar data. However, ML methods cannot offer safety guarantees and sometimes behave unpredictably. To combat this, our approach uses a simple yet effective rule-based fallback layer that performs sanity checks on an ML planner's decisions (e.g. avoiding collision, assuring physical feasibility). This allows us to leverage ML to handle complex situations while still assuring the safety, reducing ML planner-only collisions by 95%. We train our ML planner on 300 hours of expert driving demonstrations using imitation learning and deploy it along with the fallback layer in downtown San Francisco, where it takes complete control of a real vehicle and navigates a wide variety of challenging urban driving scenarios.
    Domain Generalization for Vision-based Driving Trajectory Generation. (arXiv:2109.13858v1 [cs.CV])
    (2 min) One of the challenges in vision-based driving trajectory generation is dealing with out-of-distribution scenarios. In this paper, we propose a domain generalization method for vision-based driving trajectory generation for autonomous vehicles in urban environments, which can be seen as a solution to extend the Invariant Risk Minimization (IRM) method in complex problems. We leverage an adversarial learning approach to train a trajectory generator as the decoder. Based on the pre-trained decoder, we infer the latent variables corresponding to the trajectories, and pre-train the encoder by regressing the inferred latent variable. Finally, we fix the decoder but fine-tune the encoder with the final trajectory loss. We compare our proposed method with the state-of-the-art trajectory generation method and some recent domain generalization methods on both datasets and simulation, demonstrating that our method has better generalization ability.
    Fail-Safe Human Detection for Drones Using a Multi-Modal Curriculum Learning Approach. (arXiv:2109.13666v1 [cs.CV])
    (2 min) Drones are currently being explored for safety-critical applications where human agents are expected to evolve in their vicinity. In such applications, robust people avoidance must be provided by fusing a number of sensing modalities in order to avoid collisions. Currently however, people detection systems used on drones are solely based on standard cameras besides an emerging number of works discussing the fusion of imaging and event-based cameras. On the other hand, radar-based systems provide up-most robustness towards environmental conditions but do not provide complete information on their own and have mainly been investigated in automotive contexts, not for drones. In order to enable the fusion of radars with both event-based and standard cameras, we present KUL-UAVSAFE, a first-of-its-kind dataset for the study of safety-critical people detection by drones. In addition, we propose a baseline CNN architecture with cross-fusion highways and introduce a curriculum learning strategy for multi-modal data termed SAUL, which greatly enhances the robustness of the system towards hard RGB failures and provides a significant gain of 15% in peak F1 score compared to the use of BlackIn, previously proposed for cross-fusion networks. We demonstrate the real-time performance and feasibility of the approach by implementing the system in an edge-computing unit. We release our dataset and additional material in the project home page.
    Motion Deblurring with Real Events. (arXiv:2109.13695v1 [cs.CV])
    (2 min) In this paper, we propose an end-to-end learning framework for event-based motion deblurring in a self-supervised manner, where real-world events are exploited to alleviate the performance degradation caused by data inconsistency. To achieve this end, optical flows are predicted from events, with which the blurry consistency and photometric consistency are exploited to enable self-supervision on the deblurring network with real-world data. Furthermore, a piece-wise linear motion model is proposed to take into account motion non-linearities and thus leads to an accurate model for the physical formation of motion blurs in the real-world scenario. Extensive evaluation on both synthetic and real motion blur datasets demonstrates that the proposed algorithm bridges the gap between simulated and real-world motion blurs and shows remarkable performance for event-based motion deblurring in real-world scenarios.
    StereoSpike: Depth Learning with a Spiking Neural Network. (arXiv:2109.13751v1 [cs.CV])
    (2 min) Depth estimation is an important computer vision task, useful in particular for navigation in autonomous vehicles, or for object manipulation in robotics. Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike. More specifically, we used the Multi Vehicle Stereo Event Camera Dataset (MVSEC). It provides a depth ground-truth, which was used to train StereoSpike in a supervised manner, using surrogate gradient descent. We propose a novel readout paradigm to obtain a dense analog prediction -- the depth of each pixel -- from the spikes of the decoder. We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy. To the best of our knowledge, it is the first time that such a large-scale regression problem is solved by a fully spiking network. Finally, we show that low firing rates (<10%) can be obtained via regularization, with a minimal cost in accuracy. This means that StereoSpike could be efficiently implemented on neuromorphic chips, opening the door for low power and real time embedded systems.
    An Efficient Network Design for Face Video Super-resolution. (arXiv:2109.13626v1 [cs.CV])
    (2 min) Face video super-resolution algorithm aims to reconstruct realistic face details through continuous input video sequences. However, existing video processing algorithms usually contain redundant parameters to guarantee different super-resolution scenes. In this work, we focus on super-resolution of face areas in original video scenes, while rest areas are interpolated. This specific super-resolved task makes it possible to cut redundant parameters in general video super-resolution networks. We construct a dataset consisting entirely of face video sequences for network training and evaluation, and conduct hyper-parameter optimization in our experiments. We use three combined strategies to optimize the network parameters with a simultaneous train-evaluation method to accelerate optimization process. Results show that simultaneous train-evaluation method improves the training speed and facilitates the generation of efficient networks. The generated network can reduce at least 52.4% parameters and 20.7% FLOPs, achieve better performance on PSNR, SSIM compared with state-of-art video super-resolution algorithms. When processing 36x36x1x3 input video frame sequences, the efficient network provides 47.62 FPS real-time processing performance. We name our proposal as hyper-parameter optimization for face Video Super-Resolution (HO-FVSR), which is open-sourced at https://github.com/yphone/efficient-network-for-face-VSR.
    Adaptive Attribute and Structure Subspace Clustering Network. (arXiv:2109.13742v1 [cs.CV])
    (2 min) Deep self-expressiveness-based subspace clustering methods have demonstrated effectiveness. However, existing works only consider the attribute information to conduct the self-expressiveness, which may limit the clustering performance. In this paper, we propose a novel adaptive attribute and structure subspace clustering network (AASSC-Net) to simultaneously consider the attribute and structure information in an adaptive graph fusion manner. Specifically, we first exploit an auto-encoder to represent input data samples with latent features for the construction of an attribute matrix. We also construct a mixed signed and symmetric structure matrix to capture the local geometric structure underlying data samples. Then, we perform self-expressiveness on the constructed attribute and structure matrices to learn their affinity graphs separately. Finally, we design a novel attention-based fusion module to adaptively leverage these two affinity graphs to construct a more discriminative affinity graph. Extensive experimental results on commonly used benchmark datasets demonstrate that our AASSC-Net significantly outperforms state-of-the-art methods. In addition, we conduct comprehensive ablation studies to discuss the effectiveness of the designed modules. The code will be publicly available at https://github.com/ZhihaoPENG-CityU.
    NudgeSeg: Zero-Shot Object Segmentation by Repeated Physical Interaction. (arXiv:2109.13859v1 [cs.CV])
    (2 min) Recent advances in object segmentation have demonstrated that deep neural networks excel at object segmentation for specific classes in color and depth images. However, their performance is dictated by the number of classes and objects used for training, thereby hindering generalization to never seen objects or zero-shot samples. To exacerbate the problem further, object segmentation using image frames rely on recognition and pattern matching cues. Instead, we utilize the 'active' nature of a robot and their ability to 'interact' with the environment to induce additional geometric constraints for segmenting zero-shot samples. In this paper, we present the first framework to segment unknown objects in a cluttered scene by repeatedly 'nudging' at the objects and moving them to obtain additional motion cues at every step using only a monochrome monocular camera. We call our framework NudgeSeg. These motion cues are used to refine the segmentation masks. We successfully test our approach to segment novel objects in various cluttered scenes and provide an extensive study with image and motion segmentation methods. We show an impressive average detection rate of over 86% on zero-shot objects.
    Efficient Global-Local Memory for Real-time Instrument Segmentation of Robotic Surgical Video. (arXiv:2109.13593v1 [cs.CV])
    (2 min) Performing a real-time and accurate instrument segmentation from videos is of great significance for improving the performance of robotic-assisted surgery. We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration. However, most existing works perform segmentation purely using visual cues in a single frame. Optical flow is just used to model the motion between only two frames and brings heavy computational cost. We propose a novel dual-memory network (DMNet) to wisely relate both global and local spatio-temporal knowledge to augment the current features, boosting the segmentation performance and retaining the real-time prediction capability. We propose, on the one hand, an efficient local memory by taking the complementary advantages of convolutional LSTM and non-local mechanisms towards the relating reception field. On the other hand, we develop an active global memory to gather the global semantic correlation in long temporal range to current one, in which we gather the most informative frames derived from model uncertainty and frame similarity. We have extensively validated our method on two public benchmark surgical video datasets. Experimental results demonstrate that our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
    Modelling Neighbor Relation in Joint Space-Time Graph for Video Correspondence Learning. (arXiv:2109.13499v1 [cs.CV])
    (2 min) This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos. We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges: (i) neighbor relations that determine the aggregation strength from intra-frame neighbors in space, and (ii) similarity relations that indicate the transition probability of inter-frame paths across time. Leveraging the cycle-consistency in videos, our contrastive learning objective discriminates dynamic objects from both their neighboring views and temporal views. Compared with prior works, our approach actively explores the neighbor relations of central instances to learn a latent association between center-neighbor pairs (e.g., "hand -- arm") across time, thus improving the instance discrimination. Without fine-tuning, our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks including video object propagation, part propagation, and pose keypoint tracking. Our self-supervised method also surpasses some fully supervised algorithms designed for the specific tasks.
    CIDEr-R: Robust Consensus-based Image Description Evaluation. (arXiv:2109.13701v1 [cs.CV])
    (2 min) This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high sentence length variance. We demonstrate that CIDEr-R is more accurate and closer to human judgment than CIDEr-D; CIDEr-R is more robust regarding the number of available references. Our results reveal that using Self-Critical Sequence Training to optimize CIDEr-R generates descriptive captions. In contrast, when CIDEr-D is optimized, the generated captions' length tends to be similar to the reference length. However, the models also repeat several times the same word to increase the sentence length.
    Information Elevation Network for Fast Online Action Detection. (arXiv:2109.13572v1 [cs.CV])
    (2 min) Online action detection (OAD) is a task that receives video segments within a streaming video as inputs and identifies ongoing actions within them. It is important to retain past information associated with a current action. However, long short-term memory (LSTM), a popular recurrent unit for modeling temporal information from videos, accumulates past information from the previous hidden and cell states and the extracted visual features at each timestep without considering the relationships between the past and current information. Consequently, the forget gate of the original LSTM can lose the accumulated information relevant to the current action because it determines which information to forget without considering the current action. We introduce a novel information elevation unit (IEU) that lifts up and accumulate the past information relevant to the current action in order to model the past information that is especially relevant to the current action. To the best of our knowledge, our IEN is the first attempt that considers the computational overhead for the practical use of OAD. Through ablation studies, we design an efficient and effective OAD network using IEUs, called an information elevation network (IEN). Our IEN uses visual features extracted by a fast action recognition network taking only RGB frames because extracting optical flows requires heavy computation overhead. On two OAD benchmark datasets, THUMOS-14 and TVSeries, our IEN outperforms state-of-the-art OAD methods using only RGB frames. Furthermore, on the THUMOS-14 dataset, our IEN outperforms the state-of-the-art OAD methods using two-stream features based on RGB frames and optical flows.
    Evaluation of Deep Neural Network Domain Adaptation Techniques for Image Recognition. (arXiv:2109.13420v1 [cs.CV])
    (2 min) It has been well proved that deep networks are efficient at extracting features from a given (source) labeled dataset. However, it is not always the case that they can generalize well to other (target) datasets which very often have a different underlying distribution. In this report, we evaluate four different domain adaptation techniques for image classification tasks: DeepCORAL, DeepDomainConfusion, CDAN and CDAN+E. These techniques are unsupervised given that the target dataset dopes not carry any labels during training phase. We evaluate model performance on the office-31 dataset. A link to the github repository of this report can be found here: https://github.com/agrija9/Deep-Unsupervised-Domain-Adaptation.
    Making Curiosity Explicit in Vision-based RL. (arXiv:2109.13588v1 [cs.LG])
    (2 min) Vision-based reinforcement learning (RL) is a promising technique to solve control tasks involving images as the main observation. State-of-the-art RL algorithms still struggle in terms of sample efficiency, especially when using image observations. This has led to an increased attention on integrating state representation learning (SRL) techniques into the RL pipeline. Work in this field demonstrates a substantial improvement in sample efficiency among other benefits. However, to take full advantage of this paradigm, the quality of samples used for training plays a crucial role. More importantly, the diversity of these samples could affect the sample efficiency of vision-based RL, but also its generalization capability. In this work, we present an approach to improve the sample diversity. Our method enhances the exploration capability of the RL algorithms by taking advantage of the SRL setup. Our experiments show that the presented approach outperforms the baseline for all tested environments. These results are most apparent for environments where the baseline method struggles. Even in simple environments, our method stabilizes the training, reduces the reward variance and boosts sample efficiency.
    Warp-Refine Propagation: Semi-Supervised Auto-labeling via Cycle-consistency. (arXiv:2109.13432v1 [cs.CV])
    (2 min) Deep learning models for semantic segmentation rely on expensive, large-scale, manually annotated datasets. Labelling is a tedious process that can take hours per image. Automatically annotating video sequences by propagating sparsely labeled frames through time is a more scalable alternative. In this work, we propose a novel label propagation method, termed Warp-Refine Propagation, that combines semantic cues with geometric cues to efficiently auto-label videos. Our method learns to refine geometrically-warped labels and infuse them with learned semantic priors in a semi-supervised setting by leveraging cycle consistency across time. We quantitatively show that our method improves label-propagation by a noteworthy margin of 13.1 mIoU on the ApolloScape dataset. Furthermore, by training with the auto-labelled frames, we achieve competitive results on three semantic-segmentation benchmarks, improving the state-of-the-art by a large margin of 1.8 and 3.61 mIoU on NYU-V2 and KITTI, while matching the current best results on Cityscapes.
    A hierarchical residual network with compact triplet-center loss for sketch recognition. (arXiv:2109.13536v1 [cs.CV])
    (2 min) With the widespread use of touch-screen devices, it is more and more convenient for people to draw sketches on screen. This results in the demand for automatically understanding the sketches. Thus, the sketch recognition task becomes more significant than before. To accomplish this task, it is necessary to solve the critical issue of improving the distinction of the sketch features. To this end, we have made efforts in three aspects. First, a novel multi-scale residual block is designed. Compared with the conventional basic residual block, it can better perceive multi-scale information and reduce the number of parameters during training. Second, a hierarchical residual structure is built by stacking multi-scale residual blocks in a specific way. In contrast with the single-level residual structure, the learned features from this structure are more sufficient. Last but not least, the compact triplet-center loss is proposed specifically for the sketch recognition task. It can solve the problem that the triplet-center loss does not fully consider too large intra-class space and too small inter-class space in sketch field. By studying the above modules, a hierarchical residual network as a whole is proposed for sketch recognition and evaluated on Tu-Berlin benchmark thoroughly. The experimental results show that the proposed network outperforms most of baseline methods and it is excellent among non-sequential models at present.
    Convolutional Shapelet Transform: A new approach for time series shapelets. (arXiv:2109.13514v1 [cs.CV])
    (2 min) Shapelet-based algorithms are widely used for time series classification because of their ease of interpretation, but they are currently outperformed, notably by methods using convolutional kernels, capable of reaching state-of-the-art performance while being highly scalable. We present a new formulation of time series shapelets including the notion of dilation, and a shapelet extraction method based on convolutional kernels, which is able to target the discriminant information identified by convolutional kernels. Experiments performed on 108 datasets show that our method improves on the state-of-the-art for shapelet algorithms, and we show that it can be used to interpret results from convolutional kernels.
    Towards Rotation Invariance in Object Detection. (arXiv:2109.13488v1 [cs.CV])
    (2 min) Rotation augmentations generally improve a model's invariance/equivariance to rotation - except in object detection. In object detection the shape is not known, therefore rotation creates a label ambiguity. We show that the de-facto method for bounding box label rotation, the Largest Box Method, creates very large labels, leading to poor performance and in many cases worse performance than using no rotation at all. We propose a new method of rotation augmentation that can be implemented in a few lines of code. First, we create a differentiable approximation of label accuracy and show that axis-aligning the bounding box around an ellipse is optimal. We then introduce Rotation Uncertainty (RU) Loss, allowing the model to adapt to the uncertainty of the labels. On five different datasets (including COCO, PascalVOC, and Transparent Object Bin Picking), this approach improves the rotational invariance of both one-stage and two-stage architectures when measured with AP, AP50, and AP75. The code is available at \url{https://github.com/akasha-imaging/ICCV2021}.
    Metal Artifact Reduction in 2D CT Images with Self-supervised Cross-domain Learning. (arXiv:2109.13483v1 [eess.IV])
    (2 min) The presence of metallic implants often introduces severe metal artifacts in the X-ray CT images, which could adversely influence clinical diagnosis or dose calculation in radiation therapy. In this work, we present a novel deep-learning-based approach for metal artifact reduction (MAR). In order to alleviate the need for anatomically identical CT image pairs (i.e., metal artifact-corrupted CT image and metal artifact-free CT image) for network learning, we propose a self-supervised cross-domain learning framework. Specifically, we train a neural network to restore the metal trace region values in the given metal-free sinogram, where the metal trace is identified by the forward projection of metal masks. We then design a novel FBP reconstruction loss to encourage the network to generate more perfect completion results and a residual-learning-based image refinement module to reduce the secondary artifacts in the reconstructed CT images. To preserve the fine structure details and fidelity of the final MAR image, instead of directly adopting CNN-refined images as output, we incorporate the metal trace replacement into our framework and replace the metal-affected projections of the original sinogram with the prior sinogram generated by the forward projection of the CNN output. We then use the filtered backward projection (FBP) algorithms for final MAR image reconstruction. We conduct an extensive evaluation on simulated and real artifact data to show the effectiveness of our design. Our method produces superior MAR results and outperforms other compelling methods. We also demonstrate the potential of our framework for other organ sites.
    Delve into the Performance Degradation of Differentiable Architecture Search. (arXiv:2109.13466v1 [cs.LG])
    (2 min) Differentiable architecture search (DARTS) is widely considered to be easy to overfit the validation set which leads to performance degradation. We first employ a series of exploratory experiments to verify that neither high-strength architecture parameters regularization nor warmup training scheme can effectively solve this problem. Based on the insights from the experiments, we conjecture that the performance of DARTS does not depend on the well-trained supernet weights and argue that the architecture parameters should be trained by the gradients which are obtained in the early stage rather than the final stage of training. This argument is then verified by exchanging the learning rate schemes of weights and parameters. Experimental results show that the simple swap of the learning rates can effectively solve the degradation and achieve competitive performance. Further empirical evidence suggests that the degradation is not a simple problem of the validation set overfitting but exhibit some links between the degradation and the operation selection bias within bilevel optimization dynamics. We demonstrate the generalization of this bias and propose to utilize this bias to achieve an operation-magnitude-based selective stop.
    SiamEvent: Event-based Object Tracking via Edge-aware Similarity Learning with Siamese Networks. (arXiv:2109.13456v1 [cs.CV])
    (2 min) Event cameras are novel sensors that perceive the per-pixel intensity changes and output asynchronous event streams, showing lots of advantages over traditional cameras, such as high dynamic range (HDR) and no motion blur. It has been shown that events alone can be used for object tracking by motion compensation or prediction. However, existing methods assume that the target always moves and is the stand-alone object. Moreover, they fail to track the stopped non-independent moving objects on fixed scenes. In this paper, we propose a novel event-based object tracking framework, called SiamEvent, using Siamese networks via edge-aware similarity learning. Importantly, to find the part having the most similar edge structure of target, we propose to correlate the embedded events at two timestamps to compute the target edge similarity. The Siamese network enables tracking arbitrary target edge by finding the part with the highest similarity score. This extends the possibility of event-based object tracking applied not only for the independent stand-alone moving objects, but also for various settings of the camera and scenes. In addition, target edge initialization and edge detector are also proposed to prevent SiamEvent from the drifting problem. Lastly, we built an open dataset including various synthetic and real scenes to train and evaluate SiamEvent. Extensive experiments demonstrate that SiamEvent achieves up to 15% tracking performance enhancement than the baselines on the real-world scenes and more robust tracking performance in the challenging HDR and motion blur conditions.
    To Which Out-Of-Distribution Object Orientations Are DNNs Capable of Generalizing?. (arXiv:2109.13445v1 [cs.CV])
    (2 min) The capability of Deep Neural Networks (DNNs) to recognize objects in orientations outside the distribution of the training data, ie. out-of-distribution (OoD) orientations, is not well understood. For humans, behavioral studies showed that recognition accuracy varies across OoD orientations, where generalization is much better for some orientations than for others. In contrast, for DNNs, it remains unknown how generalization abilities are distributed among OoD orientations. In this paper, we investigate the limitations of DNNs' generalization capacities by systematically inspecting patterns of success and failure of DNNs across OoD orientations. We use an intuitive and controlled, yet challenging learning paradigm, in which some instances of an object category are seen at only a few geometrically restricted orientations, while other instances are seen at all orientations. The effect of data diversity is also investigated by increasing the number of instances seen at all orientations in the training set. We present a comprehensive analysis of DNNs' generalization abilities and limitations for representative architectures (ResNet, Inception, DenseNet and CORnet). Our results reveal an intriguing pattern -- DNNs are only capable of generalizing to instances of objects that appear like 2D, ie. in-plane, rotations of in-distribution orientations.
    Multi-Semantic Image Recognition Model and Evaluating Index for explaining the deep learning models. (arXiv:2109.13531v1 [cs.CV])
    (2 min) Although deep learning models are powerful among various applications, most deep learning models are still a black box, lacking verifiability and interpretability, which means the decision-making process that human beings cannot understand. Therefore, how to evaluate deep neural networks with explanations is still an urgent task. In this paper, we first propose a multi-semantic image recognition model, which enables human beings to understand the decision-making process of the neural network. Then, we presents a new evaluation index, which can quantitatively assess the model interpretability. We also comprehensively summarize the semantic information that affects the image classification results in the judgment process of neural networks. Finally, this paper also exhibits the relevant baseline performance with current state-of-the-art deep learning models.
    Efficient Computer Vision on Edge Devices with Pipeline-Parallel Hierarchical Neural Networks. (arXiv:2109.13356v1 [cs.CV])
    (2 min) Computer vision on low-power edge devices enables applications including search-and-rescue and security. State-of-the-art computer vision algorithms, such as Deep Neural Networks (DNNs), are too large for inference on low-power edge devices. To improve efficiency, some existing approaches parallelize DNN inference across multiple edge devices. However, these techniques introduce significant communication and synchronization overheads or are unable to balance workloads across devices. This paper demonstrates that the hierarchical DNN architecture is well suited for parallel processing on multiple edge devices. We design a novel method that creates a parallel inference pipeline for computer vision problems that use hierarchical DNNs. The method balances loads across the collaborating devices and reduces communication costs to facilitate the processing of multiple video frames simultaneously with higher throughput. Our experiments consider a representative computer vision problem where image recognition is performed on each video frame, running on multiple Raspberry Pi 4Bs. With four collaborating low-power edge devices, our approach achieves 3.21X higher throughput, 68% less energy consumption per device per frame, and 58% decrease in memory when compared with existing single-device hierarchical DNNs.
    WarpedGANSpace: Finding non-linear RBF paths in GAN latent space. (arXiv:2109.13357v1 [cs.CV])
    (2 min) This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions that are independent of the latent code, i.e., paths that are linear, and b) that their evaluation relies either on visual inspection or on laborious human labeling. More specifically, we propose to learn non-linear warpings on the latent space, each one parametrized by a set of RBF-based latent space warping functions, and where each warping gives rise to a family of non-linear paths via the gradient of the function. Building on the work of Voynov and Babenko, that discovers linear paths, we optimize the trainable parameters of the set of RBFs, so as that images that are generated by codes along different paths, are easily distinguishable by a discriminator network. This leads to easily distinguishable image transformations, such as pose and facial expressions in facial images. We show that linear paths can be derived as a special case of our method, and show experimentally that non-linear paths in the latent space lead to steeper, more disentangled and interpretable changes in the image space than in state-of-the art methods, both qualitatively and quantitatively. We make the code and the pretrained models publicly available at: https://github.com/chi0tzp/WarpedGANSpace.
    IGAN: Inferent and Generative Adversarial Networks. (arXiv:2109.13360v1 [cs.LG])
    (2 min) I present IGAN (Inferent Generative Adversarial Networks), a neural architecture that learns both a generative and an inference model on a complex high dimensional data distribution, i.e. a bidirectional mapping between data samples and a simpler low-dimensional latent space. It extends the traditional GAN framework with inference by rewriting the adversarial strategy in both the image and the latent space with an entangled game between data-latent encoded posteriors and priors. It brings a measurable stability and convergence to the classical GAN scheme, while keeping its generative quality and remaining simple and frugal in order to run on a lab PC. IGAN fosters the encoded latents to span the full prior space: this enables the exploitation of an enlarged and self-organised latent space in an unsupervised manner. An analysis of previously published articles sets the theoretical ground for the proposed algorithm. A qualitative demonstration of potential applications like self-supervision or multi-modal data translation is given on common image datasets including SAR and optical imagery.
    Weakly Supervised Keypoint Discovery. (arXiv:2109.13423v1 [cs.CV])
    (2 min) In this paper, we propose a method for keypoint discovery from a 2D image using image-level supervision. Recent works on unsupervised keypoint discovery reliably discover keypoints of aligned instances. However, when the target instances have high viewpoint or appearance variation, the discovered keypoints do not match the semantic correspondences over different images. Our work aims to discover keypoints even when the target instances have high viewpoint and appearance variation by using image-level supervision. Motivated by the weakly-supervised learning approach, our method exploits image-level supervision to identify discriminative parts and infer the viewpoint of the target instance. To discover diverse parts, we adopt a conditional image generation approach using a pair of images with structural deformation. Finally, we enforce a viewpoint-based equivariance constraint using the keypoints from the image-level supervision to resolve the spatial correlation problem that consistently appears in the images taken from various viewpoints. Our approach achieves state-of-the-art performance for the task of keypoint estimation on the limited supervision scenarios. Furthermore, the discovered keypoints are directly applicable to downstream tasks without requiring any keypoint labels.
    DOODLER: Determining Out-Of-Distribution Likelihood from Encoder Reconstructions. (arXiv:2109.13237v1 [cs.LG])
    (2 min) Deep Learning models possess two key traits that, in combination, make their use in the real world a risky prospect. One, they do not typically generalize well outside of the distribution for which they were trained, and two, they tend to exhibit confident behavior regardless of whether or not they are producing meaningful outputs. While Deep Learning possesses immense power to solve realistic, high-dimensional problems, these traits in concert make it difficult to have confidence in their real-world applications. To overcome this difficulty, the task of Out-Of-Distribution (OOD) Detection has been defined, to determine when a model has received an input from outside of the distribution for which it is trained to operate. This paper introduces and examines a novel methodology, DOODLER, for OOD Detection, which directly leverages the traits which result in its necessity. By training a Variational Auto-Encoder (VAE) on the same data as another Deep Learning model, the VAE learns to accurately reconstruct In-Distribution (ID) inputs, but not to reconstruct OOD inputs, meaning that its failure state can be used to perform OOD Detection. Unlike other work in the area, DOODLER requires only very weak assumptions about the existence of an OOD dataset, allowing for more realistic application. DOODLER also enables pixel-wise segmentations of input images by OOD likelihood, and experimental results show that it matches or outperforms methodologies that operate under the same constraints.
    The Impact of Domain Shift on Left and Right Ventricle Segmentation in Short Axis Cardiac MR Images. (arXiv:2109.13230v1 [eess.IV])
    (2 min) Domain shift refers to the difference in the data distribution of two datasets, normally between the training set and the test set for machine learning algorithms. Domain shift is a serious problem for generalization of machine learning models and it is well-established that a domain shift between the training and test sets may cause a drastic drop in the model's performance. In medical imaging, there can be many sources of domain shift such as different scanners or scan protocols, different pathologies in the patient population, anatomical differences in the patient population (e.g. men vs women) etc. Therefore, in order to train models that have good generalization performance, it is important to be aware of the domain shift problem, its potential causes and to devise ways to address it. In this paper, we study the effect of domain shift on left and right ventricle blood pool segmentation in short axis cardiac MR images. Our dataset contains short axis images from 4 different MR scanners and 3 different pathology groups. The training is performed with nnUNet. The results show that scanner differences cause a greater drop in performance compared to changing the pathology group, and that the impact of domain shift is greater on right ventricle segmentation compared to left ventricle segmentation. Increasing the number of training subjects increased cross-scanner performance more than in-scanner performance at small training set sizes, but this difference in improvement decreased with larger training set sizes. Training models using data from multiple scanners improved cross-domain performance.
    Discriminative Attribution from Counterfactuals. (arXiv:2109.13412v1 [cs.LG])
    (2 min) We present a method for neural network interpretability by combining feature attribution with counterfactual explanations to generate attribution maps that highlight the most discriminative features between pairs of classes. We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner, thus preventing potential observer bias. We evaluate the proposed method on three diverse datasets, including a challenging artificial dataset and real-world biological data. We show quantitatively and qualitatively that the highlighted features are substantially more discriminative than those extracted using conventional attribution methods and argue that this type of explanation is better suited for understanding fine grained class differences as learned by a deep neural network.
  • cs.IR updates on arXiv.org

    Crowd Sensing and Living Lab Outdoor Experimentation Made Easy. (arXiv:2107.04117v3 [cs.HC] UPDATED)
    (2 min) Living lab outdoor experimentation using pervasive computing provides new opportunities: higher realism, external validity and socio-spatio-temporal observations in large scale. However, experimentation `in the wild' is complex and costly. Noise, biases, privacy concerns, compliance with standards of ethical review boards, remote moderation, control of experimental conditions and equipment perplex the collection of high-quality data for causal inference. This article introduces Smart Agora, a novel open-source software platform for rigorous systematic outdoor experimentation. Without writing a single line of code, highly complex experimental scenarios are visually designed and automatically deployed to smart phones. Novel geolocated survey and sensor data are collected subject of participants verifying desired experimental conditions, for instance, their localization at certain urban spots. This new approach drastically improves the quality and purposefulness of crowd sensing, tailored to conditions that confirm/reject hypotheses. The features that support this innovative functionality and the broad spectrum of its applicability are demonstrated.
    Open Knowledge Graphs Canonicalization using Variational Autoencoders. (arXiv:2012.04780v2 [cs.CL] UPDATED)
    (2 min) Noun phrases and Relation phrases in open knowledge graphs are not canonicalized, leading to an explosion of redundant and ambiguous subject-relation-object triples. Existing approaches to solve this problem take a two-step approach. First, they generate embedding representations for both noun and relation phrases, then a clustering algorithm is used to group them using the embeddings as features. In this work, we propose Canonicalizing Using Variational Autoencoders (CUVA), a joint model to learn both embeddings and cluster assignments in an end-to-end approach, which leads to a better vector representation for the noun and relation phrases. Our evaluation over multiple benchmarks shows that CUVA outperforms the existing state-of-the-art approaches. Moreover, we introduce CanonicNell, a novel dataset to evaluate entity canonicalization systems.
    Synerise at RecSys 2021: Twitter user engagement prediction with a fast neural model. (arXiv:2109.12985v2 [cs.IR] UPDATED)
    (2 min) In this paper we present our 2nd place solution to ACM RecSys 2021 Challenge organized by Twitter. The challenge aims to predict user engagement for a set of tweets, offering an exceptionally large data set of 1 billion data points sampled from over four weeks of real Twitter interactions. Each data point contains multiple sources of information, such as tweet text along with engagement features, user features, and tweet features. The challenge brings the problem close to a real production environment by introducing strict latency constraints in the model evaluation phase: the average inference time for single tweet engagement prediction is limited to 6ms on a single CPU core with 64GB memory. Our proposed model relies on extensive feature engineering performed with methods such as the Efficient Manifold Density Estimator (EMDE) - our previously introduced algorithm based on Locality Sensitive Hashing method, and novel Fourier Feature Encoding, among others. In total, we create numerous features describing a user's Twitter account status and the content of a tweet. In order to adhere to the strict latency constraints, the underlying model is a simple residual feed-forward neural network. The system is a variation of our previous methods which proved successful in KDD Cup 2021, WSDM Challenge 2021, and SIGIR eCom Challenge 2020. We release the source code at: https://github.com/Synerise/recsys-challenge-2021
    A Framework for Causal Discovery in non-intervenable systems. (arXiv:2010.02247v3 [stat.ME] UPDATED)
    (2 min) Many frameworks exist to infer cause and effect relations in complex nonlinear systems but a complete theory is lacking. A new framework is presented that is fully nonlinear, provides a complete information theoretic disentanglement of causal processes, allows for nonlinear interactions between causes, identifies the causal strength of missing or unknown processes, and can analyze systems that cannot be represented on Directed Acyclic Graphs. The basic building blocks are information theoretic measures such as (conditional) mutual information and a new concept called certainty that monotonically increases with the information available about the target process. The framework is presented in detail and compared with other existing frameworks, and the treatment of confounders is discussed. While there are systems with structures that the framework cannot disentangle, it is argued that any causal framework that is based on integrated quantities will miss out potentially important information of the underlying probability density functions. The framework is tested on several highly simplified stochastic processes to demonstrate how blocking and gateways are handled, and on the chaotic Lorentz 1963 system. We show that the framework provides information on the local dynamics, but also reveals information on the larger scale structure of the underlying attractor. Furthermore, by applying it to real observations related to the El-Nino-Southern-Oscillation system we demonstrate its power and advantage over other methodologies.
    Exploring The Role of Local and Global Explanations in Recommender Systems. (arXiv:2109.13301v1 [cs.IR])
    (2 min) Explanations are well-known to improve recommender systems' transparency. These explanations may be local, explaining an individual recommendation, or global, explaining the recommender model in general. Despite their widespread use, there has been little investigation into the relative benefits of these two approaches. Do they provide the same benefits to users, or do they serve different purposes? We conducted a 30-participant exploratory study and a 30-participant controlled user study with a research-paper recommender system to analyze how providing participants local, global, or both explanations influences user understanding of system behavior. Our results provide evidence suggesting that both explanations are more helpful than either alone for explaining how to improve recommendations, yet both appeared less helpful than global alone for efficiency in identifying false positives and negatives. However, we note that the two explanation approaches may be better compared in the context of a higher-stakes or more opaque domain.
    Chekhov's Gun Recognition. (arXiv:2109.13855v1 [cs.CL])
    (2 min) Chekhov's gun is a dramatic principle stating that every element in a story must be necessary, and irrelevant elements should be removed. This paper presents a new natural language processing task - Chekhov's gun recognition or (CGR) - recognition of entities that are pivotal for the development of the plot. Though similar to classical Named Entity Recognition (NER) it has profound differences and is crucial for the tasks of narrative processing, since Chekhov's guns have a profound impact on the causal relationship in a story. The paper presents a new benchmark dataset for the CGR task that includes 5550 descriptions with one or more Chekhov's Gun in each and validates the task on two more datasets available in the natural language processing (NLP) literature.
    Concept-Aware Denoising Graph Neural Network for Micro-Video Recommendation. (arXiv:2109.13527v1 [cs.IR])
    (2 min) Recently, micro-video sharing platforms such as Kuaishou and Tiktok have become a major source of information for people's lives. Thanks to the large traffic volume, short video lifespan and streaming fashion of these services, it has become more and more pressing to improve the existing recommender systems to accommodate these challenges in a cost-effective way. In this paper, we propose a novel concept-aware denoising graph neural network (named CONDE) for micro-video recommendation. CONDE consists of a three-phase graph convolution process to derive user and micro-video representations: warm-up propagation, graph denoising and preference refinement. A heterogeneous tripartite graph is constructed by connecting user nodes with video nodes, and video nodes with associated concept nodes, extracted from captions and comments of the videos. To address the noisy information in the graph, we introduce a user-oriented graph denoising phase to extract a subgraph which can better reflect the user's preference. Despite the main focus of micro-video recommendation in this paper, we also show that our method can be generalized to other types of tasks. Therefore, we also conduct empirical studies on a well-known public E-commerce dataset. The experimental results suggest that the proposed CONDE achieves significantly better recommendation performance than the existing state-of-the-art solutions.
    Fake or Credible? Towards Designing Services to Support Users' Credibility Assessment of News Content. (arXiv:2109.13336v1 [cs.HC])
    (2 min) Fake news has become omnipresent in digitalized areas such as social media platforms. While being disseminated online, it also poses a threat to individuals and societies offline, for example, in the context of democratic elections. Research and practice have investigated the detection of fake news with behavioral science or method-related perspectives. However, to date, we lack design knowledge on presenting fake news warnings to users to support their individual news credibility assessment. We present the journey through the first design cycle on developing a fake news detection service focusing on the user interface design. The design is grounded in concepts from the field of source credibility theory and instantiated in a prototype that was qualitatively evaluated. The 13 participants communicated their interest in a lightweight application that aids in the news credibility assessment and rated the design features as useful as well as desirable.
    The eDiscovery Medicine Show. (arXiv:2109.13908v1 [cs.IR])
    (2 min) The practice of bloodletting gradually fell into disfavor as a growing body of scientific evidence showed its ineffectiveness and demonstrated the effectiveness of various pharmaceuticals for the prevention and treatment of certain diseases. At the same time, the patent medicine industry promoted ineffective remedies at medicine shows featuring entertainment, testimonials, and pseudo-scientific claims with all the trappings--but none of the methodology--of science. Today, many producing parties and eDiscovery vendors similarly promote obsolete technology as well as unvetted tools labeled "artificial intelligence" or "technology-assisted review," along with unsound validation protocols. This situation will end only when eDiscovery technologies and tools are subject to testing using the methods of information retrieval.
    Exposing Paid Opinion Manipulation Trolls. (arXiv:2109.13726v1 [cs.LG])
    (2 min) Recently, Web forums have been invaded by opinion manipulation trolls. Some trolls try to influence the other users driven by their own convictions, while in other cases they can be organized and paid, e.g., by a political party or a PR agency that gives them specific instructions what to write. Finding paid trolls automatically using machine learning is a hard task, as there is no enough training data to train a classifier; yet some test data is possible to obtain, as these trolls are sometimes caught and widely exposed. In this paper, we solve the training data problem by assuming that a user who is called a troll by several different people is likely to be such, and one who has never been called a troll is unlikely to be such. We compare the profiles of (i) paid trolls vs. (ii)"mentioned" trolls vs. (iii) non-trolls, and we further show that a classifier trained to distinguish (ii) from (iii) does quite well also at telling apart (i) from (iii).
    Extracting Attentive Social Temporal Excitation for Sequential Recommendation. (arXiv:2109.13539v1 [cs.SI])
    (2 min) In collaborative filtering, it is an important way to make full use of social information to improve the recommendation quality, which has been proved to be effective because user behavior will be affected by her friends. However, existing works leverage the social relationship to aggregate user features from friends' historical behavior sequences in a user-level indirect paradigm. A significant defect of the indirect paradigm is that it ignores the temporal relationships between behavior events across users. In this paper, we propose a novel time-aware sequential recommendation framework called Social Temporal Excitation Networks (STEN), which introduces temporal point processes to model the fine-grained impact of friends' behaviors on the user s dynamic interests in an event-level direct paradigm. Moreover, we propose to decompose the temporal effect in sequential recommendation into social mutual temporal effect and ego temporal effect. Specifically, we employ a social heterogeneous graph embedding layer to refine user representation via structural information. To enhance temporal information propagation, STEN directly extracts the fine-grained temporal mutual influence of friends' behaviors through the mutually exciting temporal network. Besides, the user s dynamic interests are captured through the self-exciting temporal network. Extensive experiments on three real-world datasets show that STEN outperforms state-of-the-art baseline methods. Moreover, STEN provides event-level recommendation explainability, which is also illustrated experimentally.
    Introducing the viewpoint in the resource description using machine learning. (arXiv:2109.13306v1 [cs.LG])
    (2 min) Search engines allow providing the user with data information according to their interests and specialty. Thus, it is necessary to exploit descriptions of the resources, which take into consideration viewpoints. Generally, the resource descriptions are available in RDF (e.g., DBPedia of Wikipedia content). However, these descriptions do not take into consideration viewpoints. In this paper, we propose a new approach, which allows converting a classic RDF resource description to a resource description that takes into consideration viewpoints. To detect viewpoints in the document, a machine learning technique will be exploited on an instanced ontology. This latter allows representing the viewpoint in a given domain. An experimental study shows that the conversion of the classic RDF resource description to a resource description that takes into consideration viewpoints, allows giving very relevant responses to the user's requests.
  • cs.LG updates on arXiv.org

    Adaptive Informative Path Planning Using Deep Reinforcement Learning for UAV-based Active Sensing. (arXiv:2109.13570v1 [cs.RO])
    (2 min) Aerial robots are increasingly being utilized for a wide range of environmental monitoring and exploration tasks. However, a key challenge is efficiently planning paths to maximize the information value of acquired data as an initially unknown environment is explored. To address this, we propose a new approach for informative path planning (IPP) based on deep reinforcement learning (RL). Bridging the gap between recent advances in RL and robotic applications, our method combines Monte Carlo tree search with an offline-learned neural network predicting informative sensing actions. We introduce several components making our approach applicable for robotic tasks with continuous high-dimensional state spaces and large action spaces. By deploying the trained network during a mission, our method enables sample-efficient online replanning on physical platforms with limited computational resources. Evaluations using synthetic data show that our approach performs on par with existing information-gathering methods while reducing runtime by a factor of 8-10. We validate the performance of our framework using real-world surface temperature data from a crop field.
    EEG Signal Processing using Wavelets for Accurate Seizure Detection through Cost Sensitive Data Mining. (arXiv:2109.13818v1 [eess.SP])
    (2 min) Epilepsy is one of the most common and yet diverse set of chronic neurological disorders. This excessive or synchronous neuronal activity is termed seizure. Electroencephalogram signal processing plays a significant role in detection and prediction of epileptic seizures. In this paper we introduce an approach that relies upon the properties of wavelets for seizure detection. We utilise the Maximum Overlap Discrete Wavelet Transform which enables us to reduce signal noise Then from the variance exhibited in wavelet coefficients we develop connectivity and communication efficiency between the electrodes as these properties differ significantly during a seizure period in comparison to a non-seizure period. We use basic statistical parameters derived from the reconstructed noise reduced signal, electrode connectivity and the efficiency of information transfer to build the attribute space. We have utilised data that are publicly available to test our method that is found to be significantly better than some existing approaches.
    Real-Time Glaucoma Detection from Digital Fundus Images using Self-ONNs. (arXiv:2109.13604v1 [eess.IV])
    (2 min) Glaucoma leads to permanent vision disability by damaging the optical nerve that transmits visual images to the brain. The fact that glaucoma does not show any symptoms as it progresses and cannot be stopped at the later stages, makes it critical to be diagnosed in its early stages. Although various deep learning models have been applied for detecting glaucoma from digital fundus images, due to the scarcity of labeled data, their generalization performance was limited along with high computational complexity and special hardware requirements. In this study, compact Self-Organized Operational Neural Networks (Self- ONNs) are proposed for early detection of glaucoma in fundus images and their performance is compared against the conventional (deep) Convolutional Neural Networks (CNNs) over three benchmark datasets: ACRIMA, RIM-ONE, and ESOGU. The experimental results demonstrate that Self-ONNs not only achieve superior detection performance but can also significantly reduce the computational complexity making it a potentially suitable network model for biomedical datasets especially when the data is scarce.
    Discriminative Attribution from Counterfactuals. (arXiv:2109.13412v1 [cs.LG])
    (2 min) We present a method for neural network interpretability by combining feature attribution with counterfactual explanations to generate attribution maps that highlight the most discriminative features between pairs of classes. We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner, thus preventing potential observer bias. We evaluate the proposed method on three diverse datasets, including a challenging artificial dataset and real-world biological data. We show quantitatively and qualitatively that the highlighted features are substantially more discriminative than those extracted using conventional attribution methods and argue that this type of explanation is better suited for understanding fine grained class differences as learned by a deep neural network.
    Intra-Day Price Simulation with Generative Adversarial Modelling of the Order Flow. (arXiv:2109.13905v1 [q-fin.ST])
    (2 min) Intra-day price variations in financial markets are driven by the sequence of orders, called the order flow, that is submitted at high frequency by traders. This paper introduces a novel application of the Sequence Generative Adversarial Networks framework to model the order flow, such that random sequences of the order flow can then be generated to simulate the intra-day variation of prices. As a benchmark, a well-known parametric model from the quantitative finance literature is selected. The models are fitted, and then multiple random paths of the order flow sequences are sampled from each model. Model performances are then evaluated by using the generated sequences to simulate price variations, and we compare the empirical regularities between the price variations produced by the generated and real sequences. The empirical regularities considered include the distribution of the price log-returns, the price volatility, and the heavy-tail of the log-returns distributions. The results show that the order sequences from the generative model are better able to reproduce the statistical behaviour of real price variations than the sequences from the benchmark.
    On Interpretability of Artificial Neural Networks: A Survey. (arXiv:2001.02522v4 [cs.LG] UPDATED)
    (2 min) Deep learning as represented by the artificial deep neural networks (DNNs) has achieved great success in many important areas that deal with text, images, videos, graphs, and so on. However, the black-box nature of DNNs has become one of the primary obstacles for their wide acceptance in mission-critical applications such as medical diagnosis and therapy. Due to the huge potential of deep learning, interpreting neural networks has recently attracted much research attention. In this paper, based on our comprehensive taxonomy, we systematically review recent studies in understanding the mechanism of neural networks, describe applications of interpretability especially in medicine, and discuss future directions of interpretability research, such as in relation to fuzzy logic and brain science.
    A multi-stage semi-supervised improved deep embedded clustering (MS-SSIDEC) method for bearing fault diagnosis under the situation of insufficient labeled samples. (arXiv:2109.13521v1 [cs.LG])
    (2 min) Intelligent data-driven fault diagnosis methods have been widely applied, but most of these methods need a large number of high-quality labeled samples. It costs a lot of labor and time to label data in actual industrial processes, which challenges the application of intelligent fault diagnosis methods. To solve this problem, a multi-stage semi-supervised improved deep embedded clustering (MS-SSIDEC) method is proposed for the bearing fault diagnosis under the insufficient labeled samples situation. This method includes three stages: pre-training, deep clustering and enhanced supervised learning. In the first stage, a skip-connection based convolutional auto-encoder (SCCAE) is proposed and pre-trained to automatically learn low-dimensional representations. In the second stage, a semi-supervised improved deep embedded clustering (SSIDEC) model that integrates the pre-trained auto-encoder with a clustering layer is proposed for deep clustering. Additionally, virtual adversarial training (VAT) is introduced as a regularization term to overcome the overfitting in the model's training. In the third stage, high-quality clustering results obtained in the second stage are assigned to unlabeled samples as pseudo labels. The labeled dataset is augmented by those pseudo-labeled samples and used to train a bearing fault discriminative model. The effectiveness of the method is evaluated on the Case Western Reserve University (CWRU) bearing dataset. The results show that the method can not only satisfy the semi-supervised learning under a small number of labeled samples, but also solve the problem of unsupervised learning, and has achieved better results than traditional diagnosis methods. This method provides a new research idea for fault diagnosis with limited labeled samples by effectively using unsupervised data.
    DynG2G: An Efficient Stochastic Graph Embedding Method for Temporal Graphs. (arXiv:2109.13441v1 [cs.LG])
    (2 min) Dynamic graph embedding has gained great attention recently due to its capability of learning low dimensional graph representations for complex temporal graphs with high accuracy. However, recent advances mostly focus on learning node embeddings as deterministic "vectors" for static graphs yet disregarding the key graph temporal dynamics and the evolving uncertainties associated with node embedding in the latent space. In this work, we propose an efficient stochastic dynamic graph embedding method (DynG2G) that applies an inductive feed-forward encoder trained with node triplet-based contrastive loss. Every node per timestamp is encoded as a time-dependent probabilistic multivariate Gaussian distribution in the latent space, hence we can quantify the node embedding uncertainty on-the-fly. We adopted eight different benchmarks that represent diversity in size (from 96 nodes to 87,626 and from 13,398 edges to 4,870,863) and diversity in dynamics. We demonstrate via extensive experiments on these eight dynamic graph benchmarks that DynG2G achieves new state-of-the-art performance in capturing the underlying temporal node embeddings. We also demonstrate that DynG2G can predict the evolving node embedding uncertainty, which plays a crucial role in quantifying the intrinsic dimensionality of the dynamical system over time. We obtain a universal relation of the optimal embedding dimension, $L_o$, versus the effective dimensionality of uncertainty, $D_u$, and we infer that $L_o=D_u$ for all cases. This implies that the uncertainty quantification approach we employ in the DynG2G correctly captures the intrinsic dimensionality of the dynamics of such evolving graphs despite the diverse nature and composition of the graphs at each timestamp. Moreover, this $L_0 - D_u$ correlation provides a clear path to select adaptively the optimum embedding size at each timestamp by setting $L \ge D_u$.
    Exposing Paid Opinion Manipulation Trolls. (arXiv:2109.13726v1 [cs.LG])
    (2 min) Recently, Web forums have been invaded by opinion manipulation trolls. Some trolls try to influence the other users driven by their own convictions, while in other cases they can be organized and paid, e.g., by a political party or a PR agency that gives them specific instructions what to write. Finding paid trolls automatically using machine learning is a hard task, as there is no enough training data to train a classifier; yet some test data is possible to obtain, as these trolls are sometimes caught and widely exposed. In this paper, we solve the training data problem by assuming that a user who is called a troll by several different people is likely to be such, and one who has never been called a troll is unlikely to be such. We compare the profiles of (i) paid trolls vs. (ii)"mentioned" trolls vs. (iii) non-trolls, and we further show that a classifier trained to distinguish (ii) from (iii) does quite well also at telling apart (i) from (iii).
    Convolutional Shapelet Transform: A new approach for time series shapelets. (arXiv:2109.13514v1 [cs.CV])
    (2 min) Shapelet-based algorithms are widely used for time series classification because of their ease of interpretation, but they are currently outperformed, notably by methods using convolutional kernels, capable of reaching state-of-the-art performance while being highly scalable. We present a new formulation of time series shapelets including the notion of dilation, and a shapelet extraction method based on convolutional kernels, which is able to target the discriminant information identified by convolutional kernels. Experiments performed on 108 datasets show that our method improves on the state-of-the-art for shapelet algorithms, and we show that it can be used to interpret results from convolutional kernels.
    Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication. (arXiv:2109.13895v1 [cs.LG])
    (2 min) Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.
    Automated Estimation of Construction Equipment Emission using Inertial Sensors and Machine Learning Models. (arXiv:2109.13375v1 [cs.LG])
    (2 min) The construction industry is one of the main producers of greenhouse gasses (GHG). Quantifying the amount of air pollutants including GHG emissions during a construction project has become an additional project objective to traditional metrics such as time, cost, and safety in many parts of the world. A major contributor to air pollution during construction is the use of heavy equipment and thus their efficient operation and management can substantially reduce the harm to the environment. Although the on-road vehicle emission prediction is a widely researched topic, construction equipment emission measurement and reduction have received very little attention. This paper describes the development and deployment of a novel framework that uses machine learning (ML) methods to predict the level of emissions from heavy construction equipment monitored via an Internet of Things (IoT) system comprised of accelerometer and gyroscope sensors. The developed framework was validated using an excavator performing real-world construction work. A portable emission measurement system (PEMS) was employed along with the inertial sensors to record data including the amount of CO, NOX, CO2, SO2, and CH4 pollutions emitted by the equipment. Different ML algorithms were developed and compared to identify the best model to predict emission levels from inertial sensors data. The results showed that Random Forest with the coefficient of determination (R2) of 0.94, 0.91 and 0.94 for CO, NOX, CO2, respectively was the best algorithm among different models evaluated in this study.
    A Contrastive Learning Approach to Auroral Identification and Classification. (arXiv:2109.13899v1 [cs.CV])
    (2 min) Unsupervised learning algorithms are beginning to achieve accuracies comparable to their supervised counterparts on benchmark computer vision tasks, but their utility for practical applications has not yet been demonstrated. In this work, we present a novel application of unsupervised learning to the task of auroral image classification. Specifically, we modify and adapt the Simple framework for Contrastive Learning of Representations (SimCLR) algorithm to learn representations of auroral images in a recently released auroral image dataset constructed using image data from Time History of Events and Macroscale Interactions during Substorms (THEMIS) all-sky imagers. We demonstrate that (a) simple linear classifiers fit to the learned representations of the images achieve state-of-the-art classification performance, improving the classification accuracy by almost 10 percentage points over the current benchmark; and (b) the learned representations naturally cluster into more clusters than exist manually assigned categories, suggesting that existing categorizations are overly coarse and may obscure important connections between auroral types, near-earth solar wind conditions, and geomagnetic disturbances at the earth's surface. Moreover, our model is much lighter than the previous benchmark on this dataset, requiring in the area of fewer than 25\% of the number of parameters. Our approach exceeds an established threshold for operational purposes, demonstrating readiness for deployment and utilization.
    DOODLER: Determining Out-Of-Distribution Likelihood from Encoder Reconstructions. (arXiv:2109.13237v1 [cs.LG])
    (2 min) Deep Learning models possess two key traits that, in combination, make their use in the real world a risky prospect. One, they do not typically generalize well outside of the distribution for which they were trained, and two, they tend to exhibit confident behavior regardless of whether or not they are producing meaningful outputs. While Deep Learning possesses immense power to solve realistic, high-dimensional problems, these traits in concert make it difficult to have confidence in their real-world applications. To overcome this difficulty, the task of Out-Of-Distribution (OOD) Detection has been defined, to determine when a model has received an input from outside of the distribution for which it is trained to operate. This paper introduces and examines a novel methodology, DOODLER, for OOD Detection, which directly leverages the traits which result in its necessity. By training a Variational Auto-Encoder (VAE) on the same data as another Deep Learning model, the VAE learns to accurately reconstruct In-Distribution (ID) inputs, but not to reconstruct OOD inputs, meaning that its failure state can be used to perform OOD Detection. Unlike other work in the area, DOODLER requires only very weak assumptions about the existence of an OOD dataset, allowing for more realistic application. DOODLER also enables pixel-wise segmentations of input images by OOD likelihood, and experimental results show that it matches or outperforms methodologies that operate under the same constraints.
    Adaptive Attribute and Structure Subspace Clustering Network. (arXiv:2109.13742v1 [cs.CV])
    (2 min) Deep self-expressiveness-based subspace clustering methods have demonstrated effectiveness. However, existing works only consider the attribute information to conduct the self-expressiveness, which may limit the clustering performance. In this paper, we propose a novel adaptive attribute and structure subspace clustering network (AASSC-Net) to simultaneously consider the attribute and structure information in an adaptive graph fusion manner. Specifically, we first exploit an auto-encoder to represent input data samples with latent features for the construction of an attribute matrix. We also construct a mixed signed and symmetric structure matrix to capture the local geometric structure underlying data samples. Then, we perform self-expressiveness on the constructed attribute and structure matrices to learn their affinity graphs separately. Finally, we design a novel attention-based fusion module to adaptively leverage these two affinity graphs to construct a more discriminative affinity graph. Extensive experimental results on commonly used benchmark datasets demonstrate that our AASSC-Net significantly outperforms state-of-the-art methods. In addition, we conduct comprehensive ablation studies to discuss the effectiveness of the designed modules. The code will be publicly available at https://github.com/ZhihaoPENG-CityU.
    Multi-Task Triplet Loss for Named Entity Recognition using Supplementary Text. (arXiv:2109.13736v1 [cs.CL])
    (2 min) Retail item data contains many different forms of text like the title of an item, the description of an item, item name and reviews. It is of interest to identify the item name in the other forms of text using a named entity tagger. However, the title of an item and its description are syntactically different (but semantically similar) in that the title is not necessarily a well formed sentence while the description is made up of well formed sentences. In this work, we use a triplet loss to contrast the embeddings of the item title with the description to establish a proof of concept. We find that using the triplet loss in a multi-task NER algorithm improves both the precision and recall by a small percentage. While the improvement is small, we think it is a step in the right direction of using various forms of text in a multi-task algorithm. In addition to precision and recall, the multi task triplet loss method is also found to significantly improve the exact match accuracy i.e. the accuracy of tagging the entire set of tokens in the text with correct tags.
    TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. (arXiv:2102.07988v2 [cs.LG] UPDATED)
    (2 min) Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe
    Conjecturing-Based Computational Discovery of Patterns in Data. (arXiv:2011.11576v2 [cs.LG] UPDATED)
    (2 min) Modern machine learning methods are designed to exploit complex patterns in data regardless of their form, while not necessarily revealing them to the investigator. Here we demonstrate situations where modern machine learning methods are ill-equipped to reveal feature interaction effects and other nonlinear relationships. We propose the use of a conjecturing machine that generates feature relationships in the form of bounds for numerical features and boolean expressions for nominal features that are ignored by machine learning algorithms. The proposed framework is demonstrated for a classification problem with an interaction effect and a nonlinear regression problem. In both settings, true underlying relationships are revealed and generalization performance improves. The framework is then applied to patient-level data regarding COVID-19 outcomes to suggest possible risk factors.
    Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy. (arXiv:2106.01100v2 [eess.IV] UPDATED)
    (3 min) During lung cancer radiotherapy, the position of infrared reflective objects on the chest can be recorded to estimate the tumor location. However, radiotherapy systems usually have a latency inherent to robot control limitations that impedes the radiation delivery precision. Not taking this phenomenon into account may cause unwanted damage to healthy tissues and lead to side effects such as radiation pneumonitis. In this research, we use nine observation records of the three-dimensional position of three external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The sampling frequency is equal to 10Hz and the amplitudes of the recorded trajectories range from 6mm to 40mm in the superior-inferior direction. We forecast the location of each marker simultaneously with a horizon value (the time interval in advance for which the prediction is made) between 0.1s and 2.0s, using a recurrent neural network (RNN) trained with unbiased online recurrent optimization (UORO). We compare its performance with an RNN trained with real-time recurrent learning, least mean squares (LMS), and offline linear regression. Training and cross-validation are performed during the first minute of each sequence. On average, UORO achieves the lowest root-mean-square (RMS) and maximum error, equal respectively to 1.3mm and 8.8mm, with a prediction time per time step lower than 2.8ms (Dell Intel core i9-9900K 3.60Ghz). Linear regression has the lowest RMS error for the horizon values 0.1s and 0.2s, followed by LMS for horizon values between 0.3s and 0.5s, and UORO for horizon values greater than 0.6s.
    COV-ELM classifier: An Extreme Learning Machine based identification of COVID-19 using Chest X-Ray Images. (arXiv:2007.08637v6 [eess.IV] UPDATED)
    (3 min) Coronaviruses constitute a family of viruses that gives rise to respiratory diseases. As COVID-19 is highly contagious, early diagnosis of COVID-19 is crucial for an effective treatment strategy. However, the RT-PCR test which is considered to be a gold standard in the diagnosis of COVID-19 suffers from a high false-negative rate. Chest X-ray (CXR) image analysis has emerged as a feasible and effective diagnostic technique towards this objective. In this work, we propose the COVID-19 classification problem as a three-class classification problem to distinguish between COVID-19, normal, and pneumonia classes. We propose a three-stage framework, named COV-ELM. Stage one deals with preprocessing and transformation while stage two deals with feature extraction. These extracted features are passed as an input to the ELM at the third stage, resulting in the identification of COVID-19. The choice of ELM in this work has been motivated by its faster convergence, better generalization capability, and shorter training time in comparison to the conventional gradient-based learning algorithms. As bigger and diverse datasets become available, ELM can be quickly retrained as compared to its gradient-based competitor models. The proposed model achieved a macro average F1-score of 0.95 and the overall sensitivity of ${0.94 \pm 0.02} at a 95% confidence interval. When compared to state-of-the-art machine learning algorithms, the COV-ELM is found to outperform its competitors in this three-class classification scenario. Further, LIME has been integrated with the proposed COV-ELM model to generate annotated CXR images. The annotations are based on the superpixels that have contributed to distinguish between the different classes. It was observed that the superpixels correspond to the regions of the human lungs that are clinically observed in COVID-19 and Pneumonia cases.
    From Server-Based to Client-Based Machine Learning: A Comprehensive Survey. (arXiv:1909.08329v2 [cs.LG] UPDATED)
    (2 min) In recent years, mobile devices have gained increasing development with stronger computation capability and larger storage space. Some of the computation-intensive machine learning tasks can now be run on mobile devices. To exploit the resources available on mobile devices and preserve personal privacy, the concept of client-based machine learning has been proposed. It leverages the users' local hardware and local data to solve machine learning sub-problems on mobile devices and only uploads computation results rather than the original data for the optimization of the global model. Such an architecture can not only relieve computation and storage burdens on servers but also protect the users' sensitive information. Another benefit is the bandwidth reduction because various kinds of local data can be involved in the training process without being uploaded. In this article, we provide a literature review on the progressive development of machine learning from server based to client based. We revisit a number of widely used server-based and client-based machine learning methods and applications. We also extensively discuss the challenges and future directions in this area. We believe that this survey will give a clear overview of client-based machine learning and provide guidelines on applying client-based machine learning to practice.
    Projection Robust Wasserstein Barycenters. (arXiv:2102.03390v4 [cs.LG] UPDATED)
    (2 min) Collecting and aggregating information from several probability measures or histograms is a fundamental task in machine learning. One of the popular solution methods for this task is to compute the barycenter of the probability measures under the Wasserstein metric. However, approximating the Wasserstein barycenter is numerically challenging because of the curse of dimensionality. This paper proposes the projection robust Wasserstein barycenter (PRWB) that has the potential to mitigate the curse of dimensionality. Since PRWB is numerically very challenging to solve, we further propose a relaxed PRWB (RPRWB) model, which is more tractable. The RPRWB projects the probability measures onto a lower-dimensional subspace that maximizes the Wasserstein barycenter objective. The resulting problem is a max-min problem over the Stiefel manifold. By combining the iterative Bregman projection algorithm and Riemannian optimization, we propose two new algorithms for computing the RPRWB. The complexity of arithmetic operations of the proposed algorithms for obtaining an $\epsilon$-stationary solution is analyzed. We incorporate the RPRWB into a discrete distribution clustering algorithm, and the numerical results on real text datasets confirm that our RPRWB model helps improve the clustering performance significantly.
    Understanding How Over-Parametrization Leads to Acceleration: A case of learning a single teacher neuron. (arXiv:2010.01637v3 [cs.LG] UPDATED)
    (2 min) Over-parametrization has become a popular technique in deep learning. It is observed that by over-parametrization, a larger neural network needs a fewer training iterations than a smaller one to achieve a certain level of performance -- namely, over-parametrization leads to acceleration in optimization. However, despite that over-parametrization is widely used nowadays, little theory is available to explain the acceleration due to over-parametrization. In this paper, we propose understanding it by studying a simple problem first. Specifically, we consider the setting that there is a single teacher neuron with quadratic activation, where over-parametrization is realized by having multiple student neurons learn the data generated from the teacher neuron. We provably show that over-parametrization helps the iterate generated by gradient descent to enter the neighborhood of a global optimal solution that achieves zero testing error faster. On the other hand, we also point out an issue regarding the necessity of over-parametrization and study how the scaling of the output neurons affects the convergence time.
    HeadlineCause: A Dataset of News Headlines for Detecting Causalities. (arXiv:2108.12626v2 [cs.CL] UPDATED)
    (2 min) Detecting implicit causal relations in texts is a task that requires both common sense and world knowledge. Existing datasets are focused either on commonsense causal reasoning or explicit causal relations. In this work, we present HeadlineCause, a dataset for detecting implicit causal relations between pairs of news headlines. The dataset includes over 5000 headline pairs from English news and over 9000 headline pairs from Russian news labeled through crowdsourcing. The pairs vary from totally unrelated or belonging to the same general topic to the ones including causation and refutation relations. We also present a set of models and experiments that demonstrates the dataset validity, including a multilingual XLM-RoBERTa based model for causality detection and a GPT-2 based model for possible effects prediction.
    Conditional Extreme Value Theory for Open Set Video Domain Adaptation. (arXiv:2109.00522v2 [cs.CV] UPDATED)
    (2 min) With the advent of media streaming, video action recognition has become progressively important for various applications, yet at the high expense of requiring large-scale data labelling. To overcome the problem of expensive data labelling, domain adaptation techniques have been proposed that transfers knowledge from fully labelled data (i.e., source domain) to unlabelled data (i.e., target domain). The majority of video domain adaptation algorithms are proposed for closed-set scenarios in which all the classes are shared among the domains. In this work, we propose an open-set video domain adaptation approach to mitigate the domain discrepancy between the source and target data, allowing the target data to contain additional classes that do not belong to the source domain. Different from previous works, which only focus on improving accuracy for shared classes, we aim to jointly enhance the alignment of shared classes and recognition of unknown samples. Towards this goal, class-conditional extreme value theory is applied to enhance the unknown recognition. Specifically, the entropy values of target samples are modelled as generalised extreme value distributions, which allows separating unknown samples lying in the tail of the distribution. To alleviate the negative transfer issue, weights computed by the distance from the sample entropy to the threshold are leveraged in adversarial learning in the sense that confident source and target samples are aligned, and unconfident samples are pushed away. The proposed method has been thoroughly evaluated on both small-scale and large-scale cross-domain video datasets and achieved the state-of-the-art performance.
    Anomaly Detection With Partitioning Overfitting Autoencoder Ensembles. (arXiv:2009.02755v8 [cs.LG] UPDATED)
    (2 min) In this paper, we propose POTATOES (Partitioning OverfiTting AuTOencoder EnSemble), a new method for unsupervised outlier detection (UOD). More precisely, given any autoencoder for UOD, this technique can be used to improve its accuracy while at the same time removing the burden of tuning its regularization. The idea is to not regularize at all, but to rather randomly partition the data into sufficiently many equally sized parts, overfit each part with its own autoencoder, and to use the maximum over all autoencoder reconstruction errors as the anomaly score. We apply our model to various realistic datasets and show that if the set of inliers is dense enough, our method indeed improves the UOD performance of a given autoencoder significantly. For reproducibility, the code is made available on github so the reader can recreate the results in this paper as well as apply the method to other autoencoders and datasets.
    Open Knowledge Graphs Canonicalization using Variational Autoencoders. (arXiv:2012.04780v2 [cs.CL] UPDATED)
    (2 min) Noun phrases and Relation phrases in open knowledge graphs are not canonicalized, leading to an explosion of redundant and ambiguous subject-relation-object triples. Existing approaches to solve this problem take a two-step approach. First, they generate embedding representations for both noun and relation phrases, then a clustering algorithm is used to group them using the embeddings as features. In this work, we propose Canonicalizing Using Variational Autoencoders (CUVA), a joint model to learn both embeddings and cluster assignments in an end-to-end approach, which leads to a better vector representation for the noun and relation phrases. Our evaluation over multiple benchmarks shows that CUVA outperforms the existing state-of-the-art approaches. Moreover, we introduce CanonicNell, a novel dataset to evaluate entity canonicalization systems.
    ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering. (arXiv:2011.10331v3 [cs.CV] UPDATED)
    (2 min) Multi-view clustering has wide applications in many image processing scenarios. In these scenarios, original image data often contain missing instances and noises, which is ignored by most multi-view clustering methods. However, missing instances may make these methods difficult to use directly and noises will lead to unreliable clustering results. In this paper, we propose a novel Auto-weighted Noisy and Incomplete Multi-view Clustering framework (ANIMC) via a soft auto-weighted strategy and a doubly soft regular regression model. Firstly, by designing adaptive semi-regularized nonnegative matrix factorization (adaptive semi-RNMF), the soft auto-weighted strategy assigns a proper weight to each view and adds a soft boundary to balance the influence of noises and incompleteness. Secondly, by proposing{\theta}-norm, the doubly soft regularized regression model adjusts the sparsity of our model by choosing different{\theta}. Compared with existing methods, ANIMC has three unique advantages: 1) it is a soft algorithm to adjust our framework in different scenarios, thereby improving its generalization ability; 2) it automatically learns a proper weight for each view, thereby reducing the influence of noises; 3) it performs doubly soft regularized regression that aligns the same instances in different views, thereby decreasing the impact of missing instances. Extensive experimental results demonstrate its superior advantages over other state-of-the-art methods.
    Predicting Hidden Links and Missing Nodes in Scale-Free Networks with Artificial Neural Networks. (arXiv:2109.12331v1 [cs.SI] CROSS LISTED)
    (2 min) There are many networks in real life which exist as form of Scale-free networks such as World Wide Web, protein-protein inter action network, semantic networks, airline networks, interbank payment networks, etc. If we want to analyze these networks, it is really necessary to understand the properties of scale-free networks. By using the properties of scale free networks, we can identify any type of anomalies in those networks. In this research, we proposed a methodology in a form of an algorithm to predict hidden links and missing nodes in scale-free networks where we combined a generator of random networks as a source of train data, on one hand, with artificial neural networks for supervised classification, on the other, we aimed at training the neural networks to discriminate between different subtypes of scale-free networks and predicted the missing nodes and hidden links among (present and missing) nodes in a given scale-free network. We chose Bela Bollobas's directed scale-free random graph generation algorithm as a generator of random networks to generate a large set of scale-free network's data.
    Physics-guided Neural Networks (PGNN): An Application in Lake Temperature Modeling. (arXiv:1710.11431v3 [cs.LG] UPDATED)
    (2 min) This paper introduces a framework for combining scientific knowledge of physics-based models with neural networks to advance scientific discovery. This framework, termed physics-guided neural networks (PGNN), leverages the output of physics-based model simulations along with observational features in a hybrid modeling setup to generate predictions using a neural network architecture. Further, this framework uses physics-based loss functions in the learning objective of neural networks to ensure that the model predictions not only show lower errors on the training set but are also scientifically consistent with the known physics on the unlabeled set. We illustrate the effectiveness of PGNN for the problem of lake temperature modeling, where physical relationships between the temperature, density, and depth of water are used to design a physics-based loss function. By using scientific knowledge to guide the construction and learning of neural networks, we are able to show that the proposed framework ensures better generalizability as well as scientific consistency of results. All the code and datasets used in this study have been made available on this link \url{https://github.com/arkadaw9/PGNN}.
    Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. (arXiv:2008.07146v4 [cs.LG] UPDATED)
    (2 min) Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of enabling realistic and reproducible OPE research, we publicize Open Bandit Dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN. Our dataset is unique in that it contains a set of multiple logged bandit feedback datasets collected by running different policies on the same platform. This enables experimental comparisons of different OPE estimators for the first time. We also develop Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE. Our open data and software will contribute to the fair and transparent OPE research and help the community identify fruitful research directions. We provide extensive benchmark experiments of existing OPE estimators using our dataset and software. The results open up essential challenges and new avenues for future OPE research.
    Decentralized Deep Learning for Mobile Edge Computing: A Survey on Communication Efficiency and Trustworthiness. (arXiv:2108.03980v2 [cs.DC] UPDATED)
    (2 min) A wider coverage and a better solution to latency reduction in 5G necessitates its combination with mobile edge computing (MEC) technology. Decentralized deep learning (DDL) as a promising solution to privacy-preserving data processing for millions of edge smart devices, it leverages federated learning within the networking of local models, without disclosing a client's raw data. Especially, in industries such as finance and healthcare where sensitive data of transactions and personal medical records is cautiously maintained, DDL facilitates the collaboration among these institutes to improve the performance of local models, while protecting data privacy of participating clients. In this survey paper, we demonstrate technical fundamentals of DDL for benefiting many walks of society through decentralized learning. Furthermore, we offer a comprehensive overview of recent challenges of DDL and the most relevant solutions from novel perspectives of communication efficiency and trustworthiness.
    Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data. (arXiv:2104.12678v4 [cs.NI] UPDATED)
    (2 min) Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions, while preserving data privacy. Unfortunately, the learning performance of FEEL may be compromised due to limited training data in a single edge cluster. In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL). By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency, while improving the learning performance by accessing richer training data from multiple edge clusters. A training algorithm for SD-FEEL with three main procedures in each round is presented, including local model updates, intra-cluster and inter-cluster model aggregations, which is proved to converge on non-independent and identically distributed (non-IID) data. We also characterize the interplay between the network topology of the edge servers and the communication overhead of inter-cluster model aggregation on the training performance. Experiment results corroborate our analysis and demonstrate the effectiveness of SD-FFEL in achieving faster convergence than traditional federated learning architectures. Besides, guidelines on choosing critical hyper-parameters of the training algorithm are also provided.
    Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations. (arXiv:2109.13059v2 [cs.CL] UPDATED)
    (2 min) In NLP, a large volume of tasks involve pairwise comparison between two sequences (e.g. sentence similarity and paraphrase identification). Predominantly, two formulations are used for sentence-pair tasks: bi-encoders and cross-encoders. Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient, however, they usually underperform cross-encoders. Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance but they require task fine-tuning and are computationally more expensive. In this paper, we present a completely unsupervised sentence representation model termed as Trans-Encoder that combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders. Specifically, on top of a pre-trained Language Model (PLM), we start with converting it to an unsupervised bi-encoder, and then alternate between the bi- and cross-encoder task formulations. In each alternation, one task formulation will produce pseudo-labels which are used as learning signals for the other task formulation. We then propose an extension to conduct such self-distillation approach on multiple PLMs in parallel and use the average of their pseudo-labels for mutual-distillation. Trans-Encoder creates, to the best of our knowledge, the first completely unsupervised cross-encoder and also a state-of-the-art unsupervised bi-encoder for sentence similarity. Both the bi-encoder and cross-encoder formulations of Trans-Encoder outperform recently proposed state-of-the-art unsupervised sentence encoders such as Mirror-BERT and SimCSE by up to 5% on the sentence similarity benchmarks.
    Self-Supervised Learning in Multi-Task Graphs through Iterative Consensus Shift. (arXiv:2103.14417v2 [cs.LG] UPDATED)
    (2 min) The human ability to synchronize the feedback from all their senses inspired recent works in multi-task and multi-modal learning. While these works rely on expensive supervision, our multi-task graph requires only pseudo-labels from expert models. Every graph node represents a task, and each edge learns between tasks transformations. Once initialized, the graph learns self-supervised, based on a novel consensus shift algorithm that intelligently exploits the agreement between graph pathways to generate new pseudo-labels for the next learning cycle. We demonstrate significant improvement from one unsupervised learning iteration to the next, outperforming related recent methods in extensive multi-task learning experiments on two challenging datasets.
    Deep Phasor Networks: Connecting Conventional and Spiking Neural Networks. (arXiv:2106.11908v2 [cs.NE] UPDATED)
    (2 min) In this work, we extend standard neural networks by building upon an assumption that neuronal activations correspond to the angle of a complex number lying on the unit circle, or 'phasor.' Each layer in such a network produces new activations by taking a weighted superposition of the previous layer's phases and calculating the new phase value. This generalized architecture allows models to reach high accuracy and carries the singular advantage that mathematically equivalent versions of the network can be executed with or without regard to a temporal variable. Importantly, the value of a phase angle in the temporal domain can be sparsely represented by a periodically repeating series of delta functions or 'spikes'. We demonstrate the atemporal training of a phasor network on standard deep learning tasks and show that these networks can then be executed in either the traditional atemporal domain or spiking temporal domain with no conversion step needed. This provides a novel basis for constructing deep networkswhich operate via temporal, spike-based calculations suitable for neuromorphic computing hardware.
    JFB: Jacobian-Free Backpropagation for Implicit Networks. (arXiv:2103.12803v3 [cs.LG] UPDATED)
    (2 min) A promising trend in deep learning replaces traditional feedforward networks with implicit networks. Unlike traditional networks, implicit networks solve a fixed point equation to compute inferences. Solving for the fixed point varies in complexity, depending on provided data and an error tolerance. Importantly, implicit networks may be trained with fixed memory costs in stark contrast to feedforward networks, whose memory requirements scale linearly with depth. However, there is no free lunch -- backpropagation through implicit networks often requires solving a costly Jacobian-based equation arising from the implicit function theorem. We propose Jacobian-Free Backpropagation (JFB), a fixed-memory approach that circumvents the need to solve Jacobian-based equations. JFB makes implicit networks faster to train and significantly easier to implement, without sacrificing test accuracy. Our experiments show implicit networks trained with JFB are competitive with feedforward networks and prior implicit networks given the same number of parameters.
    Head2Head++: Deep Facial Attributes Re-Targeting. (arXiv:2006.10199v2 [cs.CV] UPDATED)
    (2 min) Facial video re-targeting is a challenging problem aiming to modify the facial attributes of a target subject in a seamless manner by a driving monocular sequence. We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment. Our method is different to purely 3D model-based approaches, or recent image-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames. We manage to capture the complex non-rigid facial motion from the driving monocular performances and synthesise temporally consistent videos, with the aid of a sequential Generator and an ad-hoc Dynamics Discriminator network. We conduct a comprehensive set of quantitative and qualitative tests and demonstrate experimentally that our proposed method can successfully transfer facial expressions, head pose and eye gaze from a source video to a target subject, in a photo-realistic and faithful fashion, better than other state-of-the-art methods. Most importantly, our system performs end-to-end reenactment in nearly real-time speed (18 fps).
    A Variational Inequality Approach to Bayesian Regression Games. (arXiv:2103.13509v2 [cs.LG] UPDATED)
    (2 min) Bayesian regression games are a special class of two-player general-sum Bayesian games in which the learner is partially informed about the adversary's objective through a Bayesian prior. This formulation captures the uncertainty in regard to the adversary, and is useful in problems where the learner and adversary may have conflicting, but not necessarily perfectly antagonistic objectives. Although the Bayesian approach is a more general alternative to the standard minimax formulation, the applications of Bayesian regression games have been limited due to computational difficulties, and the existence and uniqueness of a Bayesian equilibrium are only known for quadratic cost functions. First, we prove the existence and uniqueness of a Bayesian equilibrium for a class of convex and smooth Bayesian games by regarding it as a solution of an infinite-dimensional variational inequality (VI) in Hilbert space. We consider two special cases in which the infinite-dimensional VI reduces to a high-dimensional VI or a nonconvex stochastic optimization, and provide two simple algorithms of solving them with strong convergence guarantees. Numerical results on real datasets demonstrate the promise of this approach.
    Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems. (arXiv:2002.05814v2 [cs.DC] UPDATED)
    (2 min) Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.
    Trustworthy AI and Robotics and the Implications for the AEC Industry: A Systematic Literature Review and Future Potentials. (arXiv:2109.13373v1 [cs.HC])
    (2 min) Human-technology interaction deals with trust as an inevitable requirement for user acceptance. As the applications of artificial intelligence (AI) and robotics emerge and with their ever-growing socio-economic influence in various fields of research and practice, there is an imminent need to study trust in such systems. With the opaque work mechanism of AI-based systems and the prospect of intelligent robots as workers' companions, context-specific interdisciplinary studies on trust are key in increasing their adoption. Through a thorough systematic literature review on (1) trust in AI and robotics (AIR) and (2) AIR applications in the architecture, engineering, and construction (AEC) industry, this study identifies common trust dimensions in the literature and uses them to organize the paper. Furthermore, the connections of the identified dimensions to the existing and potential AEC applications are determined and discussed. Finally, major future directions on trustworthy AI and robotics in AEC research and practice are outlined.
    A biconvex optimization for solving semidefinite programs via bilinear factorization. (arXiv:1811.01198v8 [cs.LG] UPDATED)
    (2 min) Many problems in machine learning can be reduced to learning a low-rank positive semidefinite matrix (denoted as $Z$), which encounters semidefinite program (SDP). Existing SDP solvers by classical convex optimization are expensive to solve large-scale problems. Employing the low rank of solution, Burer-Monteiro's method reformulated SDP as a nonconvex problem via the $quadratic$ factorization ($Z$ as $XX^\top$). However, this would lose the structure of problem in optimization. In this paper, we propose to convert SDP into a biconvex problem via the $bilinear$ factorization ($Z$ as $XY^\top$), and while adding the term $\frac{\gamma}{2}||X-Y||_F^2$ to penalize the difference of $X$ and $Y$. Thus, the biconvex structure (w.r.t. $X$ and $Y$) can be exploited naturally in optimization. As a theoretical result, we provide a bound to the penalty parameter $\gamma$ under the assumption of $L$-Lipschitz smoothness and $\sigma $-strongly biconvexity, such that, at stationary points, the proposed bilinear factorization is equivalent to Burer-Monteiro's factorization when the bound is arrived, that is $\gamma>\frac{1}{4}(L-\sigma)_+$. Our proposal opens up a new way to surrogate SDP by biconvex program. Experiments on two SDP-related applications demonstrate that the proposed method is effective as the state-of-the-art.
    Byzantine-robust Federated Learning through Spatial-temporal Analysis of Local Model Updates. (arXiv:2107.01477v2 [cs.LG] UPDATED)
    (2 min) Federated Learning (FL) enables multiple distributed clients (e.g., mobile devices) to collaboratively train a centralized model while keeping the training data locally on the client. Compared to traditional centralized machine learning, FL offers many favorable features such as offloading operations which would usually be performed by a central server and reducing risks of serious privacy leakage. However, Byzantine clients that send incorrect or disruptive updates due to system failures or adversarial attacks may disturb the joint learning process, consequently degrading the performance of the resulting model. In this paper, we propose to mitigate these failures and attacks from a spatial-temporal perspective. Specifically, we use a clustering-based method to detect and exclude incorrect updates by leveraging their geometric properties in the parameter space. Moreover, to further handle malicious clients with time-varying behaviors, we propose to adaptively adjust the learning rate according to momentum-based update speculation. Extensive experiments on 4 public datasets demonstrate that our algorithm achieves enhanced robustness comparing to existing methods under both cross-silo and cross-device FL settings with faulty/malicious clients.
    Deep graph matching meets mixed-integer linear programming: Relax at your own risk ?. (arXiv:2108.00394v3 [cs.CV] UPDATED)
    (2 min) Graph matching is an important problem that has received widespread attention, especially in the field of computer vision. Recently, state-of-the-art methods seek to incorporate graph matching with deep learning. However, there is no research to explain what role the graph matching algorithm plays in the model. Therefore, we propose an approach integrating a MILP formulation of the graph matching problem. This formulation is solved to optimal and it provides inherent baseline. Meanwhile, similar approaches are derived by releasing the optimal guarantee of the graph matching solver and by introducing a quality level. This quality level controls the quality of the solutions provided by the graph matching solver. In addition, several relaxations of the graph matching problem are put to the test. Our experimental evaluation gives several theoretical insights and guides the direction of deep graph matching methods.
    Learning Rates for Multi-task Regularization Networks. (arXiv:2104.00453v3 [cs.LG] UPDATED)
    (2 min) Multi-task learning is an important trend of machine learning in facing the era of artificial intelligence and big data. Despite a large amount of researches on learning rate estimates of various single-task machine learning algorithms, there is little parallel work for multi-task learning. We present mathematical analysis on the learning rate estimate of multi-task learning based on the theory of vector-valued reproducing kernel Hilbert spaces and matrix-valued reproducing kernels. For the typical multi-task regularization networks, an explicit learning rate dependent both on the number of sample data and the number of tasks is obtained. It reveals that the generalization ability of multi-task learning algorithms is indeed affected as the number of tasks increases.
    Online Robust and Adaptive Learning from Data Streams. (arXiv:2007.12160v2 [stat.ML] UPDATED)
    (2 min) In online learning from non-stationary data streams, it is necessary to learn robustly to outliers and to adapt quickly to changes in the underlying data generating mechanism. In this paper, we refer to the former attribute of online learning algorithms as robustness and to the latter as adaptivity. There is an obvious tradeoff between the two attributes. It is a fundamental issue to quantify and evaluate the tradeoff because it provides important information on the data generating mechanism. However, no previous work has considered the tradeoff quantitatively. We propose a novel algorithm called the stochastic approximation-based robustness-adaptivity algorithm (SRA) to evaluate the tradeoff. The key idea of SRA is to update parameters of distribution or sufficient statistics with the biased stochastic approximation scheme, while dropping data points with large values of the stochastic update. We address the relation between the two parameters: one is the step size of the stochastic approximation, and the other is the threshold parameter of the norm of the stochastic update. The former controls the adaptivity and the latter does the robustness. We give a theoretical analysis for the non-asymptotic convergence of SRA in the presence of outliers, which depends on both the step size and threshold parameter. Because SRA is formulated on the majorization-minimization principle, it is a general algorithm that includes many algorithms, such as the online EM algorithm and stochastic gradient descent. Empirical experiments for both synthetic and real datasets demonstrated that SRA was superior to previous methods.
    Sign-RIP: A Robust Restricted Isometry Property for Low-rank Matrix Recovery. (arXiv:2102.02969v3 [cs.LG] UPDATED)
    (2 min) Restricted isometry property (RIP), essentially stating that the linear measurements are approximately norm-preserving, plays a crucial role in studying low-rank matrix recovery problem. However, RIP fails in the robust setting, when a subset of the measurements are grossly corrupted with noise. In this work, we propose a robust restricted isometry property, called Sign-RIP, and show its broad applications in robust low-rank matrix recovery. In particular, we show that Sign-RIP can guarantee the uniform convergence of the subdifferentials of the robust matrix recovery with nonsmooth loss function, even at the presence of arbitrarily dense and arbitrarily large outliers. Based on Sign-RIP, we characterize the location of the critical points in the robust rank-1 matrix recovery, and prove that they are either close to the true solution, or have small norm. Moreover, in the over-parameterized regime, where the rank of the true solution is over-estimated, we show that subgradient method converges to the true solution at a (nearly) dimension-free rate. Finally, we show that sign-RIP enjoys almost the same complexity as its classical counterparts, but provides significantly better robustness against noise.
    A Causal Direction Test for Heterogeneous Populations. (arXiv:2006.04877v2 [stat.ME] UPDATED)
    (2 min) A probabilistic expert system emulates the decision-making ability of a human expert through a directional graphical model. The first step in building such systems is to understand data generation mechanism. To this end, one may try to decompose a multivariate distribution into product of several conditionals, and evolving a blackbox machine learning predictive models towards transparent cause-and-effect discovery. Most causal models assume a single homogeneous population, an assumption that may fail to hold in many applications. We show that when the homogeneity assumption is violated, causal models developed based on such assumption can fail to identify the correct causal direction. We propose an adjustment to a commonly used causal direction test statistic by using a $k$-means type clustering algorithm where both the labels and the number of components are estimated from the collected data to adjust the test statistic. Our simulation result show that the proposed adjustment significantly improves the performance of the causal direction test statistic for heterogeneous data. We study large sample behaviour of our proposed test statistic and demonstrate the application of the proposed method using real data.
    Physics-Augmented Learning: A New Paradigm Beyond Physics-Informed Learning. (arXiv:2109.13901v1 [cs.LG])
    (2 min) Integrating physical inductive biases into machine learning can improve model generalizability. We generalize the successful paradigm of physics-informed learning (PIL) into a more general framework that also includes what we term physics-augmented learning (PAL). PIL and PAL complement each other by handling discriminative and generative properties, respectively. In numerical experiments, we show that PAL performs well on examples where PIL is inapplicable or inefficient.
    Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning. (arXiv:1804.00308v3 [cs.CR] UPDATED)
    (2 min) As machine learning becomes widely used for automated decisions, attackers have strong incentives to manipulate the results and models generated by machine learning algorithms. In this paper, we perform the first systematic study of poisoning attacks and their countermeasures for linear regression models. In poisoning attacks, attackers deliberately influence the training data to manipulate the results of a predictive model. We propose a theoretically-grounded optimization framework specifically designed for linear regression and demonstrate its effectiveness on a range of datasets and models. We also introduce a fast statistical attack that requires limited knowledge of the training process. Finally, we design a new principled defense method that is highly resilient against all poisoning attacks. We provide formal guarantees about its convergence and an upper bound on the effect of poisoning attacks when the defense is deployed. We evaluate extensively our attacks and defenses on three realistic datasets from health care, loan assessment, and real estate domains.
    Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding. (arXiv:1911.00232v4 [cs.CV] UPDATED)
    (2 min) Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information present in each video in training. Towards this goal, we present the Multi-Moments in Time dataset (M-MiT) which includes over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning, provide improved methods for visualizing and interpreting models trained for multi-label action detection and show the strength of transferring models trained on M-MiT to smaller datasets.
    A PAC-Bayesian Analysis of Distance-Based Classifiers: Why Nearest-Neighbour works!. (arXiv:2109.13889v1 [cs.LG])
    (2 min) Abstract We present PAC-Bayesian bounds for the generalisation error of the K-nearest-neighbour classifier (K-NN). This is achieved by casting the K-NN classifier into a kernel space framework in the limit of vanishing kernel bandwidth. We establish a relation between prior measures over the coefficients in the kernel expansion and the induced measure on the weight vectors in kernel space. Defining a sparse prior over the coefficients allows the application of a PAC-Bayesian folk theorem that leads to a generalisation bound that is a function of the number of redundant training examples: those that can be left out without changing the solution. The presented bound requires to quantify a prior belief in the sparseness of the solution and is evaluated after learning when the actual redundancy level is known. Even for small sample size (m ~ 100) the bound gives non-trivial results when both the expected sparseness and the actual redundancy are high.
    Investigating Bi-Level Optimization for Learning and Vision from a Unified Perspective: A Survey and Beyond. (arXiv:2101.11517v3 [cs.LG] UPDATED)
    (2 min) Bi-Level Optimization (BLO) is originated from the area of economic game theory and then introduced into the optimization community. BLO is able to handle problems with a hierarchical structure, involving two levels of optimization tasks, where one task is nested inside the other. In machine learning and computer vision fields, despite the different motivations and mechanisms, a lot of complex problems, such as hyper-parameter optimization, multi-task and meta-learning, neural architecture search, adversarial learning and deep reinforcement learning, actually all contain a series of closely related subproblms. In this paper, we first uniformly express these complex learning and vision problems from the perspective of BLO. Then we construct a best-response-based single-level reformulation and establish a unified algorithmic framework to understand and formulate mainstream gradient-based BLO methodologies, covering aspects ranging from fundamental automatic differentiation schemes to various accelerations, simplifications, extensions and their convergence and complexity properties. Last but not least, we discuss the potentials of our unified BLO framework for designing new algorithms and point out some promising directions for future research.
    $f$-Cal: Calibrated aleatoric uncertainty estimation from neural networks for robot perception. (arXiv:2109.13913v1 [cs.CV])
    (2 min) While modern deep neural networks are performant perception modules, performance (accuracy) alone is insufficient, particularly for safety-critical robotic applications such as self-driving vehicles. Robot autonomy stacks also require these otherwise blackbox models to produce reliable and calibrated measures of confidence on their predictions. Existing approaches estimate uncertainty from these neural network perception stacks by modifying network architectures, inference procedure, or loss functions. However, in general, these methods lack calibration, meaning that the predictive uncertainties do not faithfully represent the true underlying uncertainties (process noise). Our key insight is that calibration is only achieved by imposing constraints across multiple examples, such as those in a mini-batch; as opposed to existing approaches which only impose constraints per-sample, often leading to overconfident (thus miscalibrated) uncertainty estimates. By enforcing the distribution of outputs of a neural network to resemble a target distribution by minimizing an $f$-divergence, we obtain significantly better-calibrated models compared to prior approaches. Our approach, $f$-Cal, outperforms existing uncertainty calibration approaches on robot perception tasks such as object detection and monocular depth estimation over multiple real-world benchmarks.
    An Efficient Epileptic Seizure Detection Technique using Discrete Wavelet Transform and Machine Learning Classifiers. (arXiv:2109.13811v1 [eess.SP])
    (2 min) This paper presents an epilepsy detection method based on discrete wavelet transform (DWT) and Machine learning classifiers. Here DWT has been used for feature extraction as it provides a better decomposition of the signals in different frequency bands. At first, DWT has been applied to the EEG signal to extract the detail and approximate coefficients or different sub-bands. After the extraction of the coefficients, principal component analysis (PCA) has been applied on different sub-bands and then a feature level fusion technique is used to extract the important features in low dimensional feature space. Three classifiers namely: Support Vector Machine (SVM) classifier, K-Nearest-Neighbor (KNN) classifier, and Naive Bayes (NB) Classifiers have been used in the proposed work for classifying the EEG signals. The proposed method is tested on Bonn databases and provides a maximum of 100% recognition accuracy for KNN, SVM, NB classifiers.
    Reinforcement Learning for Quantitative Trading. (arXiv:2109.13851v1 [cs.LG])
    (2 min) Quantitative trading (QT), which refers to the usage of mathematical models and data-driven techniques in analyzing the financial market, has been a popular topic in both academia and financial industry since 1970s. In the last decade, reinforcement learning (RL) has garnered significant interest in many domains such as robotics and video games, owing to its outstanding ability on solving complex sequential decision making problems. RL's impact is pervasive, recently demonstrating its ability to conquer many challenging QT tasks. It is a flourishing research direction to explore RL techniques' potential on QT tasks. This paper aims at providing a comprehensive survey of research efforts on RL-based methods for QT tasks. More concretely, we devise a taxonomy of RL-based QT models, along with a comprehensive summary of the state of the art. Finally, we discuss current challenges and propose future research directions in this exciting field.
    Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme. (arXiv:2109.13821v1 [cs.SD])
    (2 min) Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while keeping synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis.
    3N-GAN: Semi-Supervised Classification of X-Ray Images with a 3-Player Adversarial Framework. (arXiv:2109.13862v1 [eess.IV])
    (2 min) The success of deep learning for medical imaging tasks, such as classification, is heavily reliant on the availability of large-scale datasets. However, acquiring datasets with large quantities of labeled data is challenging, as labeling is expensive and time-consuming. Semi-supervised learning (SSL) is a growing alternative to fully-supervised learning, but requires unlabeled samples for training. In medical imaging, many datasets lack unlabeled data entirely, so SSL can't be conventionally utilized. We propose 3N-GAN, or 3 Network Generative Adversarial Networks, to perform semi-supervised classification of medical images in fully-supervised settings. We incorporate a classifier into the adversarial relationship such that the generator trains adversarially against both the classifier and discriminator. Our preliminary results show improved classification performance and GAN generations over various algorithms. Our work can seamlessly integrate with numerous other medical imaging model architectures and SSL methods for greater performance.
    Unsolved Problems in ML Safety. (arXiv:2109.13916v1 [cs.LG])
    (2 min) Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), steering ML systems ("Alignment"), and reducing risks to how ML systems are handled ("External Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
    Sentiment Analysis in Twitter for Macedonian. (arXiv:2109.13725v1 [cs.CL])
    (2 min) We present work on sentiment analysis in Twitter for Macedonian. As this is pioneering work for this combination of language and genre, we created suitable resources for training and evaluating a system for sentiment analysis of Macedonian tweets. In particular, we developed a corpus of tweets annotated with tweet-level sentiment polarity (positive, negative, and neutral), as well as with phrase-level sentiment, which we made freely available for research purposes. We further bootstrapped several large-scale sentiment lexicons for Macedonian, motivated by previous work for English. The impact of several different pre-processing steps as well as of various features is shown in experiments that represent the first attempt to build a system for sentiment analysis in Twitter for the morphologically rich Macedonian language. Overall, our experimental results show an F1-score of 92.16, which is very strong and is on par with the best results for English, which were achieved in recent SemEval competitions.
    A Riemannian Block Coordinate Descent Method for Computing the Projection Robust Wasserstein Distance. (arXiv:2012.05199v5 [cs.LG] UPDATED)
    (2 min) The Wasserstein distance has become increasingly important in machine learning and deep learning. Despite its popularity, the Wasserstein distance is hard to approximate because of the curse of dimensionality. A recently proposed approach to alleviate the curse of dimensionality is to project the sampled data from the high dimensional probability distribution onto a lower-dimensional subspace, and then compute the Wasserstein distance between the projected data. However, this approach requires to solve a max-min problem over the Stiefel manifold, which is very challenging in practice. The only existing work that solves this problem directly is the RGAS (Riemannian Gradient Ascent with Sinkhorn Iteration) algorithm, which requires to solve an entropy-regularized optimal transport problem in each iteration, and thus can be costly for large-scale problems. In this paper, we propose a Riemannian block coordinate descent (RBCD) method to solve this problem, which is based on a novel reformulation of the regularized max-min problem over the Stiefel manifold. We show that the complexity of arithmetic operations for RBCD to obtain an $\epsilon$-stationary point is $O(\epsilon^{-3})$. This significantly improves the corresponding complexity of RGAS, which is $O(\epsilon^{-12})$. Moreover, our RBCD has very low per-iteration complexity, and hence is suitable for large-scale problems. Numerical results on both synthetic and real datasets demonstrate that our method is more efficient than existing methods, especially when the number of sampled data is very large.
    Machine learning methods for prediction of cancer driver genes: a survey paper. (arXiv:2109.13685v1 [cs.LG])
    (2 min) Identifying the genes and mutations that drive the emergence of tumors is a major step to improve understanding of cancer and identify new directions for disease diagnosis and treatment. Despite the large volume of genomics data, the precise detection of driver mutations and their carrying genes, known as cancer driver genes, from the millions of possible somatic mutations remains a challenge. Computational methods play an increasingly important role in identifying genomic patterns associated with cancer drivers and developing models to predict driver events. Machine learning (ML) has been the engine behind many of these efforts and provides excellent opportunities for tackling remaining gaps in the field. Thus, this survey aims to perform a comprehensive analysis of ML-based computational approaches to identify cancer driver mutations and genes, providing an integrated, panoramic view of the broad data and algorithmic landscape within this scientific problem. We discuss how the interactions among data types and ML algorithms have been explored in previous solutions and outline current analytical limitations that deserve further attention from the scientific community. We hope that by helping readers become more familiar with significant developments in the field brought by ML, we may inspire new researchers to address open problems and advance our knowledge towards cancer driver discovery.
    Cluster Analysis of a Symbolic Regression Search Space. (arXiv:2109.13898v1 [cs.LG])
    (2 min) In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.
    SafetyNet: Safe planning for real-world self-driving vehicles using machine-learned policies. (arXiv:2109.13602v1 [cs.RO])
    (2 min) In this paper we present the first safe system for full control of self-driving vehicles trained from human demonstrations and deployed in challenging, real-world, urban environments. Current industry-standard solutions use rule-based systems for planning. Although they perform reasonably well in common scenarios, the engineering complexity renders this approach incompatible with human-level performance. On the other hand, the performance of machine-learned (ML) planning solutions can be improved by simply adding more exemplar data. However, ML methods cannot offer safety guarantees and sometimes behave unpredictably. To combat this, our approach uses a simple yet effective rule-based fallback layer that performs sanity checks on an ML planner's decisions (e.g. avoiding collision, assuring physical feasibility). This allows us to leverage ML to handle complex situations while still assuring the safety, reducing ML planner-only collisions by 95%. We train our ML planner on 300 hours of expert driving demonstrations using imitation learning and deploy it along with the fallback layer in downtown San Francisco, where it takes complete control of a real vehicle and navigates a wide variety of challenging urban driving scenarios.
    MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. (arXiv:2102.00457v2 [cs.LG] UPDATED)
    (2 min) We propose MultiRocket, a fast time series classification (TSC) algorithm that achieves state-of-the-art performance with a tiny fraction of the time and without the complex ensembling structure of many state-of-the-art methods. MultiRocket improves on MiniRocket, one of the fastest TSC algorithms to date, by adding multiple pooling operators and transformations to improve the diversity of the features generated. In addition to processing the raw input series, MultiRocket also applies first order differences to transform the original series. Convolutions are applied to both representations, and four pooling operators are applied to the convolution outputs. When benchmarked using the University of California Riverside TSC benchmark datasets, MultiRocket is significantly more accurate than MiniRocket, and competitive with the best ranked current method in terms of accuracy, HIVE-COTE 2.0, while being orders of magnitude faster.
    A First-Occupancy Representation for Reinforcement Learning. (arXiv:2109.13863v1 [cs.LG])
    (2 min) Both animals and artificial agents benefit from state representations that support rapid transfer of learning across tasks and which enable them to efficiently traverse their environments to reach rewarding states. The successor representation (SR), which measures the expected cumulative, discounted state occupancy under a fixed policy, enables efficient transfer to different reward structures in an otherwise constant Markovian environment and has been hypothesized to underlie aspects of biological behavior and neural activity. However, in the real world, rewards may move or only be available for consumption once, may shift location, or agents may simply aim to reach goal states as rapidly as possible without the constraint of artificially imposed task horizons. In such cases, the most behaviorally-relevant representation would carry information about when the agent was likely to first reach states of interest, rather than how often it should expect to visit them over a potentially infinite time span. To reflect such demands, we introduce the first-occupancy representation (FR), which measures the expected temporal discount to the first time a state is accessed. We demonstrate that the FR facilitates the selection of efficient paths to desired states, allows the agent, under certain conditions, to plan provably optimal trajectories defined by a sequence of subgoals, and induces similar behavior to animals avoiding threatening stimuli.
    Making Curiosity Explicit in Vision-based RL. (arXiv:2109.13588v1 [cs.LG])
    (2 min) Vision-based reinforcement learning (RL) is a promising technique to solve control tasks involving images as the main observation. State-of-the-art RL algorithms still struggle in terms of sample efficiency, especially when using image observations. This has led to an increased attention on integrating state representation learning (SRL) techniques into the RL pipeline. Work in this field demonstrates a substantial improvement in sample efficiency among other benefits. However, to take full advantage of this paradigm, the quality of samples used for training plays a crucial role. More importantly, the diversity of these samples could affect the sample efficiency of vision-based RL, but also its generalization capability. In this work, we present an approach to improve the sample diversity. Our method enhances the exploration capability of the RL algorithms by taking advantage of the SRL setup. Our experiments show that the presented approach outperforms the baseline for all tested environments. These results are most apparent for environments where the baseline method struggles. Even in simple environments, our method stabilizes the training, reduces the reward variance and boosts sample efficiency.
    Gaussian Processes to speed up MCMC with automatic exploratory-exploitation effect. (arXiv:2109.13891v1 [stat.ML])
    (2 min) We present a two-stage Metropolis-Hastings algorithm for sampling probabilistic models, whose log-likelihood is computationally expensive to evaluate, by using a surrogate Gaussian Process (GP) model. The key feature of the approach, and the difference w.r.t. previous works, is the ability to learn the target distribution from scratch (while sampling), and so without the need of pre-training the GP. This is fundamental for automatic and inference in Probabilistic Programming Languages In particular, we present an alternative first stage acceptance scheme by marginalising out the GP distributed function, which makes the acceptance ratio explicitly dependent on the variance of the GP. This approach is extended to Metropolis-Adjusted Langevin algorithm (MALA).
    What to Prioritize? Natural Language Processing for the Development of a Modern Bug Tracking Solution in Hardware Development. (arXiv:2109.13825v1 [cs.LG])
    (2 min) Managing large numbers of incoming bug reports and finding the most critical issues in hardware development is time consuming, but crucial in order to reduce development costs. In this paper, we present an approach to predict the time to fix, the risk and the complexity of debugging and resolution of a bug report using different supervised machine learning algorithms, namely Random Forest, Naive Bayes, SVM, MLP and XGBoost. Further, we investigate the effect of the application of active learning and we evaluate the impact of different text representation techniques, namely TF-IDF, Word2Vec, Universal Sentence Encoder and XLNet on the model's performance. The evaluation shows that a combination of text embeddings generated through the Universal Sentence Encoder and MLP as classifier outperforms all other methods, and is well suited to predict the risk and complexity of bug tickets.
    Opportunistic Implicit User Authentication for Health-Tracking IoT Wearables. (arXiv:2109.13705v1 [cs.HC])
    (2 min) With the advancement of technologies, market wearables are becoming increasingly popular with a range of services, including providing access to bank accounts, accessing cars, monitoring patients remotely, among several others. However, often these wearables collect various sensitive personal information of a user with no to limited authentication, e.g., knowledge-based external authentication techniques, such as PINs. While most of these external authentication techniques suffer from multiple limitations, including recall burden, human errors, or biases, researchers have started using various physiological and behavioral data, such as gait and heart rate, collected by the wearables to authenticate a wearable user implicitly with a limited accuracy due to sensing and computing constraints of wearables. In this work, we explore the usefulness of blood oxygen saturation SpO2 values collected from the Oximeter device to distinguish a user from others. From a cohort of 25 subjects, we find that 92% of the cases SpO2 can distinguish pairs of users. From detailed modeling and performance analysis, we observe that while SpO2 alone can obtain an average accuracy of 0.69 and F1 score of 0.69, the addition of heart rate (HR) can improve the average identification accuracy by 15% and F1 score by 13%. These results show promise in using SpO2 along with other biometrics to develop implicit continuous authentications for wearables.
    Anomaly Detection for High-Dimensional Data Using Large Deviations Principle. (arXiv:2109.13698v1 [cs.LG])
    (2 min) Most current anomaly detection methods suffer from the curse of dimensionality when dealing with high-dimensional data. We propose an anomaly detection algorithm that can scale to high-dimensional data using concepts from the theory of large deviations. The proposed Large Deviations Anomaly Detection (LAD) algorithm is shown to outperform state of art anomaly detection methods on a variety of large and high-dimensional benchmark data sets. Exploiting the ability of the algorithm to scale to high-dimensional data, we propose an online anomaly detection method to identify anomalies in a collection of multivariate time series. We demonstrate the applicability of the online algorithm in identifying counties in the United States with anomalous trends in terms of COVID-19 related cases and deaths. Several of the identified anomalous counties correlate with counties with documented poor response to the COVID pandemic.
    Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation. (arXiv:2109.13841v1 [cs.RO])
    (2 min) We tackle real-world long-horizon robot manipulation tasks through skill discovery. We present a bottom-up approach to learning a library of reusable skills from unsegmented demonstrations and use these skills to synthesize prolonged robot behaviors. Our method starts with constructing a hierarchical task structure from each demonstration through agglomerative clustering. From the task structures of multi-task demonstrations, we identify skills based on the recurring patterns and train goal-conditioned sensorimotor policies with hierarchical imitation learning. Finally, we train a meta controller to compose these skills to solve long-horizon manipulation tasks. The entire model can be trained on a small set of human demonstrations collected within 30 minutes without further annotations, making it amendable to real-world deployment. We systematically evaluated our method in simulation environments and on a real robot. Our method has shown superior performance over state-of-the-art imitation learning methods in multi-stage manipulation tasks. Furthermore, skills discovered from multi-task demonstrations boost the average task success by $8\%$ compared to those discovered from individual tasks.
    StereoSpike: Depth Learning with a Spiking Neural Network. (arXiv:2109.13751v1 [cs.CV])
    (2 min) Depth estimation is an important computer vision task, useful in particular for navigation in autonomous vehicles, or for object manipulation in robotics. Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike. More specifically, we used the Multi Vehicle Stereo Event Camera Dataset (MVSEC). It provides a depth ground-truth, which was used to train StereoSpike in a supervised manner, using surrogate gradient descent. We propose a novel readout paradigm to obtain a dense analog prediction -- the depth of each pixel -- from the spikes of the decoder. We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy. To the best of our knowledge, it is the first time that such a large-scale regression problem is solved by a fully spiking network. Finally, we show that low firing rates (<10%) can be obtained via regularization, with a minimal cost in accuracy. This means that StereoSpike could be efficiently implemented on neuromorphic chips, opening the door for low power and real time embedded systems.
    Deep Generative Modeling for Protein Design. (arXiv:2109.13754v1 [cs.LG])
    (2 min) Deep learning approaches have produced substantial breakthroughs in fields such as image classification and natural language processing and are making rapid inroads in the area of protein design. Many generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins. Those generative models can learn protein representations that are often more informative of protein structure and function than hand-engineered features. Furthermore, they can be used to quickly propose millions of novel proteins that resemble the native counterparts in terms of expression level, stability, or other attributes. The protein design process can further be guided by discriminative oracles to select candidates with the highest probability of having the desired properties. In this review, we discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.
    An Automated Data Engineering Pipeline for Anomaly Detection of IoT Sensor Data. (arXiv:2109.13828v1 [cs.LG])
    (2 min) The rapid development in the field of System of Chip (SoC) technology, Internet of Things (IoT), cloud computing, and artificial intelligence has brought more possibilities of improving and solving the current problems. With data analytics and the use of machine learning/deep learning, it is made possible to learn the underlying patterns and make decisions based on what was learned from massive data generated from IoT sensors. When combined with cloud computing, the whole pipeline can be automated, and free of manual controls and operations. In this paper, an implementation of an automated data engineering pipeline for anomaly detection of IoT sensor data is studied and proposed. The process involves the use of IoT sensors, Raspberry Pis, Amazon Web Services (AWS) and multiple machine learning techniques with the intent to identify anomalous cases for the smart home security system.
    IRMAC: Interpretable Refined Motifs and Binary Classification for Rooftops PV Owners. (arXiv:2109.13732v1 [cs.LG])
    (2 min) In this paper, we seek to identify residential rooftop solar PV owners using imported energy data. To solve this problem with an interpretable, fast, secure, and maintainable solution, we propose Interpretable Refined Motifs And binary Classification (IRMAC) method, which includes a shape-based dimensionality reduction technique we call Refined Motif (RM), and a classification technique with linear complexity to identify solar owners. Furthermore, with the real data from Australia and Denmark, the proposed method is tested and verified in identifying PV owners as well as identifying electrical heating system users. The performances of the proposed method is studied and compared with various of state-of-the-art methods, where the proposed method outperformed the alternatives.
    Domain Generalization for Vision-based Driving Trajectory Generation. (arXiv:2109.13858v1 [cs.CV])
    (2 min) One of the challenges in vision-based driving trajectory generation is dealing with out-of-distribution scenarios. In this paper, we propose a domain generalization method for vision-based driving trajectory generation for autonomous vehicles in urban environments, which can be seen as a solution to extend the Invariant Risk Minimization (IRM) method in complex problems. We leverage an adversarial learning approach to train a trajectory generator as the decoder. Based on the pre-trained decoder, we infer the latent variables corresponding to the trajectories, and pre-train the encoder by regressing the inferred latent variable. Finally, we fix the decoder but fine-tune the encoder with the final trajectory loss. We compare our proposed method with the state-of-the-art trajectory generation method and some recent domain generalization methods on both datasets and simulation, demonstrating that our method has better generalization ability.
    Confusion-based rank similarity filters for computationally-efficient machine learning on high dimensional data. (arXiv:2109.13610v1 [cs.LG])
    (2 min) We introduce a novel type of computationally efficient artificial neural network (ANN) called the rank similarity filter (RSF). RSFs can be used to both transform and classify nonlinearly separable datasets with many data points and dimensions. The weights of RSF are set using the rank orders of features in a data point, or optionally the 'confusion' adjusted ranks between features (determined from their distributions in the dataset). The activation strength of a filter determines its similarity to other points in the dataset, a measure related to cosine similarity. The activation of many RSFs maps samples into a new nonlinear space suitable for linear classification (the rank similarity transform (RST)). We additionally used this method to create the nonlinear rank similarity classifier (RSC), which is a fast and accurate multiclass classifier, and the nonlinear rank similarity probabilistic classifier (RSPC), which is an extension to the multilabel case. We evaluated the classifiers on multiple datasets and RSC was competitive with existing classifiers but with superior computational efficiency. Open-source code for RST, RSC and RSPC was written in Python using the popular scikit-learn framework to make it easily accessible. In future extensions the algorithm can be applied to specialised hardware suitable for the parallelization of an ANN (GPU) and a Spiking Neural Network (neuromorphic computing) with corresponding performance gains. This makes RSF a promising solution to the problem of efficient analysis of nonlinearly separable data.
    Macroeconomic forecasting with LSTM and mixed frequency time series data. (arXiv:2109.13777v1 [econ.EM])
    (2 min) This paper demonstrates the potentials of the long short-term memory (LSTM) when applyingwith macroeconomic time series data sampled at different frequencies. We first present how theconventional LSTM model can be adapted to the time series observed at mixed frequencies when thesame mismatch ratio is applied for all pairs of low-frequency output and higher-frequency variable. Togeneralize the LSTM to the case of multiple mismatch ratios, we adopt the unrestricted Mixed DAtaSampling (U-MIDAS) scheme (Foroni et al., 2015) into the LSTM architecture. We assess via bothMonte Carlo simulations and empirical application the out-of-sample predictive performance. Ourproposed models outperform the restricted MIDAS model even in a set up favorable to the MIDASestimator. For real world application, we study forecasting a quarterly growth rate of Thai realGDP using a vast array of macroeconomic indicators both quarterly and monthly. Our LSTM withU-MIDAS scheme easily beats the simple benchmark AR(1) model at all horizons, but outperformsthe strong benchmark univariate LSTM only at one and six months ahead. Nonetheless, we find thatour proposed model could be very helpful in the period of large economic downturns for short-termforecast. Simulation and empirical results seem to support the use of our proposed LSTM withU-MIDAS scheme to nowcasting application.
    Improved prediction rule ensembling through model-based data generation. (arXiv:2109.13672v1 [stat.ML])
    (2 min) Prediction rule ensembles (PRE) provide interpretable prediction models with relatively high accuracy.PRE obtain a large set of decision rules from a (boosted) decision tree ensemble, and achieves sparsitythrough application of Lasso-penalized regression. This article examines the use of surrogate modelsto improve performance of PRE, wherein the Lasso regression is trained with the help of a massivedataset generated by the (boosted) decision tree ensemble. This use of model-based data generationmay improve the stability and consistency of the Lasso step, thus leading to improved overallperformance. We propose two surrogacy approaches, and evaluate them on simulated and existingdatasets, in terms of sparsity and predictive accuracy. The results indicate that the use of surrogacymodels can substantially improve the sparsity of PRE, while retaining predictive accuracy, especiallythrough the use of a nested surrogacy approach.
    Near-Linear Time Algorithm with Near-Logarithmic Regret Per Switch for Mixable/Exp-Concave Losses. (arXiv:2109.13786v1 [cs.LG])
    (2 min) We investigate the problem of online learning, which has gained significant attention in recent years due to its applicability in a wide range of fields from machine learning to game theory. Specifically, we study the online optimization of mixable loss functions with logarithmic static regret in a dynamic environment. The best dynamic estimation sequence that we compete against is selected in hindsight with full observation of the loss functions and is allowed to select different optimal estimations in different time intervals (segments). We propose an online mixture framework that uses these static solvers as the base algorithm. We show that with the suitable selection of hyper-expert creations and weighting strategies, we can achieve logarithmic and squared logarithmic regret per switch in quadratic and linearithmic computational complexity, respectively. For the first time in literature, we show that it is also possible to achieve near-logarithmic regret per switch with sub-polynomial complexity per time. Our results are guaranteed to hold in a strong deterministic sense in an individual sequence manner.
    Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query. (arXiv:2109.13814v1 [q-bio.NC])
    (2 min) Most neuroimaging experiments are under-powered, limited by the number of subjects and cognitive processes that an individual study can investigate. Nonetheless, over decades of research, neuroscience has accumulated an extensive wealth of results. It remains a challenge to digest this growing knowledge base and obtain new insights since existing meta-analytic tools are limited to keyword queries. In this work, we propose Text2Brain, a neural network approach for coordinate-based meta-analysis of neuroimaging studies to synthesize brain activation maps from open-ended text queries. Combining a transformer-based text encoder and a 3D image generator, Text2Brain was trained on variable-length text snippets and their corresponding activation maps sampled from 13,000 published neuroimaging studies. We demonstrate that Text2Brain can synthesize anatomically-plausible neural activation patterns from free-form textual descriptions of cognitive concepts. Text2Brain is available at https://braininterpreter.com as a web-based tool for retrieving established priors and generating new hypotheses for neuroscience research.
    Stable training of autoencoders for hyperspectral unmixing. (arXiv:2109.13748v1 [cs.CV])
    (2 min) Neural networks, autoencoders in particular, are one of the most promising solutions for unmixing hyperspectral data, i.e. reconstructing the spectra of observed substances (endmembers) and their relative mixing fractions (abundances). Unmixing is needed for effective hyperspectral analysis and classification. However, as we show in this paper, the training of autoencoders for unmixing is highly dependent on weights initialisation. Some sets of weights lead to degenerate or low performance solutions, introducing negative bias in expected performance. In this work we present the results of experiments investigating autoencoders' stability, verifying the dependence of reconstruction error on initial weights and exploring conditions needed for successful optimisation of autoencoder parameters.
    Improving Time Series Classification Algorithms Using Octave-Convolutional Layers. (arXiv:2109.13696v1 [cs.LG])
    (2 min) Deep learning models utilizing convolution layers have achieved state-of-the-art performance on univariate time series classification tasks. In this work, we propose improving CNN based time series classifiers by utilizing Octave Convolutions (OctConv) to outperform themselves. These network architectures include Fully Convolutional Networks (FCN), Residual Neural Networks (ResNets), LSTM-Fully Convolutional Networks (LSTM-FCN), and Attention LSTM-Fully Convolutional Networks (ALSTM-FCN). The proposed layers significantly improve each of these models with minimally increased network parameters. In this paper, we experimentally show that by substituting convolutions with OctConv, we significantly improve accuracy for time series classification tasks for most of the benchmark datasets. In addition, the updated ALSTM-OctFCN performs statistically the same as the top two time series classifers, TS-CHIEF and HIVE-COTE (both ensemble models). To further explore the impact of the OctConv layers, we perform ablation tests of the augmented model compared to their base model.
    The Fragility of Optimized Bandit Algorithms. (arXiv:2109.13595v1 [cs.LG])
    (2 min) Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the associated algorithms necessarily have the undesirable feature that the tail of the regret distribution behaves like that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the number of sub-optimal arm plays. We show that optimized Thompson sampling and UCB bandit designs are also fragile, in the sense that when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays that make that arm appear to be sub-optimal, thereby causing the algorithm to sample a truly sub-optimal arm much more than would be optimal.
    DeepPSL: End-to-end perception and reasoning with applications to zero shot learning. (arXiv:2109.13662v1 [eess.SY])
    (2 min) We introduce DeepPSL a variant of Probabilistic Soft Logic (PSL) to produce an end-to-end trainable system that integrates reasoning and perception. PSL represents first-order logic in terms of a convex graphical model -- Hinge Loss Markov random fields (HL-MRFs). PSL stands out among probabilistic logic frameworks due to its tractability having been applied to systems of more than 1 billion ground rules. The key to our approach is to represent predicates in first-order logic using deep neural networks and then to approximately back-propagate through the HL-MRF and thus train every aspect of the first-order system being represented. We believe that this approach represents an interesting direction for the integration of deep learning and reasoning techniques with applications to knowledge base learning, multi-task learning, and explainability. We evaluate DeepPSL on a zero shot learning problem in image classification. State of the art results demonstrate the utility and flexibility of our approach.
    Meta-aprendizado para otimizacao de parametros de redes neurais. (arXiv:2109.13745v1 [cs.NE])
    (2 min) The optimization of Artificial Neural Networks (ANNs) is an important task to the success of using these models in real-world applications. The solutions adopted to this task are expensive in general, involving trial-and-error procedures or expert knowledge which are not always available. In this work, we investigated the use of meta-learning to the optimization of ANNs. Meta-learning is a research field aiming to automatically acquiring knowledge which relates features of the learning problems to the performance of the learning algorithms. The meta-learning techniques were originally proposed and evaluated to the algorithm selection problem and after to the optimization of parameters for Support Vector Machines. However, meta-learning can be adopted as a more general strategy to optimize ANN parameters, which motivates new efforts in this research direction. In the current work, we performed a case study using meta-learning to choose the number of hidden nodes for MLP networks, which is an important parameter to be defined aiming a good networks performance. In our work, we generated a base of meta-examples associated to 93 regression problems. Each meta-example was generated from a regression problem and stored: 16 features describing the problem (e.g., number of attributes and correlation among the problem attributes) and the best number of nodes for this problem, empirically chosen from a range of possible values. This set of meta-examples was given as input to a meta-learner which was able to predict the best number of nodes for new problems based on their features. The experiments performed in this case study revealed satisfactory results.
    Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets. (arXiv:2109.13723v1 [cs.CL])
    (2 min) This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such character-level models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
    Translating from Morphologically Complex Languages: A Paraphrase-Based Approach. (arXiv:2109.13724v1 [cs.CL])
    (2 min) We propose a novel approach to translating from a morphologically complex language. Unlike previous research, which has targeted word inflections and concatenations, we focus on the pairwise relationship between morphologically related words, which we treat as potential paraphrases and handle using paraphrasing techniques at the word, phrase, and sentence level. An important advantage of this framework is that it can cope with derivational morphology, which has so far remained largely beyond the capabilities of statistical machine translation systems. Our experiments translating from Malay, whose morphology is mostly derivational, into English show significant improvements over rivaling approaches based on five automatic evaluation measures (for 320,000 sentence pairs; 9.5 million English word tokens).
    Multimodality in Meta-Learning: A Comprehensive Survey. (arXiv:2109.13576v1 [cs.LG])
    (2 min) Meta-learning has gained wide popularity as a training framework that is more data-efficient than traditional machine learning methods. However, its generalization ability in complex task distributions, such as multimodal tasks, has not been thoroughly studied. Recently, some studies on multimodality-based meta-learning have emerged. This survey provides a comprehensive overview of the multimodality-based meta-learning landscape in terms of the methodologies and applications. We first formalize the definition of meta-learning and multimodality, along with the research challenges in this growing field, such as how to enrich the input in few-shot or zero-shot scenarios and how to generalize the models to new tasks. We then propose a new taxonomy to systematically discuss typical meta-learning algorithms combined with multimodal tasks. We investigate the contributions of related papers and summarize them by our taxonomy. Finally, we propose potential research directions for this promising field.
    Delve into the Performance Degradation of Differentiable Architecture Search. (arXiv:2109.13466v1 [cs.LG])
    (2 min) Differentiable architecture search (DARTS) is widely considered to be easy to overfit the validation set which leads to performance degradation. We first employ a series of exploratory experiments to verify that neither high-strength architecture parameters regularization nor warmup training scheme can effectively solve this problem. Based on the insights from the experiments, we conjecture that the performance of DARTS does not depend on the well-trained supernet weights and argue that the architecture parameters should be trained by the gradients which are obtained in the early stage rather than the final stage of training. This argument is then verified by exchanging the learning rate schemes of weights and parameters. Experimental results show that the simple swap of the learning rates can effectively solve the degradation and achieve competitive performance. Further empirical evidence suggests that the degradation is not a simple problem of the validation set overfitting but exhibit some links between the degradation and the operation selection bias within bilevel optimization dynamics. We demonstrate the generalization of this bias and propose to utilize this bias to achieve an operation-magnitude-based selective stop.
    Learning Ideological Embeddings from Information Cascades. (arXiv:2109.13589v1 [cs.SI])
    (2 min) Modeling information cascades in a social network through the lenses of the ideological leaning of its users can help understanding phenomena such as misinformation propagation and confirmation bias, and devising techniques for mitigating their toxic effects. In this paper we propose a stochastic model to learn the ideological leaning of each user in a multidimensional ideological space, by analyzing the way politically salient content propagates. In particular, our model assumes that information propagates from one user to another if both users are interested in the topic and ideologically aligned with each other. To infer the parameters of our model, we devise a gradient-based optimization procedure maximizing the likelihood of an observed set of information cascades. Our experiments on real-world political discussions on Twitter and Reddit confirm that our model is able to learn the political stance of the social media users in a multidimensional ideological space.
    Exploratory State Representation Learning. (arXiv:2109.13596v1 [cs.LG])
    (2 min) Not having access to compact and meaningful representations is known to significantly increase the complexity of reinforcement learning (RL). For this reason, it can be useful to perform state representation learning (SRL) before tackling RL tasks. However, obtaining a good state representation can only be done if a large diversity of transitions is observed, which can require a difficult exploration, especially if the environment is initially reward-free. To solve the problems of exploration and SRL in parallel, we propose a new approach called XSRL (eXploratory State Representation Learning). On one hand, it jointly learns compact state representations and a state transition estimator which is used to remove unexploitable information from the representations. On the other hand, it continuously trains an inverse model, and adds to the prediction error of this model a $k$-step learning progress bonus to form the maximization objective of a discovery policy. This results in a policy that seeks complex transitions from which the trained models can effectively learn. Our experimental results show that the approach leads to efficient exploration in challenging environments with image observations, and to state representations that significantly accelerate learning in RL tasks.
    Clustering to the Fewest Clusters Under Intra-Cluster Dissimilarity Constraints. (arXiv:2109.13644v1 [cs.LG])
    (2 min) This paper introduces the equiwide clustering problem, where valid partitions must satisfy intra-cluster dissimilarity constraints. Unlike most existing clustering algorithms, equiwide clustering relies neither on density nor on a predefined number of expected classes, but on a dissimilarity threshold. Its main goal is to ensure an upper bound on the error induced by ultimately replacing any object with its cluster representative. Under this constraint, we then primarily focus on minimizing the number of clusters, along with potential sub-objectives. We argue that equiwide clustering is a sound clustering problem, and discuss its relationship with other optimization problems, existing and novel implementations as well as approximation strategies. We review and evaluate suitable clustering algorithms to identify trade-offs between the various practical solutions for this clustering problem.
    Convergence of Deep Convolutional Neural Networks. (arXiv:2109.13542v1 [cs.LG])
    (2 min) Convergence of deep neural networks as the depth of the networks tends to infinity is fundamental in building the mathematical foundation for deep learning. In a previous study, we investigated this question for deep ReLU networks with a fixed width. This does not cover the important convolutional neural networks where the widths are increasing from layer to layer. For this reason, we first study convergence of general ReLU networks with increasing widths and then apply the results obtained to deep convolutional neural networks. It turns out the convergence reduces to convergence of infinite products of matrices with increasing sizes, which has not been considered in the literature. We establish sufficient conditions for convergence of such infinite products of matrices. Based on the conditions, we present sufficient conditions for piecewise convergence of general deep ReLU networks with increasing widths, and as well as pointwise convergence of deep ReLU convolutional neural networks.
    VoxCeleb Enrichment for Age and Gender Recognition. (arXiv:2109.13510v1 [cs.LG])
    (2 min) VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes. First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive age and gender labels. We also compare the original VoxCeleb gender labels with our labels to identify records that might be mislabeled in the original VoxCeleb data. On modeling side, we design a comprehensive study of multiple features and models for recognizing gender and age. Our best system, using i-vector features, achieved an F1-score of 0.9829 for gender recognition task using logistic regression, and the lowest mean absolute error (MAE) in age regression, 9.443 years, is obtained with ridge regression. This indicates challenge in age estimation from in-the-wild style speech data.
    Concept-Aware Denoising Graph Neural Network for Micro-Video Recommendation. (arXiv:2109.13527v1 [cs.IR])
    (2 min) Recently, micro-video sharing platforms such as Kuaishou and Tiktok have become a major source of information for people's lives. Thanks to the large traffic volume, short video lifespan and streaming fashion of these services, it has become more and more pressing to improve the existing recommender systems to accommodate these challenges in a cost-effective way. In this paper, we propose a novel concept-aware denoising graph neural network (named CONDE) for micro-video recommendation. CONDE consists of a three-phase graph convolution process to derive user and micro-video representations: warm-up propagation, graph denoising and preference refinement. A heterogeneous tripartite graph is constructed by connecting user nodes with video nodes, and video nodes with associated concept nodes, extracted from captions and comments of the videos. To address the noisy information in the graph, we introduce a user-oriented graph denoising phase to extract a subgraph which can better reflect the user's preference. Despite the main focus of micro-video recommendation in this paper, we also show that our method can be generalized to other types of tasks. Therefore, we also conduct empirical studies on a well-known public E-commerce dataset. The experimental results suggest that the proposed CONDE achieves significantly better recommendation performance than the existing state-of-the-art solutions.
    FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for Speech Synthesis. (arXiv:2109.13675v1 [cs.SD])
    (2 min) Recently, non-autoregressive neural vocoders have provided remarkable performance in generating high-fidelity speech and have been able to produce synthetic speech in real-time. However, non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals due to their limitation in expressiveness. In addition, though NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters, its performance is marginally lower than WaveFlow. Therefore, in this paper, we propose a new type of autoregressive neural vocoder called FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time. Our proposed model improves the expressiveness of flow blocks by operating a mixture of Cumulative Distribution Function(CDF) for bipartite transformation. Hence, the proposed model is capable of modeling waveform signals as well as WaveFlow, while its memory footprint is much smaller thanWaveFlow. As shown in experiments, FlowVocoder achieves competitive results with baseline methods in terms of both subjective and objective evaluation, also, it is more suitable for real-time text-to-speech applications.
    Lyapunov-Net: A Deep Neural Network Architecture for Lyapunov Function Approximation. (arXiv:2109.13359v1 [cs.LG])
    (2 min) We develop a versatile deep neural network architecture, called Lyapunov-Net, to approximate Lyapunov functions of dynamical systems in high dimensions. Lyapunov-Net guarantees positive definiteness, and thus it can be easily trained to satisfy the negative orbital derivative condition, which only renders a single term in the empirical risk function in practice. This significantly reduces the number of hyper-parameters compared to existing methods. We also provide theoretical justifications on the approximation power of Lyapunov-Net and its complexity bounds. We demonstrate the efficiency of the proposed method on nonlinear dynamical systems involving up to 30-dimensional state spaces, and show that the proposed approach significantly outperforms the state-of-the-art methods.
    Warp-Refine Propagation: Semi-Supervised Auto-labeling via Cycle-consistency. (arXiv:2109.13432v1 [cs.CV])
    (2 min) Deep learning models for semantic segmentation rely on expensive, large-scale, manually annotated datasets. Labelling is a tedious process that can take hours per image. Automatically annotating video sequences by propagating sparsely labeled frames through time is a more scalable alternative. In this work, we propose a novel label propagation method, termed Warp-Refine Propagation, that combines semantic cues with geometric cues to efficiently auto-label videos. Our method learns to refine geometrically-warped labels and infuse them with learned semantic priors in a semi-supervised setting by leveraging cycle consistency across time. We quantitatively show that our method improves label-propagation by a noteworthy margin of 13.1 mIoU on the ApolloScape dataset. Furthermore, by training with the auto-labelled frames, we achieve competitive results on three semantic-segmentation benchmarks, improving the state-of-the-art by a large margin of 1.8 and 3.61 mIoU on NYU-V2 and KITTI, while matching the current best results on Cityscapes.
    The JHU submission to VoxSRC-21: Track 3. (arXiv:2109.13425v1 [eess.AS])
    (2 min) This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised method in computer vision (CV), distillation with no labels (DINO), is used to train our initial model, which outperformed the last year's contrastive learning based on momentum contrast (MoCo). Also, this requires only a few iterations in the iterative clustering stage, where pseudo labels for supervised embedding learning are updated based on the clusters of the embeddings generated from a model that is continually fine-tuned over iterations. In the final stage, Res2Net50 is trained on the final pseudo labels from the iterative clustering stage. This is our best submitted model to the challenge, showing 1.89, 6.50, and 6.89 in EER(%) in voxceleb1 test o, VoxSRC-21 validation, and test trials, respectively.
    An Automated Approach to Causal Inference in Discrete Settings. (arXiv:2109.13471v1 [stat.ME])
    (2 min) When causal quantities cannot be point identified, researchers often pursue partial identification to quantify the range of possible values. However, the peculiarities of applied research conditions can make this analytically intractable. We present a general and automated approach to causal inference in discrete settings. We show causal questions with discrete data reduce to polynomial programming problems, and we present an algorithm to automatically bound causal effects using efficient dual relaxation and spatial branch-and-bound techniques. The user declares an estimand, states assumptions, and provides data (however incomplete or mismeasured). The algorithm then searches over admissible data-generating processes and outputs the most precise possible range consistent with available information -- i.e., sharp bounds -- including a point-identified solution if one exists. Because this search can be computationally intensive, our procedure reports and continually refines non-sharp ranges that are guaranteed to contain the truth at all times, even when the algorithm is not run to completion. Moreover, it offers an additional guarantee we refer to as $\epsilon$-sharpness, characterizing the worst-case looseness of the incomplete bounds. Analytically validated simulations show the algorithm accommodates classic obstacles, including confounding, selection, measurement error, noncompliance, and nonresponse.
    Instance-Based Neural Dependency Parsing. (arXiv:2109.13497v1 [cs.CL])
    (2 min) Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.
    Metal Artifact Reduction in 2D CT Images with Self-supervised Cross-domain Learning. (arXiv:2109.13483v1 [eess.IV])
    (2 min) The presence of metallic implants often introduces severe metal artifacts in the X-ray CT images, which could adversely influence clinical diagnosis or dose calculation in radiation therapy. In this work, we present a novel deep-learning-based approach for metal artifact reduction (MAR). In order to alleviate the need for anatomically identical CT image pairs (i.e., metal artifact-corrupted CT image and metal artifact-free CT image) for network learning, we propose a self-supervised cross-domain learning framework. Specifically, we train a neural network to restore the metal trace region values in the given metal-free sinogram, where the metal trace is identified by the forward projection of metal masks. We then design a novel FBP reconstruction loss to encourage the network to generate more perfect completion results and a residual-learning-based image refinement module to reduce the secondary artifacts in the reconstructed CT images. To preserve the fine structure details and fidelity of the final MAR image, instead of directly adopting CNN-refined images as output, we incorporate the metal trace replacement into our framework and replace the metal-affected projections of the original sinogram with the prior sinogram generated by the forward projection of the CNN output. We then use the filtered backward projection (FBP) algorithms for final MAR image reconstruction. We conduct an extensive evaluation on simulated and real artifact data to show the effectiveness of our design. Our method produces superior MAR results and outperforms other compelling methods. We also demonstrate the potential of our framework for other organ sites.
    Physics-Informed Neural Nets for Control of Dynamical Systems. (arXiv:2104.02556v2 [cs.LG] UPDATED)
    (2 min) Physics-informed neural networks (PINNs) impose known physical laws into the learning of deep neural networks, making sure they respect the physics of the process while decreasing the demand of labeled data. For systems represented by Ordinary Differential Equations (ODEs), the conventional PINN has a continuous time input variable and outputs the solution of the corresponding ODE. In their original form, PINNs do not allow control inputs neither can they simulate for long-range intervals without serious degradation in their predictions. In this context, this work presents a new framework called Physics-Informed Neural Nets for Control (PINC), which proposes a novel PINN-based architecture that is amenable to \emph{control} problems and able to simulate for longer-range time horizons that are not fixed beforehand. The framework has new inputs to account for the initial state of the system and the control action. In PINC, the response over the complete time horizon is split such that each smaller interval constitutes a solution of the ODE conditioned on the fixed values of initial state and control action for that interval. The whole response is formed by feeding back the predictions of the terminal state as the initial state for the next interval. This proposal enables the optimal control of dynamic systems, integrating a priori knowledge from experts and data collected from plants into control applications. We showcase our proposal in the control of two nonlinear dynamic systems: the Van der Pol oscillator and the four-tank system.
    Learning to Superoptimize Real-world Programs. (arXiv:2109.13498v1 [cs.LG])
    (2 min) Program optimization is the process of modifying software to execute more efficiently. Because finding the optimal program is generally undecidable, modern compilers usually resort to expert-written heuristic optimizations. In contrast, superoptimizers attempt to find the optimal program by employing significantly more expensive search and constraint solving techniques. Generally, these methods do not scale well to programs in real development scenarios, and as a result superoptimization has largely been confined to small-scale, domain-specific, and/or synthetic program benchmarks. In this paper, we propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models. We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly, which enables experimentation on large-scale optimization of real-world programs. We propose an approach, Self Imitation Learning for Optimization (SILO) that is easy to implement and outperforms a standard policy gradient learning approach on our Big Assembly benchmark. Our method, SILO, superoptimizes programs an expected 6.2% of our test set when compared with the gcc version 10.3 compiler's aggressive optimization level -O3. We also report that SILO's rate of superoptimization on our test set is over five times that of a standard policy gradient approach and a model pre-trained on compiler optimization demonstration.
    Multiwavelet-based Operator Learning for Differential Equations. (arXiv:2109.13459v1 [cs.LG])
    (2 min) The solution of a partial differential equation can be obtained by computing the inverse operator map between the input and the solution space. Towards this end, we introduce a \textit{multiwavelet-based neural operator learning scheme} that compresses the associated operator's kernel using fine-grained wavelets. By explicitly embedding the inverse multiwavelet filters, we learn the projection of the kernel onto fixed multiwavelet polynomial bases. The projected kernel is trained at multiple scales derived from using repeated computation of multiwavelet transform. This allows learning the complex dependencies at various scales and results in a resolution-independent scheme. Compare to the prior works, we exploit the fundamental properties of the operator's kernel which enable numerically efficient representation. We perform experiments on the Korteweg-de Vries (KdV) equation, Burgers' equation, Darcy Flow, and Navier-Stokes equation. Compared with the existing neural operator approaches, our model shows significantly higher accuracy and achieves state-of-the-art in a range of datasets. For the time-varying equations, the proposed method exhibits a ($2X-10X$) improvement ($0.0018$ ($0.0033$) relative $L2$ error for Burgers' (KdV) equation). By learning the mappings between function spaces, the proposed method has the ability to find the solution of a high-resolution input after learning from lower-resolution data.
    To Which Out-Of-Distribution Object Orientations Are DNNs Capable of Generalizing?. (arXiv:2109.13445v1 [cs.CV])
    (2 min) The capability of Deep Neural Networks (DNNs) to recognize objects in orientations outside the distribution of the training data, ie. out-of-distribution (OoD) orientations, is not well understood. For humans, behavioral studies showed that recognition accuracy varies across OoD orientations, where generalization is much better for some orientations than for others. In contrast, for DNNs, it remains unknown how generalization abilities are distributed among OoD orientations. In this paper, we investigate the limitations of DNNs' generalization capacities by systematically inspecting patterns of success and failure of DNNs across OoD orientations. We use an intuitive and controlled, yet challenging learning paradigm, in which some instances of an object category are seen at only a few geometrically restricted orientations, while other instances are seen at all orientations. The effect of data diversity is also investigated by increasing the number of instances seen at all orientations in the training set. We present a comprehensive analysis of DNNs' generalization abilities and limitations for representative architectures (ResNet, Inception, DenseNet and CORnet). Our results reveal an intriguing pattern -- DNNs are only capable of generalizing to instances of objects that appear like 2D, ie. in-plane, rotations of in-distribution orientations.
    Structure Learning for Directed Trees. (arXiv:2108.08871v2 [stat.ML] UPDATED)
    (2 min) Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu-Liu-Edmonds' algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce a method for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods.
    Exploring More When It Needs in Deep Reinforcement Learning. (arXiv:2109.13477v1 [cs.LG])
    (2 min) We propose a exploration mechanism of policy in Deep Reinforcement Learning, which is exploring more when agent needs, called Add Noise to Noise (AN2N). The core idea is: when the Deep Reinforcement Learning agent is in a state of poor performance in history, it needs to explore more. So we use cumulative rewards to evaluate which past states the agents have not performed well, and use cosine distance to measure whether the current state needs to be explored more. This method shows that the exploration mechanism of the agent's policy is conducive to efficient exploration. We combining the proposed exploration mechanism AN2N with Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC) algorithms, and apply it to the field of continuous control tasks, such as halfCheetah, Hopper, and Swimmer, achieving considerable improvement in performance and convergence speed.
    Synerise at RecSys 2021: Twitter user engagement prediction with a fast neural model. (arXiv:2109.12985v2 [cs.IR] UPDATED)
    (2 min) In this paper we present our 2nd place solution to ACM RecSys 2021 Challenge organized by Twitter. The challenge aims to predict user engagement for a set of tweets, offering an exceptionally large data set of 1 billion data points sampled from over four weeks of real Twitter interactions. Each data point contains multiple sources of information, such as tweet text along with engagement features, user features, and tweet features. The challenge brings the problem close to a real production environment by introducing strict latency constraints in the model evaluation phase: the average inference time for single tweet engagement prediction is limited to 6ms on a single CPU core with 64GB memory. Our proposed model relies on extensive feature engineering performed with methods such as the Efficient Manifold Density Estimator (EMDE) - our previously introduced algorithm based on Locality Sensitive Hashing method, and novel Fourier Feature Encoding, among others. In total, we create numerous features describing a user's Twitter account status and the content of a tweet. In order to adhere to the strict latency constraints, the underlying model is a simple residual feed-forward neural network. The system is a variation of our previous methods which proved successful in KDD Cup 2021, WSDM Challenge 2021, and SIGIR eCom Challenge 2020. We release the source code at: https://github.com/Synerise/recsys-challenge-2021
    Deep Reinforcement Learning with Adjustments. (arXiv:2109.13463v1 [cs.LG])
    (2 min) Deep reinforcement learning (RL) algorithms can learn complex policies to optimize agent operation over time. RL algorithms have shown promising results in solving complicated problems in recent years. However, their application on real-world physical systems remains limited. Despite the advancements in RL algorithms, the industries often prefer traditional control strategies. Traditional methods are simple, computationally efficient and easy to adjust. In this paper, we first propose a new Q-learning algorithm for continuous action space, which can bridge the control and RL algorithms and bring us the best of both worlds. Our method can learn complex policies to achieve long-term goals and at the same time it can be easily adjusted to address short-term requirements without retraining. Next, we present an approximation of our algorithm which can be applied to address short-term requirements of any pre-trained RL algorithm. The case studies demonstrate that both our proposed method as well as its practical approximation can achieve short-term and long-term goals without complex reward functions.
    Interactive Dynamic Walking: Learning Gait Switching Policies with Generalization Guarantees. (arXiv:2109.13417v1 [cs.RO])
    (2 min) In this paper, we consider the problem of adapting a dynamically walking bipedal robot to follow a leading co-worker while engaging in tasks that require physical interaction. Our approach relies on switching among a family of Dynamic Movement Primitives (DMPs) as governed by a supervisor. We train the supervisor to orchestrate the switching among the DMPs in order to adapt to the leader's intentions, which are only implicitly available in the form of interaction forces. The primary contribution of our approach is its ability to furnish certificates of generalization to novel leader intentions for the trained supervisor. This is achieved by leveraging the Probably Approximately Correct (PAC)-Bayes bounds from generalization theory. We demonstrate the efficacy of our approach by training a neural-network supervisor to adapt the gait of a dynamically walking biped to a leading collaborator whose intended trajectory is not known explicitly.
    The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation. (arXiv:2109.13419v1 [cs.LG])
    (2 min) When the sizes of the state and action spaces are large, solving MDPs can be computationally prohibitive even if the probability transition matrix is known. So in practice, a number of techniques are used to approximately solve the dynamic programming problem, including lookahead, approximate policy evaluation using an m-step return, and function approximation. In a recent paper, (Efroni et al. 2019) studied the impact of lookahead on the convergence rate of approximate dynamic programming. In this paper, we show that these convergence results change dramatically when function approximation is used in conjunction with lookout and approximate policy evaluation using an m-step return. Specifically, we show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed for the algorithm to even converge. And when this condition is met, we characterize the finite-time performance of policies obtained using such approximate policy iteration. Our results are presented for two different procedures to compute the function approximation: linear least-squares regression and gradient descent.
    ST-MAML: A Stochastic-Task based Method for Task-Heterogeneous Meta-Learning. (arXiv:2109.13305v1 [cs.LG])
    (2 min) Optimization-based meta-learning typically assumes tasks are sampled from a single distribution - an assumption oversimplifies and limits the diversity of tasks that meta-learning can model. Handling tasks from multiple different distributions is challenging for meta-learning due to a so-called task ambiguity issue. This paper proposes a novel method, ST-MAML, that empowers model-agnostic meta-learning (MAML) to learn from multiple task distributions. ST-MAML encodes tasks using a stochastic neural network module, that summarizes every task with a stochastic representation. The proposed Stochastic Task (ST) strategy allows a meta-model to get tailored for the current task and enables us to learn a distribution of solutions for an ambiguous task. ST-MAML also propagates the task representation to revise the encoding of input variables. Empirically, we demonstrate that ST-MAML matches or outperforms the state-of-the-art on two few-shot image classification tasks, one curve regression benchmark, one image completion problem, and a real-world temperature prediction application. To the best of authors' knowledge, this is the first time optimization-based meta-learning method being applied on a large-scale real-world task.
    Lithium-ion Battery State of Health Estimation based on Cycle Synchronization using Dynamic Time Warping. (arXiv:2109.13448v1 [cs.LG])
    (2 min) The state of health (SOH) estimation plays an essential role in battery-powered applications to avoid unexpected breakdowns due to battery capacity fading. However, few studies have paid attention to the problem of uneven length of degrading cycles, simply employing manual operation or leaving to the automatic processing mechanism of advanced machine learning models, like long short-term memory (LSTM). As a result, this causes information loss and caps the full capability of the data-driven SOH estimation models. To address this challenge, this paper proposes an innovative cycle synchronization way to change the existing coordinate system using dynamic time warping, not only enabling the equal length inputs of the estimation model but also preserving all information. By exploiting the time information of the time series, the proposed method embeds the time index and the original measurements into a novel indicator to reflect the battery degradation status, which could have the same length over cycles. Adopting the LSTM as the basic estimation model, the cycle synchronization-based SOH model could significantly improve the prediction accuracy by more than 30% compared to the traditional LSTM.
    DEBOSH: Deep Bayesian Shape Optimization. (arXiv:2109.13337v1 [cs.LG])
    (2 min) Shape optimization is at the heart of many industrial applications, such as aerodynamics, heat transfer, and structural analysis. It has recently been shown that Graph Neural Networks (GNNs) can predict the performance of a shape quickly and accurately and be used to optimize more effectively than traditional techniques that rely on response-surfaces obtained by Kriging. However, GNNs suffer from the fact that they do not evaluate their own accuracy, which is something Bayesian Optimization methods require. Therefore, estimating confidence in generated predictions is necessary to go beyond straight deterministic optimization, which is less effective. In this paper, we demonstrate that we can use Ensembles-based technique to overcome this limitation and outperform the state-of-the-art. Our experiments on diverse aerodynamics and structural analysis tasks prove that adding uncertainty to shape optimization significantly improves the quality of resulting shapes and reduces the time required for the optimization.
    Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients. (arXiv:2109.13333v1 [cs.RO])
    (2 min) In this work we are the first to present an offline policy gradient method for learning imitative policies for complex urban driving from a large corpus of real-world demonstrations. This is achieved by building a differentiable data-driven simulator on top of perception outputs and high-fidelity HD maps of the area. It allows us to synthesize new driving experiences from existing demonstrations using mid-level representations. Using this simulator we then train a policy network in closed-loop employing policy gradients. We train our proposed method on 100 hours of expert demonstrations on urban roads and show that it learns complex driving policies that generalize well and can perform a variety of driving maneuvers. We demonstrate this in simulation as well as deploy our model to self-driving vehicles in the real-world. Our method outperforms previously demonstrated state-of-the-art for urban driving scenarios -- all this without the need for complex state perturbations or collecting additional on-policy data during training. We make code and data publicly available.
    Unrolling SGD: Understanding Factors Influencing Machine Unlearning. (arXiv:2109.13398v1 [cs.LG])
    (2 min) Machine unlearning is the process through which a deployed machine learning model forgets about one of its training data points. While naively retraining the model from scratch is an option, it is almost always associated with a large computational effort for deep learning models. Thus, several approaches to approximately unlearn have been proposed along with corresponding metrics that formalize what it means for a model to forget about a data point. In this work, we first taxonomize approaches and metrics of approximate unlearning. As a result, we identify verification error, i.e., the L2 difference between the weights of an approximately unlearned and a naively retrained model, as a metric approximate unlearning should optimize for as it implies a large class of other metrics. We theoretically analyze the canonical stochastic gradient descent (SGD) training algorithm to surface the variables which are relevant to reducing the verification error of approximate unlearning for SGD. From this analysis, we first derive an easy-to-compute proxy for verification error (termed unlearning error). The analysis also informs the design of a new training objective penalty that limits the overall change in weights during SGD and as a result facilitates approximate unlearning with lower verification error. We validate our theoretical work through an empirical evaluation on CIFAR-10, CIFAR-100, and IMDB sentiment analysis.
    CateCom: a practical data-centric approach to categorization of computational models. (arXiv:2109.13452v1 [cs.DB])
    (2 min) The advent of data-driven science in the 21st century brought about the need for well-organized structured data and associated infrastructure able to facilitate the applications of Artificial Intelligence and Machine Learning. We present an effort aimed at organizing the diverse landscape of physics-based and data-driven computational models in order to facilitate the storage of associated information as structured data. We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework that is: (1) capable of uniquely describing the approaches in structured data, (2) flexible enough to cover the majority of widely used models, and (3) utilizes collective intelligence through community contributions. We present example database schemas and corresponding data structures and explain how these are deployed in software at the time of this writing.
    Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation. (arXiv:2109.13318v1 [cs.CL])
    (2 min) Automating sign language translation (SLT) is a challenging real world application. Despite its societal importance, though, research progress in the field remains rather poor. Crucially, existing methods that yield viable performance necessitate the availability of laborious to obtain gloss sequence groundtruth. In this paper, we attenuate this need, by introducing an end-to-end SLT model that does not entail explicit use of glosses; the model only needs text groundtruth. This is in stark contrast to existing end-to-end models that use gloss sequence groundtruth, either in the form of a modality that is recognized at an intermediate model stage, or in the form of a parallel output process, jointly trained with the SLT model. Our approach constitutes a Transformer network with a novel type of layers that combines: (i) local winner-takes-all (LWTA) layers with stochastic winner sampling, instead of conventional ReLU layers, (ii) stochastic weights with posterior distributions estimated via variational inference, and (iii) a weight compression technique at inference time that exploits estimated posterior variance to perform massive, almost lossless compression. We demonstrate that our approach can reach the currently best reported BLEU-4 score on the PHOENIX 2014T benchmark, but without making use of glosses for model training, and with a memory footprint reduced by more than 70%.
    Go-Blend behavior and affect. (arXiv:2109.13388v1 [cs.LG])
    (2 min) This paper proposes a paradigm shift for affective computing by viewing the affect modeling task as a reinforcement learning process. According to our proposed framework the context (environment) and the actions of an agent define the common representation that interweaves behavior and affect. To realise this framework we build on recent advances in reinforcement learning and use a modified version of the Go-Explore algorithm which has showcased supreme performance in hard exploration tasks. In this initial study, we test our framework in an arcade game by training Go-Explore agents to both play optimally and attempt to mimic human demonstrations of arousal. We vary the degree of importance between optimal play and arousal imitation and create agents that can effectively display a palette of affect and behavioral patterns. Our Go-Explore implementation not only introduces a new paradigm for affect modeling; it empowers believable AI-based game testing by providing agents that can blend and express a multitude of behavioral and affective patterns.
    IGAN: Inferent and Generative Adversarial Networks. (arXiv:2109.13360v1 [cs.LG])
    (2 min) I present IGAN (Inferent Generative Adversarial Networks), a neural architecture that learns both a generative and an inference model on a complex high dimensional data distribution, i.e. a bidirectional mapping between data samples and a simpler low-dimensional latent space. It extends the traditional GAN framework with inference by rewriting the adversarial strategy in both the image and the latent space with an entangled game between data-latent encoded posteriors and priors. It brings a measurable stability and convergence to the classical GAN scheme, while keeping its generative quality and remaining simple and frugal in order to run on a lab PC. IGAN fosters the encoded latents to span the full prior space: this enables the exploitation of an enlarged and self-organised latent space in an unsupervised manner. An analysis of previously published articles sets the theoretical ground for the proposed algorithm. A qualitative demonstration of potential applications like self-supervision or multi-modal data translation is given on common image datasets including SAR and optical imagery.
    Probabilistic modeling of lake surface water temperature using a Bayesian spatio-temporal graph convolutional neural network. (arXiv:2109.13235v1 [cs.LG])
    (2 min) Accurate lake temperature estimation is essential for numerous problems tackled in both hydrological and ecological domains. Nowadays physical models are developed to estimate lake dynamics; however, computations needed for accurate estimation of lake surface temperature can get prohibitively expensive. We propose to aggregate simulations of lake temperature at a certain depth together with a range of meteorological features to probabilistically estimate lake surface temperature. Accordingly, we introduce a spatio-temporal neural network that combines Bayesian recurrent neural networks and Bayesian graph convolutional neural networks. This work demonstrates that the proposed graphical model can deliver homogeneously good performance covering the whole lake surface despite having sparse training data available. Quantitative results are compared with a state-of-the-art Bayesian deep learning method. Code for the developed architectural layers, as well as demo scripts, are available on https://renkulab.io/projects/das/bstnn.
    Graph Neural Network-based Resource AllocationStrategies for Multi-Object Spectroscopy. (arXiv:2109.13361v1 [astro-ph.IM])
    (2 min) Resource allocation problems are often approached with linear program-ming techniques. But many concrete allocation problems in the experimental and ob-servational sciences cannot or should not be expressed in the form of linear objectivefunctions. Even if the objective is linear, its parameters may not be known beforehandbecause they depend on the results of the experiment for which the allocation is to bedetermined. To address these challenges, we present a bipartite Graph Neural Networkarchitecture for trainable resource allocation strategies. Items of value and constraintsform the two sets of graph nodes, which are connected by edges corresponding to pos-sible allocations. The GNN is trained on simulations or past problem occurrences tomaximize any user-supplied, scientifically motivated objective function, augmented byan infeasibility penalty. The amount of feasibility violation can be tuned in relation toany available slack in the system. We apply this method to optimize the astronomicaltarget selection strategy for the highly multiplexed Subaru Prime Focus Spectrographinstrument, where it shows superior results to direct gradient descent optimization andextends the capabilities of the currently employed solver which uses linear objectivefunctions. The development of this method enables fast adjustment and deployment ofallocation strategies, statistical analyses of allocation patterns, and fully differentiable,science-driven solutions for resource allocation problems.
    An Adaptive Deep Learning Framework for Day-ahead Forecasting of Photovoltaic Power Generation. (arXiv:2109.13442v1 [cs.LG])
    (2 min) Accurate forecasts of photovoltaic power generation (PVPG) are essential to optimize operations between energy supply and demand. Recently, the propagation of sensors and smart meters has produced an enormous volume of data, which supports the development of data based PVPG forecasting. Although emerging deep learning (DL) models, such as the long short-term memory (LSTM) model, based on historical data, have provided effective solutions for PVPG forecasting with great successes, these models utilize offline learning. As a result, DL models cannot take advantage of the opportunity to learn from newly-arrived data, and are unable to handle concept drift caused by installing extra PV units and unforeseen PV unit failures. Consequently, to improve day-ahead PVPG forecasting accuracy, as well as eliminate the impacts of concept drift, this paper proposes an adaptive LSTM (AD-LSTM) model, which is a DL framework that can not only acquire general knowledge from historical data, but also dynamically learn specific knowledge from newly-arrived data. A two-phase adaptive learning strategy (TP-ALS) is integrated into AD-LSTM, and a sliding window (SDWIN) algorithm is proposed, to detect concept drift in PV systems. Multiple datasets from PV systems are utilized to assess the feasibility and effectiveness of the proposed approaches. The developed AD-LSTM model demonstrates greater forecasting capability than the offline LSTM model, particularly in the presence of concept drift. Additionally, the proposed AD-LSTM model also achieves superior performance in terms of day-ahead PVPG forecasting compared to other traditional machine learning models and statistical models in the literature.
    Audio-to-Image Cross-Modal Generation. (arXiv:2109.13354v1 [cs.MM])
    (2 min) Cross-modal representation learning allows to integrate information from different modalities into one representation. At the same time, research on generative models tends to focus on the visual domain with less emphasis on other domains, such as audio or text, potentially missing the benefits of shared representations. Studies successfully linking more than one modality in the generative setting are rare. In this context, we verify the possibility to train variational autoencoders (VAEs) to reconstruct image archetypes from audio data. Specifically, we consider VAEs in an adversarial training framework in order to ensure more variability in the generated data and find that there is a trade-off between the consistency and diversity of the generated images - this trade-off can be governed by scaling the reconstruction loss up or down, respectively. Our results further suggest that even in the case when the generated images are relatively inconsistent (diverse), features that are critical for proper image classification are preserved.
    The Impact of Domain Shift on Left and Right Ventricle Segmentation in Short Axis Cardiac MR Images. (arXiv:2109.13230v1 [eess.IV])
    (2 min) Domain shift refers to the difference in the data distribution of two datasets, normally between the training set and the test set for machine learning algorithms. Domain shift is a serious problem for generalization of machine learning models and it is well-established that a domain shift between the training and test sets may cause a drastic drop in the model's performance. In medical imaging, there can be many sources of domain shift such as different scanners or scan protocols, different pathologies in the patient population, anatomical differences in the patient population (e.g. men vs women) etc. Therefore, in order to train models that have good generalization performance, it is important to be aware of the domain shift problem, its potential causes and to devise ways to address it. In this paper, we study the effect of domain shift on left and right ventricle blood pool segmentation in short axis cardiac MR images. Our dataset contains short axis images from 4 different MR scanners and 3 different pathology groups. The training is performed with nnUNet. The results show that scanner differences cause a greater drop in performance compared to changing the pathology group, and that the impact of domain shift is greater on right ventricle segmentation compared to left ventricle segmentation. Increasing the number of training subjects increased cross-scanner performance more than in-scanner performance at small training set sizes, but this difference in improvement decreased with larger training set sizes. Training models using data from multiple scanners improved cross-domain performance.
    Exploring The Role of Local and Global Explanations in Recommender Systems. (arXiv:2109.13301v1 [cs.IR])
    (2 min) Explanations are well-known to improve recommender systems' transparency. These explanations may be local, explaining an individual recommendation, or global, explaining the recommender model in general. Despite their widespread use, there has been little investigation into the relative benefits of these two approaches. Do they provide the same benefits to users, or do they serve different purposes? We conducted a 30-participant exploratory study and a 30-participant controlled user study with a research-paper recommender system to analyze how providing participants local, global, or both explanations influences user understanding of system behavior. Our results provide evidence suggesting that both explanations are more helpful than either alone for explaining how to improve recommendations, yet both appeared less helpful than global alone for efficiency in identifying false positives and negatives. However, we note that the two explanation approaches may be better compared in the context of a higher-stakes or more opaque domain.
    When in Doubt: Improving Classification Performance with Alternating Normalization. (arXiv:2109.13449v1 [cs.LG])
    (2 min) We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks.
    The edge of chaos: quantum field theory and deep neural networks. (arXiv:2109.13247v1 [hep-th])
    (2 min) We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the mean-field theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth $T$ to width $N$, and find a precise analogy with the well-studied $O(N)$ vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the $\mathcal{O}(1)$ corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading $\mathcal{O}(T/N)$ corrections due to finite-width effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
    Introducing the viewpoint in the resource description using machine learning. (arXiv:2109.13306v1 [cs.LG])
    (2 min) Search engines allow providing the user with data information according to their interests and specialty. Thus, it is necessary to exploit descriptions of the resources, which take into consideration viewpoints. Generally, the resource descriptions are available in RDF (e.g., DBPedia of Wikipedia content). However, these descriptions do not take into consideration viewpoints. In this paper, we propose a new approach, which allows converting a classic RDF resource description to a resource description that takes into consideration viewpoints. To detect viewpoints in the document, a machine learning technique will be exploited on an instanced ontology. This latter allows representing the viewpoint in a given domain. An experimental study shows that the conversion of the classic RDF resource description to a resource description that takes into consideration viewpoints, allows giving very relevant responses to the user's requests.
    GANG-MAM: GAN based enGine for Modifying Android Malware. (arXiv:2109.13297v1 [cs.CR])
    (2 min) Malware detectors based on machine learning are vulnerable to adversarial attacks. Generative Adversarial Networks (GAN) are architectures based on Neural Networks that could produce successful adversarial samples. The interest towards this technology is quickly growing. In this paper, we propose a system that produces a feature vector for making an Android malware strongly evasive and then modify the malicious program accordingly. Such a system could have a twofold contribution: it could be used to generate datasets to validate systems for detecting GAN-based malware and to enlarge the training and testing dataset for making more robust malware classifiers.
    FedIPR: Ownership Verification for Federated Deep Neural Network Models. (arXiv:2109.13236v1 [cs.LG])
    (2 min) Federated learning models must be protected against plagiarism since these models are built upon valuable training data owned by multiple institutions or people.This paper illustrates a novel federated deep neural network (FedDNN) ownership verification scheme that allows ownership signatures to be embedded and verified to claim legitimate intellectual property rights (IPR) of FedDNN models, in case that models are illegally copied, re-distributed or misused. The effectiveness of embedded ownership signatures is theoretically justified by proved condition sunder which signatures can be embedded and detected by multiple clients with-out disclosing private signatures. Extensive experimental results on CIFAR10,CIFAR100 image datasets demonstrate that varying bit-lengths signatures can be embedded and reliably detected without affecting models classification performances. Signatures are also robust against removal attacks including fine-tuning and pruning.
    Bayesian Transfer Learning: An Overview of Probabilistic Graphical Models for Transfer Learning. (arXiv:2109.13233v1 [cs.LG])
    (2 min) Transfer learning where the behavior of extracting transferable knowledge from the source domain(s) and reusing this knowledge to target domain has become a research area of great interest in the field of artificial intelligence. Probabilistic graphical models (PGMs) have been recognized as a powerful tool for modeling complex systems with many advantages, e.g., the ability to handle uncertainty and possessing good interpretability. Considering the success of these two aforementioned research areas, it seems natural to apply PGMs to transfer learning. However, although there are already some excellent PGMs specific to transfer learning in the literature, the potential of PGMs for this problem is still grossly underestimated. This paper aims to boost the development of PGMs for transfer learning by 1) examining the pilot studies on PGMs specific to transfer learning, i.e., analyzing and summarizing the existing mechanisms particularly designed for knowledge transfer; 2) discussing examples of real-world transfer problems where existing PGMs have been successfully applied; and 3) exploring several potential research directions on transfer learning using PGM.
    Contributions to Large Scale Bayesian Inference and Adversarial Machine Learning. (arXiv:2109.13232v1 [stat.ML])
    (2 min) The rampant adoption of ML methodologies has revealed that models are usually adopted to make decisions without taking into account the uncertainties in their predictions. More critically, they can be vulnerable to adversarial examples. Thus, we believe that developing ML systems that take into account predictive uncertainties and are robust against adversarial examples is a must for critical, real-world tasks. We start with a case study in retailing. We propose a robust implementation of the Nerlove-Arrow model using a Bayesian structural time series model. Its Bayesian nature facilitates incorporating prior information reflecting the manager's views, which can be updated with relevant data. However, this case adopted classical Bayesian techniques, such as the Gibbs sampler. Nowadays, the ML landscape is pervaded with neural networks and this chapter also surveys current developments in this sub-field. Then, we tackle the problem of scaling Bayesian inference to complex models and large data regimes. In the first part, we propose a unifying view of two different Bayesian inference algorithms, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) and Stein Variational Gradient Descent (SVGD), leading to improved and efficient novel sampling schemes. In the second part, we develop a framework to boost the efficiency of Bayesian inference in probabilistic models by embedding a Markov chain sampler within a variational posterior approximation. After that, we present an alternative perspective on adversarial classification based on adversarial risk analysis, and leveraging the scalable Bayesian approaches from chapter 2. In chapter 4 we turn to reinforcement learning, introducing Threatened Markov Decision Processes, showing the benefits of accounting for adversaries in RL while the agent learns.

2021-09-28

  • cs.CL updates on arXiv.org

    Extracting and Inferring Personal Attributes from Dialogue. (arXiv:2109.12702v1 [cs.CL])
    (0 min) Personal attributes represent structured information about a person, such as their hobbies, pets, family, likes and dislikes. In this work, we introduce the tasks of extracting and inferring personal attributes from human-human dialogue. We first demonstrate the benefit of incorporating personal attributes in a social chit-chat dialogue model and task-oriented dialogue setting. Thus motivated, we propose the tasks of personal attribute extraction and inference, and then analyze the linguistic demands of these tasks. To meet these challenges, we introduce a simple and extensible model that combines an autoregressive language model utilizing constrained attribute generation with a discriminative reranker. Our model outperforms strong baselines on extracting personal attributes as well as inferring personal attributes that are not contained verbatim in utterances and instead requires commonsense reasoning and lexical inferences, which occur frequently in everyday conversation.
    MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents. (arXiv:2109.12595v1 [cs.CL])
    (2 min) We propose MultiDoc2Dial, a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as a machine reading comprehension task based on a single given document or passage. In this work, we aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents. To facilitate such a task, we introduce a new dataset that contains dialogues grounded in multiple documents from four different domains. We also explore modeling the dialogue-based and document-based context in the dataset. We present strong baseline approaches and various experimental results, aiming to support further research efforts on such a task.
    COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes. (arXiv:2007.06954v7 [cs.CL] UPDATED)
    (3 min) This paper describes a large global dataset on people's social media responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 September 2021, we collected over 198 million Twitter posts from more than 25 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging topic modeling techniques and pre-trained machine learning-based emotion analytic algorithms, we labeled each tweet with seventeen semantic attributes, including a) ten binary attributes indicating the tweet's relevance or irrelevance to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: very negative to 1: very positive), and the degree of intensity of fear, anger, happiness and sadness emotions (from 0: not at all to 1: extremely intense), and c) two qualitative attributes indicating the sentiment category (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion category (fear, anger, happiness, sadness, no specific emotion) the tweet is mainly expressing. We report the descriptive statistics around these new attributes, their temporal distributions, and the overall geographic representation of the dataset. The paper concludes with an outline of the dataset's possible usage in communication, psychology, public health, economics, and epidemiology.
    Paradigm Shift in Natural Language Processing. (arXiv:2109.12575v1 [cs.CL])
    (0 min) In the era of deep learning, modeling for most NLP tasks has converged to several mainstream paradigms. For example, we usually adopt the sequence labeling paradigm to solve a bundle of tasks such as POS-tagging, NER, Chunking, and adopt the classification paradigm to solve tasks like sentiment analysis. With the rapid progress of pre-trained language models, recent years have observed a rising trend of Paradigm Shift, which is solving one NLP task by reformulating it as another one. Paradigm shift has achieved great success on many tasks, becoming a promising way to improve model performance. Moreover, some of these paradigms have shown great potential to unify a large number of NLP tasks, making it possible to build a single model to handle diverse tasks. In this paper, we review such phenomenon of paradigm shifts in recent years, highlighting several paradigms that have the potential to solve different NLP tasks.
    Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits. (arXiv:2106.14371v2 [cs.SD] UPDATED)
    (0 min) Target speech separation is the process of filtering a certain speaker's voice out of speech mixtures according to the additional speaker identity information provided. Recent works have made considerable improvement by processing signals in the time domain directly. The majority of them take fully overlapped speech mixtures for training. However, since most real-life conversations occur randomly and are sparsely overlapped, we argue that training with different overlap ratio data benefits. To do so, an unavoidable problem is that the popularly used SI-SNR loss has no definition for silent sources. This paper proposes the weighted SI-SNR loss, together with the joint learning of target speech separation and personal VAD. The weighted SI-SNR loss imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target speaker is absent. Meanwhile, the personal VAD generates masks and sets non-target speech to silence. Experiments show that our proposed method outperforms the baseline by 1.73 dB in terms of SDR on fully overlapped speech, as well as by 4.17 dB and 0.9 dB on sparsely overlapped speech of clean and noisy conditions. Besides, with slight degradation in performance, our model could reduce the time costs in inference.
    Investigating Pretrained Language Models for Graph-to-Text Generation. (arXiv:2007.08426v3 [cs.CL] UPDATED)
    (0 min) Graph-to-text generation aims to generate fluent texts from graph-based data. In this paper, we investigate two recently proposed pretrained language models (PLMs) and analyze the impact of different task-adaptive pretraining strategies for PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. We show that the PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further. In particular, we report new state-of-the-art BLEU scores of 49.72 on LDC2017T10, 59.70 on WebNLG, and 25.66 on AGENDA datasets - a relative improvement of 31.8%, 4.5%, and 42.4%, respectively. In an extensive analysis, we identify possible reasons for the PLMs' success on graph-to-text tasks. We find evidence that their knowledge about true facts helps them perform well even when the input graph representation is reduced to a simple bag of node and edge labels.
    FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding. (arXiv:2109.12742v1 [cs.CL])
    (0 min) The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev-test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods.
    Separate but Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data. (arXiv:2105.04727v3 [cs.SD] UPDATED)
    (0 min) We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.
    ReINTEL Challenge 2020: A Comparative Study of Hybrid Deep Neural Network for Reliable Intelligence Identification on Vietnamese SNSs. (arXiv:2109.12777v1 [cs.LG])
    (0 min) The overwhelming abundance of data has created a misinformation crisis. Unverified sensationalism that is designed to grab the readers' short attention span, when crafted with malice, has caused irreparable damage to our society's structure. As a result, determining the reliability of an article has become a crucial task. After various ablation studies, we propose a multi-input model that can effectively leverage both tabular metadata and post content for the task. Applying state-of-the-art finetuning techniques for the pretrained component and training strategies for our complete model, we have achieved a 0.9462 ROC-score on the VLSP private test set.
    Graph Reasoning with Context-Aware Linearization for Interpretable Fact Extraction and Verification. (arXiv:2109.12349v1 [cs.CL])
    (0 min) This paper presents an end-to-end system for fact extraction and verification using textual and tabular evidence, the performance of which we demonstrate on the FEVEROUS dataset. We experiment with both a multi-task learning paradigm to jointly train a graph attention network for both the task of evidence extraction and veracity prediction, as well as a single objective graph model for solely learning veracity prediction and separate evidence extraction. In both instances, we employ a framework for per-cell linearization of tabular evidence, thus allowing us to treat evidence from tables as sequences. The templates we employ for linearizing tables capture the context as well as the content of table data. We furthermore provide a case study to show the interpretability our approach. Our best performing system achieves a FEVEROUS score of 0.23 and 53% label accuracy on the blind test data.
    An Investigation of Language Model Interpretability via Sentence Editing. (arXiv:2011.14039v2 [cs.CL] UPDATED)
    (0 min) Pre-trained language models (PLMs) like BERT are being used for almost all language-related tasks, but interpreting their behavior still remains a significant challenge and many important questions remain largely unanswered. In this work, we re-purpose a sentence editing dataset, where faithful high-quality human rationales can be automatically extracted and compared with extracted model rationales, as a new testbed for interpretability. This enables us to conduct a systematic investigation on an array of questions regarding PLMs' interpretability, including the role of pre-training procedure, comparison of rationale extraction methods, and different layers in the PLM. The investigation generates new insights, for example, contrary to the common understanding, we find that attention weights correlate well with human rationales and work better than gradient-based saliency in extracting model rationales. Both the dataset and code are available at https://github.com/samuelstevens/sentence-editing-interpretability to facilitate future interpretability research.
    Deciding Whether to Ask Clarifying Questions in Large-Scale Spoken Language Understanding. (arXiv:2109.12451v1 [cs.CL])
    (0 min) A large-scale conversational agent can suffer from understanding user utterances with various ambiguities such as ASR ambiguity, intent ambiguity, and hypothesis ambiguity. When ambiguities are detected, the agent should engage in a clarifying dialog to resolve the ambiguities before committing to actions. However, asking clarifying questions for all the ambiguity occurrences could lead to asking too many questions, essentially hampering the user experience. To trigger clarifying questions only when necessary for the user satisfaction, we propose a neural self-attentive model that leverages the hypotheses with ambiguities and contextual signals. We conduct extensive experiments on five common ambiguity types using real data from a large-scale commercial conversational agent and demonstrate significant improvement over a set of baseline approaches.
    Multiplicative Position-aware Transformer Models for Language Understanding. (arXiv:2109.12788v1 [cs.CL])
    (0 min) Transformer models, which leverage architectural improvements like self-attention, perform remarkably well on Natural Language Processing (NLP) tasks. The self-attention mechanism is position agnostic. In order to capture positional ordering information, various flavors of absolute and relative position embeddings have been proposed. However, there is no systematic analysis on their contributions and a comprehensive comparison of these methods is missing in the literature. In this paper, we review major existing position embedding methods and compare their accuracy on downstream NLP tasks, using our own implementations. We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods. Finally, we show that our proposed embedding method, served as a drop-in replacement of the default absolute position embedding, can improve the RoBERTa-base and RoBERTa-large models on SQuAD1.1 and SQuAD2.0 datasets.
    Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates. (arXiv:2109.12804v1 [eess.AS])
    (0 min) The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between teacher-forcing during training and conditioning on CTC outputs during testing, we also propose sampling CTC outputs during training. Experimental evaluations on three corpora show that Fast-MD achieved about 2x and 4x faster decoding speed than that of the na\"ive MD model on GPU and CPU with comparable translation quality. Adopting the Conformer encoder and intermediate CTC loss further boosts its quality without sacrificing decoding speed.
    A Graph-Based Neural Model for End-to-End Frame Semantic Parsing. (arXiv:2109.12319v1 [cs.CL])
    (0 min) Frame semantic parsing is a semantic analysis task based on FrameNet which has received great attention recently. The task usually involves three subtasks sequentially: (1) target identification, (2) frame classification and (3) semantic role labeling. The three subtasks are closely related while previous studies model them individually, which ignores their intern connections and meanwhile induces error propagation problem. In this work, we propose an end-to-end neural model to tackle the task jointly. Concretely, we exploit a graph-based method, regarding frame semantic parsing as a graph construction problem. All predicates and roles are treated as graph nodes, and their relations are taken as graph edges. Experiment results on two benchmark datasets of frame semantic parsing show that our method is highly competitive, resulting in better performance than pipeline models.
    A Panoramic Survey of Natural Language Processing in the Arab World. (arXiv:2011.12631v3 [cs.CL] UPDATED)
    (0 min) The term natural language refers to any system of symbolic communication (spoken, signed or written) without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NLP) is the sub-field of artificial intelligence (AI) focused on modeling natural languages to build applications such as speech recognition and synthesis, machine translation, optical character recognition (OCR), sentiment analysis (SA), question answering, dialogue systems, etc. NLP is a highly interdisciplinary field with connections to computer science, linguistics, cognitive science, psychology, mathematics and others. Some of the earliest AI applications were in NLP (e.g., machine translation); and the last decade (2010-2020) in particular has witnessed an incredible increase in quality, matched with a rise in public awareness, use, and expectations of what may have seemed like science fiction in the past. NLP researchers pride themselves on developing language independent models and tools that can be applied to all human languages, e.g. machine translation systems can be built for a variety of languages using the same basic mechanisms and models. However, the reality is that some languages do get more attention (e.g., English and Chinese) than others (e.g., Hindi and Swahili). Arabic, the primary language of the Arab world and the religious language of millions of non-Arab Muslims is somewhere in the middle of this continuum. Though Arabic NLP has many challenges, it has seen many successes and developments. Next we discuss Arabic's main challenges as a necessary background, and we present a brief history of Arabic NLP. We then survey a number of its research areas, and close with a critical discussion of the future of Arabic NLP.
    Intent Recognition and Unsupervised Slot Identification for Low Resourced Spoken Dialog Systems. (arXiv:2104.01287v2 [cs.CL] UPDATED)
    (0 min) Intent Recognition and Slot Identification are crucial components in spoken language understanding (SLU) systems. In this paper, we present a novel approach towards both these tasks in the context of low resourced and unwritten languages. We present an acoustic based SLU system that converts speech to its phonetic transcription using a universal phone recognition system. We build a word-free natural language understanding module that does intent recognition and slot identification from these phonetic transcription. Our proposed SLU system performs competitively for resource rich scenarios and significantly outperforms existing approaches as the amount of available data reduces. We observe more than 10% improvement for intent classification in Tamil and more than 5% improvement for intent classification in Sinhala. We also present a novel approach towards unsupervised slot identification using normalized attention scores. This approach can be used for unsupervised slot labelling, data augmentation and to generate data for a new slot in a one-shot way with only one speech recording
    Sorting through the noise: Testing robustness of information processing in pre-trained language models. (arXiv:2109.12393v1 [cs.CL])
    (0 min) Pre-trained LMs have shown impressive performance on downstream NLP tasks, but we have yet to establish a clear understanding of their sophistication when it comes to processing, retaining, and applying information presented in their input. In this paper we tackle a component of this question by examining robustness of models' ability to deploy relevant context information in the face of distracting content. We present models with cloze tasks requiring use of critical context information, and introduce distracting content to test how robustly the models retain and use that critical information for prediction. We also systematically manipulate the nature of these distractors, to shed light on dynamics of models' use of contextual cues. We find that although models appear in simple contexts to make predictions based on understanding and applying relevant facts from prior context, the presence of distracting but irrelevant content has clear impact in confusing model predictions. In particular, models appear particularly susceptible to factors of semantic similarity and word position. The findings are consistent with the conclusion that LM predictions are driven in large part by superficial contextual cues, rather than by robust representations of context meaning.
    Learning to Selectively Learn for Weakly-supervised Paraphrase Generation. (arXiv:2109.12457v1 [cs.CL])
    (0 min) Paraphrase generation is a longstanding NLP task that has diverse applications for downstream NLP tasks. However, the effectiveness of existing efforts predominantly relies on large amounts of golden labeled data. Though unsupervised endeavors have been proposed to address this issue, they may fail to generate meaningful paraphrases due to the lack of supervision signals. In this work, we go beyond the existing paradigms and propose a novel approach to generate high-quality paraphrases with weak supervision data. Specifically, we tackle the weakly-supervised paraphrase generation problem by: (1) obtaining abundant weakly-labeled parallel sentences via retrieval-based pseudo paraphrase expansion; and (2) developing a meta-learning framework to progressively select valuable samples for fine-tuning a pre-trained language model, i.e., BART, on the sentential paraphrasing task. We demonstrate that our approach achieves significant improvements over existing unsupervised approaches, and is even comparable in performance with supervised state-of-the-arts.
    MINIMAL: Mining Models for Data Free Universal Adversarial Triggers. (arXiv:2109.12406v1 [cs.CL])
    (0 min) It is well known that natural language models are vulnerable to adversarial attacks, which are mostly input-specific in nature. Recently, it has been shown that there also exist input-agnostic attacks in NLP models, called universal adversarial triggers. However, existing methods to craft universal triggers are data intensive. They require large amounts of data samples to generate adversarial triggers, which are typically inaccessible by attackers. For instance, previous works take 3000 data samples per class for the SNLI dataset to generate adversarial triggers. In this paper, we present a novel data-free approach, MINIMAL, to mine input-agnostic adversarial triggers from models. Using the triggers produced with our data-free algorithm, we reduce the accuracy of Stanford Sentiment Treebank's positive class from 93.6% to 9.6%. Similarly, for the Stanford Natural Language Inference (SNLI), our single-word trigger reduces the accuracy of the entailment class from 90.95% to less than 0.6\%. Despite being completely data-free, we get equivalent accuracy drops as data-dependent methods.
    Explanation-Based Human Debugging of NLP Models: A Survey. (arXiv:2104.15135v2 [cs.CL] UPDATED)
    (0 min) Debugging a machine learning model is hard since the bug usually involves the training data and the learning process. This becomes even harder for an opaque deep learning model if we have no clue about how the model actually works. In this survey, we review papers that exploit explanations to enable humans to give feedback and debug NLP models. We call this problem explanation-based human debugging (EBHD). In particular, we categorize and discuss existing work along three dimensions of EBHD (the bug context, the workflow, and the experimental setting), compile findings on how EBHD components affect the feedback providers, and highlight open problems that could be future research directions.
    More Than Reading Comprehension: A Survey on Datasets and Metrics of Textual Question Answering. (arXiv:2109.12264v1 [cs.CL])
    (0 min) Textual Question Answering (QA) aims to provide precise answers to user's questions in natural language using unstructured data. One of the most popular approaches to this goal is machine reading comprehension(MRC). In recent years, many novel datasets and evaluation metrics based on classical MRC tasks have been proposed for broader textual QA tasks. In this paper, we survey 47 recent textual QA benchmark datasets and propose a new taxonomy from an application point of view. In addition, We summarize 8 evaluation metrics of textual QA tasks. Finally, we discuss current trends in constructing textual QA benchmarks and suggest directions for future work.
    An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces. (arXiv:2109.12640v1 [cs.CL])
    (0 min) Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exploits new techniques from the graph matching optimization literature. These contrasting approaches have not been compared in BLI so far. In this work, we study the behavior of Euclidean versus graph-based approaches to BLI under differing data conditions and show that they complement each other when combined. We release our code at https://github.com/kellymarchisio/euc-v-graph-bli.
    AuGPT: Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language Models. (arXiv:2102.05126v2 [cs.CL] UPDATED)
    (2 min) Attention-based pre-trained language models such as GPT-2 brought considerable progress to end-to-end dialogue modelling. However, they also present considerable risks for task-oriented dialogue, such as lack of knowledge grounding or diversity. To address these issues, we introduce modified training objectives for language model finetuning, and we employ massive data augmentation via back-translation to increase the diversity of the training data. We further examine the possibilities of combining data from multiples sources to improve performance on the target dataset. We carefully evaluate our contributions with both human and automatic methods. Our model substantially outperforms the baseline on the MultiWOZ data and shows competitive performance with state of the art in both automatic and human evaluation.
    Overcoming Conflicting Data when Updating a Neural Semantic Parser. (arXiv:2010.12675v2 [cs.CL] UPDATED)
    (2 min) In this paper, we explore how to use a small amount of new data to update a task-oriented semantic parsing model when the desired output for some examples has changed. When making updates in this way, one potential problem that arises is the presence of conflicting data, or out-of-date labels in the original training set. To evaluate the impact of this understudied problem, we propose an experimental setup for simulating changes to a neural semantic parser. We show that the presence of conflicting data greatly hinders learning of an update, then explore several methods to mitigate its effect. Our multi-task and data selection methods lead to large improvements in model accuracy compared to a naive data-mixing strategy, and our best method closes 86% of the accuracy gap between this baseline and an oracle upper bound.
    On the Prunability of Attention Heads in Multilingual BERT. (arXiv:2109.12683v1 [cs.CL])
    (2 min) Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning indicating lower robustness in crosslingual transfer. Also, the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size. The top layers, which are relatively more influenced by fine-tuning, encode important information for languages similar to English (SVO) while the bottom layers, which are relatively less influenced by fine-tuning, are particularly important for agglutinative and low-resource languages.
    Surface Form Competition: Why the Highest Probability Answer Isn't Always Right. (arXiv:2104.08315v4 [cs.CL] UPDATED)
    (2 min) Large language models have shown promising results in zero-shot settings (Brown et al.,2020; Radford et al., 2019). For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition-wherein different surface forms compete for probability mass, even if they represent the same underlying concept, e.g. "computer" and "PC." Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task. It achieves consistent gains in zero-shot performance over both calibrated (Zhao et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.
    K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. (arXiv:2104.06960v2 [cs.CL] UPDATED)
    (2 min) Existing pre-trained language models (PLMs) have demonstrated the effectiveness of self-supervised learning for a broad range of natural language processing (NLP) tasks. However, most of them are not explicitly aware of domain-specific knowledge, which is essential for downstream tasks in many domains, such as tasks in e-commerce scenarios. In this paper, we propose K-PLUG, a knowledge-injected pre-trained language model based on the encoder-decoder transformer that can be transferred to both natural language understanding and generation tasks. We verify our method in a diverse range of e-commerce scenarios that require domain-specific knowledge. Specifically, we propose five knowledge-aware self-supervised pre-training objectives to formulate the learning of domain-specific knowledge, including e-commerce domain-specific knowledge-bases, aspects of product entities, categories of product entities, and unique selling propositions of product entities. K-PLUG achieves new state-of-the-art results on a suite of domain-specific NLP tasks, including product knowledge base completion, abstractive product summarization, and multi-turn dialogue, significantly outperforms baselines across the board, which demonstrates that the proposed method effectively learns a diverse set of domain-specific knowledge for both language understanding and generation tasks.
    Latent Variable Models for Visual Question Answering. (arXiv:2101.06399v2 [cs.CV] UPDATED)
    (2 min) Current work on Visual Question Answering (VQA) explore deterministic approaches conditioned on various types of image and question features. We posit that, in addition to image and question pairs, other modalities are useful for teaching machine to carry out question answering. Hence in this paper, we propose latent variable models for VQA where extra information (e.g. captions and answer categories) are incorporated as latent variables, which are observed during training but in turn benefit question-answering performance at test time. Experiments on the VQA v2.0 benchmarking dataset demonstrate the effectiveness of our proposed models: they improve over strong baselines, especially those that do not rely on extensive language-vision pre-training.
    Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation. (arXiv:2109.12584v1 [cs.CL])
    (2 min) In recent times, there has been definitive progress in the field of NLP, with its applications growing as the utility of our language models increases with advances in their performance. However, these models require a large amount of computational power and data to train, consequently leading to large carbon footprints. Therefore, is it imperative that we study the carbon efficiency and look for alternatives to reduce the overall environmental impact of training models, in particular large language models. In our work, we assess the performance of models for machine translation, across multiple language pairs to assess the difference in computational power required to train these models for each of these language pairs and examine the various components of these models to analyze aspects of our pipeline that can be optimized to reduce these carbon emissions.
    Effective Use of Graph Convolution Network and Contextual Sub-Tree forCommodity News Event Extraction. (arXiv:2109.12781v1 [cs.CL])
    (2 min) Event extraction in commodity news is a less researched area as compared to generic event extraction. However, accurate event extraction from commodity news is useful in abroad range of applications such as under-standing event chains and learning event-event relations, which can then be used for commodity price prediction. The events found in commodity news exhibit characteristics different from generic events, hence posing a unique challenge in event extraction using existing methods. This paper proposes an effective use of Graph Convolutional Networks(GCN) with a pruned dependency parse tree, termed contextual sub-tree, for better event ex-traction in commodity news. The event ex-traction model is trained using feature embed-dings from ComBERT, a BERT-based masked language model that was produced through domain-adaptive pre-training on a commodity news corpus. Experimental results show the efficiency of the proposed solution, which out-performs existing methods with F1 scores as high as 0.90. Furthermore, our pre-trained language model outperforms GloVe by 23%, and BERT and RoBERTa by 7% in terms of argument roles classification. For the goal of re-producibility, the code and trained models are made publicly available1.
    Improving Question Answering Performance Using Knowledge Distillation and Active Learning. (arXiv:2109.12662v1 [cs.CL])
    (2 min) Contemporary question answering (QA) systems, including transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Further, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained BERT system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. In particular, we demonstrate that our model achieves the performance of a 6-layer TinyBERT and DistilBERT, whilst using only 2% of their total parameters. Finally, by the integration of our AL approaches into the BERT framework, we show that state-of-the-art results on the SQuAD dataset can be achieved when we only use 20% of the training data.
    Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning. (arXiv:2109.12758v1 [cs.CL])
    (2 min) Scientific literature is one of the most significant resources for sharing knowledge. Researchers turn to scientific literature as a first step in designing an experiment. Given the extensive and growing volume of literature, the common approach of reading and manually extracting knowledge is too time consuming, creating a bottleneck in the research cycle. This challenge spans nearly every scientific domain. For the materials science, experimental data distributed across millions of publications are extremely helpful for predicting materials properties and the design of novel materials. However, only recently researchers have explored computational approaches for knowledge extraction primarily for inorganic materials. This study aims to explore knowledge extraction for organic materials. We built a research dataset composed of 855 annotated and 708,376 unannotated sentences drawn from 92,667 abstracts. We used named-entity-recognition (NER) with BiLSTM-CNN-CRF deep learning model to automatically extract key knowledge from literature. Early-phase results show a high potential for automated knowledge extraction. The paper presents our findings and a framework for supervised knowledge extraction that can be adapted to other scientific domains.
    XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge. (arXiv:2109.12573v1 [cs.CL])
    (2 min) Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora. However, existing pre-trained models neglect multilingual knowledge, which is language agnostic but comprises abundant cross-lingual structure alignment. In this paper, we propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training. XLM-K augments existing multilingual pre-training with two knowledge tasks, namely Masked Entity Prediction Task and Object Entailment Task. We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate significant improvements over existing multilingual language models. The results on MLQA and NER exhibit the superiority of XLM-K in knowledge related tasks. The success in XNLI shows a better cross-lingual transferability obtained in XLM-K. What is more, we provide a detailed probing analysis to confirm the desired knowledge captured in our pre-training regimen.
    Language Model Priming for Cross-Lingual Event Extraction. (arXiv:2109.12383v1 [cs.CL])
    (2 min) We present a novel, language-agnostic approach to "priming" language models for the task of event extraction, providing particularly effective performance in low-resource and zero-shot cross-lingual settings. With priming, we augment the input to the transformer stack's language model differently depending on the question(s) being asked of the model at runtime. For instance, if the model is being asked to identify arguments for the trigger "protested", we will provide that trigger as part of the input to the language model, allowing it to produce different representations for candidate arguments than when it is asked about arguments for the trigger "arrest" elsewhere in the same sentence. We show that by enabling the language model to better compensate for the deficits of sparse and noisy training data, our approach improves both trigger and argument detection and classification significantly over the state of the art in a zero-shot cross-lingual setting.
    QA-Align: Representing Cross-Text Content Overlap by Aligning Question-Answer Propositions. (arXiv:2109.12655v1 [cs.CL])
    (2 min) Multi-text applications, such as multi-document summarization, are typically required to model redundancies across related texts. Current methods confronting consolidation struggle to fuse overlapping information. In order to explicitly represent content overlap, we propose to align predicate-argument relations across texts, providing a potential scaffold for information consolidation. We go beyond clustering coreferring mentions, and instead model overlap with respect to redundancy at a propositional level, rather than merely detecting shared referents. Our setting exploits QA-SRL, utilizing question-answer pairs to capture predicate-argument relations, facilitating laymen annotation of cross-text alignments. We employ crowd-workers for constructing a dataset of QA-based alignments, and present a baseline QA alignment model trained over our dataset. Analyses show that our new task is semantically challenging, capturing content overlap beyond lexical similarity and complements cross-document coreference with proposition-level links, offering potential use for downstream tasks.
    Enhancing Latent Space Clustering in Multi-filter Seq2Seq Model: A Reinforcement Learning Approach. (arXiv:2109.12399v1 [cs.CL])
    (2 min) In sequence-to-sequence language processing tasks, sentences with heterogeneous semantics or grammatical structures may increase the difficulty of convergence while training the network. To resolve this problem, we introduce a model that concentrates the each of the heterogeneous features in the input-output sequences. Build upon the encoder-decoder architecture, we design a latent-enhanced multi-filter seq2seq model (LMS2S) that analyzes the latent space representations using a clustering algorithm. The representations are generated from an encoder and a latent space enhancer. A cluster classifier is applied to group the representations into clusters. A soft actor-critic reinforcement learning algorithm is applied to the cluster classifier to enhance the clustering quality by maximizing the Silhouette score. Then, multiple filters are trained by the features only from their corresponding clusters, the heterogeneity of the training data can be resolved accordingly. Our experiments on semantic parsing and machine translation demonstrate the positive correlation between the clustering quality and the model's performance, as well as show the enhancement our model has made with respect to the ordinary encoder-decoder model.
    Entity Linking Meets Deep Learning: Techniques and Solutions. (arXiv:2109.12520v1 [cs.CL])
    (2 min) Entity linking (EL) is the process of linking entity mentions appearing in web text with their corresponding entities in a knowledge base. EL plays an important role in the fields of knowledge engineering and data mining, underlying a variety of downstream applications such as knowledge base population, content analysis, relation extraction, and question answering. In recent years, deep learning (DL), which has achieved tremendous success in various domains, has also been leveraged in EL methods to surpass traditional machine learning based methods and yield the state-of-the-art performance. In this survey, we present a comprehensive review and analysis of existing DL based EL methods. First of all, we propose a new taxonomy, which organizes existing DL based EL methods using three axes: embedding, feature, and algorithm. Then we systematically survey the representative EL methods along the three axes of the taxonomy. Later, we introduce ten commonly used EL data sets and give a quantitative performance analysis of DL based EL methods over these data sets. Finally, we discuss the remaining limitations of existing methods and highlight some promising future directions.
    Rumour Detection via Zero-shot Cross-lingual Transfer Learning. (arXiv:2109.12773v1 [cs.CL])
    (2 min) Most rumour detection models for social media are designed for one specific language (mostly English). There are over 40 languages on Twitter and most languages lack annotated resources to build rumour detection models. In this paper we propose a zero-shot cross-lingual transfer learning framework that can adapt a rumour detection model trained for a source language to another target language. Our framework utilises pretrained multilingual language models (e.g.\ multilingual BERT) and a self-training loop to iteratively bootstrap the creation of ''silver labels'' in the target language to adapt the model from the source language to the target language. We evaluate our methodology on English and Chinese rumour datasets and demonstrate that our model substantially outperforms competitive benchmarks in both source and target language rumour detection.
    OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts. (arXiv:2109.12761v1 [cs.CL])
    (2 min) In order to better simulate the real human conversation process, models need to generate dialogue utterances based on not only preceding textual contexts but also visual contexts. However, with the development of multi-modal dialogue learning, the dataset scale gradually becomes a bottleneck. In this report, we release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation, e.g., multi-modal pretraining for dialogue generation.
    VirTex: Learning Visual Representations from Textual Annotations. (arXiv:2006.06666v3 [cs.CV] UPDATED)
    (2 min) The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.
    BioCopy: A Plug-And-Play Span Copy Mechanism in Seq2Seq Models. (arXiv:2109.12533v1 [cs.CL])
    (2 min) Copy mechanisms explicitly obtain unchanged tokens from the source (input) sequence to generate the target (output) sequence under the neural seq2seq framework. However, most of the existing copy mechanisms only consider single word copying from the source sentences, which results in losing essential tokens while copying long spans. In this work, we propose a plug-and-play architecture, namely BioCopy, to alleviate the problem aforementioned. Specifically, in the training stage, we construct a BIO tag for each token and train the original model with BIO tags jointly. In the inference stage, the model will firstly predict the BIO tag at each time step, then conduct different mask strategies based on the predicted BIO label to diminish the scope of the probability distributions over the vocabulary list. Experimental results on two separate generative tasks show that they all outperform the baseline models by adding our BioCopy to the original model structure.
    Finetuning Transformer Models to Build ASAG System. (arXiv:2109.12300v1 [cs.CL])
    (2 min) Research towards creating systems for automatic grading of student answers to quiz and exam questions in educational settings has been ongoing since 1966. Over the years, the problem was divided into many categories. Among them, grading text answers were divided into short answer grading, and essay grading. The goal of this work was to develop an ML-based short answer grading system. I hence built a system which uses finetuning on Roberta Large Model pretrained on STS benchmark dataset and have also created an interface to show the production readiness of the system. I evaluated the performance of the system on the Mohler extended dataset and SciEntsBank Dataset. The developed system achieved a Pearsons Correlation of 0.82 and RMSE of 0.7 on the Mohler Dataset which beats the SOTA performance on this dataset which is correlation of 0.805 and RMSE of 0.793. Additionally, Pearsons Correlation of 0.79 and RMSE of 0.56 was achieved on the SciEntsBank Dataset, which only reconfirms the robustness of the system. A few observations during achieving these results included usage of batch size of 1 produced better results than using batch size of 16 or 32 and using huber loss as loss function performed well on this regression task. The system was tried and tested on train and validation splits using various random seeds and still has been tweaked to achieve a minimum of 0.76 of correlation and a maximum 0.15 (out of 1) RMSE on any dataset.
    Investigating Non-local Features for Neural Constituency Parsing. (arXiv:2109.12814v1 [cs.CL])
    (2 min) Thanks to the strong representation power of neural encoders, neural chart-based parsers have achieved highly competitive performance by using local features. Recently, it has been shown that non-local features in CRF structures lead to improvements. In this paper, we investigate injecting non-local features into the training process of a local span-based parser, by predicting constituent n-gram non-local patterns and ensuring consistency between non-local patterns and local constituents. Results show that our simple method gives better results than the CRF parser on both PTB and CTB. Besides, our method achieves state-of-the-art BERT-based performance on PTB (95.92 F1) and strong performance on CTB (92.31 F1).
    Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features. (arXiv:2109.12258v1 [cs.CL])
    (2 min) We report two essential improvements in readability assessment: 1. three novel features in advanced semantics and 2. the timely evidence that traditional ML models (e.g. Random Forest, using handcrafted features) can combine with transformers (e.g. RoBERTa) to augment model performance. First, we explore suitable transformers and traditional ML models. Then, we extract 255 handcrafted linguistic features using self-developed extraction software. Finally, we assemble those to create several hybrid models, achieving state-of-the-art (SOTA) accuracy on popular datasets in readability assessment. The use of handcrafted features help model performance on smaller datasets. Notably, our RoBERTA-RF-T1 hybrid achieves the near-perfect classification accuracy of 99%, a 20.3% increase from the previous SOTA.
    Jointly Learning to Repair Code and Generate Commit Message. (arXiv:2109.12296v1 [cs.CL])
    (2 min) We propose a novel task of jointly repairing program codes and generating commit messages. Code repair and commit message generation are two essential and related tasks for software development. However, existing work usually performs the two tasks independently. We construct a multilingual triple dataset including buggy code, fixed code, and commit messages for this novel task. We provide the cascaded models as baseline, which are enhanced with different training approaches, including the teacher-student method, the multi-task method, and the back-translation method. To deal with the error propagation problem of the cascaded method, the joint model is proposed that can both repair the code and generate the commit message in a unified framework. Experimental results show that the enhanced cascaded model with teacher-student method and multitask-learning method achieves the best score on different metrics of automated code repair, and the joint model behaves better than the cascaded model on commit message generation.
    Learning Neural Templates for Recommender Dialogue System. (arXiv:2109.12302v1 [cs.CL])
    (2 min) Though recent end-to-end neural models have shown promising progress on Conversational Recommender System (CRS), two key challenges still remain. First, the recommended items cannot be always incorporated into the generated replies precisely and appropriately. Second, only the items mentioned in the training corpus have a chance to be recommended in the conversation. To tackle these challenges, we introduce a novel framework called NTRD for recommender dialogue system that decouples the dialogue generation from the item recommendation. NTRD has two key components, i.e., response template generator and item selector. The former adopts an encoder-decoder model to generate a response template with slot locations tied to target items, while the latter fills in slot locations with the proper items using a sufficient attention mechanism. Our approach combines the strengths of both classical slot filling approaches (that are generally controllable) and modern neural NLG approaches (that are generally more natural and accurate). Extensive experiments on the benchmark ReDial show our NTRD significantly outperforms the previous state-of-the-art methods. Besides, our approach has the unique advantage to produce novel items that do not appear in the training set of dialogue corpus. The code is available at \url{https://github.com/jokieleung/NTRD}.
    Parallel Refinements for Lexically Constrained Text Generation with BART. (arXiv:2109.12487v1 [cs.CL])
    (2 min) Lexically constrained text generation aims to control the generated text by incorporating some pre-specified keywords into the output. Previous work injects lexical constraints into the output by controlling the decoding process or refining the candidate output iteratively, which tends to generate generic or ungrammatical sentences, and has high computational complexity. To address these challenges, we propose Constrained BART (CBART) for lexically constrained text generation. CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder by decomposing this task into two sub-tasks, thereby improving the sentence quality. Concretely, we extend BART by adding a token-level classifier over the encoder, aiming at instructing the decoder where to replace and insert. Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence. To further reduce the inference latency, the decoder predicts all tokens in parallel. Experiment results on One-Billion-Word and Yelp show that CBART can generate plausible text with high quality and diversity while significantly accelerating inference.
    DziriBERT: a Pre-trained Language Model for the Algerian Dialect. (arXiv:2109.12346v1 [cs.CL])
    (2 min) Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there is still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one Million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared to existing models, DziriBERT achieves the best results on two Algerian downstream datasets. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.
    Electoral Programs of German Parties 2021: A Computational Analysis Of Their Comprehensibility and Likeability Based On SentiArt. (arXiv:2109.12500v1 [cs.CL])
    (2 min) The electoral programs of six German parties issued before the parliamentary elections of 2021 are analyzed using state-of-the-art computational tools for quantitative narrative, topic and sentiment analysis. We compare different methods for computing the textual similarity of the programs, Jaccard Bag similarity, Latent Semantic Analysis, doc2vec, and sBERT, the representational and computational complexity increasing from the 1st to the 4th method. A new similarity measure for entire documents derived from the Fowlkes Mallows Score is applied to kmeans clustering of sBERT transformed sentences. Using novel indices of the readability and emotion potential of texts computed via SentiArt (Jacobs, 2019), our data shed light on the similarities and differences of the programs regarding their length, main ideas, comprehensibility, likeability, and semantic complexity. Among others, they reveal that the programs of the SPD and CDU have the best chances to be comprehensible and likeable -all other things being equal-, and they raise the important issue of which similarity measure is optimal for comparing texts such as electoral programs which necessarily share a lot of words. While such analyses can not replace qualitative analyses or a deep reading of the texts, they offer predictions that can be verified in empirical studies and may serve as a motivation for changing aspects of future electoral programs potentially making them more comprehensible and/or likeable.
    Style Control for Schema-Guided Natural Language Generation. (arXiv:2109.12211v1 [cs.CL])
    (2 min) Natural Language Generation (NLG) for task-oriented dialogue systems focuses on communicating specific content accurately, fluently, and coherently. While these attributes are crucial for a successful dialogue, it is also desirable to simultaneously accomplish specific stylistic goals, such as response length, point-of-view, descriptiveness, sentiment, formality, and empathy. In this work, we focus on stylistic control and evaluation for schema-guided NLG, with joint goals of achieving both semantic and stylistic control. We experiment in detail with various controlled generation methods for large pretrained language models: specifically, conditional training, guided fine-tuning, and guided decoding. We discuss their advantages and limitations, and evaluate them with a broad range of automatic and human evaluation metrics. Our results show that while high style accuracy and semantic correctness are easier to achieve for more lexically-defined styles with conditional training, stylistic control is also achievable for more semantically complex styles using discriminator-based guided decoding methods. The results also suggest that methods that are more scalable (with less hyper-parameters tuning) and that disentangle content generation and stylistic variations are more effective at achieving semantic correctness and style accuracy.
    Coreference Resolution for the Biomedical Domain: A Survey. (arXiv:2109.12424v1 [cs.CL])
    (2 min) Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state-of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.
    DialogueCSE: Dialogue-based Contrastive Learning of Sentence Embeddings. (arXiv:2109.12599v1 [cs.CL])
    (2 min) Learning sentence embeddings from dialogues has drawn increasing attention due to its low annotation cost and high domain adaptability. Conventional approaches employ the siamese-network for this task, which obtains the sentence embeddings through modeling the context-response semantic relevance by applying a feed-forward network on top of the sentence encoders. However, as the semantic textual similarity is commonly measured through the element-wise distance metrics (e.g. cosine and L2 distance), such architecture yields a large gap between training and evaluating. In this paper, we propose DialogueCSE, a dialogue-based contrastive learning approach to tackle this issue. DialogueCSE first introduces a novel matching-guided embedding (MGE) mechanism, which generates a context-aware embedding for each candidate response embedding (i.e. the context-free embedding) according to the guidance of the multi-turn context-response matching matrices. Then it pairs each context-aware embedding with its corresponding context-free embedding and finally minimizes the contrastive loss across all pairs. We evaluate our model on three multi-turn dialogue datasets: the Microsoft Dialogue Corpus, the Jing Dong Dialogue Corpus, and the E-commerce Dialogue Corpus. Evaluation results show that our approach significantly outperforms the baselines across all three datasets in terms of MAP and Spearman's correlation measures, demonstrating its effectiveness. Further quantitative experiments show that our approach achieves better performance when leveraging more dialogue context and remains robust when less training data is provided.
    What Truly Matters? Using Linguistic Cues for Analyzing the #BlackLivesMatter Movement and its Counter Protests: 2013 to 2020. (arXiv:2109.12192v1 [cs.CL])
    (2 min) Since the fatal shooting of 17-year old Black teenager Trayvon Martin in February 2012 by a White neighborhood watchman, George Zimmerman in Sanford, Florida, there has been a significant increase in digital activism addressing police-brutality related and racially-motivated incidents in the United States. In this work, we administer an innovative study of digital activism by exploiting social media as an authoritative tool to examine and analyze the linguistic cues and thematic relationships in these three mediums. We conduct a multi-level text analysis on 36,984,559 tweets to investigate users' behaviors to examine the language used and understand the impact of digital activism on social media within each social movement on a sentence-level, word-level, and topic-level. Our results show that excessive use of racially-related or prejudicial hashtags were used by the counter protests which portray potential discriminatory tendencies. Consequently, our findings highlight that social activism done by Black Lives Matter activists does not diverge from the social issues and topics involving police-brutality related and racially-motivated killings of Black individuals due to the shape of its topical graph that topics and conversations encircling the largest component directly relate to the topic of Black Lives Matter. Finally, we see that both Blue Lives Matter and All Lives Matter movements depict a different directive, as the topics of Blue Lives Matter or All Lives Matter do not reside in the center. These findings suggest that topics and conversations within each social movement are skewed, random or possessed racially-related undertones, and thus, deviating from the prominent social injustice issues.
    Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations. (arXiv:2109.12174v1 [cs.CL])
    (2 min) Fine-tuning pretrained models for automatically summarizing doctor-patient conversation transcripts presents many challenges: limited training data, significant domain shift, long and noisy transcripts, and high target summary variability. In this paper, we explore the feasibility of using pretrained transformer models for automatically summarizing doctor-patient conversations directly from transcripts. We show that fluent and adequate summaries can be generated with limited training data by fine-tuning BART on a specially constructed dataset. The resulting models greatly surpass the performance of an average human annotator and the quality of previous published work for the task. We evaluate multiple methods for handling long conversations, comparing them to the obvious baseline of truncating the conversation to fit the pretrained model length limit. We introduce a multistage approach that tackles the task by learning two fine-tuned models: one for summarizing conversation chunks into partial summaries, followed by one for rewriting the collection of partial summaries into a complete summary. Using a carefully chosen fine-tuning dataset, this method is shown to be effective at handling longer conversations, improving the quality of generated summaries. We conduct both an automatic evaluation (through ROUGE and two concept-based metrics focusing on medical findings) and a human evaluation (through qualitative examples from literature, assessing hallucination, generalization, fluency, and general quality of the generated summaries).
    An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog. (arXiv:2109.12212v1 [cs.CL])
    (2 min) Online conversations include more than just text. Increasingly, image-based responses such as memes and animated gifs serve as culturally recognized and often humorous responses in conversation. However, while NLP has broadened to multimodal models, conversational dialog systems have largely focused only on generating text replies. Here, we introduce a new dataset of 1.56M text-gif conversation turns and introduce a new multimodal conversational model Pepe the King Prawn for selecting gif-based replies. We demonstrate that our model produces relevant and high-quality gif responses and, in a large randomized control trial of multiple models replying to real users, we show that our model replies with gifs that are significantly better received by the community.
    Predicting Attention Sparsity in Transformers. (arXiv:2109.12188v1 [cs.CL])
    (2 min) A bottleneck in transformer architectures is their quadratic complexity with respect to the input sequence, which has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models, and may guide future benchmarks for sparse models.
    Weakly Supervised Contrastive Learning for Chest X-Ray Report Generation. (arXiv:2109.12242v1 [cs.CL])
    (2 min) Radiology report generation aims at generating descriptive text from radiology images automatically, which may present an opportunity to improve radiology reporting and interpretation. A typical setting consists of training encoder-decoder models on image-report pairs with a cross entropy loss, which struggles to generate informative sentences for clinical diagnoses since normal findings dominate the datasets. To tackle this challenge and encourage more clinically-accurate text outputs, we propose a novel weakly supervised contrastive loss for medical report generation. Experimental results demonstrate that our method benefits from contrasting target reports with incorrect but semantically-close ones. It outperforms previous work on both clinical correctness and text generation metrics for two public benchmarks.
    Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?. (arXiv:2109.12243v1 [cs.CL])
    (2 min) We analyze the grounded SCAN (gSCAN) benchmark, which was recently proposed to study systematic generalization for grounded language understanding. First, we study which aspects of the original benchmark can be solved by commonly used methods in multi-modal research. We find that a general-purpose Transformer-based model with cross-modal attention achieves strong performance on a majority of the gSCAN splits, surprisingly outperforming more specialized approaches from prior work. Furthermore, our analysis suggests that many of the remaining errors reveal the same fundamental challenge in systematic generalization of linguistic constructs regardless of visual context. Second, inspired by this finding, we propose challenging new tasks for gSCAN by generating data to incorporate relations between objects in the visual environment. Finally, we find that current models are surprisingly data inefficient given the narrow scope of commands in gSCAN, suggesting another challenge for future work.
  • cs.CV updates on arXiv.org

    Physics perception in sloshing scenes with guaranteed thermodynamic consistency. (arXiv:2106.13301v2 [cs.CV] UPDATED)
    (2 min) Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.
    Robustness in Compressed Neural Networks for Object Detection. (arXiv:2102.05509v3 [cs.LG] UPDATED)
    (2 min) Model compression techniques allow to significantly reduce the computational cost associated with data processing by deep neural networks with only a minor decrease in average accuracy. Simultaneously, reducing the model size may have a large effect on noisy cases or objects belonging to less frequent classes. It is a crucial problem from the perspective of the models' safety, especially for object detection in the autonomous driving setting, which is considered in this work. It was shown in the paper that the sensitivity of compressed models to different distortion types is nuanced, and some of the corruptions are heavily impacted by the compression methods (i.e., additive noise), while others (blur effect) are only slightly affected. A common way to improve the robustness of models is to use data augmentation, which was confirmed to positively affect models' robustness, also for highly compressed models. It was further shown that while data imbalance methods brought only a slight increase in accuracy for the baseline model (without compression), the impact was more striking at higher compression rates for the structured pruning. Finally, methods for handling data imbalance brought a significant improvement of the pruned models' worst-detected class accuracy.
    Structure-Preserving Image Super-Resolution. (arXiv:2109.12530v1 [cs.CV])
    (2 min) Structures matter in single image super-resolution (SISR). Benefiting from generative adversarial networks (GANs), recent studies have promoted the development of SISR by recovering photo-realistic images. However, there are still undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super-resolution (SPSR) method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Firstly, we propose SPSR with gradient guidance (SPSR-G) by exploiting gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss to impose a second-order restriction on the super-resolved images, which helps generative networks concentrate more on geometric structures. Secondly, since the gradient maps are handcrafted and may only be able to capture limited aspects of structural information, we further extend SPSR-G by introducing a learnable neural structure extractor (NSE) to unearth richer local structures and provide stronger supervision for SR. We propose two self-supervised structure learning methods, contrastive prediction and solving jigsaw puzzles, to train the NSEs. Our methods are model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results on five benchmark datasets show that the proposed methods outperform state-of-the-art perceptual-driven SR methods under LPIPS, PSNR, and SSIM metrics. Visual results demonstrate the superiority of our methods in restoring structures while generating natural SR images. Code is available at https://github.com/Maclory/SPSR.
    Markerless Suture Needle 6D Pose Tracking with Robust Uncertainty Estimation for Autonomous Minimally Invasive Robotic Surgery. (arXiv:2109.12722v1 [cs.RO])
    (2 min) Suture needle localization plays a crucial role towards autonomous suturing. To track the 6D pose of a suture needle robustly, previous approaches usually add markers on the needle or perform complex operations for feature extraction, making these methods difficult to be applicable to real-world environments. Therefore in this work, we present a novel approach for markerless suture needle pose tracking using Bayesian filters. A data-efficient feature point detector is trained to extract the feature points on the needle. Then based on these detections, we propose a novel observation model that measures the overlap between the detections and the expected projection of the needle, which can be calculated efficiently. In addition, for the proposed method, we derive the approximation for the covariance of the observation noise, making this model more robust to the uncertainty in the detections. The experimental results in simulation show that the proposed observation model achieves low tracking errors of approximately 1.5mm in position in space and 1 degree in orientation. We also demonstrate the qualitative results of our trained markerless feature detector combined with the proposed observation model in real-world environments. The results show high consistency between the projection of the tracked pose and that of the real pose.
    A Predictive Visual Analytics System for Studying Neurodegenerative Disease based on DTI Fiber Tracts. (arXiv:2010.07047v3 [cs.CV] UPDATED)
    (2 min) Diffusion tensor imaging (DTI) has been used to study the effects of neurodegenerative diseases on neural pathways, which may lead to more reliable and early diagnosis of these diseases as well as a better understanding of how they affect the brain. We introduce an intelligent visual analytics system for studying patient groups based on their labeled DTI fiber tract data and corresponding statistics. The system's AI-augmented interface guides the user through an organized and holistic analysis space, including the statistical feature space, the physical space, and the space of patients over different groups. We use a custom machine learning pipeline to help narrow down this large analysis space, and then explore it pragmatically through a range of linked visualizations. We conduct several case studies using real data from the research database of Parkinson's Progression Markers Initiative.
    EDDA: Explanation-driven Data Augmentation to Improve Explanation Faithfulness. (arXiv:2105.14162v3 [cs.LG] UPDATED)
    (2 min) Recent years have seen the introduction of a range of methods for post-hoc explainability of image classifier predictions. However, these post-hoc explanations may not always be faithful to classifier predictions, which poses a significant challenge when attempting to debug models based on such explanations. To this end, we seek a methodology that can improve the faithfulness of an explanation method with respect to model predictions which does not require ground truth explanations. We achieve this through a novel explanation-driven data augmentation (EDDA) technique that augments the training data with occlusions inferred from model explanations; this is based on the simple motivating principle that \emph{if} the explainer is faithful to the model \emph{then} occluding salient regions for the model prediction should decrease the model confidence in the prediction, while occluding non-salient regions should not change the prediction. To verify that the proposed augmentation method has the potential to improve faithfulness, we evaluate EDDA using a variety of datasets and classification models. We demonstrate empirically that our approach leads to a significant increase of faithfulness, which can facilitate better debugging and successful deployment of image classification models in real-world applications.
    Prediction of MGMT Methylation Status of Glioblastoma using Radiomics and Latent Space Shape Features. (arXiv:2109.12339v1 [eess.IV])
    (0 min) In this paper we propose a method for predicting the status of MGMT promoter methylation in high-grade gliomas. From the available MR images, we segment the tumor using deep convolutional neural networks and extract both radiomic features and shape features learned by a variational autoencoder. We implemented a standard machine learning workflow to obtain predictions, consisting of feature selection followed by training of a random forest classification model. We trained and evaluated our method on the RSNA-ASNR-MICCAI BraTS 2021 challenge dataset and submitted our predictions to the challenge.
    ST-GREED: Space-Time Generalized Entropic Differences for Frame Rate Dependent Video Quality Prediction. (arXiv:2010.13715v2 [cs.MM] UPDATED)
    (0 min) We consider the problem of conducting frame rate dependent video quality assessment (VQA) on videos of diverse frame rates, including high frame rate (HFR) videos. More generally, we study how perceptual quality is affected by frame rate, and how frame rate and compression combine to affect perceived quality. We devise an objective VQA model called Space-Time GeneRalized Entropic Difference (GREED) which analyzes the statistics of spatial and temporal band-pass video coefficients. A generalized Gaussian distribution (GGD) is used to model band-pass responses, while entropy variations between reference and distorted videos under the GGD model are used to capture video quality variations arising from frame rate changes. The entropic differences are calculated across multiple temporal and spatial subbands, and merged using a learned regressor. We show through extensive experiments that GREED achieves state-of-the-art performance on the LIVE-YT-HFR Database when compared with existing VQA models. The features used in GREED are highly generalizable and obtain competitive performance even on standard, non-HFR VQA databases. The implementation of GREED has been made available online: https://github.com/pavancm/GREED
    Hypothesis-driven Stream Learning with Augmented Memory. (arXiv:2104.02206v3 [cs.CV] UPDATED)
    (2 min) Stream learning refers to the ability to acquire and transfer knowledge across a continuous stream of data without forgetting and without repeated passes over the data. A common way to avoid catastrophic forgetting is to intersperse new examples with replays of old examples stored as image pixels or reproduced by generative models. Here, we consider stream learning in image classification tasks and propose a novel hypotheses-driven Augmented Memory Network, which efficiently consolidates previous knowledge with a limited number of hypotheses in the augmented memory and replays relevant hypotheses to avoid catastrophic forgetting. The advantages of hypothesis-driven replay over pixel-level replay and generative replay are two-fold. First, hypothesis-based knowledge consolidation avoids redundant information in the image pixel space and makes memory usage more efficient. Second, hypotheses in the augmented memory can be re-used for learning new tasks, improving generalization and transfer learning ability. We evaluated our method on three stream learning object recognition datasets. Our method performs comparably well or better than state-of-the-art methods, while offering more efficient memory usage. All source code and data are publicly available https://github.com/kreimanlab/AugMem.
    A Novel Hybrid Convolutional Neural Network for Accurate Organ Segmentation in 3D Head and Neck CT Images. (arXiv:2109.12634v1 [eess.IV])
    (0 min) Radiation therapy (RT) is widely employed in the clinic for the treatment of head and neck (HaN) cancers. An essential step of RT planning is the accurate segmentation of various organs-at-risks (OARs) in HaN CT images. Nevertheless, segmenting OARs manually is time-consuming, tedious, and error-prone considering that typical HaN CT images contain tens to hundreds of slices. Automated segmentation algorithms are urgently required. Recently, convolutional neural networks (CNNs) have been extensively investigated on this task. Particularly, 3D CNNs are frequently adopted to process 3D HaN CT images. There are two issues with na\"ive 3D CNNs. First, the depth resolution of 3D CT images is usually several times lower than the in-plane resolution. Direct employment of 3D CNNs without distinguishing this difference can lead to the extraction of distorted image features and influence the final segmentation performance. Second, a severe class imbalance problem exists, and large organs can be orders of times larger than small organs. It is difficult to simultaneously achieve accurate segmentation for all the organs. To address these issues, we propose a novel hybrid CNN that fuses 2D and 3D convolutions to combat the different spatial resolutions and extract effective edge and semantic features from 3D HaN CT images. To accommodate large and small organs, our final model, named OrganNet2.5D, consists of only two instead of the classic four downsampling operations, and hybrid dilated convolutions are introduced to maintain the respective field. Experiments on the MICCAI 2015 challenge dataset demonstrate that OrganNet2.5D achieves promising performance compared to state-of-the-art methods.
    WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels. (arXiv:2107.01002v2 [cs.NI] UPDATED)
    (0 min) We introduce WiCluster, a new machine learning (ML) approach for passive indoor positioning using radio frequency (RF) channel state information (CSI). WiCluster can predict both a zone-level position and a precise 2D or 3D position, without using any precise position labels during training. Prior CSI-based indoor positioning work has relied on non-parametric approaches using digital signal-processing (DSP) and, more recently, parametric approaches (e.g., fully supervised ML methods). However these do not handle the complexity of real-world environments well and do not meet requirements for large-scale commercial deployments: the accuracy of DSP-based method deteriorates significantly in non-line-of-sight conditions, while supervised ML methods need large amounts of hard-to-acquire centimeter accuracy position labels. In contrast, WiCluster is precise, requires weaker label-information that can be easily collected, and works well in non-line-of-sight conditions. Our first contribution is a novel dimensionality reduction method for charting. It combines a triplet-loss with a multi-scale clustering-loss to map the high-dimensional CSI representation to a 2D/3D latent space. Our second contribution is two weakly supervised losses that map this latent space into a Cartesian map, resulting in meter-accuracy position results. These losses only require simple to acquire priors: a sketch of the floorplan, approximate access-point locations and a few CSI packets that are labelled with the corresponding zone in the floorplan. Thirdly, we report results and a robustness study for 2D positioning in two single-floor office buildings and 3D positioning in a two-story home.
    Group Shift Pointwise Convolution for Volumetric Medical Image Segmentation. (arXiv:2109.12629v1 [eess.IV])
    (0 min) Recent studies have witnessed the effectiveness of 3D convolutions on segmenting volumetric medical images. Compared with the 2D counterparts, 3D convolutions can capture the spatial context in three dimensions. Nevertheless, models employing 3D convolutions introduce more trainable parameters and are more computationally complex, which may lead easily to model overfitting especially for medical applications with limited available training data. This paper aims to improve the effectiveness and efficiency of 3D convolutions by introducing a novel Group Shift Pointwise Convolution (GSP-Conv). GSP-Conv simplifies 3D convolutions into pointwise ones with 1x1x1 kernels, which dramatically reduces the number of model parameters and FLOPs (e.g. 27x fewer than 3D convolutions with 3x3x3 kernels). Na\"ive pointwise convolutions with limited receptive fields cannot make full use of the spatial image context. To address this problem, we propose a parameter-free operation, Group Shift (GS), which shifts the feature maps along with different spatial directions in an elegant way. With GS, pointwise convolutions can access features from different spatial locations, and the limited receptive fields of pointwise convolutions can be compensated. We evaluate the proposed methods on two datasets, PROMISE12 and BraTS18. Results show that our method, with substantially decreased model complexity, achieves comparable or even better performance than models employing 3D convolutions.
    Logo Generation Using Regional Features: A Faster R-CNN Approach to Generative Adversarial Networks. (arXiv:2109.12628v1 [cs.CV])
    (0 min) In this paper we introduce the Local Logo Generative Adversarial Network (LL-GAN) that uses regional features extracted from the Faster Regional Convolutional Neural Network (Faster R-CNN) to generate logos. We demonstrate the strength of this approach by training the framework on a small style-rich dataset collected online to generate large impressive logos. Our approach beats the state-of-the-art models (StyleGAN2, Self-Attention GANs) that suffer from mode collapse due to the size of the data.
    MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling. (arXiv:2109.12178v1 [cs.CV])
    (0 min) Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked Image-Region Modeling (MIRM). Both alignment and MIRM objectives mostly do not have ground truth. Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models. In this paper, we present Masked Language and Image Modeling (MLIM) for VLP. MLIM uses two loss functions: Masked Language Modeling (MLM) loss and image reconstruction (RECON) loss. We propose Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, we present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
    Disentangled Feature Representation for Few-shot Image Classification. (arXiv:2109.12548v1 [cs.CV])
    (2 min) Learning the generalizable feature representation is critical for few-shot image classification. While recent works exploited task-specific feature embedding using meta-tasks for few-shot learning, they are limited in many challenging tasks as being distracted by the excursive features such as the background, domain and style of the image samples. In this work, we propose a novel Disentangled Feature Representation framework, dubbed DFR, for few-shot learning applications. DFR can adaptively decouple the discriminative features that are modeled by the classification branch, from the class-irrelevant component of the variation branch. In general, most of the popular deep few-shot learning methods can be plugged in as the classification branch, thus DFR can boost their performance on various few-shot tasks. Furthermore, we propose a novel FS-DomainNet dataset based on DomainNet, for benchmarking the few-shot domain generalization tasks. We conducted extensive experiments to evaluate the proposed DFR on general and fine-grained few-shot classification, as well as few-shot domain generalization, using the corresponding four benchmarks, i.e., mini-ImageNet, tiered-ImageNet, CUB, as well as the proposed FS-DomainNet. Thanks to the effective feature disentangling, the DFR-based few-shot classifiers achieved the state-of-the-art results on all datasets.
    Rotation Invariance and Extensive Data Augmentation: a strategy for the Mitosis Domain Generalization (MIDOG) Challenge. (arXiv:2109.00823v2 [cs.CV] UPDATED)
    (2 min) Automated detection of mitotic figures in histopathology images is a challenging task: here, we present the different steps that describe the strategy we applied to participate in the MIDOG 2021 competition. The purpose of the competition was to evaluate the generalization of solutions to images acquired with unseen target scanners (hidden for the participants) under the constraint of using training data from a limited set of four independent source scanners. Given this goal and constraints, we joined the challenge by proposing a straight-forward solution based on a combination of state-of-the-art deep learning methods with the aim of yielding robustness to possible scanner-related distributional shifts at inference time. Our solution combines methods that were previously shown to be efficient for mitosis detection: hard negative mining, extensive data augmentation, rotation-invariant convolutional networks. We trained five models with different splits of the provided dataset. The subsequent classifiers produced F1-scores with a mean and standard deviation of 0.747+/-0.032 on the test splits. The resulting ensemble constitutes our candidate algorithm: its automated evaluation on the preliminary test set of the challenge returned a F1-score of 0.6828.
    N-shot Palm Vein Verification Using Siamese Networks. (arXiv:2109.12808v1 [cs.CV])
    (2 min) The use of deep learning methods to extract vascular biometric patterns from the palm surface has been of interest among researchers in recent years. In many biometric recognition tasks, there is a limit in the number of training samples. This is because of limited vein biometric databases being available for research. This restricts the application of deep learning methods to design algorithms that can effectively identify or authenticate people for vein recognition. This paper proposes an architecture using Siamese neural network structure for few shot palm vein identification. The proposed network uses images from both the palms and consists of two sub-nets that share weights to identify a person. The architecture performance was tested on the HK PolyU multi spectral palm vein database with limited samples. The results suggest that the method is effective since it has 91.9% precision, 91.1% recall, 92.2% specificity, 91.5%, F1-Score, and 90.5% accuracy values.
    DRAEM -- A discriminatively trained reconstruction embedding for surface anomaly detection. (arXiv:2108.07610v2 [cs.CV] UPDATED)
    (2 min) Visual surface anomaly detection aims to detect local image regions that significantly deviate from normal appearance. Recent surface anomaly detection methods rely on generative models to accurately reconstruct the normal areas and to fail on anomalies. These methods are trained only on anomaly-free images, and often require hand-crafted post-processing steps to localize the anomalies, which prohibits optimizing the feature extraction for maximal detection capability. In addition to reconstructive approach, we cast surface anomaly detection primarily as a discriminative problem and propose a discriminatively trained reconstruction anomaly embedding model (DRAEM). The proposed method learns a joint representation of an anomalous image and its anomaly-free reconstruction, while simultaneously learning a decision boundary between normal and anomalous examples. The method enables direct anomaly localization without the need for additional complicated post-processing of the network output and can be trained using simple and general anomaly simulations. On the challenging MVTec anomaly detection dataset, DRAEM outperforms the current state-of-the-art unsupervised methods by a large margin and even delivers detection performance close to the fully-supervised methods on the widely used DAGM surface-defect detection dataset, while substantially outperforming them in localization accuracy.
    Unsupervised domain adaptation for clinician pose estimation and instance segmentation in the OR. (arXiv:2108.11801v2 [cs.CV] UPDATED)
    (3 min) The fine-grained localization of clinicians in the operating room (OR) is a key component to design the new generation of OR support systems. Computer vision models for person pixel-based segmentation and body-keypoints detection are needed to better understand the clinical activities and the spatial layout of the OR. This is challenging, not only because OR images are very different from traditional vision datasets, but also because data and annotations are hard to collect and generate in the OR due to privacy concerns. To address these concerns, we first study how joint person pose estimation and instance segmentation can be performed on low resolutions images from 1x to 12x. Second, to address the domain shift and the lack of annotations, we propose a novel unsupervised domain adaptation method, called \emph{AdaptOR}, to adapt a model from an \emph{in-the-wild} labeled source domain to a statistically different unlabeled target domain. We propose to exploit explicit geometric constraints on the different augmentations of the unlabeled target domain image to generate accurate pseudo labels, and using these pseudo labels to train the model on high- and low-resolution OR images in a \emph{self-training} framework. Furthermore, we propose \emph{disentangled feature normalization} to handle the statistically different source and target domain data. Extensive experimental results with detailed ablation studies on the two OR datasets \emph{MVOR+} and \emph{TUM-OR-test} show the effectiveness of our approach against strongly constructed baselines, especially on the low-resolution privacy-preserving OR images. Finally, we show the generality of our method as a semi-supervised learning (SSL) method on the large-scale \emph{COCO} dataset, where we achieve comparable results with as few as \textbf{1\%} of labeled supervision against a model trained with 100\% labeled supervision.
    Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better. (arXiv:2109.12507v1 [cs.CV])
    (2 min) Knowledge distillation field delicately designs various types of knowledge to shrink the performance gap between compact student and large-scale teacher. These existing distillation approaches simply focus on the improvement of \textit{knowledge quality}, but ignore the significant influence of \textit{knowledge quantity} on the distillation procedure. Opposed to the conventional distillation approaches, which extract knowledge from a fixed teacher computation graph, this paper explores a non-negligible research direction from a novel perspective of \textit{knowledge quantity} to further improve the efficacy of knowledge distillation. We introduce a new concept of knowledge decomposition, and further put forward the \textbf{P}artial to \textbf{W}hole \textbf{K}nowledge \textbf{D}istillation~(\textbf{PWKD}) paradigm. Specifically, we reconstruct teacher into weight-sharing sub-networks with same depth but increasing channel width, and train sub-networks jointly to obtain decomposed knowledge~(sub-networks with more channels represent more knowledge). Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence. Generally, \textbf{PWKD} can be regarded as a plugin to be compatible with existing offline knowledge distillation approaches. To verify the effectiveness of \textbf{PWKD}, we conduct experiments on two benchmark datasets:~CIFAR-100 and ImageNet, and comprehensive evaluation results reveal that \textbf{PWKD} consistently improve existing knowledge distillation approaches without bells and whistles.
    Mining Latent Classes for Few-shot Segmentation. (arXiv:2103.15402v3 [cs.CV] UPDATED)
    (2 min) Few-shot segmentation (FSS) aims to segment unseen classes given only a few annotated samples. Existing methods suffer the problem of feature undermining, i.e. potential novel classes are treated as background during training phase. Our method aims to alleviate this problem and enhance the feature embedding on latent novel classes. In our work, we propose a novel joint-training framework. Based on conventional episodic training on support-query pairs, we add an additional mining branch that exploits latent novel classes via transferable sub-clusters, and a new rectification technique on both background and foreground categories to enforce more stable prototypes. Over and above that, our transferable sub-cluster has the ability to leverage extra unlabeled data for further feature enhancement. Extensive experiments on two FSS benchmarks demonstrate that our method outperforms previous state-of-the-art by a large margin of 3.7% mIOU on PASCAL-5i and 7.0% mIOU on COCO-20i at the cost of 74% fewer parameters and 2.5x faster inference speed. The source code is available at https://github.com/LiheYoung/MiningFSS.
    Automatic Map Update Using Dashcam Videos. (arXiv:2109.12131v1 [cs.CV])
    (2 min) Autonomous driving requires 3D maps that provide accurate and up-to-date information about semantic landmarks. Due to the wider availability and lower cost of cameras compared with laser scanners, vision-based mapping has attracted much attention from academia and industry. Among the existing solutions, Structure-from-Motion (SfM) technology has proved to be feasible for building 3D maps from crowdsourced data, since it allows unordered images as input. Previous works on SfM have mainly focused on issues related to building 3D point clouds and calculating camera poses, leaving the issues of automatic change detection and localization open. We propose in this paper an SfM-based solution for automatic map update, with a focus on real-time change detection and localization. Our solution builds on comparison of semantic map data (e.g. types and locations of traffic signs). Through a novel design of the pixel-wise 3D localization algorithm, our system can locate the objects detected from 2D images in a 3D space, utilizing sparse SfM point clouds. Experiments with dashcam videos collected from two urban areas prove that the system is able to locate visible traffic signs in front along the driving direction with a median distance error of 1.52 meters. Moreover, it can detect up to 80\% of the changes with a median distance error of 2.21 meters. The result analysis also shows the potential of significantly improving the system performance in the future by increasing the accuracy of the background technology in use, including in particularly the object detection and point cloud geo-registration algorithms.
    NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU. (arXiv:2109.12191v1 [cs.LG])
    (2 min) Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs. Microbatching is a standard method to alleviate this and is fully supported in the TensorFlow Privacy library (TFDP). However, this technique, while improving training times also reduces the quality of the gradients and degrades the classification accuracy. Recent works that for example use the JAX framework show promise in also alleviating this but still show degradation in throughput from non-private to private SGD on CNNs, and have not yet shown ImageNet implementations. In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.
    PETA: Photo Albums Event Recognition using Transformers Attention. (arXiv:2109.12499v1 [cs.CV])
    (2 min) In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images also presents the challenge of high-level image understanding, as opposed to low-level image object classification. In absence of methods to analyze multiple inputs, previous methods adopted temporal mechanisms, including various forms of recurrent neural networks. However, their effective temporal window is local. In addition, they are not a natural choice given the disordered characteristic of photo albums. We address this gap with a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation to perform global reasoning on image collection, offering a practical and efficient solution for photo albums event recognition. Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90\% mAP on all datasets. We further explore the related image-importance task in event recognition, demonstrating how the learned attentions correlate with the human-annotated importance for this subjective task, thus opening the door for new applications.
    Identifying Women with Mammographically-Occult Breast Cancer Leveraging GAN-Simulated Mammograms. (arXiv:2109.12113v1 [eess.IV])
    (3 min) Our objective is to show the feasibility of using simulated mammograms to detect mammographically-occult (MO) cancer in women with dense breasts and a normal screening mammogram who could be triaged for additional screening with magnetic resonance imaging (MRI) or ultrasound. We developed a Conditional Generative Adversarial Network (CGAN) to simulate a mammogram with normal appearance using the opposite mammogram as the condition. We used a Convolutional Neural Network (CNN) trained on Radon Cumulative Distribution Transform (RCDT) processed mammograms to detect MO cancer. For training CGAN, we used screening mammograms of 1366 women. For MO cancer detection, we used screening mammograms of 333 women (97 MO cancer) with dense breasts. We simulated the right mammogram for normal controls and the cancer side for MO cancer cases. We created two RCDT images, one from a real mammogram pair and another from a real-simulated mammogram pair. We finetuned a VGG16 on resulting RCDT images to classify the women with MO cancer. We compared the classification performance of the CNN trained on fused RCDT images, CNN_{Fused} to that of trained only on real RCDT images, CNN_{Real}, and to that of trained only on simulated RCDT images, CNN_{Simulated}. The test AUC for CNN_{Fused} was 0.77 with a 95% confidence interval (95CI) of [0.71, 0.83], which was statistically better (p-value < 0.02) than the CNN_{Real} AUC of 0.70 with a 95CI of [0.64, 0.77] and CNN_{Simulated} AUC of 0.68 with a 95CI of [0.62, 0.75]. It showed that CGAN simulated mammograms can help MO cancer detection.
    Profiling Neural Blocks and Design Spaces for Mobile Neural Architecture Search. (arXiv:2109.12426v1 [cs.LG])
    (2 min) Neural architecture search automates neural network design and has achieved state-of-the-art results in many deep learning applications. While recent literature has focused on designing networks to maximize accuracy, little work has been conducted to understand the compatibility of architecture design spaces to varying hardware. In this paper, we analyze the neural blocks used to build Once-for-All (MobileNetV3), ProxylessNAS and ResNet families, in order to understand their predictive power and inference latency on various devices, including Huawei Kirin 9000 NPU, RTX 2080 Ti, AMD Threadripper 2990WX, and Samsung Note10. We introduce a methodology to quantify the friendliness of neural blocks to hardware and the impact of their placement in a macro network on overall network performance via only end-to-end measurements. Based on extensive profiling results, we derive design insights and apply them to hardware-specific search space reduction. We show that searching in the reduced search space generates better accuracy-latency Pareto frontiers than searching in the original search spaces, customizing architecture search according to the hardware. Moreover, insights derived from measurements lead to notably higher ImageNet top-1 scores on all search spaces investigated.
    Robotic Vision for Space Mining. (arXiv:2109.12109v1 [cs.RO])
    (2 min) Future Moon bases will likely be constructed using resources mined from the surface of the Moon. The difficulty of maintaining a human workforce on the Moon and communications lag with Earth means that mining will need to be conducted using collaborative robots with a high degree of autonomy. In this paper, we explore the utility of robotic vision towards addressing several major challenges in autonomous mining in the lunar environment: lack of satellite positioning systems, navigation in hazardous terrain, and delicate robot interactions. Specifically, we describe and report the results of robotic vision algorithms that we developed for Phase 2 of the NASA Space Robotics Challenge, which was framed in the context of autonomous collaborative robots for mining on the Moon. The competition provided a simulated lunar environment that exhibits the complexities alluded to above. We show how machine learning-enabled vision could help alleviate the challenges posed by the lunar environment. A robust multi-robot coordinator was also developed to achieve long-term operation and effective collaboration between robots.
    Fully Convolutional Online Tracking. (arXiv:2004.07109v5 [cs.CV] UPDATED)
    (3 min) Online learning has turned out to be effective for improving tracking performance. However, it could be simply applied for classification branch, but still remains challenging to adapt to regression branch due to its complex design and intrinsic requirement for high-quality online samples. To tackle this issue, we present the fully convolutional online tracking framework, coined as FCOT, and focus on enabling online learning for both classification and regression branches by using a target filter based tracking paradigm. Our key contribution is to introduce an online regression model generator (RMG) for initializing weights of the target filter with online samples and then optimizing this target filter weights based on the groundtruth samples at the first frame. Based on the online RGM, we devise a simple anchor-free tracker (FCOT), composed of a feature backbone, an up-sampling decoder, a multi-scale classification branch, and a multi-scale regression branch. Thanks to the unique design of RMG, our FCOT can not only be more effective in handling target variation along temporal dimension thus generating more precise results, but also overcome the issue of error accumulation during the tracking procedure. In addition, due to its simplicity in design, our FCOT could be trained and deployed in a fully convolutional manner with a real-time running speed. The proposed FCOT achieves the state-of-the-art performance on seven benchmarks, including VOT2018, LaSOT, TrackingNet, GOT-10k, OTB100, UAV123, and NFS. Code and models of our FCOT have been released at: \url{https://github.com/MCG-NJU/FCOT}.
    Local Memory Attention for Fast Video Semantic Segmentation. (arXiv:2101.01715v2 [cs.CV] UPDATED)
    (2 min) We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple, fast, and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. In contrast to previous memory-based approaches, we propose a fast local attention layer, providing temporal appearance cues in the local region of prior frames. We further fuse these cues with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.
    A Novel Patch Convolutional Neural Network for View-based 3D Model Retrieval. (arXiv:2109.12299v1 [cs.CV])
    (2 min) Recently, many view-based 3D model retrieval methods have been proposed and have achieved state-of-the-art performance. Most of these methods focus on extracting more discriminative view-level features and effectively aggregating the multi-view images of a 3D model, but the latent relationship among these multi-view images is not fully explored. Thus, we tackle this problem from the perspective of exploiting the relationships between patch features to capture long-range associations among multi-view images. To capture associations among views, in this work, we propose a novel patch convolutional neural network (PCNN) for view-based 3D model retrieval. Specifically, we first employ a CNN to extract patch features of each view image separately. Secondly, a novel neural network module named PatchConv is designed to exploit intrinsic relationships between neighboring patches in the feature space to capture long-range associations among multi-view images. Then, an adaptive weighted view layer is further embedded into PCNN to automatically assign a weight to each view according to the similarity between each view feature and the view-pooling feature. Finally, a discrimination loss function is employed to extract the discriminative 3D model feature, which consists of softmax loss values generated by the fusion lassifier and the specific classifier. Extensive experimental results on two public 3D model retrieval benchmarks, namely, the ModelNet40, and ModelNet10, demonstrate that our proposed PCNN can outperform state-of-the-art approaches, with mAP alues of 93.67%, and 96.23%, respectively.
    Vehicle Detection and Tracking From Surveillance Cameras in Urban Scenes. (arXiv:2109.12414v1 [cs.CV])
    (2 min) Detecting and tracking vehicles in urban scenes is a crucial step in many traffic-related applications as it helps to improve road user safety among other benefits. Various challenges remain unresolved in multi-object tracking (MOT) including target information description, long-term occlusions and fast motion. We propose a multi-vehicle detection and tracking system following the tracking-by-detection paradigm that tackles the previously mentioned challenges. Our MOT method extends an Intersection-over-Union (IOU)-based tracker with vehicle re-identification features. This allows us to utilize appearance information to better match objects after long occlusion phases and/or when object location is significantly shifted due to fast motion. We outperform our baseline MOT method on the UA-DETRAC benchmark while maintaining a total processing speed suitable for online use cases.
    Unsupervised Cross-Modality Domain Adaptation for Segmenting Vestibular Schwannoma and Cochlea with Data Augmentation and Model Ensemble. (arXiv:2109.12169v1 [eess.IV])
    (2 min) Magnetic resonance images (MRIs) are widely used to quantify vestibular schwannoma and the cochlea. Recently, deep learning methods have shown state-of-the-art performance for segmenting these structures. However, training segmentation models may require manual labels in target domain, which is expensive and time-consuming. To overcome this problem, domain adaptation is an effective way to leverage information from source domain to obtain accurate segmentations without requiring manual labels in target domain. In this paper, we propose an unsupervised learning framework to segment the VS and cochlea. Our framework leverages information from contrast-enhanced T1-weighted (ceT1-w) MRIs and its labels, and produces segmentations for T2-weighted MRIs without any labels in the target domain. We first applied a generator to achieve image-to-image translation. Next, we ensemble outputs from an ensemble of different models to obtain final segmentations. To cope with MRIs from different sites/scanners, we applied various 'online' augmentations during training to better capture the geometric variability and the variability in image appearance and quality. Our method is easy to build and produces promising segmentations, with a mean Dice score of 0.7930 and 0.7432 for VS and cochlea respectively in the validation set.
    Using Soft Labels to Model Uncertainty in Medical Image Segmentation. (arXiv:2109.12622v1 [cs.CV])
    (2 min) Medical image segmentation is inherently uncertain. For a given image, there may be multiple plausible segmentation hypotheses, and physicians will often disagree on lesion and organ boundaries. To be suited to real-world application, automatic segmentation systems must be able to capture this uncertainty and variability. Thus far, this has been addressed by building deep learning models that, through dropout, multiple heads, or variational inference, can produce a set - infinite, in some cases - of plausible segmentation hypotheses for any given image. However, in clinical practice, it may not be practical to browse all hypotheses. Furthermore, recent work shows that segmentation variability plateaus after a certain number of independent annotations, suggesting that a large enough group of physicians may be able to represent the whole space of possible segmentations. Inspired by this, we propose a simple method to obtain soft labels from the annotations of multiple physicians and train models that, for each image, produce a single well-calibrated output that can be thresholded at multiple confidence levels, according to each application's precision-recall requirements. We evaluated our method on the MICCAI 2021 QUBIQ challenge, showing that it performs well across multiple medical image segmentation tasks, produces well-calibrated predictions, and, on average, performs better at matching physicians' predictions than other physicians.
    Nonnegative Low Rank Tensor Approximation and its Application to Multi-dimensional Images. (arXiv:2007.14137v2 [cs.CV] UPDATED)
    (2 min) The main aim of this paper is to develop a new algorithm for computing nonnegative low rank tensor approximation for nonnegative tensors that arise in many multi-dimensional imaging applications. Nonnegativity is one of the important property as each pixel value refers to nonzero light intensity in image data acquisition. Our approach is different from classical nonnegative tensor factorization (NTF) which requires each factorized matrix and/or tensor to be nonnegative. In this paper, we determine a nonnegative low Tucker rank tensor to approximate a given nonnegative tensor. We propose an alternating projections algorithm for computing such nonnegative low rank tensor approximation, which is referred to as NLRT. The convergence of the proposed manifold projection method is established. Experimental results for synthetic data and multi-dimensional images are presented to demonstrate the performance of NLRT is better than state-of-the-art NTF methods.
    Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search. (arXiv:2010.04354v2 [cs.CV] UPDATED)
    (2 min) Quantization Neural Networks (QNN) have attracted a lot of attention due to their high efficiency. To enhance the quantization accuracy, prior works mainly focus on designing advanced quantization algorithms but still fail to achieve satisfactory results under the extremely low-bit case. In this work, we take an architecture perspective to investigate the potential of high-performance QNN. Therefore, we propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. However, a naive combination inevitably faces unacceptable time consumption or unstable training problem. To alleviate these problems, we first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and meanwhile improves the quantization accuracy. Equipped with this overall framework, dubbed as Once Quantization-Aware Training~(OQAT), our searched model family, OQATNets, achieves a new state-of-the-art compared with various architectures under different bit-widths. In particular, OQAT-2bit-M achieves 61.6% ImageNet Top-1 accuracy, outperforming 2-bit counterpart MobileNetV3 by a large margin of 9% with 10% less computation cost. A series of quantization-friendly architectures are identified easily and extensive analysis can be made to summarize the interaction between quantization and neural architectures. Codes and models are released at https://github.com/LaVieEnRoseSMZ/OQA
    Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?. (arXiv:2104.09425v2 [cs.LG] UPDATED)
    (3 min) While additional training data improves the robustness of deep neural networks against adversarial examples, it presents the challenge of curating a large number of specific real-world samples. We circumvent this challenge by using additional data from proxy distributions learned by state-of-the-art generative models. We first seek to formally understand the transfer of robustness from classifiers trained on proxy distributions to the real data distribution. We prove that the difference between the robustness of a classifier on the two distributions is upper bounded by the conditional Wasserstein distance between them. Motivated by our result, we next ask how to empirically select an appropriate generative model? We find that existing distance metrics, such as FID, fail to correctly determine the robustness transfer from proxy distributions. We propose a robust discrimination approach, which measures the distinguishability of synthetic and real samples under adversarial perturbations. Our approach accurately predicts the robustness transfer from different proxy distributions. After choosing a proxy distribution, the next question is which samples are most beneficial? We successfully optimize this selection by estimating the importance of each sample in robustness transfer. Finally, using our selection criterion for proxy distribution and individual samples, we curate a set of ten million most beneficial synthetic samples for robust training on the CIFAR-10 dataset. Using this set we improve robust accuracy by up to 7.5% and 6.7% in $\ell_{\infty}$ and $\ell_2$ threat model, and certified robust accuracy by 7.6% in $\ell_2$ threat model over baselines not using proxy distributions on the CIFAR-10 dataset.
    A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction. (arXiv:2109.12581v1 [cs.CV])
    (2 min) In this paper, a Video Summarization Method using Temporal Interest Detection and Key Frame Prediction is proposed for supervised video summarization, where video summarization is formulated as a combination of sequence labeling and temporal interest detection problem. In our method, we firstly built a flexible universal network frame to simultaneously predicts frame-level importance scores and temporal interest segments, and then combine the two components with different weights to achieve a more detailed video summarization. Extensive experiments and analysis on two benchmark datasets prove the effectiveness of our method. Specifically, compared with other state-of-the-art methods, its performance is increased by at least 2.6% and 4.2% on TVSum and SumMe respectively.
    Fully Differentiable and Interpretable Model for VIO with 4 Trainable Parameters. (arXiv:2109.12292v1 [cs.RO])
    (0 min) Monocular visual-inertial odometry (VIO) is a critical problem in robotics and autonomous driving. Traditional methods solve this problem based on filtering or optimization. While being fully interpretable, they rely on manual interference and empirical parameter tuning. On the other hand, learning-based approaches allow for end-to-end training but require a large number of training data to learn millions of parameters. However, the non-interpretable and heavy models hinder the generalization ability. In this paper, we propose a fully differentiable, interpretable, and lightweight monocular VIO model that contains only 4 trainable parameters. Specifically, we first adopt Unscented Kalman Filter as a differentiable layer to predict the pitch and roll, where the covariance matrices of noise are learned to filter out the noise of the IMU raw data. Second, the refined pitch and roll are adopted to retrieve a gravity-aligned BEV image of each frame using differentiable camera projection. Finally, a differentiable pose estimator is utilized to estimate the remaining 4 DoF poses between the BEV frames. Our method allows for learning the covariance matrices end-to-end supervised by the pose estimation loss, demonstrating superior performance to empirical baselines. Experimental results on synthetic and real-world datasets demonstrate that our simple approach is competitive with state-of-the-art methods and generalizes well on unseen scenes.
    A Transductive Maximum Margin Classifier for Few-Shot Learning. (arXiv:2107.11975v2 [cs.CV] UPDATED)
    (0 min) Few-shot learning aims to train a classifier that can generalize well when just a small number of labeled examples per class are given. We introduce a transductive maximum margin classifier for few-shot learning (FS-TMMC). The basic idea of the classical maximum margin classifier is to solve an optimal prediction function so that the training data can be correctly classified by the resulting classifer with the largest geometric margin. In few-shot learning, it is challenging to find such classifiers with good generalization ability due to the insufficiency of training data in the support set. FS-TMMC leverages the unlabeled query examples to adjust the separating hyperplane of the maximum margin classifier such that the prediction function is optimal on both the support and query sets. Furthermore, we use an efficient and effective quasi-Newton algorithm, the L-BFGS method for optimization. Experimental results on three standard few-shot learning benchmarks including miniImagenet, tieredImagenet and CUB show that our method achieves state-of-the-art performance.
    DenseTNT: Waymo Open Dataset Motion Prediction Challenge 1st Place Solution. (arXiv:2106.14160v2 [cs.CV] UPDATED)
    (0 min) In autonomous driving, goal-based multi-trajectory prediction methods are proved to be effective recently, where they first score goal candidates, then select a final set of goals, and finally complete trajectories based on the selected goals. However, these methods usually involve goal predictions based on sparse predefined anchors. In this work, we propose an anchor-free model, named DenseTNT, which performs dense goal probability estimation for trajectory prediction. Our model achieves state-of-the-art performance, and ranks 1st on the Waymo Open Dataset Motion Prediction Challenge. Project page is at https://github.com/Tsinghua-MARS-Lab/DenseTNT.
    Video Summarization Using Deep Neural Networks: A Survey. (arXiv:2101.06072v2 [cs.CV] UPDATED)
    (0 min) Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.
    Leveraging Multiple CNNs for Triaging Medical Workflow. (arXiv:2109.12783v1 [eess.IV])
    (0 min) High hospitalization rates due to the global spread of Covid-19 bring about a need for improvements to classical triaging workflows. To this end, convolutional neural networks (CNNs) can effectively differentiate critical from non-critical images so that critical cases may be addressed quickly, so long as there exists some representative image for the illness. Presented is a conglomerate neural network system consisting of multiple VGG16 CNNs; the system trains on weighted skin disease images re-labelled as critical or non-critical, to then attach to input images a critical index between 0 and 10. A critical index offers a more comprehensive rating system compared to binary critical/non-critical labels. Results for batches of input images run through the trained network are promising. A batch is shown being re-ordered by the proposed architecture from most critical to least critical roughly accurately.
    Self-Supervised Learning for MRI Reconstruction with a Parallel Network Training Framework. (arXiv:2109.12502v1 [cs.CV])
    (2 min) Image reconstruction from undersampled k-space data plays an important role in accelerating the acquisition of MR data, and a lot of deep learning-based methods have been exploited recently. Despite the achieved inspiring results, the optimization of these methods commonly relies on the fully-sampled reference data, which are time-consuming and difficult to collect. To address this issue, we propose a novel self-supervised learning method. Specifically, during model optimization, two subsets are constructed by randomly selecting part of k-space data from the undersampled data and then fed into two parallel reconstruction networks to perform information recovery. Two reconstruction losses are defined on all the scanned data points to enhance the network's capability of recovering the frequency information. Meanwhile, to constrain the learned unscanned data points of the network, a difference loss is designed to enforce consistency between the two parallel networks. In this way, the reconstruction model can be properly trained with only the undersampled data. During the model evaluation, the undersampled data are treated as the inputs and either of the two trained networks is expected to reconstruct the high-quality results. The proposed method is flexible and can be employed in any existing deep learning-based method. The effectiveness of the method is evaluated on an open brain MRI dataset. Experimental results demonstrate that the proposed self-supervised method can achieve competitive reconstruction performance compared to the corresponding supervised learning method at high acceleration rates (4 and 8). The code is publicly available at \url{https://github.com/chenhu96/Self-Supervised-MRI-Reconstruction}.
    Cluster Analysis with Deep Embeddings and Contrastive Learning. (arXiv:2109.12714v1 [cs.LG])
    (2 min) Unsupervised disentangled representation learning is a long-standing problem in computer vision. This work proposes a novel framework for performing image clustering from deep embeddings by combining instance-level contrastive learning with a deep embedding based cluster center predictor. Our approach jointly learns representations and predicts cluster centers in an end-to-end manner. This is accomplished via a three-pronged approach that combines a clustering loss, an instance-wise contrastive loss, and an anchor loss. Our fundamental intuition is that using an ensemble loss that incorporates instance-level features and a clustering procedure focusing on semantic similarity reinforces learning better representations in the latent space. We observe that our method performs exceptionally well on popular vision datasets when evaluated using standard clustering metrics such as Normalized Mutual Information (NMI), in addition to producing geometrically well-separated cluster embeddings as defined by the Euclidean distance. Our framework performs on par with widely accepted clustering methods and outperforms the state-of-the-art contrastive learning method on the CIFAR-10 dataset with an NMI score of 0.772, a 7-8% improvement on the strong baseline.
    Deep Joint Source-Channel Coding for Multi-Task Network. (arXiv:2109.05779v2 [cs.CV] UPDATED)
    (0 min) Multi-task learning (MTL) is an efficient way to improve the performance of related tasks by sharing knowledge. However, most existing MTL networks run on a single end and are not suitable for collaborative intelligence (CI) scenarios. In this work, we propose an MTL network with a deep joint source-channel coding (JSCC) framework, which allows operating under CI scenarios. We first propose a feature fusion based MTL network (FFMNet) for joint object detection and semantic segmentation. Compared with other MTL networks, FFMNet gets higher performance with fewer parameters. Then FFMNet is split into two parts, which run on a mobile device and an edge server respectively. The feature generated by the mobile device is transmitted through the wireless channel to the edge server. To reduce the transmission overhead of the intermediate feature, a deep JSCC network is designed. By combining two networks together, the whole model achieves 512x compression for the intermediate feature and a performance loss within 2% on both tasks. At last, by training with noise, the FFMNet with JSCC is robust to various channel conditions and outperforms the separate source and channel coding scheme.
    ReCal-Net: Joint Region-Channel-Wise Calibrated Network for Semantic Segmentation in Cataract Surgery Videos. (arXiv:2109.12448v1 [eess.IV])
    (0 min) Semantic segmentation in surgical videos is a prerequisite for a broad range of applications towards improving surgical outcomes and surgical video analysis. However, semantic segmentation in surgical videos involves many challenges. In particular, in cataract surgery, various features of the relevant objects such as blunt edges, color and context variation, reflection, transparency, and motion blur pose a challenge for semantic segmentation. In this paper, we propose a novel convolutional module termed as \textit{ReCal} module, which can calibrate the feature maps by employing region intra-and-inter-dependencies and channel-region cross-dependencies. This calibration strategy can effectively enhance semantic representation by correlating different representations of the same semantic label, considering a multi-angle local view centering around each pixel. Thus the proposed module can deal with distant visual characteristics of unique objects as well as cross-similarities in the visual characteristics of different objects. Moreover, we propose a novel network architecture based on the proposed module termed as ReCal-Net. Experimental results confirm the superiority of ReCal-Net compared to rival state-of-the-art approaches for all relevant objects in cataract surgery. Moreover, ablation studies reveal the effectiveness of the ReCal module in boosting semantic segmentation accuracy.
    Simultaneous Semantic and Collision Learning for 6-DoF Grasp Pose Estimation. (arXiv:2108.02425v2 [cs.RO] UPDATED)
    (0 min) Grasping in cluttered scenes has always been a great challenge for robots, due to the requirement of the ability to well understand the scene and object information. Previous works usually assume that the geometry information of the objects is available, or utilize a step-wise, multi-stage strategy to predict the feasible 6-DoF grasp poses. In this work, we propose to formalize the 6-DoF grasp pose estimation as a simultaneous multi-task learning problem. In a unified framework, we jointly predict the feasible 6-DoF grasp poses, instance semantic segmentation, and collision information. The whole framework is jointly optimized and end-to-end differentiable. Our model is evaluated on large-scale benchmarks as well as the real robot system. On the public dataset, our method outperforms prior state-of-the-art methods by a large margin (+4.08 AP). We also demonstrate the implementation of our model on a real robotic platform and show that the robot can accurately grasp target objects in cluttered scenarios with a high success rate. Project link: https://openbyterobotics.github.io/sscl
    Multi-Modal Multi-Instance Learning for Retinal Disease Recognition. (arXiv:2109.12307v1 [cs.CV])
    (0 min) This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network's ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model.
    Short-Term Load Forecasting Using Time Pooling Deep Recurrent Neural Network. (arXiv:2109.12498v1 [cs.LG])
    (0 min) Integration of renewable energy sources and emerging loads like electric vehicles to smart grids brings more uncertainty to the distribution system management. Demand Side Management (DSM) is one of the approaches to reduce the uncertainty. Some applications like Nonintrusive Load Monitoring (NILM) can support DSM, however they require accurate forecasting on high resolution data. This is challenging when it comes to single loads like one residential household due to its high volatility. In this paper, we review some of the existing Deep Learning-based methods and present our solution using Time Pooling Deep Recurrent Neural Network. The proposed method augments data using time pooling strategy and can overcome overfitting problems and model uncertainties of data more efficiently. Simulation and implementation results show that our method outperforms the existing algorithms in terms of RMSE and MAE metrics.
    Predicting survival of glioblastoma from automatic whole-brain and tumor segmentation of MR images. (arXiv:2109.12334v1 [eess.IV])
    (0 min) Survival prediction models can potentially be used to guide treatment of glioblastoma patients. However, currently available MR imaging biomarkers holding prognostic information are often challenging to interpret, have difficulties generalizing across data acquisitions, or are only applicable to pre-operative MR data. In this paper we aim to address these issues by introducing novel imaging features that can be automatically computed from MR images and fed into machine learning models to predict patient survival. The features we propose have a direct biological interpretation: They measure the deformation caused by the tumor on the surrounding brain structures, comparing the shape of various structures in the patient's brain to their expected shape in healthy individuals. To obtain the required segmentations, we use an automatic method that is contrast-adaptive and robust to missing modalities, making the features generalizable across scanners and imaging protocols. Since the features we propose do not depend on characteristics of the tumor region itself, they are also applicable to post-operative images, which have been much less studied in the context of survival prediction. Using experiments involving both pre- and post-operative data, we show that the proposed features carry prognostic value in terms of overall- and progression-free survival, over and above that of conventional non-imaging features.
    Multi-source Few-shot Domain Adaptation. (arXiv:2109.12391v1 [cs.CV])
    (0 min) Multi-source Domain Adaptation (MDA) aims to transfer predictive models from multiple, fully-labeled source domains to an unlabeled target domain. However, in many applications, relevant labeled source datasets may not be available, and collecting source labels can be as expensive as labeling the target data itself. In this paper, we investigate Multi-source Few-shot Domain Adaptation (MFDA): a new domain adaptation scenario with limited multi-source labels and unlabeled target data. As we show, existing methods often fail to learn discriminative features for both source and target domains in the MFDA setting. Therefore, we propose a novel framework, termed Multi-Source Few-shot Adaptation Network (MSFAN), which can be trained end-to-end in a non-adversarial manner. MSFAN operates by first using a type of prototypical, multi-domain, self-supervised learning to learn features that are not only domain-invariant but also class-discriminative. Second, MSFAN uses a small, labeled support set to enforce feature consistency and domain invariance across domains. Finally, prototypes from multiple sources are leveraged to learn better classifiers. Compared with state-of-the-art MDA methods, MSFAN improves the mean classification accuracy over different domain pairs on MFDA by 20.2%, 9.4%, and 16.2% on Office, Office-Home, and DomainNet, respectively.
    A novel network training approach for open set image recognition. (arXiv:2109.12756v1 [cs.CV])
    (0 min) Convolutional Neural Networks (CNNs) are commonly designed for closed set arrangements, where test instances only belong to some "Known Known" (KK) classes used in training. As such, they predict a class label for a test sample based on the distribution of the KK classes. However, when used under the Open Set Recognition (OSR) setup (where an input may belong to an "Unknown Unknown" or UU class), such a network will always classify a test instance as one of the KK classes even if it is from a UU class. As a solution, recently, data augmentation based on Generative Adversarial Networks(GAN) has been used. In this work, we propose a novel approach for mining a "Known UnknownTrainer" or KUT set and design a deep OSR Network (OSRNet) to harness this dataset. The goal isto teach OSRNet the essence of the UUs through KUT set, which is effectively a collection of mined "hard Known Unknown negatives". Once trained, OSRNet can detect the UUs while maintaining high classification accuracy on KKs. We evaluate OSRNet on six benchmark datasets and demonstrate it outperforms contemporary OSR methods.
    TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. (arXiv:2104.08271v2 [cs.CV] UPDATED)
    (0 min) In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders. By contrast, despite the natural symmetry, the design of effective algorithms for exploiting large-scale language pretraining remains under-explored. In this work, we are the first to investigate the design of such algorithms and propose a novel generalized distillation method, TeachText, which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model. Moreover, we extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time without compromising performance. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time. Last but not least, we show an effective application of our method for eliminating noise from retrieval datasets. Code and data can be found at https://www.robots.ox.ac.uk/~vgg/research/teachtext/.
    Misclassification-Aware Gaussian Smoothing and Mixed Augmentations improves Robustness against Domain Shifts. (arXiv:2104.01231v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks achieve high prediction accuracy when the train and test distributions coincide. In practice though, various types of corruptions can deviate from this setup and cause severe performance degradations. Few methods have been proposed to address generalization in the presence of unforeseen domain shifts. In this paper, we propose a misclassification-aware Gaussian smoothing approach, coupled with mixed data augmentations, for improving robustness of image classifiers against a variety of corruptions while still maintaining high clean accuracy. We show that our method improves upon the state-of-the-art in robustness and uncertainty calibration on several image classification benchmarks and network architectures.
    EllipseNet: Anchor-Free Ellipse Detection for Automatic Cardiac Biometrics in Fetal Echocardiography. (arXiv:2109.12474v1 [eess.IV])
    (0 min) As an important scan plane, four chamber view is routinely performed in both second trimester perinatal screening and fetal echocardiographic examinations. The biometrics in this plane including cardio-thoracic ratio (CTR) and cardiac axis are usually measured by sonographers for diagnosing congenital heart disease. However, due to the commonly existing artifacts like acoustic shadowing, the traditional manual measurements not only suffer from the low efficiency, but also with the inconsistent results depending on the operators' skills. In this paper, we present an anchor-free ellipse detection network, namely EllipseNet, which detects the cardiac and thoracic regions in ellipse and automatically calculates the CTR and cardiac axis for fetal cardiac biometrics in 4-chamber view. In particular, we formulate the network that detects the center of each object as points and regresses the ellipses' parameters simultaneously. We define an intersection-over-union loss to further regulate the regression procedure. We evaluate EllipseNet on clinical echocardiogram dataset with more than 2000 subjects. Experimental results show that the proposed framework outperforms several state-of-the-art methods. Source code will be available at https://git.openi.org.cn/capepoint/EllipseNet .
    An optimised deep spiking neural network architecture without gradients. (arXiv:2109.12813v1 [cs.NE])
    (0 min) We present an end-to-end trainable modular event-driven neural architecture that uses local synaptic and threshold adaptation rules to perform transformations between arbitrary spatio-temporal spike patterns. The architecture represents a highly abstracted model of existing Spiking Neural Network (SNN) architectures. The proposed Optimized Deep Event-driven Spiking neural network Architecture (ODESA) can simultaneously learn hierarchical spatio-temporal features at multiple arbitrary time scales. ODESA performs online learning without the use of error back-propagation or the calculation of gradients. Through the use of simple local adaptive selection thresholds at each node, the network rapidly learns to appropriately allocate its neuronal resources at each layer for any given problem without using a real-valued error measure. These adaptive selection thresholds are the central feature of ODESA, ensuring network stability and remarkable robustness to noise as well as to the selection of initial system parameters. Network activations are inherently sparse due to a hard Winner-Take-All (WTA) constraint at each layer. We evaluate the architecture on existing spatio-temporal datasets, including the spike-encoded IRIS and TIDIGITS datasets, as well as a novel set of tasks based on International Morse Code that we created. These tests demonstrate the hierarchical spatio-temporal learning capabilities of ODESA. Through these tests, we demonstrate ODESA can optimally solve practical and highly challenging hierarchical spatio-temporal learning tasks with the minimum possible number of computing nodes.
    VirTex: Learning Visual Representations from Textual Annotations. (arXiv:2006.06666v3 [cs.CV] UPDATED)
    (0 min) The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.
    FastAno: Fast Anomaly Detection via Spatio-temporal Patch Transformation. (arXiv:2106.08613v3 [cs.CV] UPDATED)
    (0 min) Video anomaly detection has gained significant attention due to the increasing requirements of automatic monitoring for surveillance videos. Especially, the prediction based approach is one of the most studied methods to detect anomalies by predicting frames that include abnormal events in the test set after learning with the normal frames of the training set. However, a lot of prediction networks are computationally expensive owing to the use of pre-trained optical flow networks, or fail to detect abnormal situations because of their strong generative ability to predict even the anomalies. To address these shortcomings, we propose spatial rotation transformation (SRT) and temporal mixing transformation (TMT) to generate irregular patch cuboids within normal frame cuboids in order to enhance the learning of normal features. Additionally, the proposed patch transformation is used only during the training phase, allowing our model to detect abnormal frames at fast speed during inference. Our model is evaluated on three anomaly detection benchmarks, achieving competitive accuracy and surpassing all the previous works in terms of speed.
    Joint Multi-Dimension Pruning via Numerical Gradient Update. (arXiv:2005.08931v2 [cs.CV] UPDATED)
    (0 min) We present joint multi-dimension pruning (abbreviated as JointPruning), an effective method of pruning a network on three crucial aspects: spatial, depth and channel simultaneously. To tackle these three naturally different dimensions, we proposed a general framework by defining pruning as seeking the best pruning vector (i.e., the numerical value of layer-wise channel number, spacial size, depth) and construct a unique mapping from the pruning vector to the pruned network structures. Then we optimize the pruning vector with gradient update and model joint pruning as a numerical gradient optimization process. To overcome the challenge that there is no explicit function between the loss and the pruning vectors, we proposed self-adapted stochastic gradient estimation to construct a gradient path through network loss to pruning vectors and enable efficient gradient update. We show that the joint strategy discovers a better status than previous studies that focused on individual dimensions solely, as our method is optimized collaboratively across the three dimensions in a single end-to-end training and it is more efficient than the previous exhaustive methods. Extensive experiments on large-scale ImageNet dataset across a variety of network architectures MobileNet V1&V2&V3 and ResNet demonstrate the effectiveness of our proposed method. For instance, we achieve significant margins of 2.5% and 2.6% improvement over the state-of-the-art approach on the already compact MobileNet V1&V2 under an extremely large compression ratio.
    Siamese Network Training Using Artificial Triplets By Sampling and Image Transformation. (arXiv:2106.07015v2 [cs.CV] UPDATED)
    (0 min) The device used in this work detects the objects over the surface of the water using two thermal cameras which aid the users to detect and avoid the objects in scenarios where the human eyes cannot (night, fog, etc.). To avoid the obstacle collision autonomously, it is required to track the objects in real-time and assign a specific identity to each object to determine its dynamics (trajectory, velocity, etc.) for making estimated collision predictions. In the following work, a Machine Learning (ML) approach for Computer Vision (CV) called Convolutional Neural Network (CNN) was used using TensorFlow as the high-level programming environment in Python. To validate the algorithm a test set was generated using an annotation tool that was created during the work for proper evaluation. Once validated, the algorithm was deployed on the platform and tested with the sequence generated by the test boat.
    Closer Look at the Uncertainty Estimation in Semantic Segmentation under Distributional Shift. (arXiv:2106.00076v2 [cs.CV] UPDATED)
    (0 min) While recent computer vision algorithms achieve impressive performance on many benchmarks, they lack robustness - presented with an image from a different distribution, (e.g. weather or lighting conditions not considered during training), they may produce an erroneous prediction. Therefore, it is desired that such a model will be able to reliably predict its confidence measure. In this work, uncertainty estimation for the task of semantic segmentation is evaluated under a varying level of domain shift: in a cross-dataset setting and when adapting a model trained on data from the simulation. It was shown that simple color transformations already provide a strong baseline, comparable to using more sophisticated style-transfer data augmentation. Further, by constructing an ensemble consisting of models using different backbones and/or augmentation methods, it was possible to improve significantly model performance in terms of overall accuracy and uncertainty estimation under the domain shift setting. The Expected Calibration Error (ECE) on challenging GTA to Cityscapes adaptation was reduced from 4.05 to the competitive value of 1.1. Further, an ensemble of models was utilized in the self-training setting to improve the pseudo-labels generation, which resulted in a significant gain in the final model accuracy, compared to the standard fine-tuning (without ensemble).
    Assessing domain adaptation techniques for mitosis detection in multi-scanner breast cancer histopathology images. (arXiv:2109.00869v2 [eess.IV] UPDATED)
    (0 min) Breast cancer is the most prevalent cancer worldwide and is increasing in incidence, with over two million new cases now diagnosed each year. As part of diagnostic tumour grading, histopathologists manually count the number of dividing cells (mitotic figures) in a sample. Since the process is subjective and time-consuming, artificial intelligence (AI) methods have been developed to automate the process, however these methods often perform poorly when applied to data from outside of the original (training) domain, i.e. they do not generalise well to variations in histological background, staining protocols, or scanner types. Style transfer, a form of domain adaptation, provides the means to transform images from different domains to a shared visual appearance and have been adopted in various applications to mitigate the issue of domain shift. In this paper we train two mitosis detection models and two style transfer methods and evaluate the usefulness of the latter for improving mitosis detection performance in images digitised using different scanners. We found that the best of these models, U-Net without style transfer, achieved an F1-score of 0.693 on the MIDOG 2021 preliminary test set.
    Contrastive Unpaired Translation using Focal Loss for Patch Classification. (arXiv:2109.12431v1 [cs.CV])
    (0 min) Image-to-image translation models transfer images from input domain to output domain in an endeavor to retain the original content of the image. Contrastive Unpaired Translation is one of the existing methods for solving such problems. Significant advantage of this method, compared to competitors, is the ability to train and perform well in cases where both input and output domains are only a single image. Another key thing that differentiates this method from its predecessors is the usage of image patches rather than the whole images. It also turns out that sampling negatives (patches required to calculate the loss) from the same image achieves better results than a scenario where the negatives are sampled from other images in the dataset. This type of approach encourages mapping of corresponding patches to the same location in relation to other patches (negatives) while at the same time improves the output image quality and significantly decreases memory usage as well as the time required to train the model compared to CycleGAN method used as a baseline. Through a series of experiments we show that using focal loss in place of cross-entropy loss within the PatchNCE loss can improve on the model's performance and even surpass the current state-of-the-art model for image-to-image translation.
    Research on facial expression recognition based on Multimodal data fusion and neural network. (arXiv:2109.12724v1 [cs.CV])
    (0 min) Facial expression recognition is a challenging task when neural network is applied to pattern recognition. Most of the current recognition research is based on single source facial data, which generally has the disadvantages of low accuracy and low robustness. In this paper, a neural network algorithm of facial expression recognition based on multimodal data fusion is proposed. The algorithm is based on the multimodal data, and it takes the facial image, the histogram of oriented gradient of the image and the facial landmarks as the input, and establishes CNN, LNN and HNN three sub neural networks to extract data features, using multimodal data feature fusion mechanism to improve the accuracy of facial expression recognition. Experimental results show that, benefiting by the complementarity of multimodal data, the algorithm has a great improvement in accuracy, robustness and detection speed compared with the traditional facial expression recognition algorithm. Especially in the case of partial occlusion, illumination and head posture transformation, the algorithm also shows a high confidence.
    A Principled Approach to Failure Analysis and Model Repairment: Demonstration in Medical Imaging. (arXiv:2109.12347v1 [cs.LG])
    (0 min) Machine learning models commonly exhibit unexpected failures post-deployment due to either data shifts or uncommon situations in the training environment. Domain experts typically go through the tedious process of inspecting the failure cases manually, identifying failure modes and then attempting to fix the model. In this work, we aim to standardise and bring principles to this process through answering two critical questions: (i) how do we know that we have identified meaningful and distinct failure types?; (ii) how can we validate that a model has, indeed, been repaired? We suggest that the quality of the identified failure types can be validated through measuring the intra- and inter-type generalisation after fine-tuning and introduce metrics to compare different subtyping methods. Furthermore, we argue that a model can be considered repaired if it achieves high accuracy on the failure types while retaining performance on the previously correct data. We combine these two ideas into a principled framework for evaluating the quality of both the identified failure subtypes and model repairment. We evaluate its utility on a classification and an object detection tasks. Our code is available at https://github.com/Rokken-lab6/Failure-Analysis-and-Model-Repairment
    Classification of COVID-19 from CXR Images in a 15-class Scenario: an Attempt to Avoid Bias in the System. (arXiv:2109.12453v1 [eess.IV])
    (0 min) As of June 2021, the World Health Organization (WHO) has reported 171.7 million confirmed cases including 3,698,621 deaths from COVID-19. Detecting COVID-19 and other lung diseases from Chest X-Ray (CXR) images can be very effective for emergency diagnosis and treatment as CXR is fast and cheap. The objective of this study is to develop a system capable of detecting COVID-19 along with 14 other lung diseases from CXRs in a fair and unbiased manner. The proposed system consists of a CXR image selection technique and a deep learning based model to classify 15 diseases including COVID-19. The proposed CXR selection technique aims to retain the maximum variation uniformly and eliminate poor quality CXRs with the goal of reducing the training dataset size without compromising classifier accuracy. More importantly, it reduces the often hidden bias and unfairness in decision making. The proposed solution exhibits a promising COVID-19 detection scheme in a more realistic situation than most existing studies as it deals with 15 lung diseases together. We hope the proposed method will have wider adoption in medical image classification and other related fields.
    Anchor-based Plain Net for Mobile Image Super-Resolution. (arXiv:2105.09750v2 [eess.IV] UPDATED)
    (0 min) Along with the rapid development of real-world applications, higher requirements on the accuracy and efficiency of image super-resolution (SR) are brought forward. Though existing methods have achieved remarkable success, the majority of them demand plenty of computational resources and large amount of RAM, and thus they can not be well applied to mobile device. In this paper, we aim at designing efficient architecture for 8-bit quantization and deploy it on mobile device. First, we conduct an experiment about meta-node latency by decomposing lightweight SR architectures, which determines the portable operations we can utilize. Then, we dig deeper into what kind of architecture is beneficial to 8-bit quantization and propose anchor-based plain net (ABPN). Finally, we adopt quantization-aware training strategy to further boost the performance. Our model can outperform 8-bit quantized FSRCNN by nearly 2dB in terms of PSNR, while satisfying realistic needs at the same time. Code is avaliable at https://github.com/NJU- Jet/SR_Mobile_Quantization.
    Self-Supervised Video Representation Learning by Video Incoherence Detection. (arXiv:2109.12493v1 [cs.CV])
    (0 min) This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval utilizing various backbone networks. Experiments show that our proposed method achieves state-of-the-art performance across different backbone networks and different datasets compared with previous coherence-based methods.
    BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation. (arXiv:2109.12271v1 [eess.IV])
    (0 min) Convolutional neural networks (CNNs) have recently achieved remarkable success in automatically identifying organs or lesions on 3D medical images. Meanwhile, vision transformer networks have exhibited exceptional performance in 2D image classification tasks. Compared with CNNs, transformer networks have an obvious advantage of extracting long-range features due to their self-attention algorithm. Therefore, in this paper we present a CNN-Transformer combined model called BiTr-Unet for brain tumor segmentation on multi-modal MRI scans. The proposed BiTr-Unet achieves good performance on the BraTS 2021 validation dataset with mean Dice score 0.9076, 0.8392 and 0.8231, and mean Hausdorff distance 4.5322, 13.4592 and 14.9963 for the whole tumor, tumor core, and enhancing tumor, respectively.
    Joint Multimedia Event Extraction from Video and Article. (arXiv:2109.12776v1 [cs.CV])
    (0 min) Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from video and text articles. We introduce the new task of Video MultiMedia Event Extraction (Video M2E2) and propose two novel components to build the first system towards this task. First, we propose the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction.
    ISF-GAN: An Implicit Style Function for High-Resolution Image-to-Image Translation. (arXiv:2109.12492v1 [cs.CV])
    (0 min) Recently, there has been an increasing interest in image editing methods that employ pre-trained unconditional image generators (e.g., StyleGAN). However, applying these methods to translate images to multiple visual domains remains challenging. Existing works do not often preserve the domain-invariant part of the image (e.g., the identity in human face translations), they do not usually handle multiple domains, or do not allow for multi-modal translations. This work proposes an implicit style function (ISF) to straightforwardly achieve multi-modal and multi-domain image-to-image translation from pre-trained unconditional generators. The ISF manipulates the semantics of an input latent code to make the image generated from it lying in the desired visual domain. Our results in human face and animal manipulations show significantly improved results over the baselines. Our model enables cost-effective multi-modal unsupervised image-to-image translations at high resolution using pre-trained unconditional GANs. The code and data are available at: \url{https://github.com/yhlleo/stylegan-mmuit}.
    Vision-Guided Forecasting -- Visual Context for Multi-Horizon Time Series Forecasting. (arXiv:2107.12674v2 [cs.CV] UPDATED)
    (0 min) Autonomous driving gained huge traction in recent years, due to its potential to change the way we commute. Much effort has been put into trying to estimate the state of a vehicle. Meanwhile, learning to forecast the state of a vehicle ahead introduces new capabilities, such as predicting dangerous situations. Moreover, forecasting brings new supervision opportunities by learning to predict richer a context, expressed by multiple horizons. Intuitively, a video stream originated from a front-facing camera is necessary because it encodes information about the upcoming road. Besides, historical traces of the vehicle's states give more context. In this paper, we tackle multi-horizon forecasting of vehicle states by fusing the two modalities. We design and experiment with 3 end-to-end architectures that exploit 3D convolutions for visual features extraction and 1D convolutions for features extraction from speed and steering angle traces. To demonstrate the effectiveness of our method, we perform extensive experiments on two publicly available real-world datasets, Comma2k19 and the Udacity challenge. We show that we are able to forecast a vehicle's state to various horizons, while outperforming the current state-of-the-art results on the related task of driving state estimation. We examine the contribution of vision features, and find that a model fed with vision features achieves an error that is 56.6% and 66.9% of the error of a model that doesn't use those features, on the Udacity and Comma2k19 datasets respectively.
    Latent Variable Models for Visual Question Answering. (arXiv:2101.06399v2 [cs.CV] UPDATED)
    (0 min) Current work on Visual Question Answering (VQA) explore deterministic approaches conditioned on various types of image and question features. We posit that, in addition to image and question pairs, other modalities are useful for teaching machine to carry out question answering. Hence in this paper, we propose latent variable models for VQA where extra information (e.g. captions and answer categories) are incorporated as latent variables, which are observed during training but in turn benefit question-answering performance at test time. Experiments on the VQA v2.0 benchmarking dataset demonstrate the effectiveness of our proposed models: they improve over strong baselines, especially those that do not rely on extensive language-vision pre-training.
    OpenRooms: An End-to-End Open Framework for Photorealistic Indoor Scene Datasets. (arXiv:2007.12868v3 [cs.CV] UPDATED)
    (0 min) We propose a novel framework for creating large-scale photorealistic datasets of indoor scenes, with ground truth geometry, material, lighting and semantics. Our goal is to make the dataset creation process widely accessible, transforming scans into photorealistic datasets with high-quality ground truth for appearance, layout, semantic labels, high quality spatially-varying BRDF and complex lighting, including direct, indirect and visibility components. This enables important applications in inverse rendering, scene understanding and robotics. We show that deep networks trained on the proposed dataset achieve competitive performance for shape, material and lighting estimation on real images, enabling photorealistic augmented reality applications, such as object insertion and material editing. We also show our semantic labels may be used for segmentation and multi-task learning. Finally, we demonstrate that our framework may also be integrated with physics engines, to create virtual robotics environments with unique ground truth such as friction coefficients and correspondence to real scenes. The dataset and all the tools to create such datasets will be made publicly available.
    L$^{2}$NAS: Learning to Optimize Neural Architectures via Continuous-Action Reinforcement Learning. (arXiv:2109.12425v1 [cs.LG])
    (0 min) Neural architecture search (NAS) has achieved remarkable results in deep neural network design. Differentiable architecture search converts the search over discrete architectures into a hyperparameter optimization problem which can be solved by gradient descent. However, questions have been raised regarding the effectiveness and generalizability of gradient methods for solving non-convex architecture hyperparameter optimization problems. In this paper, we propose L$^{2}$NAS, which learns to intelligently optimize and update architecture hyperparameters via an actor neural network based on the distribution of high-performing architectures in the search history. We introduce a quantile-driven training procedure which efficiently trains L$^{2}$NAS in an actor-critic framework via continuous-action reinforcement learning. Experiments show that L$^{2}$NAS achieves state-of-the-art results on NAS-Bench-201 benchmark as well as DARTS search space and Once-for-All MobileNetV3 search space. We also show that search policies generated by L$^{2}$NAS are generalizable and transferable across different training datasets with minimal fine-tuning.
    A Compositional Feature Embedding and Similarity Metric for Ultra-Fine-Grained Visual Categorization. (arXiv:2109.12380v1 [cs.CV])
    (0 min) Fine-grained visual categorization (FGVC), which aims at classifying objects with small inter-class variances, has been significantly advanced in recent years. However, ultra-fine-grained visual categorization (ultra-FGVC), which targets at identifying subclasses with extremely similar patterns, has not received much attention. In ultra-FGVC datasets, the samples per category are always scarce as the granularity moves down, which will lead to overfitting problems. Moreover, the difference among different categories is too subtle to distinguish even for professional experts. Motivated by these issues, this paper proposes a novel compositional feature embedding and similarity metric (CECS). Specifically, in the compositional feature embedding module, we randomly select patches in the original input image, and these patches are then replaced by patches from the images of different categories or masked out. Then the replaced and masked images are used to augment the original input images, which can provide more diverse samples and thus largely alleviate overfitting problem resulted from limited training samples. Besides, learning with diverse samples forces the model to learn not only the most discriminative features but also other informative features in remaining regions, enhancing the generalization and robustness of the model. In the compositional similarity metric module, a new similarity metric is developed to improve the classification performance by narrowing the intra-category distance and enlarging the inter-category distance. Experimental results on two ultra-FGVC datasets and one FGVC dataset with recent benchmark methods consistently demonstrate that the proposed CECS method achieves the state of-the-art performance.
    Tensor Full Feature Measure and Its Nonconvex Relaxation Applications to Tensor Recovery. (arXiv:2109.12257v1 [cs.CV])
    (0 min) Tensor sparse modeling as a promising approach, in the whole of science and engineering has been a huge success. As is known to all, various data in practical application are often generated by multiple factors, so the use of tensors to represent the data containing the internal structure of multiple factors came into being. However, different from the matrix case, constructing reasonable sparse measure of tensor is a relatively difficult and very important task. Therefore, in this paper, we propose a new tensor sparsity measure called Tensor Full Feature Measure (FFM). It can simultaneously describe the feature information of each dimension of the tensor and the related features between two dimensions, and connect the Tucker rank with the tensor tube rank. This measurement method can describe the sparse features of the tensor more comprehensively. On this basis, we establish its non-convex relaxation, and apply FFM to low rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). LRTC and TRPCA models based on FFM are proposed, and two efficient Alternating Direction Multiplier Method (ADMM) algorithms are developed to solve the proposed model. A variety of real numerical experiments substantiate the superiority of the proposed methods beyond state-of-the-arts.
    Two Souls in an Adversarial Image: Towards Universal Adversarial Example Detection using Multi-view Inconsistency. (arXiv:2109.12459v1 [cs.CV])
    (0 min) In the evasion attacks against deep neural networks (DNN), the attacker generates adversarial instances that are visually indistinguishable from benign samples and sends them to the target DNN to trigger misclassifications. In this paper, we propose a novel multi-view adversarial image detector, namely Argos, based on a novel observation. That is, there exist two "souls" in an adversarial instance, i.e., the visually unchanged content, which corresponds to the true label, and the added invisible perturbation, which corresponds to the misclassified label. Such inconsistencies could be further amplified through an autoregressive generative approach that generates images with seed pixels selected from the original image, a selected label, and pixel distributions learned from the training data. The generated images (i.e., the "views") will deviate significantly from the original one if the label is adversarial, demonstrating inconsistencies that Argos expects to detect. To this end, Argos first amplifies the discrepancies between the visual content of an image and its misclassified label induced by the attack using a set of regeneration mechanisms and then identifies an image as adversarial if the reproduced views deviate to a preset degree. Our experimental results show that Argos significantly outperforms two representative adversarial detectors in both detection accuracy and robustness against six well-known adversarial attacks. Code is available at: https://github.com/sohaib730/Argos-Adversarial_Detection
    Deep Neural Networks for Blind Image Quality Assessment: Addressing the Data Challenge. (arXiv:2109.12161v1 [eess.IV])
    (0 min) The enormous space and diversity of natural images is usually represented by a few small-scale human-rated image quality assessment (IQA) datasets. This casts great challenges to deep neural network (DNN) based blind IQA (BIQA), which requires large-scale training data that is representative of the natural image distribution. It is extremely difficult to create human-rated IQA datasets composed of millions of images due to constraints of subjective testing. While a number of efforts have focused on design innovations to enhance the performance of DNN based BIQA, attempts to address the scarcity of labeled IQA data remain surprisingly missing. To address this data challenge, we construct so far the largest IQA database, namely Waterloo Exploration-II, which contains 3,570 pristine reference and around 3.45 million singly and multiply distorted images. Since subjective testing for such a large dataset is nearly impossible, we develop a novel mechanism that synthetically assigns perceptual quality labels to the distorted images. We construct a DNN-based BIQA model called EONSS, train it on Waterloo Exploration-II, and test it on nine subject-rated IQA datasets, without any retraining or fine-tuning. The results show that with a straightforward DNN architecture, EONSS is able to outperform the very state-of-the-art in BIQA, both in terms of quality prediction performance and execution speed. This study strongly supports the view that the quantity and quality of meaningfully annotated training data, rather than a sophisticated network architecture or training strategy, is the dominating factor that determines the performance of DNN-based BIQA models. (Note: Since this is an ongoing project, the final versions of Waterloo Exploration-II database, quality annotations, and EONSS, will be made publicly available in the future when it culminates.)
    Generalized multiscale feature extraction for remaining useful life prediction of bearings with generative adversarial networks. (arXiv:2109.12513v1 [cs.LG])
    (0 min) Bearing is a key component in industrial machinery and its failure may lead to unwanted downtime and economic loss. Hence, it is necessary to predict the remaining useful life (RUL) of bearings. Conventional data-driven approaches of RUL prediction require expert domain knowledge for manual feature extraction and may suffer from data distribution discrepancy between training and test data. In this study, we propose a novel generalized multiscale feature extraction method with generative adversarial networks. The adversarial training learns the distribution of training data from different bearings and is introduced for health stage division and RUL prediction. To capture the sequence feature from a one-dimensional vibration signal, we adapt a U-Net architecture that reconstructs features to process them with multiscale layers in the generator of the adversarial network. To validate the proposed method, comprehensive experiments on two rotating machinery datasets have been conducted to predict the RUL. The experimental results show that the proposed feature extraction method can effectively predict the RUL and outperforms the conventional RUL prediction approaches based on deep neural networks. The implementation code is available at https://github.com/opensuh/GMFE.
    Long-Range Feature Propagating for Natural Image Matting. (arXiv:2109.12252v1 [cs.CV])
    (0 min) Natural image matting estimates the alpha values of unknown regions in the trimap. Recently, deep learning based methods propagate the alpha values from the known regions to unknown regions according to the similarity between them. However, we find that more than 50\% pixels in the unknown regions cannot be correlated to pixels in known regions due to the limitation of small effective reception fields of common convolutional neural networks, which leads to inaccurate estimation when the pixels in the unknown regions cannot be inferred only with pixels in the reception fields. To solve this problem, we propose Long-Range Feature Propagating Network (LFPNet), which learns the long-range context features outside the reception fields for alpha matte estimation. Specifically, we first design the propagating module which extracts the context features from the downsampled image. Then, we present Center-Surround Pyramid Pooling (CSPP) that explicitly propagates the context features from the surrounding context image patch to the inner center image patch. Finally, we use the matting module which takes the image, trimap and context features to estimate the alpha matte. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on the AlphaMatting and Adobe Image Matting datasets.
    Equivariant Filters for Efficient Tracking in 3D Imaging. (arXiv:2103.10255v2 [cs.CV] UPDATED)
    (0 min) We demonstrate an object tracking method for 3D images with fixed computational cost and state-of-the-art performance. Previous methods predicted transformation parameters from convolutional layers. We instead propose an architecture that does not include either flattening of convolutional features or fully connected layers, but instead relies on equivariant filters to preserve transformations between inputs and outputs (e.g. rot./trans. of inputs rotate/translate outputs). The transformation is then derived in closed form from the outputs of the filters. This method is useful for applications requiring low latency, such as real-time tracking. We demonstrate our model on synthetically augmented adult brain MRI, as well as fetal brain MRI, which is the intended use-case.
    Attentive Contractive Flow: Improved Contractive Flows with Lipschitz-constrained Self-Attention. (arXiv:2109.12135v1 [cs.CV])
    (0 min) Normalizing flows provide an elegant method for obtaining tractable density estimates from distributions by using invertible transformations. The main challenge is to improve the expressivity of the models while keeping the invertibility constraints intact. We propose to do so via the incorporation of localized self-attention. However, conventional self-attention mechanisms don't satisfy the requirements to obtain invertible flows and can't be naively incorporated into normalizing flows. To address this, we introduce a novel approach called Attentive Contractive Flow (ACF) which utilizes a special category of flow-based generative models - contractive flows. We demonstrate that ACF can be introduced into a variety of state of the art flow models in a plug-and-play manner. This is demonstrated to not only improve the representation power of these models (improving on the bits per dim metric), but also to results in significantly faster convergence in training them. Qualitative results, including interpolations between test images, demonstrate that samples are more realistic and capture local correlations in the data well. We evaluate the results further by performing perturbation analysis using AWGN demonstrating that ACF models (especially the dot-product variant) show better and more consistent resilience to additive noise.
    Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries. (arXiv:2107.03035v3 [eess.IV] UPDATED)
    (0 min) Coronary artery disease (CAD) has posed a leading threat to the lives of cardiovascular disease patients worldwide for a long time. Therefore, automated diagnosis of CAD has indispensable significance in clinical medicine. However, the complexity of coronary artery plaques that cause CAD makes the automatic detection of coronary artery stenosis in Coronary CT angiography (CCTA) a difficult task. In this paper, we propose a Transformer network (TR-Net) for the automatic detection of significant stenosis (i.e. luminal narrowing > 50%) while practically completing the computer-assisted diagnosis of CAD. The proposed TR-Net introduces a novel Transformer, and tightly combines convolutional layers and Transformer encoders, allowing their advantages to be demonstrated in the task. By analyzing semantic information sequences, TR-Net can fully understand the relationship between image information in each position of a multiplanar reformatted (MPR) image, and accurately detect significant stenosis based on both local and global information. We evaluate our TR-Net on a dataset of 76 patients from different patients annotated by experienced radiologists. Experimental results illustrate that our TR-Net has achieved better results in ACC (0.92), Spec (0.96), PPV (0.84), F1 (0.79) and MCC (0.74) indicators compared with the state-of-the-art methods. The source code is publicly available from the link (https://github.com/XinghuaMa/TR-Net).
    Benchmark for Anonymous Video Analytics. (arXiv:2009.14684v2 [cs.CV] UPDATED)
    (0 min) Out-of-home audience measurement aims to count and characterize the people exposed to advertising content in the physical world. While audience measurement solutions based on computer vision are of increasing interest, no commonly accepted benchmark exists to evaluate and compare their performance. In this paper, we propose the first benchmark for digital out-of-home audience measurement that evaluates the vision-based tasks of audience localization and counting, and audience demographics. The benchmark is composed of a novel, dataset captured at multiple locations and a set of performance measures. Using the benchmark, we present an in-depth comparison of eight open-source algorithms on four hardware platforms with GPU and CPU-optimized inferences and of two commercial off-the-shelf solutions for localization, count, age, and gender estimation. This benchmark and related open-source codes are available at this http URL
    SIMstack: A Generative Shape and Instance Model for Unordered Object Stacks. (arXiv:2103.16442v2 [cs.CV] UPDATED)
    (0 min) By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that physics constrains decomposition as well as shape in occluded regions and hypothesise that a latent space learned from scenes built under physics simulation can serve as a prior to better predict shape and instances in occluded regions. To this end we propose SIMstack, a depth-conditioned Variational Auto-Encoder (VAE), trained on a dataset of objects stacked under physics simulation. We formulate instance segmentation as a centre voting task which allows for class-agnostic detection and doesn't require setting the maximum number of objects in the scene. At test time, our model can generate 3D shape and instance segmentation from a single depth view, probabilistically sampling proposals for the occluded region from the learned latent space. Our method has practical applications in providing robots some of the ability humans have to make rapid intuitive inferences of partially observed scenes. We demonstrate an application for precise (non-disruptive) object grasping of unknown objects from a single depth view.
    PW-MAD: Pixel-wise Supervision for Generalized Face Morphing Attack Detection. (arXiv:2108.10291v2 [cs.CV] UPDATED)
    (0 min) A face morphing attack image can be verified to multiple identities, making this attack a major vulnerability to processes based on identity verification, such as border checks. Various methods have been proposed to detect face morphing attacks, however, with low generalizability to unexpected post-morphing processes. A major post-morphing process is the print and scan operation performed in many countries when issuing a passport or identity document. In this work, we address this generalization problem by adapting a pixel-wise supervision approach where we train a network to classify each pixel of the image into an attack or not, rather than only having one label for the whole image. Our pixel-wise morphing attack detection (PW-MAD) solution proved to perform more accurately than a set of established baselines. More importantly, PW-MAD shows high generalizability in comparison to related works, when evaluated on unknown re-digitized attacks. Additionally to our PW-MAD approach, we create a new face morphing attack dataset with digital and re-digitized samples, namely the LMA-DRD dataset that is publicly available for research purposes upon request.
    An embarrassingly simple comparison of machine learning algorithms for indoor scene classification. (arXiv:2109.12261v1 [cs.CV])
    (0 min) With the emergence of autonomous indoor robots, the computer vision task of indoor scene recognition has gained the spotlight. Indoor scene recognition is a challenging problem in computer vision that relies on local and global features in a scene. This study aims to compare the performance of five machine learning algorithms on the task of indoor scene classification to identify the pros and cons of each classifier. It also provides a comparison of low latency feature extractors versus enormous feature extractors to understand the performance effects. Finally, a simple MnasNet based indoor classification system is proposed, which can achieve 72% accuracy at 23 ms latency.
    Structure-aware scale-adaptive networks for cancer segmentation in whole-slide images. (arXiv:2109.12617v1 [eess.IV])
    (0 min) Cancer segmentation in whole-slide images is a fundamental step for viable tumour burden estimation, which is of great value for cancer assessment. However, factors like vague boundaries or small regions dissociated from viable tumour areas make it a challenging task. Considering the usefulness of multi-scale features in various vision-related tasks, we present a structure-aware scale-adaptive feature selection method for efficient and accurate cancer segmentation. Based on a segmentation network with a popular encoder-decoder architecture, a scale-adaptive module is proposed for selecting more robust features to represent the vague, non-rigid boundaries. Furthermore, a structural similarity metric is proposed for better tissue structure awareness to deal with small region segmentation. In addition, advanced designs including several attention mechanisms and the selective-kernel convolutions are applied to the baseline network for comparative study purposes. Extensive experimental results show that the proposed structure-aware scale-adaptive networks achieve outstanding performance on liver cancer segmentation when compared to top ten submitted results in the challenge of PAIP 2019. Further evaluation on colorectal cancer segmentation shows that the scale-adaptive module improves the baseline network or outperforms the other excellent designs of attention mechanisms when considering the tradeoff between efficiency and accuracy.
    Use of the Deep Learning Approach to Measure Alveolar Bone Level. (arXiv:2109.12115v1 [eess.IV])
    (0 min) Abstract: Aim: The goal was to use a Deep Convolutional Neural Network to measure the radiographic alveolar bone level to aid periodontal diagnosis. Material and methods: A Deep Learning (DL) model was developed by integrating three segmentation networks (bone area, tooth, cementoenamel junction) and image analysis to measure the radiographic bone level and assign radiographic bone loss (RBL) stages. The percentage of RBL was calculated to determine the stage of RBL for each tooth. A provisional periodontal diagnosis was assigned using the 2018 periodontitis classification. RBL percentage, staging, and presumptive diagnosis were compared to the measurements and diagnoses made by the independent examiners. Results: The average Dice Similarity Coefficient (DSC) for segmentation was over 0.91. There was no significant difference in RBL percentage measurements determined by DL and examiners (p=0.65). The Area Under the Receiver Operating Characteristics Curve of RBL stage assignment for stage I, II and III was 0.89, 0.90 and 0.90, respectively. The accuracy of the case diagnosis was 0.85. Conclusion: The proposed DL model provides reliable RBL measurements and image-based periodontal diagnosis using periapical radiographic images. However, this model has to be further optimized and validated by a larger number of images to facilitate its application.
    Hard-sample Guided Hybrid Contrast Learning for Unsupervised Person Re-Identification. (arXiv:2109.12333v1 [cs.CV])
    (0 min) Unsupervised person re-identification (Re-ID) is a promising and very challenging research problem in computer vision. Learning robust and discriminative features with unlabeled data is of central importance to Re-ID. Recently, more attention has been paid to unsupervised Re-ID algorithms based on clustered pseudo-label. However, the previous approaches did not fully exploit information of hard samples, simply using cluster centroid or all instances for contrastive learning. In this paper, we propose a Hard-sample Guided Hybrid Contrast Learning (HHCL) approach combining cluster-level loss with instance-level loss for unsupervised person Re-ID. Our approach applies cluster centroid contrastive loss to ensure that the network is updated in a more stable way. Meanwhile, introduction of a hard instance contrastive loss further mines the discriminative information. Extensive experiments on two popular large-scale Re-ID benchmarks demonstrate that our HHCL outperforms previous state-of-the-art methods and significantly improves the performance of unsupervised person Re-ID. The code of our work is available soon at https://github.com/bupt-ai-cz/HHCL-ReID.
    Contrastive Learning for Mitochondria Segmentation. (arXiv:2109.12363v1 [cs.CV])
    (0 min) Mitochondria segmentation in electron microscopy images is essential in neuroscience. However, due to the image degradation during the imaging process, the large variety of mitochondrial structures, as well as the presence of noise, artifacts and other sub-cellular structures, mitochondria segmentation is very challenging. In this paper, we propose a novel and effective contrastive learning framework to learn a better feature representation from hard examples to improve segmentation. Specifically, we adopt a point sampling strategy to pick out representative pixels from hard examples in the training phase. Based on these sampled pixels, we introduce a pixel-wise label-based contrastive loss which consists of a similarity loss term and a consistency loss term. The similarity term can increase the similarity of pixels from the same class and the separability of pixels from different classes in feature space, while the consistency term is able to enhance the sensitivity of the 3D model to changes in image content from frame to frame. We demonstrate the effectiveness of our method on MitoEM dataset as well as FIB-SEM dataset and show better or on par with state-of-the-art results.
    A Survey on Graph-Based Deep Learning for Computational Histopathology. (arXiv:2107.00272v2 [cs.LG] UPDATED)
    (0 min) With the remarkable success of representation learning for prediction problems, we have witnessed a rapid expansion of the use of machine learning and deep learning for the analysis of digital pathology and biopsy image patches. However, learning over patch-wise features using convolutional neural networks limits the ability of the model to capture global contextual information and comprehensively model tissue composition. The phenotypical and topological distribution of constituent histological entities play a critical role in tissue diagnosis. As such, graph data representations and deep learning have attracted significant attention for encoding tissue representations, and capturing intra- and inter- entity level interactions. In this review, we provide a conceptual grounding for graph analytics in digital pathology, including entity-graph construction and graph architectures, and present their current success for tumor localization and classification, tumor invasion and staging, image retrieval, and survival prediction. We provide an overview of these methods in a systematic manner organized by the graph representation of the input image, scale, and organ on which they operate. We also outline the limitations of existing techniques, and suggest potential future research directions in this domain.
    Frequency Disentangled Residual Network. (arXiv:2109.12556v1 [cs.CV])
    (0 min) Residual networks (ResNets) have been utilized for various computer vision and image processing applications. The residual connection improves the training of the network with better gradient flow. A residual block consists of few convolutional layers having trainable parameters, which leads to overfitting. Moreover, the present residual networks are not able to utilize the high and low frequency information suitably, which also challenges the generalization capability of the network. In this paper, a frequency disentangled residual network (FDResNet) is proposed to tackle these issues. Specifically, FDResNet includes separate connections in the residual block for low and high frequency components, respectively. Basically, the proposed model disentangles the low and high frequency components to increase the generalization ability. Moreover, the computation of low and high frequency components using fixed filters further avoids the overfitting. The proposed model is tested on benchmark CIFAR10/100, Caltech and TinyImageNet datasets for image classification. The performance of the proposed model is also tested in image retrieval framework. It is noticed that the proposed model outperforms its counterpart residual model. The effect of kernel size and standard deviation is also evaluated. The impact of the frequency disentangling is also analyzed using saliency map.
    Learning Stereopsis from Geometric Synthesis for 6D Object Pose Estimation. (arXiv:2109.12266v1 [cs.CV])
    (0 min) Current monocular-based 6D object pose estimation methods generally achieve less competitive results than RGBD-based methods, mostly due to the lack of 3D information. To make up this gap, this paper proposes a 3D geometric volume based pose estimation method with a short baseline two-view setting. By constructing a geometric volume in the 3D space, we combine the features from two adjacent images to the same 3D space. Then a network is trained to learn the distribution of the position of object keypoints in the volume, and a robust soft RANSAC solver is deployed to solve the pose in closed form. To balance accuracy and cost, we propose a coarse-to-fine framework to improve the performance in an iterative way. The experiments show that our method outperforms state-of-the-art monocular-based methods, and is robust in different objects and scenes, especially in serious occlusion situations.
    High Frame Rate Video Quality Assessment using VMAF and Entropic Differences. (arXiv:2109.12785v1 [cs.MM])
    (0 min) The popularity of streaming videos with live, high-action content has led to an increased interest in High Frame Rate (HFR) videos. In this work we address the problem of frame rate dependent Video Quality Assessment (VQA) when the videos to be compared have different frame rate and compression factor. The current VQA models such as VMAF have superior correlation with perceptual judgments when videos to be compared have same frame rates and contain conventional distortions such as compression, scaling etc. However this framework requires additional pre-processing step when videos with different frame rates need to be compared, which can potentially limit its overall performance. Recently, Generalized Entropic Difference (GREED) VQA model was proposed to account for artifacts that arise due to changes in frame rate, and showed superior performance on the LIVE-YT-HFR database which contains frame rate dependent artifacts such as judder, strobing etc. In this paper we propose a simple extension, where the features from VMAF and GREED are fused in order to exploit the advantages of both models. We show through various experiments that the proposed fusion framework results in more efficient features for predicting frame rate dependent video quality. We also evaluate the fused feature set on standard non-HFR VQA databases and obtain superior performance than both GREED and VMAF, indicating the combined feature set captures complimentary perceptual quality information.
    Effect Of Personalized Calibration On Gaze Estimation Using Deep-Learning. (arXiv:2109.12801v1 [cs.CV])
    (0 min) With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning model trained for gaze estimation has no knowledge about. To analyse the performance in such scenarios we have tried to simulate a calibration mechanism. In this work we use the MPIIGaze data set. We trained a multi modal convolutional neural network and analysed its performance with and without calibration and this evaluation provides clear insights on how calibration improved the performance of the Deep Learning model in estimating gaze in the wild.
    Using Persistent Homology Topological Features to Characterize Medical Images: Case Studies on Lung and Brain Cancers. (arXiv:2012.12102v3 [cs.CV] UPDATED)
    (0 min) Tumor shape is a key factor that affects tumor growth and metastasis. This paper proposes a topological feature computed by persistent homology to characterize tumor progression from digital pathology and radiology images and examines its effect on the time-to-event data. The proposed topological features are invariant to scale-preserving transformation and can summarize various tumor shape patterns. The topological features are represented in functional space and used as functional predictors in a functional Cox proportional hazards model. The proposed model enables interpretable inference about the association between topological shape features and survival risks. Two case studies are conducted using consecutive 143 lung cancer and 77 brain tumor patients. The results of both studies show that the topological features predict survival prognosis after adjusting clinical variables, and the predicted high-risk groups have significantly (at the level of 0.001) worse survival outcomes than the low-risk groups. Also, the topological shape features found to be positively associated with survival hazards are irregular and heterogeneous shape patterns, which are known to be related to tumor progression.
    Subjective and Objective Quality Assessment of High Frame Rate Videos. (arXiv:2007.11634v2 [cs.MM] UPDATED)
    (0 min) High frame rate (HFR) videos are becoming increasingly common with the tremendous popularity of live, high-action streaming content such as sports. Although HFR contents are generally of very high quality, high bandwidth requirements make them challenging to deliver efficiently, while simultaneously maintaining their quality. To optimize trade-offs between bandwidth requirements and video quality, in terms of frame rate adaptation, it is imperative to understand the intricate relationship between frame rate and perceptual video quality. Towards advancing progression in this direction we designed a new subjective resource, called the LIVE-YouTube-HFR (LIVE-YT-HFR) dataset, which is comprised of 480 videos having 6 different frame rates, obtained from 16 diverse contents. In order to understand the combined effects of compression and frame rate adjustment, we also processed videos at 5 compression levels at each frame rate. To obtain subjective labels on the videos, we conducted a human study yielding 19,000 human quality ratings obtained from a pool of 85 human subjects. We also conducted a holistic evaluation of existing state-of-the-art Full and No-Reference video quality algorithms, and statistically benchmarked their performance on the new database. The LIVE-YT-HFR database has been made available online for public use and evaluation purposes, with hopes that it will help advance research in this exciting video technology direction. It may be obtained at \url{https://live.ece.utexas.edu/research/LIVE_YT_HFR/LIVE_YT_HFR/index.html}
    Cascade RCNN for MIDOG Challenge. (arXiv:2109.01085v2 [cs.CV] UPDATED)
    (0 min) Mitotic counts are one of the key indicators of breast cancer prognosis. However, accurate mitotic cell counting is still a difficult problem and is labourious. Automated methods have been proposed for this task, but are usually dependent on the training images and show poor performance on unseen domains. In this work, we present a multi-stage mitosis detection method based on a Cascade RCNN developed to be sequentially more selective against false positives. On the preliminary test set, the algorithm scores an F1-score of 0.7492.
    Safety Metrics for Semantic Segmentation in Autonomous Driving. (arXiv:2105.10142v2 [cs.CV] UPDATED)
    (0 min) Within the context of autonomous driving, safety-related metrics for deep neural networks have been widely studied for image classification and object detection. In this paper, we further consider safety-aware correctness and robustness metrics specialized for semantic segmentation. The novelty of our proposal is to move beyond pixel-level metrics: Given two images with each having N pixels being class-flipped, the designed metrics should, depending on the clustering of pixels being class-flipped or the location of occurrence, reflect a different level of safety criticality. The result evaluated on an autonomous driving dataset demonstrates the validity and practicality of our proposed methodology.
    Multi-Source domain adaptation via supervised contrastive learning and confident consistency regularization. (arXiv:2106.16093v3 [cs.CV] UPDATED)
    (0 min) Multi-Source Unsupervised Domain Adaptation (multi-source UDA) aims to learn a model from several labeled source domains while performing well on a different target domain where only unlabeled data are available at training time. To align source and target features distributions, several recent works use source and target explicit statistics matching such as features moments or class centroids. Yet, these approaches do not guarantee class conditional distributions alignment across domains. In this work, we propose a new framework called Contrastive Multi-Source Domain Adaptation (CMSDA) for multi-source UDA that addresses this limitation. Discriminative features are learned from interpolated source examples via cross entropy minimization and from target examples via consistency regularization and hard pseudo-labeling. Simultaneously, interpolated source examples are leveraged to align source class conditional distributions through an interpolated version of the supervised contrastive loss. This alignment leads to more general and transferable features which further improves the generalization on the target domain. Extensive experiments have been carried out on three standard multi-source UDA datasets where our method reports state-of-the-art results.
    ReRankMatch: Semi-Supervised Learning with Semantics-Oriented Similarity Representation. (arXiv:2102.06328v2 [cs.CV] UPDATED)
    (0 min) This paper proposes integrating semantics-oriented similarity representation into RankingMatch, a recently proposed semi-supervised learning method. Our method, dubbed ReRankMatch, aims to deal with the case in which labeled and unlabeled data share non-overlapping categories. ReRankMatch encourages the model to produce the similar image representations for the samples likely belonging to the same class. We evaluate our method on various datasets such as CIFAR-10, CIFAR-100, SVHN, STL-10, and Tiny ImageNet. We obtain promising results (4.21% error rate on CIFAR-10 with 4000 labels, 22.32% error rate on CIFAR-100 with 10000 labels, and 2.19% error rate on SVHN with 1000 labels) when the amount of labeled data is sufficient to learn semantics-oriented similarity representation. The code is made publicly available at https://github.com/tqtrunghnvn/ReRankMatch.
    Towards Better Explanations of Class Activation Mapping. (arXiv:2102.05228v3 [cs.CV] UPDATED)
    (0 min) Increasing demands for understanding the internal behavior of convolutional neural networks (CNNs) have led to remarkable improvements in explanation methods. Particularly, several class activation mapping (CAM) based methods, which generate visual explanation maps by a linear combination of activation maps from CNNs, have been proposed. However, the majority of the methods lack a clear theoretical basis on how they assign the coefficients of the linear combination. In this paper, we revisit the intrinsic linearity of CAM with respect to the activation maps; we construct an explanation model of CNN as a linear function of binary variables that denote the existence of the corresponding activation maps. With this approach, the explanation model can be determined by additive feature attribution methods in an analytic manner. We then demonstrate the adequacy of SHAP values, which is a unique solution for the explanation model with a set of desirable properties, as the coefficients of CAM. Since the exact SHAP values are unattainable, we introduce an efficient approximation method, LIFT-CAM, based on DeepLIFT. Our proposed LIFT-CAM can estimate the SHAP values of the activation maps with high speed and accuracy. Furthermore, it greatly outperforms other previous CAM-based methods in both qualitative and quantitative aspects.
    Data, Assemble: Leveraging Multiple Datasets with Heterogeneous and Partial Labels. (arXiv:2109.12265v1 [cs.CV])
    (0 min) The success of deep learning relies heavily on large datasets with extensive labels, but we often only have access to several small, heterogeneous datasets associated with partial labels, particularly in the field of medical imaging. When learning from multiple datasets, existing challenges include incomparable, heterogeneous, or even conflicting labeling protocols across datasets. In this paper, we propose a new initiative--"data, assemble"--which aims to unleash the full potential of partially labeled data and enormous unlabeled data from an assembly of datasets. To accommodate the supervised learning paradigm to partial labels, we introduce a dynamic adapter that encodes multiple visual tasks and aggregates image features in a question-and-answer manner. Furthermore, we employ pseudo-labeling and consistency constraints to harness images with missing labels and to mitigate the domain gap across datasets. From proof-of-concept studies on three natural imaging datasets and rigorous evaluations on two large-scale thorax X-ray benchmarks, we discover that learning from "negative examples" facilitates both classification and segmentation of classes of interest. This sheds new light on the computer-aided diagnosis of rare diseases and emerging pandemics, wherein "positive examples" are hard to collect, yet "negative examples" are relatively easier to assemble. As a result, besides exceeding the prior art in the NIH ChestXray benchmark, our model is particularly strong in identifying diseases of minority classes, yielding over 3-point improvement on average. Remarkably, when using existing partial labels, our model performance is on-par (p>0.05) with that using a fully curated dataset with exhaustive labels, eliminating the need for additional 40% annotation costs.
    S2Looking: A Satellite Side-Looking Dataset for Building Change Detection. (arXiv:2107.09244v2 [cs.CV] UPDATED)
    (0 min) Building change detection underpins many important applications, especially in the military and crisis management domains. Recent methods used for change detection have shifted towards deep-learning, which depends on the quality of its training data. The assembly of large-scale annotated satellite imagery datasets is therefore essential for global building change surveillance. Existing datasets almost exclusively offer near-nadir viewing angles. This limits the range of changes that can be detected. By offering larger observation ranges, the scroll imaging mode of optical satellites presents an opportunity to overcome this restriction. This paper therefore introduces S2Looking, a building change detection dataset that contains large-scale side-looking satellite images captured at various off-nadir angles. The dataset consists of 5000 bitemporal image pairs of rural areas and more than 65,920 annotated instances of changes throughout the world. The dataset can be used to train deep-learning-based change detection algorithms. It expands upon existing datasets by providing: 1) larger viewing angles; 2) large illumination variances; and 3) the added complexity of rural images. To facilitate use of the dataset, a benchmark task has been established and preliminary tests suggest deep-learning algorithms find the dataset significantly more challenging than the closest competing near-nadir dataset, LEVIR-CD+. S2Looking may therefore promote important advances in existing building change detection algorithms. The dataset is available at https://github.com/S2Looking/.
    GPRInvNet: Deep Learning-Based Ground Penetrating Radar Data Inversion for Tunnel Lining. (arXiv:1912.05759v3 [cs.CV] UPDATED)
    (0 min) A DNN architecture referred to as GPRInvNet was proposed to tackle the challenges of mapping the ground-penetrating radar (GPR) B-Scan data to complex permittivity maps of subsurface structures. The GPRInvNet consisted of a trace-to-trace encoder and a decoder. It was specially designed to take into account the characteristics of GPR inversion when faced with complex GPR B-Scan data, as well as addressing the spatial alignment issues between time-series B-Scan data and spatial permittivity maps. It displayed the ability to fuse features from several adjacent traces on the B-Scan data to enhance each trace, and then further condense the features of each trace separately. As a result, the sensitive zones on the permittivity maps spatially aligned to the enhanced trace could be reconstructed accurately. The GPRInvNet has been utilized to reconstruct the permittivity map of tunnel linings. A diverse range of dielectric models of tunnel linings containing complex defects has been reconstructed using GPRInvNet. The results have demonstrated that the GPRInvNet is capable of effectively reconstructing complex tunnel lining defects with clear boundaries. Comparative results with existing baseline methods also demonstrated the superiority of the GPRInvNet. For the purpose of generalizing the GPRInvNet to real GPR data, some background noise patches recorded from practical model testing were integrated into the synthetic GPR data to retrain the GPRInvNet. The model testing has been conducted for validation, and experimental results revealed that the GPRInvNet had also achieved satisfactory results with regard to the real data.
    Optical Inspection of the Silicon Micro-strip Sensors for the CBM Experiment employing Artificial Intelligence. (arXiv:2107.07714v2 [physics.ins-det] UPDATED)
    (0 min) Optical inspection of 1191 silicon micro-strip sensors was performed using a custom made optical inspection setup, employing a machine-learning based approach for the defect analysis and subsequent quality assurance. Furthermore, metrological control of the sensor's surface was performed. In this manuscript, we present the analysis of various sensor surface defects. Among these are implant breaks, p-stop breaks, aluminium strip opens, aluminium strip shorts, surface scratches, double metallization layer defects, passivation layer defects, bias resistor defects as well as dust particle identification. The defect detection was done using the application of Convolutional Deep Neural Networks (CDNNs). From this, defective strips and defect clusters were identified, as well as a 2D map of the defects using their geometrical positions on the sensor was performed. Based on the total number of defects found on the sensor's surface, a method for the estimation of sensor's overall quality grade and quality score was proposed.
    Distribution-sensitive Information Retention for Accurate Binary Neural Network. (arXiv:2109.12338v1 [cs.CV])
    (0 min) Model binarization is an effective method of compressing neural networks and accelerating their inference process, which enables state-of-the-art models to run on resource-limited devices. However, a significant performance gap still exists between the 1-bit model and the 32-bit one. The empirical study shows that binarization causes a great loss of information in the forward and backward propagation which harms the performance of binary neural networks (BNNs), and the limited information representation ability of binarized parameter is one of the bottlenecks of BNN performance. We present a novel Distribution-sensitive Information Retention Network (DIR-Net) to retain the information of the forward activations and backward gradients, which improves BNNs by distribution-sensitive optimization without increasing the overhead in the inference process. The DIR-Net mainly relies on two technical contributions: (1) Information Maximized Binarization (IMB): minimizing the information loss and the quantization error of weights/activations simultaneously by balancing and standardizing the weight distribution in the forward propagation; (2) Distribution-sensitive Two-stage Estimator (DTE): minimizing the information loss of gradients by gradual distribution-sensitive approximation of the sign function in the backward propagation, jointly considering the updating capability and accurate gradient. The DIR-Net investigates both forward and backward processes of BNNs from the unified information perspective, thereby provides new insight into the mechanism of network binarization. Comprehensive experiments on CIFAR-10 and ImageNet datasets show our DIR-Net consistently outperforms the SOTA binarization approaches under mainstream and compact architectures. Additionally, we conduct our DIR-Net on real-world resource-limited devices which achieves 11.1 times storage saving and 5.4 times speedup.
    TreeNet: A lightweight One-Shot Aggregation Convolutional Network. (arXiv:2109.12342v1 [cs.CV])
    (0 min) The architecture of deep convolutional networks (CNNs) has evolved for years, becoming more accurate and faster. However, it is still challenging to design reasonable network structures that aim at obtaining the best accuracy under a limited computational budget. In this paper, we propose a Tree block, named after its appearance, which extends the One-Shot Aggregation (OSA) module while being more lightweight and flexible. Specifically, the Tree block replaces each of the $3\times3$ Conv layers in OSA into a stack of shallow residual block (SRB) and $1\times1$ Conv layer. The $1\times1$ Conv layer is responsible for dimension increasing and the SRB is fed into the next step. By doing this, when aggregating the same number of subsequent feature maps, the Tree block has a deeper network structure while having less model complexity. In addition, residual connection and efficient channel attention(ECA) is added to the Tree block to further improve the performance of the network. Based on the Tree block, we build efficient backbone models calling TreeNets. TreeNet has a similar network architecture to ResNet, making it flexible to replace ResNet in various computer vision frameworks. We comprehensively evaluate TreeNet on common-used benchmarks, including ImageNet-1k for classification, MS COCO for object detection, and instance segmentation. Experimental results demonstrate that TreeNet is more efficient and performs favorably against the current state-of-the-art backbone methods.
    A Simple Self-calibration Method for The Internal Time Synchronization of MEMS LiDAR. (arXiv:2109.12506v1 [cs.CV])
    (0 min) This paper proposes a simple self-calibration method for the internal time synchronization of MEMS(Micro-electromechanical systems) LiDAR during research and development. Firstly, we introduced the problem of internal time misalignment in MEMS lidar. Then, a robust Minimum Vertical Gradient(MVG) prior is proposed to calibrate the time difference between the laser and MEMS mirror, which can be calculated automatically without any artificial participation or specially designed cooperation target. Finally, actual experiments on MEMS LiDARs are implemented to demonstrate the effectiveness of the proposed method. It should be noted that the calibration can be implemented in a simple laboratory environment without any ranging equipment and artificial participation, which greatly accelerate the progress of research and development in practical applications.
    Vision Transformer Hashing for Image Retrieval. (arXiv:2109.12564v1 [cs.CV])
    (0 min) Deep learning has shown a tremendous growth in hashing techniques for image retrieval. Recently, Transformer has emerged as a new architecture by utilizing self-attention without convolution. Transformer is also extended to Vision Transformer (ViT) for the visual recognition with a promising performance on ImageNet. In this paper, we propose a Vision Transformer based Hashing (VTS) for image retrieval. We utilize the pre-trained ViT on ImageNet as the backbone network and add the hashing head. The proposed VTS model is fine tuned for hashing under six different image retrieval frameworks, including Deep Supervised Hashing (DSH), HashNet, GreedyHash, Improved Deep Hashing Network (IDHN), Deep Polarized Network (DPN) and Central Similarity Quantization (CSQ) with their objective functions. We perform the extensive experiments on CIFAR10, ImageNet, NUS-Wide, and COCO datasets. The proposed VTS based image retrieval outperforms the recent state-of-the-art hashing techniques with a great margin. We also find the proposed VTS model as the backbone network is better than the existing networks, such as AlexNet and ResNet.
    Automated Multi-Process CTC Detection using Deep Learning. (arXiv:2109.12709v1 [cs.CV])
    (0 min) Circulating Tumor Cells (CTCs) bear great promise as biomarkers in tumor prognosis. However, the process of identification and later enumeration of CTCs require manual labor, which is error-prone and time-consuming. The recent developments in object detection via Deep Learning using Mask-RCNNs and wider availability of pre-trained models have enabled sensitive tasks with limited data of such to be tackled with unprecedented accuracy. In this report, we present a novel 3-stage detection model for automated identification of Circulating Tumor Cells in multi-channel darkfield microscopic images comprised of: RetinaNet based identification of Cytokeratin (CK) stains, Mask-RCNN based cell detection of DAPI cell nuclei and Otsu thresholding to detect CD-45s. The training dataset is composed of 46 high variance data points, with 10 Negative and 36 Positive data points. The test set is composed of 420 negative data points. The final accuracy of the pipeline is 98.81%.
    Machine Learning based Medical Image Deepfake Detection: A Comparative Study. (arXiv:2109.12800v1 [cs.CV])
    (0 min) Deep generative networks in recent years have reinforced the need for caution while consuming various modalities of digital information. One avenue of deepfake creation is aligned with injection and removal of tumors from medical scans. Failure to detect medical deepfakes can lead to large setbacks on hospital resources or even loss of life. This paper attempts to address the detection of such attacks with a structured case study. We evaluate different machine learning algorithms and pretrained convolutional neural networks on distinguishing between tampered and untampered data. The findings of this work show near perfect accuracy in detecting instances of tumor injections and removals.
    Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation. (arXiv:2109.12484v1 [cs.CV])
    (0 min) Self-supervised methods play an increasingly important role in monocular depth estimation due to their great potential and low annotation cost. To close the gap with supervised methods, recent works take advantage of extra constraints, e.g., semantic segmentation. However, these methods will inevitably increase the burden on the model. In this paper, we show theoretical and empirical evidence that the potential capacity of self-supervised monocular depth estimation can be excavated without increasing this cost. In particular, we propose (1) a novel data augmentation approach called data grafting, which forces the model to explore more cues to infer depth besides the vertical image position, (2) an exploratory self-distillation loss, which is supervised by the self-distillation label generated by our new post-processing method - selective post-processing, and (3) the full-scale network, designed to endow the encoder with the specialization of depth estimation task and enhance the representational power of the model. Extensive experiments show that our contributions can bring significant performance improvement to the baseline with even less computational overhead, and our model, named EPCDepth, surpasses the previous state-of-the-art methods even those supervised by additional constraints.
    Nesterov Accelerated ADMM for Fast Diffeomorphic Image Registration. (arXiv:2109.12688v1 [cs.CV])
    (0 min) Deterministic approaches using iterative optimisation have been historically successful in diffeomorphic image registration (DiffIR). Although these approaches are highly accurate, they typically carry a significant computational burden. Recent developments in stochastic approaches based on deep learning have achieved sub-second runtimes for DiffIR with competitive registration accuracy, offering a fast alternative to conventional iterative methods. In this paper, we attempt to reduce this difference in speed whilst retaining the performance advantage of iterative approaches in DiffIR. We first propose a simple iterative scheme that functionally composes intermediate non-stationary velocity fields to handle large deformations in images whilst guaranteeing diffeomorphisms in the resultant deformation. We then propose a convex optimisation model that uses a regularisation term of arbitrary order to impose smoothness on these velocity fields and solve this model with a fast algorithm that combines Nesterov gradient descent and the alternating direction method of multipliers (ADMM). Finally, we leverage the computational power of GPU to implement this accelerated ADMM solver on a 3D cardiac MRI dataset, further reducing runtime to less than 2 seconds. In addition to producing strictly diffeomorphic deformations, our methods outperform both state-of-the-art deep learning-based and iterative DiffIR approaches in terms of dice and Hausdorff scores, with speed approaching the inference time of deep learning-based methods.
    Exploring Visual Context for Weakly Supervised Person Search. (arXiv:2106.10506v2 [cs.CV] UPDATED)
    (0 min) Person search has recently emerged as a challenging task that jointly addresses pedestrian detection and person re-identification. Existing approaches follow a fully supervised setting where both bounding box and identity annotations are available. However, annotating identities is labor-intensive, limiting the practicability and scalability of current frameworks. This paper inventively considers weakly supervised person search with only bounding box annotations. We proposed to address this novel task by investigating three levels of context clues (i.e., detection, memory and scene) in unconstrained natural images. The first two are employed to promote local and global discriminative capabilities, while the latter enhances clustering accuracy. Despite its simple design, our CGPS achieves 80.0% in mAP on CUHK-SYSU, boosting the baseline model by 8.8%. Surprisingly, it even achieves comparable performance with several supervised person search models. Our code is available at https://github.com/ljpadam/CGPS
    An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog. (arXiv:2109.12212v1 [cs.CL])
    (0 min) Online conversations include more than just text. Increasingly, image-based responses such as memes and animated gifs serve as culturally recognized and often humorous responses in conversation. However, while NLP has broadened to multimodal models, conversational dialog systems have largely focused only on generating text replies. Here, we introduce a new dataset of 1.56M text-gif conversation turns and introduce a new multimodal conversational model Pepe the King Prawn for selecting gif-based replies. We demonstrate that our model produces relevant and high-quality gif responses and, in a large randomized control trial of multiple models replying to real users, we show that our model replies with gifs that are significantly better received by the community.
    Bringing Generalization to Deep Multi-view Detection. (arXiv:2109.12227v1 [cs.CV])
    (0 min) Multi-view Detection (MVD) is highly effective for occlusion reasoning and is a mainstream solution in various applications that require accurate top-view occupancy maps. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them \emph{impractical for real-world deployment}. The key novelty of our work is to \emph{formalize} three critical forms of generalization and \emph{propose experiments to investigate them}: i) generalization across a varying number of cameras, ii) generalization with varying camera positions, and finally, iii) generalization to new scenes. We find that existing \sota models show poor generalization by overfitting to a single scene and camera configuration. We propose modifications in terms of pre-training, pooling strategy, regularization, and loss function to an existing state-of-the-art framework, leading to successful generalization across new camera configurations and new scenes. We perform a comprehensive set of experiments on the \wildtrack and \multiviewx datasets to (a) motivate the necessity to evaluate MVD methods on generalization abilities and (b) demonstrate the efficacy of the proposed approach. The code is publicly available at \url{https://github.com/jeetv/GMVD}
    Auditing AI models for Verified Deployment under Semantic Specifications. (arXiv:2109.12456v1 [cs.LG])
    (0 min) Auditing trained deep learning (DL) models prior to deployment is vital in preventing unintended consequences. One of the biggest challenges in auditing is in understanding how we can obtain human-interpretable specifications that are directly useful to the end-user. We address this challenge through a sequence of semantically-aligned unit tests, where each unit test verifies whether a predefined specification (e.g., accuracy over 95%) is satisfied with respect to controlled and semantically aligned variations in the input space (e.g., in face recognition, the angle relative to the camera). We perform these unit tests by directly verifying the semantically aligned variations in an interpretable latent space of a generative model. Our framework, AuditAI, bridges the gap between interpretable formal verification and scalability. With evaluations on four different datasets, covering images of towers, chest X-rays, human faces, and ImageNet classes, we show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations. A blog post accompanying the paper is at this link https://developer.nvidia.com/blog/nvidia-research-auditing-ai-models-for-verified-deployment-under-semantic-specifications
    DAMix: Density-Aware Data Augmentation for Unsupervised Domain Adaptation on Single Image Dehazing. (arXiv:2109.12544v1 [cs.CV])
    (0 min) Learning-based methods have achieved great success on single image dehazing in recent years. However, these methods are often subject to performance degradation when domain shifts are confronted. Specifically, haze density gaps exist among the existing datasets, often resulting in poor performance when these methods are tested across datasets. To address this issue, we propose a density-aware data augmentation method (DAMix) that generates synthetic hazy samples according to the haze density level of the target domain. These samples are generated by combining a hazy image with its corresponding ground truth by a combination ratio sampled from a density-aware distribution. They not only comply with the atmospheric scattering model but also bridge the haze density gap between the source and target domains. DAMix ensures that the model learns from examples featuring diverse haze densities. To better utilize the various hazy samples generated by DAMix, we develop a dual-branch dehazing network involving two branches that can adaptively remove haze according to the haze density of the region. In addition, the dual-branch design enlarges the learning capacity of the entire network; hence, our network can fully utilize the DAMix-ed samples. We evaluate the effectiveness of DAMix by applying it to the existing open-source dehazing methods. The experimental results demonstrate that all methods show significant improvements after DAMix is applied. Furthermore, by combining DAMix with our model, we can achieve state-of-the-art (SOTA) performance in terms of domain adaptation.
    Ground material classification and for UAV-based photogrammetric 3D data A 2D-3D Hybrid Approach. (arXiv:2109.12221v1 [cs.CV])
    (0 min) In recent years, photogrammetry has been widely used in many areas to create photorealistic 3D virtual data representing the physical environment. The innovation of small unmanned aerial vehicles (sUAVs) has provided additional high-resolution imaging capabilities with low cost for mapping a relatively large area of interest. These cutting-edge technologies have caught the US Army and Navy's attention for the purpose of rapid 3D battlefield reconstruction, virtual training, and simulations. Our previous works have demonstrated the importance of information extraction from the derived photogrammetric data to create semantic-rich virtual environments (Chen et al., 2019). For example, an increase of simulation realism and fidelity was achieved by segmenting and replacing photogrammetric trees with game-ready tree models. In this work, we further investigated the semantic information extraction problem and focused on the ground material segmentation and object detection tasks. The main innovation of this work was that we leveraged both the original 2D images and the derived 3D photogrammetric data to overcome the challenges faced when using each individual data source. For ground material segmentation, we utilized an existing convolutional neural network architecture (i.e., 3DMV) which was originally designed for segmenting RGB-D sensed indoor data. We improved its performance for outdoor photogrammetric data by introducing a depth pooling layer in the architecture to take into consideration the distance between the source images and the reconstructed terrain model. To test the performance of our improved 3DMV, a ground truth ground material database was created using data from the One World Terrain (OWT) data repository. Finally, a workflow for importing the segmented ground materials into a virtual simulation scene was introduced, and visual results are reported in this paper.
    Improving the Thermal Infrared Monitoring of Volcanoes: A Deep Learning Approach for Intermittent Image Series. (arXiv:2109.12767v1 [cs.CV])
    (0 min) Active volcanoes are globally distributed and pose societal risks at multiple geographic scales, ranging from local hazards to regional/international disruptions. Many volcanoes do not have continuous ground monitoring networks; meaning that satellite observations provide the only record of volcanic behavior and unrest. Among these remote sensing observations, thermal imagery is inspected daily by volcanic observatories for examining the early signs, onset, and evolution of eruptive activity. However, thermal scenes are often obstructed by clouds, meaning that forecasts must be made off image sequences whose scenes are only usable intermittently through time. Here, we explore forecasting this thermal data stream from a deep learning perspective using existing architectures that model sequences with varying spatiotemporal considerations. Additionally, we propose and evaluate new architectures that explicitly model intermittent image sequences. Using ASTER Kinetic Surface Temperature data for $9$ volcanoes between $1999$ and $2020$, we found that a proposed architecture (ConvLSTM + Time-LSTM + U-Net) forecasts volcanic temperature imagery with the lowest RMSE ($4.164^{\circ}$C, other methods: $4.217-5.291^{\circ}$C). Additionally, we examined performance on multiple time series derived from the thermal imagery and the effect of training with data from singular volcanoes. Ultimately, we found that models with the lowest RMSE on forecasting imagery did not possess the lowest RMSE on recreating time series derived from that imagery and that training with individual volcanoes generally worsened performance relative to a multi-volcano data set. This work highlights the potential of data-driven deep learning models for volcanic unrest forecasting while revealing the need for carefully constructed optimization targets.
    Point Transformer. (arXiv:2012.09164v2 [cs.CV] UPDATED)
    (0 min) Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection. Inspired by this success, we investigate the application of self-attention networks to 3D point cloud processing. We design self-attention layers for point clouds and use these to construct self-attention networks for tasks such as semantic scene segmentation, object part segmentation, and object classification. Our Point Transformer design improves upon prior work across domains and tasks. For example, on the challenging S3DIS dataset for large-scale semantic scene segmentation, the Point Transformer attains an mIoU of 70.4% on Area 5, outperforming the strongest prior model by 3.3 absolute percentage points and crossing the 70% mIoU threshold for the first time.
    A New Journey from SDRTV to HDRTV. (arXiv:2108.07978v2 [eess.IV] UPDATED)
    (0 min) Nowadays modern displays are capable to render video content with high dynamic range (HDR) and wide color gamut (WCG). However, most available resources are still in standard dynamic range (SDR). Therefore, there is an urgent demand to transform existing SDR-TV contents into their HDR-TV versions. In this paper, we conduct an analysis of SDRTV-to-HDRTV task by modeling the formation of SDRTV/HDRTV content. Base on the analysis, we propose a three-step solution pipeline including adaptive global color mapping, local enhancement and highlight generation. Moreover, the above analysis inspires us to present a lightweight network that utilizes global statistics as guidance to conduct image-adaptive color mapping. In addition, we construct a dataset using HDR videos in HDR10 standard, named HDRTV1K, and select five metrics to evaluate the results of SDRTV-to-HDRTV algorithms. Furthermore, our final results achieve state-of-the-art performance in quantitative comparisons and visual quality. The code and dataset are available at https://github.com/chxy95/HDRTVNet.
    Joint Progressive and Coarse-to-fine Registration of Brain MRI via Deformation Field Integration and Non-Rigid Feature Fusion. (arXiv:2109.12384v1 [eess.IV])
    (0 min) Registration of brain MRI images requires to solve a deformation field, which is extremely difficult in aligning intricate brain tissues, e.g., subcortical nuclei, etc. Existing efforts resort to decomposing the target deformation field into intermediate sub-fields with either tiny motions, i.e., progressive registration stage by stage, or lower resolutions, i.e., coarse-to-fine estimation of the full-size deformation field. In this paper, we argue that those efforts are not mutually exclusive, and propose a unified framework for robust brain MRI registration in both progressive and coarse-to-fine manners simultaneously. Specifically, building on a dual-encoder U-Net, the fixed-moving MRI pair is encoded and decoded into multi-scale deformation sub-fields from coarse to fine. Each decoding block contains two proposed novel modules: i) in Deformation Field Integration (DFI), a single integrated sub-field is calculated, warping by which is equivalent to warping progressively by sub-fields from all previous decoding blocks, and ii) in Non-rigid Feature Fusion (NFF), features of the fixed-moving pair are aligned by DFI-integrated sub-field, and then fused to predict a finer sub-field. Leveraging both DFI and NFF, the target deformation field is factorized into multi-scale sub-fields, where the coarser fields alleviate the estimate of a finer one and the finer field learns to make up those misalignments insolvable by previous coarser ones. The extensive and comprehensive experimental results on both private and public datasets demonstrate a superior registration performance of brain MRI images over progressive registration only and coarse-to-fine estimation only, with an increase by at most 10% in the average Dice.
  • cs.IR updates on arXiv.org

    Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning. (arXiv:2109.12758v1 [cs.CL])
    (2 min) Scientific literature is one of the most significant resources for sharing knowledge. Researchers turn to scientific literature as a first step in designing an experiment. Given the extensive and growing volume of literature, the common approach of reading and manually extracting knowledge is too time consuming, creating a bottleneck in the research cycle. This challenge spans nearly every scientific domain. For the materials science, experimental data distributed across millions of publications are extremely helpful for predicting materials properties and the design of novel materials. However, only recently researchers have explored computational approaches for knowledge extraction primarily for inorganic materials. This study aims to explore knowledge extraction for organic materials. We built a research dataset composed of 855 annotated and 708,376 unannotated sentences drawn from 92,667 abstracts. We used named-entity-recognition (NER) with BiLSTM-CNN-CRF deep learning model to automatically extract key knowledge from literature. Early-phase results show a high potential for automated knowledge extraction. The paper presents our findings and a framework for supervised knowledge extraction that can be adapted to other scientific domains.
    Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark. (arXiv:2109.12769v1 [cs.LG])
    (2 min) Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness.
    SimpleX: A Simple and Strong Baseline for Collaborative Filtering. (arXiv:2109.12613v1 [cs.IR])
    (2 min) Collaborative filtering (CF) is a widely studied research topic in recommender systems. The learning of a CF model generally depends on three major components, namely interaction encoder, loss function, and negative sampling. While many existing studies focus on the design of more powerful interaction encoders, the impacts of loss functions and negative sampling ratios have not yet been well explored. In this work, we show that the choice of loss function as well as negative sampling ratio is equivalently important. More specifically, we propose the cosine contrastive loss (CCL) and further incorporate it to a simple unified CF model, dubbed SimpleX. Extensive experiments have been conducted on 11 benchmark datasets and compared with 29 existing CF models in total. Surprisingly, the results show that, under our CCL loss and a large negative sampling ratio, SimpleX can surpass most sophisticated state-of-the-art models by a large margin (e.g., max 48.5% improvement in NDCG@20 over LightGCN). We believe that SimpleX could not only serve as a simple strong baseline to foster future research on CF, but also shed light on the potential research direction towards improving loss function and negative sampling.
    Topic Model Robustness to Automatic Speech Recognition Errors in Podcast Transcripts. (arXiv:2109.12306v1 [cs.IR])
    (2 min) For a multilingual podcast streaming service, it is critical to be able to deliver relevant content to all users independent of language. Podcast content relevance is conventionally determined using various metadata sources. However, with the increasing quality of speech recognition in many languages, utilizing automatic transcriptions to provide better content recommendations becomes possible. In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine. Specifically, we explore how increasing transcription noise influences topics obtained from transcriptions in Danish; a low resource language. First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators. We then observe how the cosine similarities decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic embeddings from the transcriptions.
    Improving Question Answering Performance Using Knowledge Distillation and Active Learning. (arXiv:2109.12662v1 [cs.CL])
    (2 min) Contemporary question answering (QA) systems, including transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Further, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained BERT system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. In particular, we demonstrate that our model achieves the performance of a 6-layer TinyBERT and DistilBERT, whilst using only 2% of their total parameters. Finally, by the integration of our AL approaches into the BERT framework, we show that state-of-the-art results on the SQuAD dataset can be achieved when we only use 20% of the training data.
    COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes. (arXiv:2007.06954v7 [cs.CL] UPDATED)
    (3 min) This paper describes a large global dataset on people's social media responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 September 2021, we collected over 198 million Twitter posts from more than 25 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging topic modeling techniques and pre-trained machine learning-based emotion analytic algorithms, we labeled each tweet with seventeen semantic attributes, including a) ten binary attributes indicating the tweet's relevance or irrelevance to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: very negative to 1: very positive), and the degree of intensity of fear, anger, happiness and sadness emotions (from 0: not at all to 1: extremely intense), and c) two qualitative attributes indicating the sentiment category (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion category (fear, anger, happiness, sadness, no specific emotion) the tweet is mainly expressing. We report the descriptive statistics around these new attributes, their temporal distributions, and the overall geographic representation of the dataset. The paper concludes with an outline of the dataset's possible usage in communication, psychology, public health, economics, and epidemiology.
    MC$^2$-SF: Slow-Fast Learning for Mobile-Cloud Collaborative Recommendation. (arXiv:2109.12314v1 [cs.IR])
    (2 min) With the hardware development of mobile devices, it is possible to build the recommendation models on the mobile side to utilize the fine-grained features and the real-time feedbacks. Compared to the straightforward mobile-based modeling appended to the cloud-based modeling, we propose a Slow-Fast learning mechanism to make the Mobile-Cloud Collaborative recommendation (MC$^2$-SF) mutual benefit. Specially, in our MC$^2$-SF, the cloud-based model and the mobile-based model are respectively treated as the slow component and the fast component, according to their interaction frequency in real-world scenarios. During training and serving, they will communicate the prior/privileged knowledge to each other to help better capture the user interests about the candidates, resembling the role of System I and System II in the human cognition. We conduct the extensive experiments on three benchmark datasets and demonstrate the proposed MC$^2$-SF outperforms several state-of-the-art methods.
    Dynamic Sequential Graph Learning for Click-Through Rate Prediction. (arXiv:2109.12541v1 [cs.IR])
    (2 min) Click-through rate prediction plays an important role in the field of recommender system and many other applications. Existing methods mainly extract user interests from user historical behaviors. However, behavioral sequences only contain users' directly interacted items, which are limited by the system's exposure, thus they are often not rich enough to reflect all the potential interests. In this paper, we propose a novel method, named Dynamic Sequential Graph Learning (DSGL), to enhance users or items' representations by utilizing collaborative information from the local sub-graphs associated with users or items. Specifically, we design the Dynamic Sequential Graph (DSG), i.e., a lightweight ego subgraph with timestamps induced from historical interactions. At every scoring moment, we construct DSGs for the target user and the candidate item respectively. Based on the DSGs, we perform graph convolutional operations iteratively in a bottom-up manner to obtain the final representations of the target user and the candidate item. As for the graph convolution, we design a Time-aware Sequential Encoding Layer that leverages the interaction time information as well as temporal dependencies to learn evolutionary user and item dynamics. Besides, we propose a Target-Preference Dual Attention Layer, composed of a preference-aware attention module and a target-aware attention module, to automatically search for parts of behaviors that are relevant to the target and alleviate the noise from unreliable neighbors. Results on real-world CTR prediction benchmarks demonstrate the improvements brought by DSGL.
    DemiNet: Dependency-Aware Multi-Interest Network with Self-Supervised Graph Learning for Click-Through Rate Prediction. (arXiv:2109.12512v1 [cs.IR])
    (2 min) In this paper, we propose a novel model named DemiNet (short for DEpendency-Aware Multi-Interest Network}) to address the above two issues. To be specific, we first consider various dependency types between item nodes and perform dependency-aware heterogeneous attention for denoising and obtaining accurate sequence item representations. Secondly, for multiple interests extraction, multi-head attention is conducted on top of the graph embedding. To filter out noisy inter-item correlations and enhance the robustness of extracted interests, self-supervised interest learning is introduced to the above two steps. Thirdly, to aggregate the multiple interests, interest experts corresponding to different interest routes give rating scores respectively, while a specialized network assigns the confidence of each score. Experimental results on three real-world datasets demonstrate that the proposed DemiNet significantly improves the overall recommendation performance over several state-of-the-art baselines. Further studies verify the efficacy and interpretability benefits brought from the fine-grained user interest modeling.
    Why Do We Click: Visual Impression-aware News Recommendation. (arXiv:2109.12651v1 [cs.IR])
    (2 min) There is a soaring interest in the news recommendation research scenario due to the information overload. To accurately capture users' interests, we propose to model multi-modal features, in addition to the news titles that are widely used in existing works, for news recommendation. Besides, existing research pays little attention to the click decision-making process in designing multi-modal modeling modules. In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation. Specifically, we devise the local impression modeling module to simultaneously attend to decomposed details in the impression when understanding the semantic meaning of news title, which could explicitly get close to the process of users reading news. In addition, we inspect the impression from a global view and take structural information, such as the arrangement of different fields and spatial position of different words on the impression, into the modeling of multiple modalities. To accommodate the research of visual impression-aware news recommendation, we extend the text-dominated news recommendation dataset MIND by adding snapshot impression images and will release it to nourish the research field. Extensive comparisons with the state-of-the-art news recommenders along with the in-depth analyses demonstrate the effectiveness of the proposed method and the promising capability of modeling visual impressions for the content-based recommenders.
    Extracting and Inferring Personal Attributes from Dialogue. (arXiv:2109.12702v1 [cs.CL])
    (2 min) Personal attributes represent structured information about a person, such as their hobbies, pets, family, likes and dislikes. In this work, we introduce the tasks of extracting and inferring personal attributes from human-human dialogue. We first demonstrate the benefit of incorporating personal attributes in a social chit-chat dialogue model and task-oriented dialogue setting. Thus motivated, we propose the tasks of personal attribute extraction and inference, and then analyze the linguistic demands of these tasks. To meet these challenges, we introduce a simple and extensible model that combines an autoregressive language model utilizing constrained attribute generation with a discriminative reranker. Our model outperforms strong baselines on extracting personal attributes as well as inferring personal attributes that are not contained verbatim in utterances and instead requires commonsense reasoning and lexical inferences, which occur frequently in everyday conversation.
    Deep Exploration for Recommendation Systems. (arXiv:2109.12509v1 [cs.IR])
    (2 min) We investigate the design of recommendation systems that can efficiently learn from sparse and delayed feedback. Deep Exploration can play an important role in such contexts, enabling a recommendation system to much more quickly assess a user's needs and personalize service. We design an algorithm based on Thompson Sampling that carries out Deep Exploration. We demonstrate through simulations that the algorithm can substantially amplify the rate of positive feedback relative to common recommendation system designs in a scalable fashion. These results demonstrate promise that we hope will inspire engineering of production recommendation systems that leverage Deep Exploration.
    When Product Search Meets Collaborative Filtering: A Hierarchical Heterogeneous Graph Neural Network Approach. (arXiv:2108.07574v2 [cs.IR] UPDATED)
    (2 min) Personalization lies at the core of boosting the product search system performance. Prior studies mainly resorted to the semantic matching between textual queries and user/product related documents, leaving the user collaborative behaviors untapped. In fact, the collaborative filtering signals between users intuitively offer a complementary information for the semantic matching. To close the gap between collaborative filtering and product search, we propose a Hierarchical Heterogeneous Graph Neural Network (HHGNN) approach in this paper. Specifically, we organize HHGNN with a hierarchical graph structure according to the three edge types. The sequence edge accounts for the syntax formulation from word nodes to sentence nodes; the composition edge aggregates the semantic features to the user and product nodes; and the interaction edge on the top performs graph convolutional operation between user and product nodes. At last, we integrate the higher-order neighboring collaborative features and the semantic features for better representation learning. We conduct extensive experiments on six Amazon review datasets. The results show that our proposed method can outperform the state-of-the-art baselines with a large margin. In addition, we empirically prove that collaborative filtering and semantic matching are complementary to each other in product search performance enhancement.
  • cs.LG updates on arXiv.org

    Copula-based synthetic data augmentation for machine-learning emulators. (arXiv:2012.09037v3 [cs.LG] UPDATED)
    (2 min) Can we improve machine-learning (ML) emulators with synthetic data? If data are scarce or expensive to source and a physical model is available, statistically generated data may be useful for augmenting training sets cheaply. Here we explore the use of copula-based models for generating synthetically augmented datasets in weather and climate by testing the method on a toy physical model of downwelling longwave radiation and corresponding neural network emulator. Results show that for copula-augmented datasets, predictions are improved by up to 62 % for the mean absolute error (from 1.17 to 0.44 W m$^{-2}$).
    An embarrassingly simple comparison of machine learning algorithms for indoor scene classification. (arXiv:2109.12261v1 [cs.CV])
    (2 min) With the emergence of autonomous indoor robots, the computer vision task of indoor scene recognition has gained the spotlight. Indoor scene recognition is a challenging problem in computer vision that relies on local and global features in a scene. This study aims to compare the performance of five machine learning algorithms on the task of indoor scene classification to identify the pros and cons of each classifier. It also provides a comparison of low latency feature extractors versus enormous feature extractors to understand the performance effects. Finally, a simple MnasNet based indoor classification system is proposed, which can achieve 72% accuracy at 23 ms latency.
    Data-driven discovery of multiscale chemical reactions governed by the law of mass action. (arXiv:2101.06589v3 [physics.chem-ph] UPDATED)
    (2 min) In this paper, we propose a data-driven method to discover multiscale chemical reactions governed by the law of mass action. First, we use a single matrix to represent the stoichiometric coefficients for both the reactants and products in a system without catalysis reactions. The negative entries in the matrix denote the stoichiometric coefficients for the reactants and the positive ones for the products. Second, we find that the conventional optimization methods usually get stuck in the local minima and could not find the true solution in learning the multiscale chemical reactions. To overcome this difficulty, we propose a partial-parameters-freezing (PPF) technique to progressively determine the network parameters by using the fact that the stoichiometric coefficients are integers. With such a technique, the dimension of the searching space is gradually reduced in the training process and the global mimina can be eventually obtained. Several numerical experiments including the classical Michaelis-Menten kinetics, the hydrogen oxidation reactions, and the simplified GRI-3.0 mechanism verify the good performance of our algorithm in learning the multiscale chemical reactions. The code is available at \url{https://github.com/JuntaoHuang/multiscale-chemical-reaction}.
    Integrating Unsupervised Clustering and Label-specific Oversampling to Tackle Imbalanced Multi-label Data. (arXiv:2109.12421v1 [cs.LG])
    (2 min) There is often a mixture of very frequent labels and very infrequent labels in multi-label datatsets. This variation in label frequency, a type class imbalance, creates a significant challenge for building efficient multi-label classification algorithms. In this paper, we tackle this problem by proposing a minority class oversampling scheme, UCLSO, which integrates Unsupervised Clustering and Label-Specific data Oversampling. Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset (irrespective of the label information). Next, for each label, we explore the distributions of minority points in the cluster sets. Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling. Even though the cluster set is the same across all labels, the distributions of the synthetic minority points will vary across the labels. The training dataset is augmented with the set of label-specific synthetic minority points, and classifiers are trained to predict the relevance of each label independently. Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well compared to the other competing algorithms.
    Bayesian Nonparametric Dimensionality Reduction of Categorical Data for Predicting Severity of COVID-19 in Pregnant Women. (arXiv:2011.03715v2 [stat.AP] UPDATED)
    (2 min) The coronavirus disease (COVID-19) has rapidly spread throughout the world and while pregnant women present the same adverse outcome rates, they are underrepresented in clinical research. We collected clinical data of 155 test-positive COVID-19 pregnant women at Stony Brook University Hospital. Many of these collected data are of multivariate categorical type, where the number of possible outcomes grows exponentially as the dimension of data increases. We modeled the data within the unsupervised Bayesian framework and mapped them into a lower-dimensional space using latent Gaussian processes. The latent features in the lower dimensional space were further used for predicting if a pregnant woman would be admitted to a hospital due to COVID-19 or would remain with mild symptoms. We compared the prediction accuracy with the dummy/one-hot encoding of categorical data and found that the latent Gaussian process had better accuracy.
    Model-Free Learning of Optimal Deterministic Resource Allocations in Wireless Systems via Action-Space Exploration. (arXiv:2108.10352v2 [eess.SY] UPDATED)
    (2 min) Wireless systems resource allocation refers to perpetual and challenging nonconvex constrained optimization tasks, which are especially timely in modern communications and networking setups involving multiple users with heterogeneous objectives and imprecise or even unknown models and/or channel statistics. In this paper, we propose a technically grounded and scalable primal-dual deterministic policy gradient method for efficiently learning optimal parameterized resource allocation policies. Our method not only efficiently exploits gradient availability of popular universal policy representations, such as deep neural networks, but is also truly model-free, as it relies on consistent zeroth-order gradient approximations of the associated random network services constructed via low-dimensional perturbations in action space, thus fully bypassing any dependence on critics. Both theory and numerical simulations confirm the efficacy and applicability of the proposed approach, as well as its superiority over the current state of the art in terms of both achieving near-optimal performance and scalability.
    ConE: A Concurrent Edit Detection Tool for Large Scale Software Development. (arXiv:2101.06542v3 [cs.SE] UPDATED)
    (3 min) Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull request based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named ConE (Concurrent Edit Detector) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis.
    Learning Multimodal Rewards from Rankings. (arXiv:2109.12750v1 [cs.LG])
    (2 min) Learning from human feedback has shown to be a useful approach in acquiring robot reward functions. However, expert feedback is often assumed to be drawn from an underlying unimodal reward function. This assumption does not always hold including in settings where multiple experts provide data or when a single expert provides data for different tasks -- we thus go beyond learning a unimodal reward and focus on learning a multimodal reward function. We formulate the multimodal reward learning as a mixture learning problem and develop a novel ranking-based learning approach, where the experts are only required to rank a given set of trajectories. Furthermore, as access to interaction data is often expensive in robotics, we develop an active querying approach to accelerate the learning process. We conduct experiments and user studies using a multi-task variant of OpenAI's LunarLander and a real Fetch robot, where we collect data from multiple users with different preferences. The results suggest that our approach can efficiently learn multimodal reward functions, and improve data-efficiency over benchmark methods that we adapt to our learning problem.
    Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark. (arXiv:2109.12769v1 [cs.LG])
    (2 min) Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness.
    A Sequential Framework Towards an Exact SDP Verification of Neural Networks. (arXiv:2010.08603v2 [cs.LG] UPDATED)
    (2 min) Although neural networks have been applied to several systems in recent years, they still cannot be used in safety-critical systems due to the lack of efficient techniques to certify their robustness. A number of techniques based on convex optimization have been proposed in the literature to study the robustness of neural networks, and the semidefinite programming (SDP) approach has emerged as a leading contender for the robust certification of neural networks. The major challenge to the SDP approach is that it is prone to a large relaxation gap. In this work, we address this issue by developing a sequential framework to shrink this gap to zero by adding non-convex cuts to the optimization problem via disjunctive programming. We analyze the performance of this sequential SDP method both theoretically and empirically, and show that it bridges the gap as the number of cuts increases.
    Distributionally Robust Multiclass Classification and Applications in Deep CNN Image Classifiers. (arXiv:2109.12772v1 [stat.ML])
    (2 min) We develop a Distributionally Robust Optimization (DRO) formulation for Multiclass Logistic Regression (MLR), which could tolerate data contaminated by outliers. The DRO framework uses a probabilistic ambiguity set defined as a ball of distributions that are close to the empirical distribution of the training set in the sense of the Wasserstein metric. We relax the DRO formulation into a regularized learning problem whose regularizer is a norm of the coefficient matrix. We establish out-of-sample performance guarantees for the solutions to our model, offering insights on the role of the regularizer in controlling the prediction error. We apply the proposed method in rendering deep CNN-based image classifiers robust to random and adversarial attacks. Specifically, using the MNIST and CIFAR-10 datasets, we demonstrate reductions in test error rate by up to 78.8% and loss by up to 90.8%. We also show that with a limited number of perturbed images in the training set, our method can improve the error rate by up to 49.49% and the loss by up to 68.93% compared to Empirical Risk Minimization (ERM), converging faster to an ideal loss/error rate as the number of perturbed images increases.
    Curvature Injected Adaptive Momentum Optimizer for Convolutional Neural Networks. (arXiv:2109.12504v1 [cs.LG])
    (2 min) In this paper, we propose a new approach, hereafter referred as AdaInject, for the gradient descent optimizers by injecting the curvature information with adaptive momentum. Specifically, the curvature information is used as a weight to inject the second order moment in the update rule. The curvature information is captured through the short-term parameter history. The AdaInject approach boosts the parameter update by exploiting the curvature information. The proposed approach is generic in nature and can be integrated with any existing adaptive momentum stochastic gradient descent optimizers. The effectiveness of the AdaInject optimizer is tested using a theoretical analysis as well as through toy examples. We also show the convergence property of the proposed injection based optimizer. Further, we depict the efficacy of the AdaInject approach through extensive experiments in conjunction with the state-of-the-art optimizers, i.e., AdamInject, diffGradInject, RadamInject, and AdaBeliefInject on four benchmark datasets. Different CNN models are used in the experiments. A highest improvement in the top-1 classification error rate of $16.54\%$ is observed using diffGradInject optimizer with ResNeXt29 model over the CIFAR10 dataset. Overall, we observe very promising performance improvement of existing optimizers with the proposed AdaInject approach.
    Autoregressive neural-network wavefunctions for ab initio quantum chemistry. (arXiv:2109.12606v1 [physics.chem-ph])
    (2 min) Performing electronic structure calculations is a canonical many-body problem that has recently emerged as a challenging new paradigm for neural network quantum states (NNQS). Here, we parameterise the electronic wavefunction with a novel autoregressive neural network (ARN) that permits highly efficient and scalable sampling, whilst also embedding physical priors that reflect the structure of molecular systems without sacrificing expressibility. This allows us to perform electronic structure calculations on molecules with up to 30 spin-orbitals - which consider multiple orders of magnitude more Slater determinants than previous applications of conventional NNQS - and we find that our ansatz can outperform the de-facto gold-standard coupled cluster methods even in the presence of strong quantum correlations. With a highly expressive neural network for which sampling is no longer a computational bottleneck, we conclude that the barriers to further scaling are not associated with the wavefunction ansatz itself, but rather are inherent to any variational Monte Carlo approach.
    AsySQN: Faster Vertical Federated Learning Algorithms with Better Computation Resource Utilization. (arXiv:2109.12519v1 [cs.LG])
    (2 min) Vertical federated learning (VFL) is an effective paradigm of training the emerging cross-organizational (e.g., different corporations, companies and organizations) collaborative learning with privacy preserving. Stochastic gradient descent (SGD) methods are the popular choices for training VFL models because of the low per-iteration computation. However, existing SGD-based VFL algorithms are communication-expensive due to a large number of communication rounds. Meanwhile, most existing VFL algorithms use synchronous computation which seriously hamper the computation resource utilization in real-world applications. To address the challenges of communication and computation resource utilization, we propose an asynchronous stochastic quasi-Newton (AsySQN) framework for VFL, under which three algorithms, i.e. AsySQN-SGD, -SVRG and -SAGA, are proposed. The proposed AsySQN-type algorithms making descent steps scaled by approximate (without calculating the inverse Hessian matrix explicitly) Hessian information convergence much faster than SGD-based methods in practice and thus can dramatically reduce the number of communication rounds. Moreover, the adopted asynchronous computation can make better use of the computation resource. We theoretically prove the convergence rates of our proposed algorithms for strongly convex problems. Extensive numerical experiments on real-word datasets demonstrate the lower communication costs and better computation resource utilization of our algorithms compared with state-of-the-art VFL algorithms.
    A Study of Fake News Reading and Annotating in Social Media Context. (arXiv:2109.12523v1 [cs.HC])
    (2 min) The online spreading of fake news is a major issue threatening entire societies. Much of this spreading is enabled by new media formats, namely social networks and online media sites. Researchers and practitioners have been trying to answer this by characterizing the fake news and devising automated methods for detecting them. The detection methods had so far only limited success, mostly due to the complexity of the news content and context and lack of properly annotated datasets. One possible way to boost the efficiency of automated misinformation detection methods, is to imitate the detection work of humans. It is also important to understand the news consumption behavior of online users. In this paper, we present an eye-tracking study, in which we let 44 lay participants to casually read through a social media feed containing posts with news articles, some of which were fake. In a second run, we asked the participants to decide on the truthfulness of these articles. We also describe a follow-up qualitative study with a similar scenario but this time with 7 expert fake news annotators. We present the description of both studies, characteristics of the resulting dataset (which we hereby publish) and several findings.
    SAND-mask: An Enhanced Gradient Masking Strategy for the Discovery of Invariances in Domain Generalization. (arXiv:2106.02266v2 [cs.LG] UPDATED)
    (2 min) A major bottleneck in the real-world applications of machine learning models is their failure in generalizing to unseen domains whose data distribution is not i.i.d to the training domains. This failure often stems from learning non-generalizable features in the training domains that are spuriously correlated with the label of data. To address this shortcoming, there has been a growing surge of interest in learning good explanations that are hard to vary, which is studied under the notion of Out-of-Distribution (OOD) Generalization. The search for good explanations that are \textit{invariant} across different domains can be seen as finding local (global) minimas in the loss landscape that hold true across all of the training domains. In this paper, we propose a masking strategy, which determines a continuous weight based on the agreement of gradients that flow in each edge of network, in order to control the amount of update received by the edge in each step of optimization. Particularly, our proposed technique referred to as "Smoothed-AND (SAND)-masking", not only validates the agreement in the direction of gradients but also promotes the agreement among their magnitudes to further ensure the discovery of invariances across training domains. SAND-mask is validated over the Domainbed benchmark for domain generalization and significantly improves the state-of-the-art accuracy on the Colored MNIST dataset while providing competitive results on other domain generalization datasets.
    Testing Autonomous Systems with Believed Equivalence Refinement. (arXiv:2103.04578v3 [cs.AI] UPDATED)
    (2 min) Continuous engineering of autonomous driving functions commonly requires deploying vehicles in road testing to obtain inputs that cause problematic decisions. Although the discovery leads to producing an improved system, it also challenges the foundation of testing using equivalence classes and the associated relative test coverage criterion. In this paper, we propose believed equivalence, where the establishment of an equivalence class is initially based on expert belief and is subject to a set of available test cases having a consistent valuation. Upon a newly encountered test case that breaks the consistency, one may need to refine the established categorization in order to split the originally believed equivalence into two. Finally, we focus on modules implemented using deep neural networks where every category partitions an input over the real domain. We present both analytical and lazy methods to suggest the refinement. The concept is demonstrated in analyzing multiple autonomous driving modules, indicating the potential of our proposed approach.
    FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding. (arXiv:2109.12742v1 [cs.CL])
    (2 min) The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev-test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods.
    Using Soft Labels to Model Uncertainty in Medical Image Segmentation. (arXiv:2109.12622v1 [cs.CV])
    (2 min) Medical image segmentation is inherently uncertain. For a given image, there may be multiple plausible segmentation hypotheses, and physicians will often disagree on lesion and organ boundaries. To be suited to real-world application, automatic segmentation systems must be able to capture this uncertainty and variability. Thus far, this has been addressed by building deep learning models that, through dropout, multiple heads, or variational inference, can produce a set - infinite, in some cases - of plausible segmentation hypotheses for any given image. However, in clinical practice, it may not be practical to browse all hypotheses. Furthermore, recent work shows that segmentation variability plateaus after a certain number of independent annotations, suggesting that a large enough group of physicians may be able to represent the whole space of possible segmentations. Inspired by this, we propose a simple method to obtain soft labels from the annotations of multiple physicians and train models that, for each image, produce a single well-calibrated output that can be thresholded at multiple confidence levels, according to each application's precision-recall requirements. We evaluated our method on the MICCAI 2021 QUBIQ challenge, showing that it performs well across multiple medical image segmentation tasks, produces well-calibrated predictions, and, on average, performs better at matching physicians' predictions than other physicians.
    Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?. (arXiv:2104.09425v2 [cs.LG] UPDATED)
    (3 min) While additional training data improves the robustness of deep neural networks against adversarial examples, it presents the challenge of curating a large number of specific real-world samples. We circumvent this challenge by using additional data from proxy distributions learned by state-of-the-art generative models. We first seek to formally understand the transfer of robustness from classifiers trained on proxy distributions to the real data distribution. We prove that the difference between the robustness of a classifier on the two distributions is upper bounded by the conditional Wasserstein distance between them. Motivated by our result, we next ask how to empirically select an appropriate generative model? We find that existing distance metrics, such as FID, fail to correctly determine the robustness transfer from proxy distributions. We propose a robust discrimination approach, which measures the distinguishability of synthetic and real samples under adversarial perturbations. Our approach accurately predicts the robustness transfer from different proxy distributions. After choosing a proxy distribution, the next question is which samples are most beneficial? We successfully optimize this selection by estimating the importance of each sample in robustness transfer. Finally, using our selection criterion for proxy distribution and individual samples, we curate a set of ten million most beneficial synthetic samples for robust training on the CIFAR-10 dataset. Using this set we improve robust accuracy by up to 7.5% and 6.7% in $\ell_{\infty}$ and $\ell_2$ threat model, and certified robust accuracy by 7.6% in $\ell_2$ threat model over baselines not using proxy distributions on the CIFAR-10 dataset.
    Machine Learning based Medical Image Deepfake Detection: A Comparative Study. (arXiv:2109.12800v1 [cs.CV])
    (2 min) Deep generative networks in recent years have reinforced the need for caution while consuming various modalities of digital information. One avenue of deepfake creation is aligned with injection and removal of tumors from medical scans. Failure to detect medical deepfakes can lead to large setbacks on hospital resources or even loss of life. This paper attempts to address the detection of such attacks with a structured case study. We evaluate different machine learning algorithms and pretrained convolutional neural networks on distinguishing between tampered and untampered data. The findings of this work show near perfect accuracy in detecting instances of tumor injections and removals.
    LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates. (arXiv:2109.12508v1 [cs.LG])
    (2 min) In cooperative multi-agent reinforcement learning (MARL), where agents only have access to partial observations, efficiently leveraging local information is critical. During long-time observations, agents can build \textit{awareness} for teammates to alleviate the problem of partial observability. However, previous MARL methods usually neglect this kind of utilization of local information. To address this problem, we propose a novel framework, multi-agent \textit{Local INformation Decomposition for Awareness of teammates} (LINDA), with which agents learn to decompose local information and build awareness for each teammate. We model the awareness as stochastic random variables and perform representation learning to ensure the informativeness of awareness representations by maximizing the mutual information between awareness and the actual trajectory of the corresponding agent. LINDA is agnostic to specific algorithms and can be flexibly integrated to different MARL methods. Sufficient experiments show that the proposed framework learns informative awareness from local partial observations for better collaboration and significantly improves the learning performance, especially on challenging tasks.
    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. (arXiv:2108.03298v2 [cs.RO] UPDATED)
    (2 min) Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Codebase, datasets, trained models, and more available at https://arise-initiative.github.io/robomimic-web/
    Physics-informed Convolutional Neural Networks for Temperature Field Prediction of Heat Source Layout without Labeled Data. (arXiv:2109.12482v1 [cs.LG])
    (0 min) Recently, surrogate models based on deep learning have attracted much attention for engineering analysis and optimization. As the construction of data pairs in most engineering problems is time-consuming, data acquisition is becoming the predictive capability bottleneck of most deep surrogate models, which also exists in surrogate for thermal analysis and design. To address this issue, this paper develops a physics-informed convolutional neural network (CNN) for the thermal simulation surrogate. The network can learn a mapping from heat source layout to the steady-state temperature field without labeled data, which equals solving an entire family of partial difference equations (PDEs). To realize the physics-guided training without labeled data, we employ the heat conduction equation and finite difference method to construct the loss function. Since the solution is sensitive to boundary conditions, we properly impose hard constraints by padding in the Dirichlet and Neumann boundary conditions. In addition, the neural network architecture is well-designed to improve the prediction precision of the problem at hand, and pixel-level online hard example mining is introduced to overcome the imbalance of optimization difficulty in the computation domain. The experiments demonstrate that the proposed method can provide comparable predictions with numerical method and data-driven deep learning models. We also conduct various ablation studies to investigate the effectiveness of the network component and training methods proposed in this paper.
    ReCal-Net: Joint Region-Channel-Wise Calibrated Network for Semantic Segmentation in Cataract Surgery Videos. (arXiv:2109.12448v1 [eess.IV])
    (2 min) Semantic segmentation in surgical videos is a prerequisite for a broad range of applications towards improving surgical outcomes and surgical video analysis. However, semantic segmentation in surgical videos involves many challenges. In particular, in cataract surgery, various features of the relevant objects such as blunt edges, color and context variation, reflection, transparency, and motion blur pose a challenge for semantic segmentation. In this paper, we propose a novel convolutional module termed as \textit{ReCal} module, which can calibrate the feature maps by employing region intra-and-inter-dependencies and channel-region cross-dependencies. This calibration strategy can effectively enhance semantic representation by correlating different representations of the same semantic label, considering a multi-angle local view centering around each pixel. Thus the proposed module can deal with distant visual characteristics of unique objects as well as cross-similarities in the visual characteristics of different objects. Moreover, we propose a novel network architecture based on the proposed module termed as ReCal-Net. Experimental results confirm the superiority of ReCal-Net compared to rival state-of-the-art approaches for all relevant objects in cataract surgery. Moreover, ablation studies reveal the effectiveness of the ReCal module in boosting semantic segmentation accuracy.
    Provable Low Rank Plus Sparse Matrix Separation Via Nonconvex Regularizers. (arXiv:2109.12713v1 [stat.ML])
    (0 min) This paper considers a large class of problems where we seek to recover a low rank matrix and/or sparse vector from some set of measurements. While methods based on convex relaxations suffer from a (possibly large) estimator bias, and other nonconvex methods require the rank or sparsity to be known a priori, we use nonconvex regularizers to minimize the rank and $l_0$ norm without the estimator bias from the convex relaxation. We present a novel analysis of the alternating proximal gradient descent algorithm applied to such problems, and bound the error between the iterates and the ground truth sparse and low rank matrices. The algorithm and error bound can be applied to sparse optimization, matrix completion, and robust principal component analysis as special cases of our results.
    Physics-Informed Neural Networks for Minimising Worst-Case Violations in DC Optimal Power Flow. (arXiv:2107.00465v2 [eess.SY] UPDATED)
    (0 min) Physics-informed neural networks exploit the existing models of the underlying physical systems to generate higher accuracy results with fewer data. Such approaches can help drastically reduce the computation time and generate a good estimate of computationally intensive processes in power systems, such as dynamic security assessment or optimal power flow. Combined with the extraction of worst-case guarantees for the neural network performance, such neural networks can be applied in safety-critical applications in power systems and build a high level of trust among power system operators. This paper takes the first step and applies, for the first time to our knowledge, Physics-Informed Neural Networks with Worst-Case Guarantees for the DC Optimal Power Flow problem. We look for guarantees related to (i) maximum constraint violations, (ii) maximum distance between predicted and optimal decision variables, and (iii) maximum sub-optimality in the entire input domain. In a range of PGLib-OPF networks, we demonstrate how physics-informed neural networks can be supplied with worst-case guarantees and how they can lead to reduced worst-case violations compared with conventional neural networks.
    Data Summarization via Bilevel Optimization. (arXiv:2109.12534v1 [cs.LG])
    (0 min) The increasing availability of massive data sets poses a series of challenges for machine learning. Prominent among these is the need to learn models under hardware or human resource constraints. In such resource-constrained settings, a simple yet powerful approach is to operate on small subsets of the data. Coresets are weighted subsets of the data that provide approximation guarantees for the optimization objective. However, existing coreset constructions are highly model-specific and are limited to simple models such as linear regression, logistic regression, and $k$-means. In this work, we propose a generic coreset construction framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem. In contrast to existing approaches, our framework does not require model-specific adaptations and applies to any twice differentiable model, including neural networks. We show the effectiveness of our framework for a wide range of models in various settings, including training non-convex models online and batch active learning.
    Under the Skin of Foundation NFT Auctions. (arXiv:2109.12321v1 [cs.SI])
    (2 min) Non Fungible Tokens (NFTs) have gained a solid foothold within the crypto community, and substantial amounts of money have been allocated to their trades. In this paper, we studied one of the most prominent marketplaces dedicated to NFT auctions and trades, Foundation. We analyzed the activities on Foundation and identified several intriguing underlying dynamics that occur on this platform. Moreover, We performed social network analysis on a graph that we had created based on transferred NFTs on Foundation, and then described the characteristics of this graph. Lastly, We built a neural network-based similarity model for retrieving and clustering similar NFTs. We also showed that for most NFTs, their performances in auctions were comparable with the auction performance of other NFTs in their cluster.
    Efficient Non-linear Calculators. (arXiv:2109.12686v1 [cs.LG])
    (0 min) A novel algorithm for producing smooth nonlinearities on digital hardware is presented. The non-linearities are inherently quadratic and have both symmetrical and asymmetrical variants. The integer (and fixed point) implementation is highly amenable for use with digital gates on an ASIC or FPGA. The implementations are multiplier-less. Scaling of the non-linear output, as required in an LSTM cell, is integrated into the implementation. This too does not require a multiplier. The non-linearities are useful as activation functions in a variety of ANN architectures. The floating point mappings have been compared with other non-linearities and have been benchmarked. Results show that these functions should be considered in the ANN design phase. The hardware resource usage of the implementations have been thoroughly investigated. Our results make a strong case for implementions in edge applications. This document summarizes the findings and serves to give a quick overview of the outcomes of our research\footnote{The authors peer-reviewed manuscripts (available at https://doi.org/10.1016/j.neucom.2021.02.030) offer more detail and may be better suited for a thorough consideration}.
    Misclassification-Aware Gaussian Smoothing and Mixed Augmentations improves Robustness against Domain Shifts. (arXiv:2104.01231v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks achieve high prediction accuracy when the train and test distributions coincide. In practice though, various types of corruptions can deviate from this setup and cause severe performance degradations. Few methods have been proposed to address generalization in the presence of unforeseen domain shifts. In this paper, we propose a misclassification-aware Gaussian smoothing approach, coupled with mixed data augmentations, for improving robustness of image classifiers against a variety of corruptions while still maintaining high clean accuracy. We show that our method improves upon the state-of-the-art in robustness and uncertainty calibration on several image classification benchmarks and network architectures.
    MINIMAL: Mining Models for Data Free Universal Adversarial Triggers. (arXiv:2109.12406v1 [cs.CL])
    (2 min) It is well known that natural language models are vulnerable to adversarial attacks, which are mostly input-specific in nature. Recently, it has been shown that there also exist input-agnostic attacks in NLP models, called universal adversarial triggers. However, existing methods to craft universal triggers are data intensive. They require large amounts of data samples to generate adversarial triggers, which are typically inaccessible by attackers. For instance, previous works take 3000 data samples per class for the SNLI dataset to generate adversarial triggers. In this paper, we present a novel data-free approach, MINIMAL, to mine input-agnostic adversarial triggers from models. Using the triggers produced with our data-free algorithm, we reduce the accuracy of Stanford Sentiment Treebank's positive class from 93.6% to 9.6%. Similarly, for the Stanford Natural Language Inference (SNLI), our single-word trigger reduces the accuracy of the entailment class from 90.95% to less than 0.6\%. Despite being completely data-free, we get equivalent accuracy drops as data-dependent methods.
    Model reduction for the material point method via learning the deformation map and its spatial-temporal gradients. (arXiv:2109.12390v1 [cs.LG])
    (0 min) This work proposes a model-reduction approach for the material point method on nonlinear manifolds. The technique approximates the $\textit{kinematics}$ by approximating the deformation map in a manner that restricts deformation trajectories to reside on a low-dimensional manifold expressed from the extrinsic view via a parameterization function. By explicitly approximating the deformation map and its spatial-temporal gradients, the deformation gradient and the velocity can be computed simply by differentiating the associated parameterization function. Unlike classical model reduction techniques that build a subspace for a finite number of degrees of freedom, the proposed method approximates the entire deformation map with infinite degrees of freedom. Therefore, the technique supports resolution changes in the reduced simulation, attaining the challenging task of zero-shot super-resolution by generating material points unseen in the training data. The ability to generate material points also allows for adaptive quadrature rules for stress update. A family of projection methods is devised to generate $\textit{dynamics}$, i.e., at every time step, the methods perform three steps: (1) generate quadratures in the full space from the reduced space, (2) compute position and velocity updates in the full space, and (3) perform a least-squares projection of the updated position and velocity onto the low-dimensional manifold and its tangent space. Computational speedup is achieved via hyper-reduction, i.e., only a subset of the original material points are needed for dynamics update. Large-scale numerical examples with millions of material points illustrate the method's ability to gain an order-of-magnitude computational-cost saving -- indeed $\textit{real-time simulations}$ in some cases -- with negligible errors.
    Logo Generation Using Regional Features: A Faster R-CNN Approach to Generative Adversarial Networks. (arXiv:2109.12628v1 [cs.CV])
    (0 min) In this paper we introduce the Local Logo Generative Adversarial Network (LL-GAN) that uses regional features extracted from the Faster Regional Convolutional Neural Network (Faster R-CNN) to generate logos. We demonstrate the strength of this approach by training the framework on a small style-rich dataset collected online to generate large impressive logos. Our approach beats the state-of-the-art models (StyleGAN2, Self-Attention GANs) that suffer from mode collapse due to the size of the data.
    Deep Learning-Based Detection of the Acute Respiratory Distress Syndrome: What Are the Models Learning?. (arXiv:2109.12323v1 [cs.LG])
    (0 min) The acute respiratory distress syndrome (ARDS) is a severe form of hypoxemic respiratory failure with in-hospital mortality of 35-46%. High mortality is thought to be related in part to challenges in making a prompt diagnosis, which may in turn delay implementation of evidence-based therapies. A deep neural network (DNN) algorithm utilizing unbiased ventilator waveform data (VWD) may help to improve screening for ARDS. We first show that a convolutional neural network-based ARDS detection model can outperform prior work with random forest models in AUC (0.95+/-0.019 vs. 0.88+/-0.064), accuracy (0.84+/-0.026 vs 0.80+/-0.078), and specificity (0.81+/-0.06 vs 0.71+/-0.089). Frequency ablation studies imply that our model can learn features from low frequency domains typically used for expert feature engineering, and high-frequency information that may be difficult to manually featurize. Further experiments suggest that subtle, high-frequency components of physiologic signals may explain the superior performance of DL models over traditional ML when using physiologic waveform data. Our observations may enable improved interpretability of DL-based physiologic models and may improve the understanding of how high-frequency information in physiologic data impacts the performance our DL model.
    Separate but Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data. (arXiv:2105.04727v3 [cs.SD] UPDATED)
    (2 min) We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.
    Revisiting minimum description length complexity in overparameterized models. (arXiv:2006.10189v2 [cs.LG] UPDATED)
    (2 min) Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $dn$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators.
    Equivariant Filters for Efficient Tracking in 3D Imaging. (arXiv:2103.10255v2 [cs.CV] UPDATED)
    (2 min) We demonstrate an object tracking method for 3D images with fixed computational cost and state-of-the-art performance. Previous methods predicted transformation parameters from convolutional layers. We instead propose an architecture that does not include either flattening of convolutional features or fully connected layers, but instead relies on equivariant filters to preserve transformations between inputs and outputs (e.g. rot./trans. of inputs rotate/translate outputs). The transformation is then derived in closed form from the outputs of the filters. This method is useful for applications requiring low latency, such as real-time tracking. We demonstrate our model on synthetically augmented adult brain MRI, as well as fetal brain MRI, which is the intended use-case.
    Overcoming Conflicting Data when Updating a Neural Semantic Parser. (arXiv:2010.12675v2 [cs.CL] UPDATED)
    (2 min) In this paper, we explore how to use a small amount of new data to update a task-oriented semantic parsing model when the desired output for some examples has changed. When making updates in this way, one potential problem that arises is the presence of conflicting data, or out-of-date labels in the original training set. To evaluate the impact of this understudied problem, we propose an experimental setup for simulating changes to a neural semantic parser. We show that the presence of conflicting data greatly hinders learning of an update, then explore several methods to mitigate its effect. Our multi-task and data selection methods lead to large improvements in model accuracy compared to a naive data-mixing strategy, and our best method closes 86% of the accuracy gap between this baseline and an oracle upper bound.
    ReINTEL Challenge 2020: A Comparative Study of Hybrid Deep Neural Network for Reliable Intelligence Identification on Vietnamese SNSs. (arXiv:2109.12777v1 [cs.LG])
    (2 min) The overwhelming abundance of data has created a misinformation crisis. Unverified sensationalism that is designed to grab the readers' short attention span, when crafted with malice, has caused irreparable damage to our society's structure. As a result, determining the reliability of an article has become a crucial task. After various ablation studies, we propose a multi-input model that can effectively leverage both tabular metadata and post content for the task. Applying state-of-the-art finetuning techniques for the pretrained component and training strategies for our complete model, we have achieved a 0.9462 ROC-score on the VLSP private test set.
    BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation. (arXiv:2109.12271v1 [eess.IV])
    (2 min) Convolutional neural networks (CNNs) have recently achieved remarkable success in automatically identifying organs or lesions on 3D medical images. Meanwhile, vision transformer networks have exhibited exceptional performance in 2D image classification tasks. Compared with CNNs, transformer networks have an obvious advantage of extracting long-range features due to their self-attention algorithm. Therefore, in this paper we present a CNN-Transformer combined model called BiTr-Unet for brain tumor segmentation on multi-modal MRI scans. The proposed BiTr-Unet achieves good performance on the BraTS 2021 validation dataset with mean Dice score 0.9076, 0.8392 and 0.8231, and mean Hausdorff distance 4.5322, 13.4592 and 14.9963 for the whole tumor, tumor core, and enhancing tumor, respectively.
    Smart Home Energy Management: Sequence-to-Sequence Load Forecasting and Q-Learning. (arXiv:2109.12440v1 [eess.SY])
    (2 min) A smart home energy management system (HEMS) can contribute towards reducing the energy costs of customers; however, HEMS suffers from uncertainty in both energy generation and consumption patterns. In this paper, we propose a sequence to sequence (Seq2Seq) learning-based supply and load prediction along with reinforcement learning-based HEMS control. We investigate how the prediction method affects the HEMS operation. First, we use Seq2Seq learning to predict photovoltaic (PV) power and home devices' load. We then apply Q-learning for offline optimization of HEMS based on the prediction results. Finally, we test the online performance of the trained Q-learning scheme with actual PV and load data. The Seq2Seq learning is compared with VARMA, SVR, and LSTM in both prediction and operation levels. The simulation results show that Seq2Seq performs better with a lower prediction error and online operation performance.
    Time Series Analysis and Modeling to Forecast: a Survey. (arXiv:2104.00164v2 [cs.LG] UPDATED)
    (2 min) Time series modeling for predictive purpose has been an active research area of machine learning for many years. However, no sufficiently comprehensive and meanwhile substantive survey was offered so far. This survey strives to meet this need. A unified presentation has been adopted for entire parts of this compilation. A red thread guides the reader from time series preprocessing to forecasting. Time series decomposition is a major preprocessing task, to separate nonstationary effects (the deterministic components) from the remaining stochastic constituent, assumed to be stationary. The deterministic components are predictable and contribute to the prediction through estimations or extrapolation. Fitting the most appropriate model to the remaining stochastic component aims at capturing the relationship between past and future values, to allow prediction. We cover a sufficiently broad spectrum of models while nonetheless offering substantial methodological developments. We describe three major linear parametric models, together with two nonlinear extensions, and present five categories of nonlinear parametric models. Beyond conventional statistical models, we highlight six categories of deep neural networks appropriate for time series forecasting in nonlinear framework. Finally, we enlighten new avenues of research for time series modeling and forecasting. We also report software made publicly available for the models presented.
    Physics perception in sloshing scenes with guaranteed thermodynamic consistency. (arXiv:2106.13301v2 [cs.CV] UPDATED)
    (2 min) Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.
    AI Explainability 360: Impact and Design. (arXiv:2109.12151v1 [cs.LG])
    (2 min) As artificial intelligence and machine learning algorithms become increasingly prevalent in society, multiple stakeholders are calling for these algorithms to provide explanations. At the same time, these stakeholders, whether they be affected citizens, government regulators, domain experts, or system developers, have different explanation needs. To address these needs, in 2019, we created AI Explainability 360 (Arya et al. 2020), an open source software toolkit featuring ten diverse and state-of-the-art explainability methods and two evaluation metrics. This paper examines the impact of the toolkit with several case studies, statistics, and community feedback. The different ways in which users have experienced AI Explainability 360 have resulted in multiple types of impact and improvements in multiple metrics, highlighted by the adoption of the toolkit by the independent LF AI & Data Foundation. The paper also describes the flexible design of the toolkit, examples of its use, and the significant educational material and documentation available to its users.
    Opacus: User-Friendly Differential Privacy Library in PyTorch. (arXiv:2109.12298v1 [cs.LG])
    (2 min) We introduce Opacus, a free, open-source PyTorch library for training deep learning models with differential privacy (hosted at opacus.ai). Opacus is designed for simplicity, flexibility, and speed. It provides a simple and user-friendly API, and enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code. It supports a wide variety of layers, including multi-head attention, convolution, LSTM, and embedding, right out of the box, and it also provides the means for supporting other user-defined layers. Opacus computes batched per-sample gradients, providing better efficiency compared to the traditional "micro batch" approach. In this paper we present Opacus, detail the principles that drove its implementation and unique features, and compare its performance against other frameworks for differential privacy in ML.
    Beyond Pointwise Submodularity: Non-Monotone Adaptive Submodular Maximization subject to Knapsack and $k$-System Constraints. (arXiv:2104.04853v3 [cs.DS] UPDATED)
    (2 min) In this paper, we study the non-monotone adaptive submodular maximization problem subject to a knapsack and a $k$-system constraints. The input of our problem is a set of items, where each item has a particular state drawn from a known prior distribution. However, the state of an item is initially unknown, one must select an item in order to reveal the state of that item. There is a utility function which is defined over items and states. Our objective is to sequentially select a group of items to maximize the expected utility. Although the cardinality-constrained non-monotone adaptive submodular maximization has been well studied in the literature, whether there exists a constant approximation solution for the knapsack-constrained or $k$-system constrained adaptive submodular maximization problem remains an open problem. It fact, it has only been settled given the additional assumption of pointwise submodularity. In this paper, we remove the common assumption on pointwise submodularity and propose the first constant approximation solutions for both cases. Inspired by two recent studies on non-monotone adaptive submodular maximization, we develop a sampling-based randomized algorithm that achieves a $\frac{1}{10}$ approximation for the case of a knapsack constraint and that achieves a $\frac{1}{2k+4}$ approximation ratio for the case of a $k$-system constraint.
    Disentanglement based Active Learning. (arXiv:1912.07018v2 [cs.LG] UPDATED)
    (2 min) We propose Disentanglement based Active Learning (DAL), a new active learning technique based on self-supervision which leverages the concept of disentanglement. Instead of requesting labels from human oracle, our method automatically labels the majority of the datapoints, thus drastically reducing the human labeling budget in Generative Adversarial Net (GAN) based active learning approaches. The proposed method uses Information Maximizing Generative Adversarial Nets (InfoGAN) to learn disentangled class category representations. Disagreement between active learner predictions and InfoGAN labels decides if the datapoints need to be human-labeled. We also introduce a label correction mechanism that aims to filter out label noise that occurs due to automatic labeling. Results on three benchmark datasets for the image classification task demonstrate that our method achieves better performance compared to existing GAN-based active learning approaches.
    Integrating Recurrent Neural Networks with Data Assimilation for Scalable Data-Driven State Estimation. (arXiv:2109.12269v1 [cs.LG])
    (2 min) Data assimilation (DA) is integrated with machine learning in order to perform entirely data-driven online state estimation. To achieve this, recurrent neural networks (RNNs) are implemented as surrogate models to replace key components of the DA cycle in numerical weather prediction (NWP), including the conventional numerical forecast model, the forecast error covariance matrix, and the tangent linear and adjoint models. It is shown how these RNNs can be initialized using DA methods to directly update the hidden/reservoir state with observations of the target system. The results indicate that these techniques can be applied to estimate the state of a system for the repeated initialization of short-term forecasts, even in the absence of a traditional numerical forecast model. Further, it is demonstrated how these integrated RNN-DA methods can scale to higher dimensions by applying domain localization and parallelization, providing a path for practical applications in NWP.
    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v1 [cs.LG])
    (2 min) Motivated by the problem of learning when the number of training samples is small, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{minimum} distance over a group of transformations, which corresponds to defining similarity as the \textit{best} over the possible transformations, are not generally positive definite. Perhaps it is for this reason that they have neither previously been experimentally tested for their performance nor studied theoretically. Instead, previous attempts have employed kernels based on the \textit{average} distance over a group of transformations, which are trivially positive definite, but which generally yield both poor margins as well as poor performance, as we show. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the minimum distance in the small training sample set regime of interest, and that they do yield the best results in that regime. Another important property of CNNs is their ability to incorporate local features at multiple spatial scales, e.g., through max pooling. A third important property is their ability to provide the benefits of composition through the architecture of multiple layers. We show how these additional properties can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established neural network (DNN) benchmarks for small sample sizes.
    Spatial Aggregation and Temporal Convolution Networks for Real-time Kriging. (arXiv:2109.12144v1 [cs.LG])
    (2 min) Spatiotemporal kriging is an important application in spatiotemporal data analysis, aiming to recover/interpolate signals for unsampled/unobserved locations based on observed signals. The principle challenge for spatiotemporal kriging is how to effectively model and leverage the spatiotemporal dependencies within the data. Recently, graph neural networks (GNNs) have shown great promise for spatiotemporal kriging tasks. However, standard GNNs often require a carefully designed adjacency matrix and specific aggregation functions, which are inflexible for general applications/problems. To address this issue, we present SATCN -- Spatial Aggregation and Temporal Convolution Networks -- a universal and flexible framework to perform spatiotemporal kriging for various spatiotemporal datasets without the need for model specification. Specifically, we propose a novel spatial aggregation network (SAN) inspired by Principal Neighborhood Aggregation, which uses multiple aggregation functions to help one node gather diverse information from its neighbors. To exclude information from unsampled nodes, a masking strategy that prevents the unsampled sensors from sending messages to their neighborhood is introduced to SAN. We capture temporal dependencies by the temporal convolutional networks, which allows our model to cope with data of diverse sizes. To make SATCN generalizable to unseen nodes and even unseen graph structures, we employ an inductive strategy to train SATCN. We conduct extensive experiments on three real-world spatiotemporal datasets, including traffic speed and climate recordings. Our results demonstrate the superiority of SATCN over traditional and GNN-based kriging models.
    On the Generalization of Stochastic Gradient Descent with Momentum. (arXiv:1809.04564v2 [cs.LG] UPDATED)
    (2 min) While momentum-based methods, in conjunction with stochastic gradient descent (SGD), are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM), and show that it admits an upper-bound on the generalization error. Thus, our results show that machine learning models can be trained for multiple epochs of SGDEM with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper-bound on the expected true risk, in terms of the number of training steps, the size of the training set, and the momentum parameter. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings.
    $\ell_p$-Norm Multiple Kernel One-Class Fisher Null-Space. (arXiv:2008.08642v2 [cs.LG] UPDATED)
    (2 min) The paper addresses the multiple kernel learning (MKL) problem for one-class classification (OCC). For this purpose, based on the Fisher null-space one-class classification principle, we present a multiple kernel learning algorithm where a general $\ell_p$-norm constraint ($p\geq1$) on kernel weights is considered. We cast the proposed one-class MKL task as a min-max saddle point Lagrangian optimisation problem and propose an efficient method to solve it. An extension of the proposed one-class MKL approach is also considered where several related one-class MKL tasks are learned jointly by constraining them to share common kernel weights. An extensive assessment of the proposed method on a range of data sets from different application domains confirms its merits against the baseline and several other algorithms.
    What Kinds of Functions do Deep Neural Networks Learn? Insights from Variational Spline Theory. (arXiv:2105.03361v3 [stat.ML] UPDATED)
    (2 min) We develop a variational framework to understand the properties of functions learned by fitting deep neural networks with rectified linear unit activations to data. We propose a new function space, which is reminiscent of classical bounded variation-type spaces, that captures the compositional structure associated with deep neural networks. We derive a representer theorem showing that deep ReLU networks are solutions to regularized data fitting problems over functions from this space. The function space consists of compositions of functions from the Banach spaces of second-order bounded variation in the Radon domain. These are Banach spaces with sparsity-promoting norms, giving insight into the role of sparsity in deep neural networks. The neural network solutions have skip connections and rank bounded weight matrices, providing new theoretical support for these common architectural choices. The variational problem we study can be recast as a finite-dimensional neural network training problem with regularization schemes related to the notions of weight decay and path-norm regularization. Finally, our analysis builds on techniques from variational spline theory, providing new connections between deep neural networks and splines.
    Accurate Remaining Useful Life Prediction with Uncertainty Quantification: a Deep Learning and Nonstationary Gaussian Process Approach. (arXiv:2109.12111v1 [cs.LG])
    (2 min) Remaining useful life (RUL) refers to the expected remaining lifespan of a component or system. Accurate RUL prediction is critical for prognostic and health management and for maintenance planning. In this work, we address three prevalent challenges in data-driven RUL prediction, namely the handling of high dimensional input features, the robustness to noise in sensor data and prognostic datasets, and the capturing of the time-dependency between system degradation and RUL prediction. We devise a highly accurate RUL prediction model with uncertainty quantification, which integrates and leverages the advantages of deep learning and nonstationary Gaussian process regression (DL-NSGPR). We examine and benchmark our model against other advanced data-driven RUL prediction models using the turbofan engine dataset from the NASA prognostic repository. Our computational experiments show that the DL-NSGPR predictions are highly accurate with root mean square error 1.7 to 6.2 times smaller than those of competing RUL models. Furthermore, the results demonstrate that RUL uncertainty bounds with the proposed DL-NSGPR are both valid and significantly tighter than other stochastic RUL prediction models. We unpack and discuss the reasons for this excellent performance of the DL-NSGPR.
    DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neural Networks. (arXiv:2010.03250v3 [cs.LG] UPDATED)
    (2 min) In this paper, we propose a novel framework to automatically utilize task-dependent semantic information which is encoded in heterogeneous information networks (HINs). Specifically, we search for a meta graph, which can capture more complex semantic relations than a meta path, to determine how graph neural networks (GNNs) propagate messages along different types of edges. We formalize the problem within the framework of neural architecture search (NAS) and then perform the search in a differentiable manner. We design an expressive search space in the form of a directed acyclic graph (DAG) to represent candidate meta graphs for a HIN, and we propose task-dependent type constraint to filter out those edge types along which message passing has no effect on the representations of nodes that are related to the downstream task. The size of the search space we define is huge, so we further propose a novel and efficient search algorithm to make the total search cost on a par with training a single GNN once. Compared with existing popular NAS algorithms, our proposed search algorithm improves the search efficiency. We conduct extensive experiments on different HINs and downstream tasks to evaluate our method, and experimental results show that our method can outperform state-of-the-art heterogeneous GNNs and also improves efficiency compared with those methods which can implicitly learn meta paths.
    On the complexity of finding a local minimizer of a quadratic function over a polytope. (arXiv:2008.05558v4 [math.OC] UPDATED)
    (2 min) We show that unless P=NP, there cannot be a polynomial-time algorithm that finds a point within Euclidean distance $c^n$ (for any constant $c \ge 0$) of a local minimizer of an $n$-variate quadratic function over a polytope. This result (even with $c=0$) answers a question of Pardalos and Vavasis that appeared in 1992 on a list of seven open problems in complexity theory for numerical optimization. Our proof technique also implies that the problem of deciding whether a quadratic function has a local minimizer over an (unbounded) polyhedron, and that of deciding if a quartic polynomial has a local minimizer are NP-hard.
    On hallucinations in tomographic image reconstruction. (arXiv:2012.00646v3 [eess.IV] UPDATED)
    (2 min) Tomographic image reconstruction is generally an ill-posed linear inverse problem. Such ill-posed inverse problems are typically regularized using prior knowledge of the sought-after object property. Recently, deep neural networks have been actively investigated for regularizing image reconstruction problems by learning a prior for the object properties from training images. However, an analysis of the prior information learned by these deep networks and their ability to generalize to data that may lie outside the training distribution is still being explored. An inaccurate prior might lead to false structures being hallucinated in the reconstructed image and that is a cause for serious concern in medical imaging. In this work, we propose to illustrate the effect of the prior imposed by a reconstruction method by decomposing the image estimate into generalized measurement and null components. The concept of a hallucination map is introduced for the general purpose of understanding the effect of the prior in regularized reconstruction methods. Numerical studies are conducted corresponding to a stylized tomographic imaging modality. The behavior of different reconstruction methods under the proposed formalism is discussed with the help of the numerical studies.
    Provably-Robust Runtime Monitoring of Neuron Activation Patterns. (arXiv:2011.11959v2 [cs.LG] UPDATED)
    (2 min) For deep neural networks (DNNs) to be used in safety-critical autonomous driving tasks, it is desirable to monitor in operation time if the input for the DNN is similar to the data used in DNN training. While recent results in monitoring DNN activation patterns provide a sound guarantee due to building an abstraction out of the training data set, reducing false positives due to slight input perturbation has been an issue towards successfully adapting the techniques. We address this challenge by integrating formal symbolic reasoning inside the monitor construction process. The algorithm performs a sound worst-case estimate of neuron values with inputs (or features) subject to perturbation, before the abstraction function is applied to build the monitor. The provable robustness is further generalized to cases where monitoring a single neuron can use more than one bit, implying that one can record activation patterns with a fine-grained decision on the neuron value interval.
    State Representation Learning from Demonstration. (arXiv:1910.01738v2 [cs.LG] UPDATED)
    (2 min) Robots could learn their own state and world representation from perception and experience without supervision. This desirable goal is the main focus of our field of interest, state representation learning (SRL). Indeed, a compact representation of such a state is beneficial to help robots grasp onto their environment for interacting. The properties of this representation have a strong impact on the adaptive capability of the agent. In this article we present an approach based on imitation learning. The idea is to train several policies that share the same representation to reproduce various demonstrations. To do so, we use a multi-head neural network with a shared state representation feeding a task-specific agent. If the demonstrations are diverse, the trained representation will eventually contain the information necessary for all tasks, while discarding irrelevant information. As such, it will potentially become a compact state representation useful for new tasks. We call this approach SRLfD (State Representation Learning from Demonstration). Our experiments confirm that when a controller takes SRLfD-based representations as input, it can achieve better performance than with other representation strategies and promote more efficient reinforcement learning (RL) than with an end-to-end RL strategy.
    Beyond Robustness: A Taxonomy of Approaches towards Resilient Multi-Robot Systems. (arXiv:2109.12343v1 [cs.RO])
    (2 min) Robustness is key to engineering, automation, and science as a whole. However, the property of robustness is often underpinned by costly requirements such as over-provisioning, known uncertainty and predictive models, and known adversaries. These conditions are idealistic, and often not satisfiable. Resilience on the other hand is the capability to endure unexpected disruptions, to recover swiftly from negative events, and bounce back to normality. In this survey article, we analyze how resilience is achieved in networks of agents and multi-robot systems that are able to overcome adversity by leveraging system-wide complementarity, diversity, and redundancy - often involving a reconfiguration of robotic capabilities to provide some key ability that was not present in the system a priori. As society increasingly depends on connected automated systems to provide key infrastructure services (e.g., logistics, transport, and precision agriculture), providing the means to achieving resilient multi-robot systems is paramount. By enumerating the consequences of a system that is not resilient (fragile), we argue that resilience must become a central engineering design consideration. Towards this goal, the community needs to gain clarity on how it is defined, measured, and maintained. We address these questions across foundational robotics domains, spanning perception, control, planning, and learning. One of our key contributions is a formal taxonomy of approaches, which also helps us discuss the defining factors and stressors for a resilient system. Finally, this survey article gives insight as to how resilience may be achieved. Importantly, we highlight open problems that remain to be tackled in order to reap the benefits of resilient robotic systems.
    Predicting Hydroxyl Mediated Nucleophilic Degradation and Molecular Stability of RNA Sequences through the Application of Deep Learning Methods. (arXiv:2011.05136v3 [q-bio.QM] UPDATED)
    (3 min) Synthesis and efficient implementation mRNA strands has been shown to have wide utility, especially recently in the development of COVID vaccines. However, the intrinsic chemical stability of mRNA poses a challenge due to the presence of 2'-hydroxyl groups in ribose sugars. The -OH group in the backbone structure enables a base-catalyzed nucleophilic attack by the deprotonated hydroxyl on the adjacent phosphorous and consequent self-hydrolysis of the phosphodiester bond. As expected for in-line hydrolytic cleavage reactions, the chemical stability of mRNA strands is highly dependent on external environmental factors, e.g. pH, temperature, oxidizers, etc. Predicting this chemical instability using a computational model will reduce the number of sequences synthesized and tested through identifying the most promising candidates, aiding the development of mRNA related therapies. This paper proposes and evaluates three deep learning models (Long Short Term Memory, Gated Recurrent Unit, and Graph Convolutional Networks) as methods to predict the reactivity and risk of degradation of mRNA sequences. The Stanford Open Vaccine dataset of 6034 mRNA sequences was used in this study. The training set consisted of 3029 of these sequences (length of 107 nucleotide bases) while the testing dataset consisted of 3005 sequences (length of 130 nucleotide bases), in structured (Lowest Entropy Base Pair Probability Matrix) and unstructured (Nodes and Edges) forms. The stability of mRNA strands was accurately generated, with the Graph Convolutional Network being the best predictor of reactivity ($RMSE = 0.249$) while the Gated Recurrent Unit Network was the best at predicting risks of degradation ($RMSE = 0.266$). Combining all target variables, the GRU performed the best with 76% accuracy. Results suggest these models can be applied to understand and predict the chemical stability of mRNA in the near future.
    Meta-learning based Alternating Minimization Algorithm for Non-convex Optimization. (arXiv:2009.04899v5 [cs.LG] UPDATED)
    (2 min) In this paper, we propose a novel solution for non-convex problems of multiple variables, especially for those typically solved by an alternating minimization (AM) strategy that splits the original optimization problem into a set of sub-problems corresponding to each variable, and then iteratively optimize each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the original optimization problem, the optimization can usually be trapped into spurious local minimum even when each sub-problem can be optimally solved at each iteration. Meanwhile, learning-based approaches, such as deep unfolding algorithms, are highly limited by the lack of labelled data and restricted explainability. To tackle these issues, we propose a meta-learning based alternating minimization (MLAM) method, which aims to minimize a partial of the global losses over iterations instead of carrying minimization on each sub-problem, and it tends to learn an adaptive strategy to replace the handcrafted counterpart resulting in advance on superior performance. Meanwhile, the proposed MLAM still maintains the original algorithmic principle, which contributes to a better interpretability. We evaluate the proposed method on two representative problems, namely, bi-linear inverse problem: matrix completion, and non-linear problem: Gaussian mixture models. The experimental results validate that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in challenging cases while other comparing methods would typically fail.
    Efficient Estimation and Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling. (arXiv:2010.09443v2 [stat.ML] UPDATED)
    (2 min) In many contemporary applications, large amounts of unlabeled data are readily available while labeled examples are limited. There has been substantial interest in semi-supervised learning (SSL) which aims to leverage unlabeled data to improve estimation or prediction. However, current SSL literature focuses primarily on settings where labeled data is selected randomly from the population of interest. Non-random sampling, while posing additional analytical challenges, is highly applicable to many real world problems. Moreover, no SSL methods currently exist for estimating the prediction performance of a fitted model under non-random sampling. In this paper, we propose a two-step SSL procedure for evaluating a prediction rule derived from a working binary regression model based on the Brier score and overall misclassification rate under stratified sampling. In step I, we impute the missing labels via weighted regression with nonlinear basis functions to account for nonrandom sampling and to improve efficiency. In step II, we augment the initial imputations to ensure the consistency of the resulting estimators regardless of the specification of the prediction model or the imputation model. The final estimator is then obtained with the augmented imputations. We provide asymptotic theory and numerical studies illustrating that our proposals outperform their supervised counterparts in terms of efficiency gain. Our methods are motivated by electronic health records (EHR) research and validated with a real data analysis of an EHR-based study of diabetic neuropathy.
    Channel State Information Based Localization with Deep Learning. (arXiv:2109.12398v1 [eess.SP])
    (2 min) Localization is one of the most important problems in various fields such as robotics and wireless communications. For instance, Unmanned Aerial Vehicles (UAVs) require the information of the position precisely for an adequate control strategy. This problem is handled very efficiently with integrated GPS units for outdoor applications. However, indoor applications require special treatment due to the unavailability of GPS signals. Another aspect of mobile robots such as UAVs is that there is constant wireless communication between the mobile robot and a computational unit. This communication is mainly done for obtaining telemetry information or computation of control actions directly. The responsible integrated units for this transmission are commercial wireless communication chipsets. These units on the receiver side are responsible for getting rid of the diverse effects of the communication channel with various mathematical techniques. These techniques mainly require the Channel State Information (CSI) of the current channel to compensate the channel itself. After the compensation, the chipset has nothing to do with CSI. However, the locations of both the transmitter and receiver have a direct impact on CSI. Even though CSI contains such rich information about the environment, the accessibility of these data is blocked by the commercial wireless chipsets since they are manufactured to provide only the processed information data bits to the user. However, with the IEEE 802.11n standardization, certain chipsets provide access to CSI. Therefore, CSI data became processible and integrable to localization schemes. In this project, a test environment was constructed for the localization task. Two routers with proper chipsets were assigned as transmitter and receiver. They were operationalized for the CSI data collection. Lastly, these data were processed with various deep learning models.
    Self-Replicating Neural Programs. (arXiv:2109.12786v1 [cs.NE])
    (2 min) In this work, a neural network is trained to replicate the code that trains it using only its own output as input. A paradigm for evolutionary self-replication in neural programs is introduced, where program parameters are mutated, and the ability for the program to more efficiently train itself leads to greater reproductive success. This evolutionary paradigm is demonstrated to produce more efficient learning in organisms from a setting without any explicit guidance, solely based on natural selection favoring organisms with faster reproductive maturity.
    Classification of COVID-19 from CXR Images in a 15-class Scenario: an Attempt to Avoid Bias in the System. (arXiv:2109.12453v1 [eess.IV])
    (2 min) As of June 2021, the World Health Organization (WHO) has reported 171.7 million confirmed cases including 3,698,621 deaths from COVID-19. Detecting COVID-19 and other lung diseases from Chest X-Ray (CXR) images can be very effective for emergency diagnosis and treatment as CXR is fast and cheap. The objective of this study is to develop a system capable of detecting COVID-19 along with 14 other lung diseases from CXRs in a fair and unbiased manner. The proposed system consists of a CXR image selection technique and a deep learning based model to classify 15 diseases including COVID-19. The proposed CXR selection technique aims to retain the maximum variation uniformly and eliminate poor quality CXRs with the goal of reducing the training dataset size without compromising classifier accuracy. More importantly, it reduces the often hidden bias and unfairness in decision making. The proposed solution exhibits a promising COVID-19 detection scheme in a more realistic situation than most existing studies as it deals with 15 lung diseases together. We hope the proposed method will have wider adoption in medical image classification and other related fields.
    A Meta-Learning Approach to the Optimal Power Flow Problem Under Topology Reconfigurations. (arXiv:2012.11524v2 [cs.LG] UPDATED)
    (2 min) Recently, there has been a surge of interest in adopting deep neural networks (DNNs) for solving the optimal power flow (OPF) problem in power systems. Computing optimal generation dispatch decisions using a trained DNN takes significantly less time when compared to using conventional optimization solvers. However, a major drawback of existing work is that the machine learning models are trained for a specific system topology. Hence, the DNN predictions are only useful as long as the system topology remains unchanged. Changes to the system topology (initiated by the system operator) would require retraining the DNN, which incurs significant training overhead and requires an extensive amount of training data (corresponding to the new system topology). To overcome this drawback, we propose a DNN-based OPF predictor that is trained using a meta-learning (MTL) approach. The key idea behind this approach is to find a common initialization vector that enables fast training for any system topology. The developed OPF-predictor is validated through simulations using benchmark IEEE bus systems. The results show that the MTL approach achieves significant training speeds-ups and requires only a few gradient steps with a few data samples to achieve high OPF prediction accuracy.
    Joint Multi-Dimension Pruning via Numerical Gradient Update. (arXiv:2005.08931v2 [cs.CV] UPDATED)
    (2 min) We present joint multi-dimension pruning (abbreviated as JointPruning), an effective method of pruning a network on three crucial aspects: spatial, depth and channel simultaneously. To tackle these three naturally different dimensions, we proposed a general framework by defining pruning as seeking the best pruning vector (i.e., the numerical value of layer-wise channel number, spacial size, depth) and construct a unique mapping from the pruning vector to the pruned network structures. Then we optimize the pruning vector with gradient update and model joint pruning as a numerical gradient optimization process. To overcome the challenge that there is no explicit function between the loss and the pruning vectors, we proposed self-adapted stochastic gradient estimation to construct a gradient path through network loss to pruning vectors and enable efficient gradient update. We show that the joint strategy discovers a better status than previous studies that focused on individual dimensions solely, as our method is optimized collaboratively across the three dimensions in a single end-to-end training and it is more efficient than the previous exhaustive methods. Extensive experiments on large-scale ImageNet dataset across a variety of network architectures MobileNet V1&V2&V3 and ResNet demonstrate the effectiveness of our proposed method. For instance, we achieve significant margins of 2.5% and 2.6% improvement over the state-of-the-art approach on the already compact MobileNet V1&V2 under an extremely large compression ratio.
    A Predictive Visual Analytics System for Studying Neurodegenerative Disease based on DTI Fiber Tracts. (arXiv:2010.07047v3 [cs.CV] UPDATED)
    (2 min) Diffusion tensor imaging (DTI) has been used to study the effects of neurodegenerative diseases on neural pathways, which may lead to more reliable and early diagnosis of these diseases as well as a better understanding of how they affect the brain. We introduce an intelligent visual analytics system for studying patient groups based on their labeled DTI fiber tract data and corresponding statistics. The system's AI-augmented interface guides the user through an organized and holistic analysis space, including the statistical feature space, the physical space, and the space of patients over different groups. We use a custom machine learning pipeline to help narrow down this large analysis space, and then explore it pragmatically through a range of linked visualizations. We conduct several case studies using real data from the research database of Parkinson's Progression Markers Initiative.
    Deep Learning Based Signal Enhancement of Low-Resolution Accelerometer for Fall Detection Systems. (arXiv:2012.03426v2 [eess.SP] UPDATED)
    (2 min) In the last two decades, fall detection (FD) systems have been developed as a popular assistive technology. Such systems automatically detect critical fall events and immediately alert medical professionals or caregivers. To support long-term FD services, various power-saving strategies have been implemented. Among them, a reduced sampling rate is a common approach for an energy-efficient system in the real-world. However, the performance of FD systems is diminished owing to low-resolution (LR) accelerometer signals. To improve the detection accuracy with LR accelerometer signals, several technical challenges must be considered, including misalignment, mismatch of effective features, and the degradation effects. In this work, a deep-learning-based accelerometer signal enhancement (ASE) model is proposed to improve the detection performance of LR-FD systems. This proposed model reconstructs high-resolution (HR) signals from the LR signals by learning the relationship between the LR and HR signals. The results show that the FD system using support vector machine and the proposed ASE model at an extremely low sampling rate (sampling rate < 2 Hz) achieved 97.34% and 90.52% accuracies in the SisFall and FallAllD datasets, respectively, while those without ASE models only achieved 95.92% and 87.47% accuracies in the SisFall and FallAllD datasets, respectively. This study demonstrates that the ASE model helps the FD systems tackle the technical challenges of LR signals and achieve better detection performance.
    Learning Stereopsis from Geometric Synthesis for 6D Object Pose Estimation. (arXiv:2109.12266v1 [cs.CV])
    (2 min) Current monocular-based 6D object pose estimation methods generally achieve less competitive results than RGBD-based methods, mostly due to the lack of 3D information. To make up this gap, this paper proposes a 3D geometric volume based pose estimation method with a short baseline two-view setting. By constructing a geometric volume in the 3D space, we combine the features from two adjacent images to the same 3D space. Then a network is trained to learn the distribution of the position of object keypoints in the volume, and a robust soft RANSAC solver is deployed to solve the pose in closed form. To balance accuracy and cost, we propose a coarse-to-fine framework to improve the performance in an iterative way. The experiments show that our method outperforms state-of-the-art monocular-based methods, and is robust in different objects and scenes, especially in serious occlusion situations.
    Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation. (arXiv:2109.12584v1 [cs.CL])
    (0 min) In recent times, there has been definitive progress in the field of NLP, with its applications growing as the utility of our language models increases with advances in their performance. However, these models require a large amount of computational power and data to train, consequently leading to large carbon footprints. Therefore, is it imperative that we study the carbon efficiency and look for alternatives to reduce the overall environmental impact of training models, in particular large language models. In our work, we assess the performance of models for machine translation, across multiple language pairs to assess the difference in computational power required to train these models for each of these language pairs and examine the various components of these models to analyze aspects of our pipeline that can be optimized to reduce these carbon emissions.
    Density Estimation via Bayesian Inference Engines. (arXiv:2009.06182v4 [stat.ML] UPDATED)
    (2 min) We explain how effective automatic probability density function estimates can be constructed using contemporary Bayesian inference engines such as those based on no-U-turn sampling and expectation propagation. Extensive simulation studies demonstrate that the proposed density estimates have excellent comparative performance and scale well to very large sample sizes due to a binning strategy. Moreover, the approach is fully Bayesian and all estimates are accompanied by pointwise credible intervals. An accompanying package in the R language facilitates easy use of the new density estimates.
    Video Summarization Using Deep Neural Networks: A Survey. (arXiv:2101.06072v2 [cs.CV] UPDATED)
    (2 min) Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.
    Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach. (arXiv:2109.12701v1 [stat.ML])
    (2 min) We study the Sparse Plus Low Rank decomposition problem (SLR), which is the problem of decomposing a corrupted data matrix $\mathbf{D}$ into a sparse matrix $\mathbf{Y}$ containing the perturbations plus a low rank matrix $\mathbf{X}$. SLR is a fundamental problem in Operations Research and Machine Learning arising in many applications such as data compression, latent semantic indexing, collaborative filtering and medical imaging. We introduce a novel formulation for SLR that directly models the underlying discreteness of the problem. For this formulation, we develop an alternating minimization heuristic to compute high quality solutions and a novel semidefinite relaxation that provides meaningful bounds for the solutions returned by our heuristic. We further develop a custom branch and bound routine that leverages our heuristic and convex relaxation that solves small instances of SLR to certifiable near-optimality. Our heuristic can scale to $n=10000$ in hours, our relaxation can scale to $n=200$ in hours, and our branch and bound algorithm can scale to $n=25$ in minutes. Our numerical results demonstrate that our approach outperforms existing state-of-the-art approaches in terms of the MSE of the low rank matrix and that of the sparse matrix.
    GPRInvNet: Deep Learning-Based Ground Penetrating Radar Data Inversion for Tunnel Lining. (arXiv:1912.05759v3 [cs.CV] UPDATED)
    (3 min) A DNN architecture referred to as GPRInvNet was proposed to tackle the challenges of mapping the ground-penetrating radar (GPR) B-Scan data to complex permittivity maps of subsurface structures. The GPRInvNet consisted of a trace-to-trace encoder and a decoder. It was specially designed to take into account the characteristics of GPR inversion when faced with complex GPR B-Scan data, as well as addressing the spatial alignment issues between time-series B-Scan data and spatial permittivity maps. It displayed the ability to fuse features from several adjacent traces on the B-Scan data to enhance each trace, and then further condense the features of each trace separately. As a result, the sensitive zones on the permittivity maps spatially aligned to the enhanced trace could be reconstructed accurately. The GPRInvNet has been utilized to reconstruct the permittivity map of tunnel linings. A diverse range of dielectric models of tunnel linings containing complex defects has been reconstructed using GPRInvNet. The results have demonstrated that the GPRInvNet is capable of effectively reconstructing complex tunnel lining defects with clear boundaries. Comparative results with existing baseline methods also demonstrated the superiority of the GPRInvNet. For the purpose of generalizing the GPRInvNet to real GPR data, some background noise patches recorded from practical model testing were integrated into the synthetic GPR data to retrain the GPRInvNet. The model testing has been conducted for validation, and experimental results revealed that the GPRInvNet had also achieved satisfactory results with regard to the real data.
    Emergent behavior and neural dynamics in artificial agents tracking turbulent plumes. (arXiv:2109.12434v1 [q-bio.NC])
    (2 min) Tracking a turbulent plume to locate its source is a complex control problem because it requires multi-sensory integration and must be robust to intermittent odors, changing wind direction, and variable plume statistics. This task is routinely performed by flying insects, often over long distances, in pursuit of food or mates. Several aspects of this remarkable behavior have been studied in detail in many experimental studies. Here, we take a complementary in silico approach, using artificial agents trained with reinforcement learning to develop an integrated understanding of the behaviors and neural computations that support plume tracking. Specifically, we use deep reinforcement learning (DRL) to train recurrent neural network (RNN) agents to locate the source of simulated turbulent plumes. Interestingly, the agents' emergent behaviors resemble those of flying insects, and the RNNs learn to represent task-relevant variables, such as head direction and time since last odor encounter. Our analyses suggest an intriguing experimentally testable hypothesis for tracking plumes in changing wind direction -- that agents follow local plume shape rather than the current wind direction. While reflexive short-memory behaviors are sufficient for tracking plumes in constant wind, longer timescales of memory are essential for tracking plumes that switch direction. At the level of neural dynamics, the RNNs' population activity is low-dimensional and organized into distinct dynamical structures, with some correspondence to behavioral modules. Our in silico approach provides key intuitions for turbulent plume tracking strategies and motivates future targeted experimental and theoretical developments.
    MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning. (arXiv:2109.12674v1 [cs.LG])
    (2 min) Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the decision making in complex multi-agent settings, and the safety awareness of the surrounding traffic. Despite the great success of reinforcement learning, most of the RL research studies each capability separately due to the lack of the integrated interactive environments. In this work, we develop a new driving simulation platform called MetaDrive for the study of generalizable reinforcement learning algorithms. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real traffic data replay. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. We open-source this simulator and maintain its development at: https://github.com/decisionforce/metadrive
    Accelerated nonlinear primal-dual hybrid gradient algorithms with applications to machine learning. (arXiv:2109.12222v1 [math.OC])
    (2 min) The primal-dual hybrid gradient (PDHG) algorithm is a first-order method that splits convex optimization problems with saddle-point structure into smaller subproblems. Those subproblems, unlike those obtained from most other splitting methods, can generally be solved efficiently because they involve simple operations such as matrix-vector multiplications or proximal mappings that are easy to evaluate. In order to work fast, however, the PDHG algorithm requires stepsize parameters fine-tuned for the problem at hand. Unfortunately, the stepsize parameters must often be estimated from quantities that are prohibitively expensive to compute for large-scale optimization problems, such as those in machine learning. In this paper, we introduce accelerated nonlinear variants of the PDHG algorithm that can achieve, for a broad class of optimization problems relevant to machine learning, an optimal rate of convergence with stepsize parameters that are simple to compute. We prove rigorous convergence results, including for problems posed on infinite-dimensional reflexive Banach spaces. We also provide practical implementations of accelerated nonlinear PDHG algorithms for solving several regression tasks in machine learning, including support vector machines without offset, kernel ridge regression, elastic net regularized linear regression, and the least absolute shrinkage selection operator.
    A Practical & Unified Notation for Information-Theoretic Quantities in ML. (arXiv:2106.12062v2 [cs.LG] UPDATED)
    (0 min) Information theory is of importance to machine learning, but the notation for information-theoretic quantities is sometimes opaque. The right notation can convey valuable intuitions and concisely express new ideas. We propose such a notation for machine learning users and expand it to include information-theoretic quantities between observed outcomes (events) and random variables. To demonstrate the value of our notation, first, we apply it to elegantly prove a version of Stirling's approximation for binomial coefficients mentioned by MacKay. Second, we apply the notation to a popular information-theoretic acquisition function in Bayesian active learning which selects the most informative (unlabelled) samples to be labelled by an expert and extend this acquisition function to the core-set problem, which consists of selecting the most informative samples \emph{given} the labels.
    Vision-Guided Forecasting -- Visual Context for Multi-Horizon Time Series Forecasting. (arXiv:2107.12674v2 [cs.CV] UPDATED)
    (0 min) Autonomous driving gained huge traction in recent years, due to its potential to change the way we commute. Much effort has been put into trying to estimate the state of a vehicle. Meanwhile, learning to forecast the state of a vehicle ahead introduces new capabilities, such as predicting dangerous situations. Moreover, forecasting brings new supervision opportunities by learning to predict richer a context, expressed by multiple horizons. Intuitively, a video stream originated from a front-facing camera is necessary because it encodes information about the upcoming road. Besides, historical traces of the vehicle's states give more context. In this paper, we tackle multi-horizon forecasting of vehicle states by fusing the two modalities. We design and experiment with 3 end-to-end architectures that exploit 3D convolutions for visual features extraction and 1D convolutions for features extraction from speed and steering angle traces. To demonstrate the effectiveness of our method, we perform extensive experiments on two publicly available real-world datasets, Comma2k19 and the Udacity challenge. We show that we are able to forecast a vehicle's state to various horizons, while outperforming the current state-of-the-art results on the related task of driving state estimation. We examine the contribution of vision features, and find that a model fed with vision features achieves an error that is 56.6% and 66.9% of the error of a model that doesn't use those features, on the Udacity and Comma2k19 datasets respectively.
    Development of Risk-Free COVID-19 Screening Algorithm from Routine Blood Test using Ensemble Machine Learning. (arXiv:2108.05660v2 [cs.LG] UPDATED)
    (0 min) The Reverse Transcription Polymerase Chain Reaction (RTPCR) test is the silver bullet diagnostic test to discern COVID infection. Rapid antigen detection is a screening test to identify COVID positive patients in little as 15 minutes, but has a lower sensitivity than the PCR tests. Besides having multiple standardized test kits, many people are getting infected & either recovering or dying even before the test due to the shortage and cost of kits, lack of indispensable specialists and labs, time-consuming result compared to bulk population especially in developing and underdeveloped countries. Intrigued by the parametric deviations in immunological & hematological profile of a COVID patient, this research work leveraged the concept of COVID-19 detection by proposing a risk-free and highly accurate Stacked Ensemble Machine Learning model to identify a COVID patient from communally available-widespread-cheap routine blood tests which gives a promising accuracy, precision, recall & F1-score of 100%. Analysis from R-curve also shows the preciseness of the risk-free model to be implemented. The proposed method has the potential for large scale ubiquitous low-cost screening application. This can add an extra layer of protection in keeping the number of infected cases to a minimum and control the pandemic by identifying asymptomatic or pre-symptomatic people early.
    FLEA: Provably Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v2 [cs.LG] UPDATED)
    (0 min) Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that - given enough data - FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.
    Safety Metrics for Semantic Segmentation in Autonomous Driving. (arXiv:2105.10142v2 [cs.CV] UPDATED)
    (0 min) Within the context of autonomous driving, safety-related metrics for deep neural networks have been widely studied for image classification and object detection. In this paper, we further consider safety-aware correctness and robustness metrics specialized for semantic segmentation. The novelty of our proposal is to move beyond pixel-level metrics: Given two images with each having N pixels being class-flipped, the designed metrics should, depending on the clustering of pixels being class-flipped or the location of occurrence, reflect a different level of safety criticality. The result evaluated on an autonomous driving dataset demonstrates the validity and practicality of our proposed methodology.
    Contrastive Learning of Musical Representations. (arXiv:2103.09410v2 [cs.SD] UPDATED)
    (0 min) While deep learning has enabled great advances in many areas of music, labeled music datasets remain especially hard, expensive, and time-consuming to create. In this work, we introduce SimCLR to the music domain and contribute a large chain of audio data augmentations to form a simple framework for self-supervised, contrastive learning of musical representations: CLMR. This approach works on raw time-domain music data and requires no labels to learn useful representations. We evaluate CLMR in the downstream task of music classification on the MagnaTagATune and Million Song datasets and present an ablation study to test which of our music-related innovations over SimCLR are most effective. A linear classifier trained on the proposed representations achieves a higher average precision than supervised models on the MagnaTagATune dataset, and performs comparably on the Million Song dataset. Moreover, we show that CLMR's representations are transferable using out-of-domain datasets, indicating that our method has strong generalisability in music classification. Lastly, we show that the proposed method allows data-efficient learning on smaller labeled datasets: we achieve an average precision of 33.1% despite using only 259 labeled songs in the MagnaTagATune dataset (1% of the full dataset) during linear evaluation. To foster reproducibility and future research on self-supervised learning in music, we publicly release the pre-trained models and the source code of all experiments of this paper.
    MixNN: Protection of Federated Learning Against Inference Attacks by Mixing Neural Network Layers. (arXiv:2109.12550v1 [cs.LG])
    (2 min) Machine Learning (ML) has emerged as a core technology to provide learning models to perform complex tasks. Boosted by Machine Learning as a Service (MLaaS), the number of applications relying on ML capabilities is ever increasing. However, ML models are the source of different privacy violations through passive or active attacks from different entities. In this paper, we present MixNN a proxy-based privacy-preserving system for federated learning to protect the privacy of participants against a curious or malicious aggregation server trying to infer sensitive attributes. MixNN receives the model updates from participants and mixes layers between participants before sending the mixed updates to the aggregation server. This mixing strategy drastically reduces privacy without any trade-off with utility. Indeed, mixing the updates of the model has no impact on the result of the aggregation of the updates computed by the server. We experimentally evaluate MixNN and design a new attribute inference attack, Sim, exploiting the privacy vulnerability of SGD algorithm to quantify privacy leakage in different settings (i.e., the aggregation server can conduct a passive or an active attack). We show that MixNN significantly limits the attribute inference compared to a baseline using noisy gradient (well known to damage the utility) while keeping the same level of utility as classic federated learning.
    Training on Test Data with Bayesian Adaptation for Covariate Shift. (arXiv:2109.12746v1 [cs.LG])
    (2 min) When faced with distribution shift at test time, deep neural networks often make inaccurate predictions with unreliable uncertainty estimates. While improving the robustness of neural networks is one promising approach to mitigate this issue, an appealing alternate to robustifying networks against all possible test-time shifts is to instead directly adapt them to unlabeled inputs from the particular distribution shift we encounter at test time. However, this poses a challenging question: in the standard Bayesian model for supervised learning, unlabeled inputs are conditionally independent of model parameters when the labels are unobserved, so what can unlabeled data tell us about the model parameters at test-time? In this paper, we derive a Bayesian model that provides for a well-defined relationship between unlabeled inputs under distributional shift and model parameters, and show how approximate inference in this model can be instantiated with a simple regularized entropy minimization procedure at test-time. We evaluate our method on a variety of distribution shifts for image classification, including image corruptions, natural distribution shifts, and domain adaptation settings, and show that our method improves both accuracy and uncertainty estimation.
    Private Prediction Sets. (arXiv:2102.06202v2 [cs.LG] UPDATED)
    (0 min) In real-world settings involving consequential decision-making, the deployment of machine learning systems generally requires both reliable uncertainty quantification and protection of individuals' privacy. We present a framework that treats these two desiderata jointly. Our framework is based on conformal prediction, a methodology that augments predictive models to return prediction sets that provide uncertainty quantification -- they provably cover the true response with a user-specified probability, such as 90%. One might hope that when used with privately-trained models, conformal prediction would yield privacy guarantees for the resulting prediction sets; unfortunately this is not the case. To remedy this key problem, we develop a method that takes any pre-trained predictive model and outputs differentially private prediction sets. Our method follows the general approach of split conformal prediction; we use holdout data to calibrate the size of the prediction sets but preserve privacy by using a privatized quantile subroutine. This subroutine compensates for the noise introduced to preserve privacy in order to guarantee correct coverage. We evaluate the method on large-scale computer vision datasets.
    On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models. (arXiv:2107.04154v2 [eess.AS] UPDATED)
    (0 min) Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different training schemes are evaluated comprehensively on Librispeech, and wp-CTC-bMMI and ch-CTC-bMMI are evaluated on two real world ASR tasks to show their effectiveness. Besides, we also show bi-char(bc) HMM-MMI models can serve as better alignment models than traditional non-neural GMM-HMMs.
    Sublinear Regret for Learning POMDPs. (arXiv:2107.03635v3 [cs.LG] UPDATED)
    (0 min) We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
    Rethinking Hard-Parameter Sharing in Multi-Domain Learning. (arXiv:2107.11359v2 [cs.LG] UPDATED)
    (0 min) Hard parameter sharing in multi-domain learning (MDL) allows domains to share some of the model parameters in order to reduce storage cost while improving prediction accuracy. One common sharing practice is to share bottom layers of a deep neural network among domains while using separate top layers for each domain. In this work, we revisit this common practice via an empirical study on image classification tasks on a diverse set of visual domains and make two surprising observations. (1) Using separate bottom-layer parameters could achieve significantly better performance than the common practice and this phenomenon holds for the different number of domains jointly trained on different backbone architectures with different quantities of domain-specific parameters. (2) A multi-domain model with a small proportion of domain-specific parameters from bottom layers can achieve competitive performance with independent models trained on each domain separately. Our observations suggest that people adopt the new strategy of using separate bottom-layer parameters as a stronger baseline for model design in MDL.
    Unbiased Monte Carlo Cluster Updates with Autoregressive Neural Networks. (arXiv:2105.05650v2 [cond-mat.stat-mech] UPDATED)
    (0 min) Efficient sampling of complex high-dimensional probability densities is a central task in computational science. Machine Learning techniques based on autoregressive neural networks have been recently shown to provide good approximations of probability distributions of interest in physics. In this work, we propose a systematic way to remove the intrinsic bias associated with these variational approximations, combining it with Markov-chain Monte Carlo in an automatic scheme to efficiently generate cluster updates, which is particularly useful for models for which no efficient cluster update scheme is known. Our approach is based on symmetry-enforced cluster updates building on the neural-network representation of conditional probabilities. We demonstrate that such finite-cluster updates are crucial to circumvent ergodicity problems associated with global neural updates. We test our method for first- and second-order phase transitions in classical spin systems, proving in particular its viability for critical systems, or in the presence of metastable states.
    Cascade RCNN for MIDOG Challenge. (arXiv:2109.01085v2 [cs.CV] UPDATED)
    (0 min) Mitotic counts are one of the key indicators of breast cancer prognosis. However, accurate mitotic cell counting is still a difficult problem and is labourious. Automated methods have been proposed for this task, but are usually dependent on the training images and show poor performance on unseen domains. In this work, we present a multi-stage mitosis detection method based on a Cascade RCNN developed to be sequentially more selective against false positives. On the preliminary test set, the algorithm scores an F1-score of 0.7492.
    Model-Free Learning of Safe yet Effective Controllers. (arXiv:2103.14600v2 [cs.RO] UPDATED)
    (0 min) We study the problem of learning safe control policies that are also effective; i.e., maximizing the probability of satisfying a linear temporal logic (LTL) specification of a task, and the discounted reward capturing the (classic) control performance. We consider unknown environments modeled as Markov decision processes. We propose a model-free reinforcement learning algorithm that learns a policy that first maximizes the probability of ensuring safety, then the probability of satisfying the given LTL specification and lastly, the sum of discounted Quality of Control rewards. Finally, we illustrate applicability of our RL-based approach.
    A k-mer Based Approach for SARS-CoV-2 Variant Identification. (arXiv:2108.03465v4 [q-bio.QM] UPDATED)
    (0 min) With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants. In this paper, we use spike sequences to classify different variants of the coronavirus in humans. We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples ($1\%$ of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC).
    Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network. (arXiv:2007.02486v2 [stat.ML] UPDATED)
    (0 min) Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the $L_2$ estimation error with respect to the GD iterations, which is away from zero without a delicate scheme of early stopping. In turn, through a comprehensive analysis of $\ell_2$-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the $\ell_2$ regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax {optimal} rate of $L_2$ estimation error can be achieved. Numerical experiments confirm our theory and further demonstrate that the $\ell_2$ regularization approach improves the training robustness and works for a wider range of neural networks.
    Electoral Programs of German Parties 2021: A Computational Analysis Of Their Comprehensibility and Likeability Based On SentiArt. (arXiv:2109.12500v1 [cs.CL])
    (2 min) The electoral programs of six German parties issued before the parliamentary elections of 2021 are analyzed using state-of-the-art computational tools for quantitative narrative, topic and sentiment analysis. We compare different methods for computing the textual similarity of the programs, Jaccard Bag similarity, Latent Semantic Analysis, doc2vec, and sBERT, the representational and computational complexity increasing from the 1st to the 4th method. A new similarity measure for entire documents derived from the Fowlkes Mallows Score is applied to kmeans clustering of sBERT transformed sentences. Using novel indices of the readability and emotion potential of texts computed via SentiArt (Jacobs, 2019), our data shed light on the similarities and differences of the programs regarding their length, main ideas, comprehensibility, likeability, and semantic complexity. Among others, they reveal that the programs of the SPD and CDU have the best chances to be comprehensible and likeable -all other things being equal-, and they raise the important issue of which similarity measure is optimal for comparing texts such as electoral programs which necessarily share a lot of words. While such analyses can not replace qualitative analyses or a deep reading of the texts, they offer predictions that can be verified in empirical studies and may serve as a motivation for changing aspects of future electoral programs potentially making them more comprehensible and/or likeable.
    Deep Exploration for Recommendation Systems. (arXiv:2109.12509v1 [cs.IR])
    (2 min) We investigate the design of recommendation systems that can efficiently learn from sparse and delayed feedback. Deep Exploration can play an important role in such contexts, enabling a recommendation system to much more quickly assess a user's needs and personalize service. We design an algorithm based on Thompson Sampling that carries out Deep Exploration. We demonstrate through simulations that the algorithm can substantially amplify the rate of positive feedback relative to common recommendation system designs in a scalable fashion. These results demonstrate promise that we hope will inspire engineering of production recommendation systems that leverage Deep Exploration.
    Quantization for Distributed Optimization. (arXiv:2109.12497v1 [cs.LG])
    (2 min) Massive amounts of data have led to the training of large-scale machine learning models on a single worker inefficient. Distributed machine learning methods such as Parallel-SGD have received significant interest as a solution to tackle this problem. However, the performance of distributed systems does not scale linearly with the number of workers due to the high network communication cost for synchronizing gradients and parameters. Researchers have proposed techniques such as quantization and sparsification to alleviate this problem by compressing the gradients. Most of the compression schemes result in compressed gradients that cannot be directly aggregated with efficient protocols such as all-reduce. In this paper, we present a set of all-reduce compatible gradient compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD. We present the results of our experiments with the CIFAR10 dataset and observations derived during the process. Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks. Code is available at the repository: \url{https://github.com/vineeths96/Gradient-Compression}.
    Auditing AI models for Verified Deployment under Semantic Specifications. (arXiv:2109.12456v1 [cs.LG])
    (2 min) Auditing trained deep learning (DL) models prior to deployment is vital in preventing unintended consequences. One of the biggest challenges in auditing is in understanding how we can obtain human-interpretable specifications that are directly useful to the end-user. We address this challenge through a sequence of semantically-aligned unit tests, where each unit test verifies whether a predefined specification (e.g., accuracy over 95%) is satisfied with respect to controlled and semantically aligned variations in the input space (e.g., in face recognition, the angle relative to the camera). We perform these unit tests by directly verifying the semantically aligned variations in an interpretable latent space of a generative model. Our framework, AuditAI, bridges the gap between interpretable formal verification and scalability. With evaluations on four different datasets, covering images of towers, chest X-rays, human faces, and ImageNet classes, we show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations. A blog post accompanying the paper is at this link https://developer.nvidia.com/blog/nvidia-research-auditing-ai-models-for-verified-deployment-under-semantic-specifications
    A Survey on Graph-Based Deep Learning for Computational Histopathology. (arXiv:2107.00272v2 [cs.LG] UPDATED)
    (0 min) With the remarkable success of representation learning for prediction problems, we have witnessed a rapid expansion of the use of machine learning and deep learning for the analysis of digital pathology and biopsy image patches. However, learning over patch-wise features using convolutional neural networks limits the ability of the model to capture global contextual information and comprehensively model tissue composition. The phenotypical and topological distribution of constituent histological entities play a critical role in tissue diagnosis. As such, graph data representations and deep learning have attracted significant attention for encoding tissue representations, and capturing intra- and inter- entity level interactions. In this review, we provide a conceptual grounding for graph analytics in digital pathology, including entity-graph construction and graph architectures, and present their current success for tumor localization and classification, tumor invasion and staging, image retrieval, and survival prediction. We provide an overview of these methods in a systematic manner organized by the graph representation of the input image, scale, and organ on which they operate. We also outline the limitations of existing techniques, and suggest potential future research directions in this domain.
    BioCopy: A Plug-And-Play Span Copy Mechanism in Seq2Seq Models. (arXiv:2109.12533v1 [cs.CL])
    (2 min) Copy mechanisms explicitly obtain unchanged tokens from the source (input) sequence to generate the target (output) sequence under the neural seq2seq framework. However, most of the existing copy mechanisms only consider single word copying from the source sentences, which results in losing essential tokens while copying long spans. In this work, we propose a plug-and-play architecture, namely BioCopy, to alleviate the problem aforementioned. Specifically, in the training stage, we construct a BIO tag for each token and train the original model with BIO tags jointly. In the inference stage, the model will firstly predict the BIO tag at each time step, then conduct different mask strategies based on the predicted BIO label to diminish the scope of the probability distributions over the vocabulary list. Experimental results on two separate generative tasks show that they all outperform the baseline models by adding our BioCopy to the original model structure.
    Online Dynamic Window (ODW) Assisted Two-stage LSTM Frameworks for Indoor Localization. (arXiv:2109.00126v2 [cs.LG] UPDATED)
    (0 min) Internet of Things (IoT)-based indoor localization has gained significant popularity recently to satisfy the ever-increasing requirements of indoor Location-based Services (LBS). In this context, Inertial Measurement Unit (IMU)-based localization is of interest as it provides a scalable solution independent of any proprietary sensors/modules. Existing IMU-based methodologies, however, are mainly developed based on statistical heading and step length estimation techniques that suffer from cumulative error issues and have extensive computational time requirements limiting their application for real-time indoor positioning. To address the aforementioned issues, we propose the Online Dynamic Window (ODW)-assisted two-stage Long Short Term Memory (LSTM) localization framework. Three ODWs are proposed, where the first model uses a Natural Language Processing (NLP)-inspired Dynamic Window (DW) approach, which significantly reduces the required computational time. The second framework is developed based on a Signal Processing Dynamic Windowing (SP-DW) approach to further reduce the required processing time of the two-stage LSTM-based model. The third ODW, referred to as the SP-NLP, combines the first two windowing mechanisms to further improve the overall achieved accuracy. Compared to the traditional LSTM-based positioning approaches, which suffer from either high tensor computation requirements or low accuracy, the proposed ODW-assisted models can perform indoor localization in a near-real time fashion with high accuracy. Performances of the proposed ODW-assisted models are evaluated based on a real Pedestrian Dead Reckoning (PDR) dataset. The results illustrate potentials of the proposed ODW-assisted techniques in achieving high classification accuracy with significantly reduced computational time, making them applicable for near real-time implementations.
    SAINT-ACC: Safety-Aware Intelligent Adaptive Cruise Control for Autonomous Vehicles Using Deep Reinforcement Learning. (arXiv:2104.06506v2 [cs.RO] UPDATED)
    (0 min) We present a novel adaptive cruise control (ACC) system namely SAINT-ACC: {S}afety-{A}ware {Int}elligent {ACC} system (SAINT-ACC) that is designed to achieve simultaneous optimization of traffic efficiency, driving safety, and driving comfort through dynamic adaptation of the inter-vehicle gap based on deep reinforcement learning (RL). A novel dual RL agent-based approach is developed to seek and adapt the optimal balance between traffic efficiency and driving safety/comfort by effectively controlling the driving safety model parameters and inter-vehicle gap based on macroscopic and microscopic traffic information collected from dynamically changing and complex traffic environments. Results obtained through over 12,000 simulation runs with varying traffic scenarios and penetration rates demonstrate that SAINT-ACC significantly enhances traffic flow, driving safety and comfort compared with a state-of-the-art approach.
    Random Walk-steered Majority Undersampling. (arXiv:2109.12423v1 [cs.LG])
    (2 min) In this work, we propose Random Walk-steered Majority Undersampling (RWMaU), which undersamples the majority points of a class imbalanced dataset, in order to balance the classes. Rather than marking the majority points which belong to the neighborhood of a few minority points, we are interested to perceive the closeness of the majority points to the minority class. Random walk, a powerful tool for perceiving the proximities of connected points in a graph, is used to identify the majority points which lie close to the minority class of a class-imbalanced dataset. The visit frequencies and the order of visits of the majority points in the walks enable us to perceive an overall closeness of the majority points to the minority class. The ones lying close to the minority class are subsequently undersampled. Empirical evaluation on 21 datasets and 3 classifiers demonstrate substantial improvement in performance of RWMaU over the competing methods.
    Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning. (arXiv:2011.00517v3 [cs.LG] UPDATED)
    (0 min) Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level language generator and low-level policy, conditioned on language. We find that human demonstrations help solve the most complex tasks. We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting and to learn quickly from a few demonstrations. Generalization is not only reflected in the actions of the agent, but also in the generated natural language instructions in unseen tasks. Our approach also gives our trained agent interpretable behaviors because it is able to generate a sequence of high-level descriptions of its actions.
    Coreference Resolution for the Biomedical Domain: A Survey. (arXiv:2109.12424v1 [cs.CL])
    (0 min) Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state-of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.
    Prioritized Experience-based Reinforcement Learning with Human Guidance: Methdology and Application to Autonomous Driving. (arXiv:2109.12516v1 [cs.LG])
    (0 min) Reinforcement learning requires skillful definition and remarkable computational efforts to solve optimization and control problems, which could impair its prospect. Introducing human guidance into reinforcement learning is a promising way to improve learning performance. In this paper, a comprehensive human guidance-based reinforcement learning framework is established. A novel prioritized experience replay mechanism that adapts to human guidance in the reinforcement learning process is proposed to boost the efficiency and performance of the reinforcement learning algorithm. To relieve the heavy workload on human participants, a behavior model is established based on an incremental online learning method to mimic human actions. We design two challenging autonomous driving tasks for evaluating the proposed algorithm. Experiments are conducted to access the training and testing performance and learning mechanism of the proposed algorithm. Comparative results against the state-of-the-arts suggest the advantages of our algorithm in terms of learning efficiency, performance, and robustness.
    Profiling Neural Blocks and Design Spaces for Mobile Neural Architecture Search. (arXiv:2109.12426v1 [cs.LG])
    (0 min) Neural architecture search automates neural network design and has achieved state-of-the-art results in many deep learning applications. While recent literature has focused on designing networks to maximize accuracy, little work has been conducted to understand the compatibility of architecture design spaces to varying hardware. In this paper, we analyze the neural blocks used to build Once-for-All (MobileNetV3), ProxylessNAS and ResNet families, in order to understand their predictive power and inference latency on various devices, including Huawei Kirin 9000 NPU, RTX 2080 Ti, AMD Threadripper 2990WX, and Samsung Note10. We introduce a methodology to quantify the friendliness of neural blocks to hardware and the impact of their placement in a macro network on overall network performance via only end-to-end measurements. Based on extensive profiling results, we derive design insights and apply them to hardware-specific search space reduction. We show that searching in the reduced search space generates better accuracy-latency Pareto frontiers than searching in the original search spaces, customizing architecture search according to the hardware. Moreover, insights derived from measurements lead to notably higher ImageNet top-1 scores on all search spaces investigated.
    Improving Question Answering Performance Using Knowledge Distillation and Active Learning. (arXiv:2109.12662v1 [cs.CL])
    (0 min) Contemporary question answering (QA) systems, including transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Further, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained BERT system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. In particular, we demonstrate that our model achieves the performance of a 6-layer TinyBERT and DistilBERT, whilst using only 2% of their total parameters. Finally, by the integration of our AL approaches into the BERT framework, we show that state-of-the-art results on the SQuAD dataset can be achieved when we only use 20% of the training data.
    AbstractDifferentiation.jl: Backend-Agnostic Differentiable Programming in Julia. (arXiv:2109.12449v1 [cs.MS])
    (0 min) No single Automatic Differentiation (AD) system is the optimal choice for all problems. This means informed selection of an AD system and combinations can be a problem-specific variable that can greatly impact performance. In the Julia programming language, the major AD systems target the same input and thus in theory can compose. Hitherto, switching between AD packages in the Julia Language required end-users to familiarize themselves with the user-facing API of the respective packages. Furthermore, implementing a new, usable AD package required AD package developers to write boilerplate code to define convenience API functions for end-users. As a response to these issues, we present AbstractDifferentiation.jl for the automatized generation of an extensive, unified, user-facing API for any AD package. By splitting the complexity between AD users and AD developers, AD package developers only need to implement one or two primitive definitions to support various utilities for AD users like Jacobians, Hessians and lazy product operators from native primitives such as pullbacks or pushforwards, thus removing tedious -- but so far inevitable -- boilerplate code, and enabling the easy switching and composing between AD implementations for end-users.
    Just Train Twice: Improving Group Robustness without Training Group Information. (arXiv:2107.09044v2 [cs.LG] UPDATED)
    (0 min) Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point, whereas approaches that do not use such group annotations typically achieve unsatisfactory worst-group accuracy. In this paper, we propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified. Intuitively, this upweights examples from groups on which standard ERM models perform poorly, leading to improved worst-group performance. Averaged over four image classification and natural language processing tasks with spurious correlations, JTT closes 75% of the gap in worst-group accuracy between standard ERM and group DRO, while only requiring group annotations on a small validation set in order to tune hyperparameters.
    Deep Learning and Traffic Classification: Lessons learned from a commercial-grade dataset with hundreds of encrypted and zero-day applications. (arXiv:2104.03182v2 [cs.LG] UPDATED)
    (0 min) The increasing success of Machine Learning (ML) and Deep Learning (DL) has recently re-sparked interest towards traffic classification. While classification of known traffic is a well investigated subject with supervised classification tools (such as ML and DL models) are known to provide satisfactory performance, detection of unknown (or zero-day) traffic is more challenging and typically handled by unsupervised techniques (such as clustering algorithms). In this paper, we share our experience on a commercial-grade DL traffic classification engine that is able to (i) identify known applications from encrypted traffic, as well as (ii) handle unknown zero-day applications. In particular, our contribution for (i) is to perform a thorough assessment of state of the art traffic classifiers in commercial-grade settings comprising few thousands of very fine grained application labels, as opposite to the few tens of classes generally targeted in academic evaluations. Additionally, we contribute to the problem of (ii) detection of zero-day applications by proposing a novel technique, tailored for DL models, that is significantly more accurate and light-weight than the state of the art. Summarizing our main findings, we gather that (i) while ML and DL models are both equally able to provide satisfactory solution for classification of known traffic, however (ii) the non-linear feature extraction process of the DL backbone provides sizeable advantages for the detection of unknown classes.
    NICE: Robust Scheduling through Reinforcement Learning-Guided Integer Programming. (arXiv:2109.12171v1 [cs.LG])
    (0 min) Integer programs provide a powerful abstraction for representing a wide range of real-world scheduling problems. Despite their ability to model general scheduling problems, solving large-scale integer programs (IP) remains a computational challenge in practice. The incorporation of more complex objectives such as robustness to disruptions further exacerbates the computational challenge. We present NICE (Neural network IP Coefficient Extraction), a novel technique that combines reinforcement learning and integer programming to tackle the problem of robust scheduling. More specifically, NICE uses reinforcement learning to approximately represent complex objectives in an integer programming formulation. We use NICE to determine assignments of pilots to a flight crew schedule so as to reduce the impact of disruptions. We compare NICE with (1) a baseline integer programming formulation that produces a feasible crew schedule, and (2) a robust integer programming formulation that explicitly tries to minimize the impact of disruptions. Our experiments show that, across a variety of scenarios, NICE produces schedules resulting in 33\% to 48\% fewer disruptions than the baseline formulation. Moreover, in more severely constrained scheduling scenarios in which the robust integer program fails to produce a schedule within 90 minutes, NICE is able to build robust schedules in less than 2 seconds on average.
    MAML is a Noisy Contrastive Learner. (arXiv:2106.15367v2 [cs.LG] UPDATED)
    (0 min) Model-agnostic meta-learning (MAML) is one of the most popular and widely-adopted meta-learning algorithms nowadays, which achieves remarkable success in various learning problems. Yet, with the unique design of nested inner-loop and outer-loop updates which respectively govern the task-specific and meta-model-centric learning, the underlying learning objective of MAML still remains implicit and thus impedes a more straightforward understanding of it. In this paper, we provide a new perspective to the working mechanism of MAML and discover that: MAML is analogous to a meta-learner using a supervised contrastive objective function, where the query features are pulled towards the support features of the same class and against those of different classes, in which such contrastiveness is experimentally verified via an analysis based on the cosine similarity. Moreover, our analysis reveals that the vanilla MAML algorithm has an undesirable interference term originating from the random initialization and the cross-task interaction. We therefore propose a simple but effective technique, zeroing trick, to alleviate such interference, where the extensive experiments are then conducted on both miniImagenet and Omniglot datasets to demonstrate the consistent improvement brought by our proposed technique thus well validating its effectiveness.
    Equality of opportunity in travel behavior prediction with deep neural networks and discrete choice models. (arXiv:2109.12422v1 [stat.ML])
    (0 min) Although researchers increasingly adopt machine learning to model travel behavior, they predominantly focus on prediction accuracy, ignoring the ethical challenges embedded in machine learning algorithms. This study introduces an important missing dimension - computational fairness - to travel behavior analysis. We first operationalize computational fairness by equality of opportunity, then differentiate between the bias inherent in data and the bias introduced by modeling. We then demonstrate the prediction disparities in travel behavior modeling using the 2017 National Household Travel Survey (NHTS) and the 2018-2019 My Daily Travel Survey in Chicago. Empirically, deep neural network (DNN) and discrete choice models (DCM) reveal consistent prediction disparities across multiple social groups: both over-predict the false negative rate of frequent driving for the ethnic minorities, the low-income and the disabled populations, and falsely predict a higher travel burden of the socially disadvantaged groups and the rural populations than reality. Comparing DNN with DCM, we find that DNN can outperform DCM in prediction disparities because of DNN's smaller misspecification error. To mitigate prediction disparities, this study introduces an absolute correlation regularization method, which is evaluated with synthetic and real-world data. The results demonstrate the prevalence of prediction disparities in travel behavior modeling, and the disparities still persist regarding a variety of model specifics such as the number of DNN layers, batch size and weight initialization. Since these prediction disparities can exacerbate social inequity if prediction results without fairness adjustment are used for transportation policy making, we advocate for careful consideration of the fairness problem in travel behavior modeling, and the use of bias mitigation algorithms for fair transport decisions.
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v1 [cs.LG])
    (0 min) The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition, originally proposed for sampled representations, to mean representations and show that active variables are equally disentangled in both representations. Based on this new definition and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.
    Synthetic Data Generation for Fraud Detection using GANs. (arXiv:2109.12546v1 [cs.LG])
    (0 min) Detecting money laundering in gambling is becoming increasingly challenging for the gambling industry as consumers migrate to online channels. Whilst increasingly stringent regulations have been applied over the years to prevent money laundering in gambling, despite this, online gambling is still a channel for criminals to spend proceeds from crime. Complementing online gambling's growth more concerns are raised to its effects compared with gambling in traditional, physical formats, as it might introduce higher levels of problem gambling or fraudulent behaviour due to its nature of immediate interaction with online gambling experience. However, in most cases the main issue when organisations try to tackle those areas is the absence of high quality data. Since fraud detection related issues face the significant problem of the class imbalance, in this paper we propose a novel system based on Generative Adversarial Networks (GANs) for generating synthetic data in order to train a supervised classifier. Our framework Synthetic Data Generation GAN (SDG-GAN), manages to outperformed density based over-sampling methods and improve the classification performance of benchmarks datasets and the real world gambling fraud dataset.
    Communication-Efficient Distributed Linear and Deep Generalized Canonical Correlation Analysis. (arXiv:2109.12400v1 [cs.DC])
    (0 min) Classic and deep learning-based generalized canonical correlation analysis (GCCA) algorithms seek low-dimensional common representations of data entities from multiple ``views'' (e.g., audio and image) using linear transformations and neural networks, respectively. When the views are acquired and stored at different locations, organizations and edge devices, computing GCCA in a distributed, parallel and efficient manner is well-motivated. However, existing distributed GCCA algorithms may incur prohitively high communication overhead. This work puts forth a communication-efficient distributed framework for both linear and deep GCCA under the maximum variance (MAX-VAR) paradigm. The overhead issue is addressed by aggressively compressing (via quantization) the exchanging information between the distributed computing agents and a central controller. Compared to the unquantized version, the proposed algorithm consistently reduces the communication overhead by about $90\%$ with virtually no loss in accuracy and convergence speed. Rigorous convergence analyses are also presented -- which is a nontrivial effort since no existing generic result from quantized distributed optimization covers the special problem structure of GCCA. Our result shows that the proposed algorithms for both linear and deep GCCA converge to critical points in a sublinear rate, even under heavy quantization and stochastic approximations. In addition, it is shown that in the linear MAX-VAR case, the quantized algorithm approaches a {\it global optimum} in a {\it geometric} rate -- if the computing agents' updates meet a certain accuracy level. Synthetic and real data experiments are used to showcase the effectiveness of the proposed approach.
    Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment. (arXiv:1908.02388v3 [cs.LG] UPDATED)
    (0 min) This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the state-of-the-art algorithm for value-based agents, and focus on some of the bonuses proposed in the last few years. We consider the impact these algorithms have on performance within the popular game Montezuma's Revenge which has gathered a lot of interest from the exploration community, across the the set of seven games identified by Bellemare et al. (2016) as challenging for exploration, and easier games where exploration is not an issue. We find that, in our setting, recently developed bonuses do not provide significantly improved performance on Montezuma's Revenge or hard exploration games. We also find that existing bonus-based methods may negatively impact performance on games in which exploration is not an issue and may even perform worse than $\epsilon$-greedy exploration.
    Constructing Sub-scale Surrogate Model for Proppant Settling in Inclined Fractures from Simulation Data with Multi-fidelity Neural Network. (arXiv:2109.12311v1 [physics.flu-dyn])
    (0 min) Particle settling in inclined channels is an important phenomenon that occurs during hydraulic fracturing of shale gas production. Generally, in order to accurately simulate the large-scale (field-scale) proppant transport process, constructing a fast and accurate sub-scale proppant settling model, or surrogate model, becomes a critical issue, since mapping between physical parameters and proppant settling velocity is complex. Previously, particle settling has usually been investigated via high-fidelity experiments and meso-scale numerical simulations, both of which are time-consuming. In this work, a new method is proposed and utilized, i.e., the multi-fidelity neural network (MFNN), to construct a settling surrogate model, which could utilize both high-fidelity and low-fidelity (thus, less expensive) data. The results demonstrate that constructing the settling surrogate with the MFNN can reduce the need for high-fidelity data and thus computational cost by 80%, while the accuracy lost is less than 5% compared to a high-fidelity surrogate. Moreover, the investigated particle settling surrogate is applied in macro-scale proppant transport simulation, which shows that the settling model is significant to proppant transport and yields accurate results. This opens novel pathways for rapidly predicting proppant settling velocity in reservoir applications.
    Structure-aware scale-adaptive networks for cancer segmentation in whole-slide images. (arXiv:2109.12617v1 [eess.IV])
    (0 min) Cancer segmentation in whole-slide images is a fundamental step for viable tumour burden estimation, which is of great value for cancer assessment. However, factors like vague boundaries or small regions dissociated from viable tumour areas make it a challenging task. Considering the usefulness of multi-scale features in various vision-related tasks, we present a structure-aware scale-adaptive feature selection method for efficient and accurate cancer segmentation. Based on a segmentation network with a popular encoder-decoder architecture, a scale-adaptive module is proposed for selecting more robust features to represent the vague, non-rigid boundaries. Furthermore, a structural similarity metric is proposed for better tissue structure awareness to deal with small region segmentation. In addition, advanced designs including several attention mechanisms and the selective-kernel convolutions are applied to the baseline network for comparative study purposes. Extensive experimental results show that the proposed structure-aware scale-adaptive networks achieve outstanding performance on liver cancer segmentation when compared to top ten submitted results in the challenge of PAIP 2019. Further evaluation on colorectal cancer segmentation shows that the scale-adaptive module improves the baseline network or outperforms the other excellent designs of attention mechanisms when considering the tradeoff between efficiency and accuracy.
    Multi-Transformer: A New Neural Network-Based Architecture for Forecasting S&P Volatility. (arXiv:2109.12621v1 [q-fin.CP])
    (0 min) Events such as the Financial Crisis of 2007-2008 or the COVID-19 pandemic caused significant losses to banks and insurance entities. They also demonstrated the importance of using accurate equity risk models and having a risk management function able to implement effective hedging strategies. Stock volatility forecasts play a key role in the estimation of equity risk and, thus, in the management actions carried out by financial institutions. Therefore, this paper has the aim of proposing more accurate stock volatility models based on novel machine and deep learning techniques. This paper introduces a neural network-based architecture, called Multi-Transformer. Multi-Transformer is a variant of Transformer models, which have already been successfully applied in the field of natural language processing. Indeed, this paper also adapts traditional Transformer layers in order to be used in volatility forecasting models. The empirical results obtained in this paper suggest that the hybrid models based on Multi-Transformer and Transformer layers are more accurate and, hence, they lead to more appropriate risk measures than other autoregressive algorithms or hybrid models based on feed forward layers or long short term memory cells.
    DziriBERT: a Pre-trained Language Model for the Algerian Dialect. (arXiv:2109.12346v1 [cs.CL])
    (0 min) Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there is still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one Million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared to existing models, DziriBERT achieves the best results on two Algerian downstream datasets. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.
    Distributed Online Optimization with Byzantine Adversarial Agents. (arXiv:2109.12340v1 [math.OC])
    (0 min) We study the problem of non-constrained, discrete-time, online distributed optimization in a multi-agent system where some of the agents do not follow the prescribed update rule either due to failures or malicious intentions. None of the agents have prior information about the identities of the faulty agents and any agent can communicate only with its immediate neighbours. At each time step, a Lipschitz strongly convex cost function is revealed locally to all the agents and the non-faulty agents update their states using their local information and the information obtained from their neighbours. We measure the performance of the online algorithm by comparing it to its offline version when the cost functions are known apriori. The difference between the same is termed as regret. Under sufficient conditions on the graph topology, the number and location of the adversaries, the defined regret grows sublinearly. We further conduct numerical experiments to validate our theoretical results.
    MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling. (arXiv:2109.12178v1 [cs.CV])
    (0 min) Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked Image-Region Modeling (MIRM). Both alignment and MIRM objectives mostly do not have ground truth. Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models. In this paper, we present Masked Language and Image Modeling (MLIM) for VLP. MLIM uses two loss functions: Masked Language Modeling (MLM) loss and image reconstruction (RECON) loss. We propose Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, we present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
    Soundata: A Python library for reproducible use of audio datasets. (arXiv:2109.12690v1 [cs.SD])
    (0 min) Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, validate that the dataset is complete and correct, and more. Soundata is based and inspired on mirdata and design to complement mirdata by working with environmental sound, bioacoustic and speech datasets, among others. Soundata was created to be easy to use, easy to contribute to, and to increase reproducibility and standardize usage of sound datasets in a flexible way.
    A Principled Approach to Failure Analysis and Model Repairment: Demonstration in Medical Imaging. (arXiv:2109.12347v1 [cs.LG])
    (0 min) Machine learning models commonly exhibit unexpected failures post-deployment due to either data shifts or uncommon situations in the training environment. Domain experts typically go through the tedious process of inspecting the failure cases manually, identifying failure modes and then attempting to fix the model. In this work, we aim to standardise and bring principles to this process through answering two critical questions: (i) how do we know that we have identified meaningful and distinct failure types?; (ii) how can we validate that a model has, indeed, been repaired? We suggest that the quality of the identified failure types can be validated through measuring the intra- and inter-type generalisation after fine-tuning and introduce metrics to compare different subtyping methods. Furthermore, we argue that a model can be considered repaired if it achieves high accuracy on the failure types while retaining performance on the previously correct data. We combine these two ideas into a principled framework for evaluating the quality of both the identified failure subtypes and model repairment. We evaluate its utility on a classification and an object detection tasks. Our code is available at https://github.com/Rokken-lab6/Failure-Analysis-and-Model-Repairment
    NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU. (arXiv:2109.12191v1 [cs.LG])
    (0 min) Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs. Microbatching is a standard method to alleviate this and is fully supported in the TensorFlow Privacy library (TFDP). However, this technique, while improving training times also reduces the quality of the gradients and degrades the classification accuracy. Recent works that for example use the JAX framework show promise in also alleviating this but still show degradation in throughput from non-private to private SGD on CNNs, and have not yet shown ImageNet implementations. In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.
    EllipseNet: Anchor-Free Ellipse Detection for Automatic Cardiac Biometrics in Fetal Echocardiography. (arXiv:2109.12474v1 [eess.IV])
    (0 min) As an important scan plane, four chamber view is routinely performed in both second trimester perinatal screening and fetal echocardiographic examinations. The biometrics in this plane including cardio-thoracic ratio (CTR) and cardiac axis are usually measured by sonographers for diagnosing congenital heart disease. However, due to the commonly existing artifacts like acoustic shadowing, the traditional manual measurements not only suffer from the low efficiency, but also with the inconsistent results depending on the operators' skills. In this paper, we present an anchor-free ellipse detection network, namely EllipseNet, which detects the cardiac and thoracic regions in ellipse and automatically calculates the CTR and cardiac axis for fetal cardiac biometrics in 4-chamber view. In particular, we formulate the network that detects the center of each object as points and regresses the ellipses' parameters simultaneously. We define an intersection-over-union loss to further regulate the regression procedure. We evaluate EllipseNet on clinical echocardiogram dataset with more than 2000 subjects. Experimental results show that the proposed framework outperforms several state-of-the-art methods. Source code will be available at https://git.openi.org.cn/capepoint/EllipseNet .
    Topic Model Robustness to Automatic Speech Recognition Errors in Podcast Transcripts. (arXiv:2109.12306v1 [cs.IR])
    (0 min) For a multilingual podcast streaming service, it is critical to be able to deliver relevant content to all users independent of language. Podcast content relevance is conventionally determined using various metadata sources. However, with the increasing quality of speech recognition in many languages, utilizing automatic transcriptions to provide better content recommendations becomes possible. In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine. Specifically, we explore how increasing transcription noise influences topics obtained from transcriptions in Danish; a low resource language. First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators. We then observe how the cosine similarities decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic embeddings from the transcriptions.
    Scalable deeper graph neural networks for high-performance materials property prediction. (arXiv:2109.12283v1 [cond-mat.mtrl-sci])
    (0 min) Machine learning (ML) based materials discovery has emerged as one of the most promising approaches for breakthroughs in materials science. While heuristic knowledge based descriptors have been combined with ML algorithms to achieve good performance, the complexity of the physicochemical mechanisms makes it urgently needed to exploit representation learning from either compositions or structures for building highly effective materials machine learning models. Among these methods, the graph neural networks have shown the best performance by its capability to learn high-level features from crystal structures. However, all these models suffer from their inability to scale up the models due to the over-smoothing issue of their message-passing GNN architecture. Here we propose a novel graph attention neural network model DeeperGATGNN with differentiable group normalization and skip-connections, which allows to train very deep graph neural network models (e.g. 30 layers compared to 3-9 layers in previous works). Through systematic benchmark studies over six benchmark datasets for energy and band gap predictions, we show that our scalable DeeperGATGNN model needs little costly hyper-parameter tuning for different datasets and achieves the state-of-the-art prediction performances over five properties out of six with up to 10\% improvement. Our work shows that to deal with the high complexity of mapping the crystal materials structures to their properties, large-scale very deep graph neural networks are needed to achieve robust performances.
    Short-Term Load Forecasting Using Time Pooling Deep Recurrent Neural Network. (arXiv:2109.12498v1 [cs.LG])
    (0 min) Integration of renewable energy sources and emerging loads like electric vehicles to smart grids brings more uncertainty to the distribution system management. Demand Side Management (DSM) is one of the approaches to reduce the uncertainty. Some applications like Nonintrusive Load Monitoring (NILM) can support DSM, however they require accurate forecasting on high resolution data. This is challenging when it comes to single loads like one residential household due to its high volatility. In this paper, we review some of the existing Deep Learning-based methods and present our solution using Time Pooling Deep Recurrent Neural Network. The proposed method augments data using time pooling strategy and can overcome overfitting problems and model uncertainties of data more efficiently. Simulation and implementation results show that our method outperforms the existing algorithms in terms of RMSE and MAE metrics.
    Dynamic Sequential Graph Learning for Click-Through Rate Prediction. (arXiv:2109.12541v1 [cs.IR])
    (0 min) Click-through rate prediction plays an important role in the field of recommender system and many other applications. Existing methods mainly extract user interests from user historical behaviors. However, behavioral sequences only contain users' directly interacted items, which are limited by the system's exposure, thus they are often not rich enough to reflect all the potential interests. In this paper, we propose a novel method, named Dynamic Sequential Graph Learning (DSGL), to enhance users or items' representations by utilizing collaborative information from the local sub-graphs associated with users or items. Specifically, we design the Dynamic Sequential Graph (DSG), i.e., a lightweight ego subgraph with timestamps induced from historical interactions. At every scoring moment, we construct DSGs for the target user and the candidate item respectively. Based on the DSGs, we perform graph convolutional operations iteratively in a bottom-up manner to obtain the final representations of the target user and the candidate item. As for the graph convolution, we design a Time-aware Sequential Encoding Layer that leverages the interaction time information as well as temporal dependencies to learn evolutionary user and item dynamics. Besides, we propose a Target-Preference Dual Attention Layer, composed of a preference-aware attention module and a target-aware attention module, to automatically search for parts of behaviors that are relevant to the target and alleviate the noise from unreliable neighbors. Results on real-world CTR prediction benchmarks demonstrate the improvements brought by DSGL.
    POSSE: Patterns of Systems During Software Encryption. (arXiv:2109.12162v1 [cs.CR])
    (0 min) This research recasts ransomware detection using performance monitoring and statistical machine learning. The work builds a test environment with 41 input variables to label and compares three computing states: idle, encryption and compression. A common goal of this behavioral detector seeks to anticipate and short-circuit the final step of hard-drive locking with encryption and the demand for payment to return the file system to its baseline. Comparing machine learning techniques, linear regression outperforms random forest, decision trees, and support vector machines (SVM). All algorithms classified the 3 possible classes (idle, encryption, and compression) with greater than 91% accuracy.
    Dynamic Adaptive Spatio-temporal Graph Convolution for fMRI Modelling. (arXiv:2109.12517v1 [cs.LG])
    (0 min) The characterisation of the brain as a functional network in which the connections between brain regions are represented by correlation values across time series has been very popular in the last years. Although this representation has advanced our understanding of brain function, it represents a simplified model of brain connectivity that has a complex dynamic spatio-temporal nature. Oversimplification of the data may hinder the merits of applying advanced non-linear feature extraction algorithms. To this end, we propose a dynamic adaptive spatio-temporal graph convolution (DAST-GCN) model to overcome the shortcomings of pre-defined static correlation-based graph structures. The proposed approach allows end-to-end inference of dynamic connections between brain regions via layer-wise graph structure learning module while mapping brain connectivity to a phenotype in a supervised learning framework. This leverages the computational power of the model, data and targets to represent brain connectivity, and could enable the identification of potential biomarkers for the supervised target in question. We evaluate our pipeline on the UKBiobank dataset for age and gender classification tasks from resting-state functional scans and show that it outperforms currently adapted linear and non-linear methods in neuroimaging. Further, we assess the generalizability of the inferred graph structure by transferring the pre-trained graph to an independent dataset for the same task. Our results demonstrate the task-robustness of the graph against different scanning parameters and demographics.
    TEMGNet: Deep Transformer-based Decoding of Upperlimb sEMG for Hand Gestures Recognition. (arXiv:2109.12379v1 [cs.LG])
    (0 min) There has been a surge of recent interest in Machine Learning (ML), particularly Deep Neural Network (DNN)-based models, to decode muscle activities from surface Electromyography (sEMG) signals for myoelectric control of neurorobotic systems. DNN-based models, however, require large training sets and, typically, have high structural complexity, i.e., they depend on a large number of trainable parameters. To address these issues, we developed a framework based on the Transformer architecture for processing sEMG signals. We propose a novel Vision Transformer (ViT)-based neural network architecture (referred to as the TEMGNet) to classify and recognize upperlimb hand gestures from sEMG to be used for myocontrol of prostheses. The proposed TEMGNet architecture is trained with a small dataset without the need for pre-training or fine-tuning. To evaluate the efficacy, following the-recent literature, the second subset (exercise B) of the NinaPro DB2 dataset was utilized, where the proposed TEMGNet framework achieved a recognition accuracy of 82.93% and 82.05% for window sizes of 300ms and 200ms, respectively, outperforming its state-of-the-art counterparts. Moreover, the proposed TEMGNet framework is superior in terms of structural capacity while having seven times fewer trainable parameters. These characteristics and the high performance make DNN-based models promising approaches for myoelectric control of neurorobots.
    A Method for Medical Data Analysis Using the LogNNet for Clinical Decision Support Systems and Edge Computing in Healthcare. (arXiv:2108.02428v2 [cs.LG] UPDATED)
    (0 min) Edge computing is a fast-growing and much needed technology in healthcare. The problem of implementing artificial intelligence on edge devices is the complexity and high resource intensity of the most known neural network data analysis methods and algorithms. The difficulty of implementing these methods on low-power microcontrollers with small memory size calls for the development of new effective algorithms for neural networks. This study presents a new method for analyzing medical data based on the LogNNet neural network, which uses chaotic mappings to transform input information. The method effectively solves classification problems and calculates risk factors for the presence of a disease in a patient according to a set of medical health indicators. The efficiency of LogNNet in assessing perinatal risk is illustrated on cardiotocogram data obtained from the UC Irvine machine learning repository. The classification accuracy reaches ~91% with the ~3-10 kB of RAM used on the Arduino microcontroller. Using the LogNNet network trained on a publicly available database of the Israeli Ministry of Health, a service concept for COVID-19 express testing is provided. A classification accuracy of ~95% is achieved, and ~0.6 kB of RAM is used. In all examples, the model is tested using standard classification quality metrics: precision, recall, and F1-measure. The LogNNet architecture allows the implementation of artificial intelligence on medical peripherals of the Internet of Things with low RAM resources and can be used in clinical decision support systems.
    Graph-Based Spatial-Temporal Convolutional Network for Vehicle Trajectory Prediction in Autonomous Driving. (arXiv:2109.12764v1 [cs.LG])
    (0 min) Forecasting the trajectories of neighbor vehicles is a crucial step for decision making and motion planning of autonomous vehicles. This paper proposes a graph-based spatial-temporal convolutional network (GSTCN) to predict future trajectory distributions of all neighbor vehicles using past trajectories. This network tackles the spatial interactions using a graph convolutional network (GCN), and captures the temporal features with a convolutional neural network (CNN). The spatial-temporal features are encoded and decoded by a gated recurrent unit (GRU) network to generate future trajectory distributions. Besides, we propose a weighted adjacency matrix to describe the intensities of mutual influence between vehicles, and the ablation study demonstrates the effectiveness of our proposed scheme. Our network is evaluated on two real-world freeway trajectory datasets: I-80 and US-101 in the Next Generation Simulation (NGSIM).Comparisons in three aspects, including prediction errors, model sizes, and inference speeds, show that our network can achieve state-of-the-art performance.
    MIIDL: a Python package for microbial biomarkers identification powered by interpretable deep learning. (arXiv:2109.12204v1 [q-bio.QM])
    (0 min) Detecting microbial biomarkers used to predict disease phenotypes and clinical outcomes is crucial for disease early-stage screening and diagnosis. Most methods for biomarker identification are linear-based, which is very limited as biological processes are rarely fully linear. The introduction of machine learning to this field tends to bring a promising solution. However, identifying microbial biomarkers in an interpretable, data-driven and robust manner remains challenging. We present MIIDL, a Python package for the identification of microbial biomarkers based on interpretable deep learning. MIIDL innovatively applies convolutional neural networks, a variety of interpretability algorithms and plenty of pre-processing methods to provide a one-stop and robust pipeline for microbial biomarkers identification from high-dimensional and sparse data sets.
    Generalized multiscale feature extraction for remaining useful life prediction of bearings with generative adversarial networks. (arXiv:2109.12513v1 [cs.LG])
    (0 min) Bearing is a key component in industrial machinery and its failure may lead to unwanted downtime and economic loss. Hence, it is necessary to predict the remaining useful life (RUL) of bearings. Conventional data-driven approaches of RUL prediction require expert domain knowledge for manual feature extraction and may suffer from data distribution discrepancy between training and test data. In this study, we propose a novel generalized multiscale feature extraction method with generative adversarial networks. The adversarial training learns the distribution of training data from different bearings and is introduced for health stage division and RUL prediction. To capture the sequence feature from a one-dimensional vibration signal, we adapt a U-Net architecture that reconstructs features to process them with multiscale layers in the generator of the adversarial network. To validate the proposed method, comprehensive experiments on two rotating machinery datasets have been conducted to predict the RUL. The experimental results show that the proposed feature extraction method can effectively predict the RUL and outperforms the conventional RUL prediction approaches based on deep neural networks. The implementation code is available at https://github.com/opensuh/GMFE.
    L$^{2}$NAS: Learning to Optimize Neural Architectures via Continuous-Action Reinforcement Learning. (arXiv:2109.12425v1 [cs.LG])
    (0 min) Neural architecture search (NAS) has achieved remarkable results in deep neural network design. Differentiable architecture search converts the search over discrete architectures into a hyperparameter optimization problem which can be solved by gradient descent. However, questions have been raised regarding the effectiveness and generalizability of gradient methods for solving non-convex architecture hyperparameter optimization problems. In this paper, we propose L$^{2}$NAS, which learns to intelligently optimize and update architecture hyperparameters via an actor neural network based on the distribution of high-performing architectures in the search history. We introduce a quantile-driven training procedure which efficiently trains L$^{2}$NAS in an actor-critic framework via continuous-action reinforcement learning. Experiments show that L$^{2}$NAS achieves state-of-the-art results on NAS-Bench-201 benchmark as well as DARTS search space and Once-for-All MobileNetV3 search space. We also show that search policies generated by L$^{2}$NAS are generalizable and transferable across different training datasets with minimal fine-tuning.
    Neural Augmentation of Kalman Filter with Hypernetwork for Channel Tracking. (arXiv:2109.12561v1 [eess.SP])
    (0 min) We propose Hypernetwork Kalman Filter (HKF) for tracking applications with multiple different dynamics. The HKF combines generalization power of Kalman filters with expressive power of neural networks. Instead of keeping a bank of Kalman filters and choosing one based on approximating the actual dynamics, HKF adapts itself to each dynamics based on the observed sequence. Through extensive experiments on CDL-B channel model, we show that the HKF can be used for tracking the channel over a wide range of Doppler values, matching Kalman filter performance with genie Doppler information. At high Doppler values, it achieves around 2dB gain over genie Kalman filter. The HKF generalizes well to unseen Doppler, SNR values and pilot patterns unlike LSTM, which suffers from severe performance degradation.
    Long-Range Feature Propagating for Natural Image Matting. (arXiv:2109.12252v1 [cs.CV])
    (0 min) Natural image matting estimates the alpha values of unknown regions in the trimap. Recently, deep learning based methods propagate the alpha values from the known regions to unknown regions according to the similarity between them. However, we find that more than 50\% pixels in the unknown regions cannot be correlated to pixels in known regions due to the limitation of small effective reception fields of common convolutional neural networks, which leads to inaccurate estimation when the pixels in the unknown regions cannot be inferred only with pixels in the reception fields. To solve this problem, we propose Long-Range Feature Propagating Network (LFPNet), which learns the long-range context features outside the reception fields for alpha matte estimation. Specifically, we first design the propagating module which extracts the context features from the downsampled image. Then, we present Center-Surround Pyramid Pooling (CSPP) that explicitly propagates the context features from the surrounding context image patch to the inner center image patch. Finally, we use the matting module which takes the image, trimap and context features to estimate the alpha matte. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on the AlphaMatting and Adobe Image Matting datasets.
    Finite State Machine Policies Modulating Trajectory Generator. (arXiv:2109.12696v1 [cs.RO])
    (0 min) Deep reinforcement learning (deep RL) has emerged as an effective tool for developing controllers for legged robots. However, a simple neural network representation is known for its poor extrapolation ability, making the learned behavior vulnerable to unseen perturbations or challenging terrains. Therefore, researchers have investigated a novel architecture, Policies Modulating Trajectory Generators (PMTG), which combines trajectory generators (TG) and feedback control signals to achieve more robust behaviors. In this work, we propose to extend the PMTG framework with a finite state machine PMTG by replacing simple TGs with asynchronous finite state machines (Async FSMs). This invention offers an explicit notion of contact events to the policy to negotiate unexpected perturbations. We demonstrated that the proposed architecture could achieve more robust behaviors in various scenarios, such as challenging terrains or external perturbations, on both simulated and real robots. The supplemental video can be found at: this http URL
    PETA: Photo Albums Event Recognition using Transformers Attention. (arXiv:2109.12499v1 [cs.CV])
    (0 min) In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images also presents the challenge of high-level image understanding, as opposed to low-level image object classification. In absence of methods to analyze multiple inputs, previous methods adopted temporal mechanisms, including various forms of recurrent neural networks. However, their effective temporal window is local. In addition, they are not a natural choice given the disordered characteristic of photo albums. We address this gap with a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation to perform global reasoning on image collection, offering a practical and efficient solution for photo albums event recognition. Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90\% mAP on all datasets. We further explore the related image-importance task in event recognition, demonstrating how the learned attentions correlate with the human-annotated importance for this subjective task, thus opening the door for new applications.
    Multi-source Few-shot Domain Adaptation. (arXiv:2109.12391v1 [cs.CV])
    (0 min) Multi-source Domain Adaptation (MDA) aims to transfer predictive models from multiple, fully-labeled source domains to an unlabeled target domain. However, in many applications, relevant labeled source datasets may not be available, and collecting source labels can be as expensive as labeling the target data itself. In this paper, we investigate Multi-source Few-shot Domain Adaptation (MFDA): a new domain adaptation scenario with limited multi-source labels and unlabeled target data. As we show, existing methods often fail to learn discriminative features for both source and target domains in the MFDA setting. Therefore, we propose a novel framework, termed Multi-Source Few-shot Adaptation Network (MSFAN), which can be trained end-to-end in a non-adversarial manner. MSFAN operates by first using a type of prototypical, multi-domain, self-supervised learning to learn features that are not only domain-invariant but also class-discriminative. Second, MSFAN uses a small, labeled support set to enforce feature consistency and domain invariance across domains. Finally, prototypes from multiple sources are leveraged to learn better classifiers. Compared with state-of-the-art MDA methods, MSFAN improves the mean classification accuracy over different domain pairs on MFDA by 20.2%, 9.4%, and 16.2% on Office, Office-Home, and DomainNet, respectively.
    Cluster Analysis with Deep Embeddings and Contrastive Learning. (arXiv:2109.12714v1 [cs.LG])
    (0 min) Unsupervised disentangled representation learning is a long-standing problem in computer vision. This work proposes a novel framework for performing image clustering from deep embeddings by combining instance-level contrastive learning with a deep embedding based cluster center predictor. Our approach jointly learns representations and predicts cluster centers in an end-to-end manner. This is accomplished via a three-pronged approach that combines a clustering loss, an instance-wise contrastive loss, and an anchor loss. Our fundamental intuition is that using an ensemble loss that incorporates instance-level features and a clustering procedure focusing on semantic similarity reinforces learning better representations in the latent space. We observe that our method performs exceptionally well on popular vision datasets when evaluated using standard clustering metrics such as Normalized Mutual Information (NMI), in addition to producing geometrically well-separated cluster embeddings as defined by the Euclidean distance. Our framework performs on par with widely accepted clustering methods and outperforms the state-of-the-art contrastive learning method on the CIFAR-10 dataset with an NMI score of 0.772, a 7-8% improvement on the strong baseline.
    FedProc: Prototypical Contrastive Federated Learning on Non-IID data. (arXiv:2109.12273v1 [cs.LG])
    (0 min) Federated learning allows multiple clients to collaborate to train high-performance deep learning models while keeping the training data locally. However, when the local data of all clients are not independent and identically distributed (i.e., non-IID), it is challenging to implement this form of efficient collaborative learning. Although significant efforts have been dedicated to addressing this challenge, the effect on the image classification task is still not satisfactory. In this paper, we propose FedProc: prototypical contrastive federated learning, which is a simple and effective federated learning framework. The key idea is to utilize the prototypes as global knowledge to correct the local training of each client. We design a local network architecture and a global prototypical contrastive loss to regulate the training of local models, which makes local objectives consistent with the global optima. Eventually, the converged global model obtains a good performance on non-IID data. Experimental results show that, compared to state-of-the-art federated learning methods, FedProc improves the accuracy by $1.6\%\sim7.9\%$ with acceptable computation cost.
    MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions. (arXiv:2105.02702v3 [cs.SD] UPDATED)
    (0 min) In this paper, we introduce MIMII DUE, a new dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions. Conventional methods for anomalous sound detection face practical challenges because the distribution of features changes between the training and operational phases (called domain shift) due to various real-world factors. To check the robustness against domain shifts, we need a dataset that actually includes domain shifts, but such a dataset does not exist so far. The new dataset we created consists of the normal and abnormal operating sounds of five different types of industrial machines under two different operational/environmental conditions (source domain and target domain) independent of normal/abnormal, with domain shifts occurring between the two domains. Experimental results showed significant performance differences between the source and target domains, indicating that the dataset contains the domain shifts. These findings demonstrate that the dataset will be helpful for checking the robustness against domain shifts. The dataset is a subset of the dataset for DCASE 2021 Challenge Task 2 and freely available for download at https://zenodo.org/record/4740355
    When Should You Defend Your Classifier -- A Game-theoretical Analysis of Countermeasures against Adversarial Examples. (arXiv:2108.07602v2 [cs.LG] UPDATED)
    (0 min) Adversarial machine learning, i.e., increasing the robustness of machine learning algorithms against so-called adversarial examples, is now an established field. Yet, newly proposed methods are evaluated and compared under unrealistic scenarios where costs for adversary and defender are not considered and either all samples or no samples are adversarially perturbed. We scrutinize these assumptions and propose the advanced adversarial classification game, which incorporates all relevant parameters of an adversary and a defender. Especially, we take into account economic factors on both sides and the fact that all so far proposed countermeasures against adversarial examples reduce accuracy on benign samples. Analyzing the scenario in detail, where both players have two pure strategies, we identify all best responses and conclude that in practical settings, the most influential factor might be the maximum amount of adversarial examples.
    Decision Making For Celebrity Branding: An Opinion Mining Approach Based On Polarity And Sentiment Analysis Using Twitter Consumer-Generated Content (CGC). (arXiv:2109.12630v1 [cs.SI])
    (0 min) The volume of discussions concerning brands within social media provides digital marketers with great opportunities for tracking and analyzing the feelings and views of consumers toward brands, products, influencers, services, and ad campaigns in CGC. The present study aims to assess and compare the performance of firms and celebrities (i.e., influencers that with the experience of being in an ad campaign of those companies) with the automated sentiment analysis that was employed for CGC at social media while exploring the feeling of the consumers toward them to observe which influencer (of two for each company) had a closer effect with the corresponding corporation on consumer minds. For this purpose, several consumer tweets from the pages of brands and influencers were utilized to make a comparison of machine learning and lexicon-based approaches to the sentiment analysis through the Naive algorithm (lexicon-based) and Naive Bayes algorithm (machine learning method) and obtain the desired results to assess the campaigns. The findings suggested that the approaches were dissimilar in terms of accuracy; the machine learning method yielded higher accuracy. Finally, the results showed which influencer was more appropriate according to their existence in previous campaigns and helped choose the right influencer in the future for our company and have a better, more appropriate, and more efficient ad campaign subsequently. It is required to conduct further studies on the accuracy improvement of the sentiment classification. This approach should be employed for other social media CGC types. The results revealed decision-making for which sentiment analysis methods are the best approaches for the analysis of social media. It was also found that companies should be aware of their consumers' sentiments and choose the right person every time they think of a campaign.
    Influence of Mobility Restrictions on Transmission of COVID-19 in the state of Maryland -- the USA. (arXiv:2109.12219v1 [stat.AP])
    (0 min) Background: The novel coronavirus, COVID-19, was first detected in the United States in January 2020. To curb the spread of the disease in mid-March, different states issued mandatory stay-at-home (SAH) orders. These nonpharmaceutical interventions were mandated based on prior experiences, such as the 1918 influenza epidemic. Hence, we decided to study the impact of restrictions on mobility on reducing COVID-19 transmission. Methods: We designed an ecological time series study with our exposure variable as Mobility patterns in the state of Maryland for March- December 2020 and our outcome variable as the COVID-19 hospitalizations for the same period. We built an Extreme Gradient Boosting (XGBoost) ensemble machine learning model and regressed the lagged COVID-19 hospitalizations with Mobility volume for different regions of Maryland. Results: We found an 18% increase in COVID-19 hospitalizations when mobility was increased by a factor of five, similarly a 43% increase when mobility was further increased by a factor of ten. Conclusion: The findings of our study demonstrated a positive linear relationship between mobility and the incidence of COVID-19 cases. These findings are partially consistent with other studies suggesting the benefits of mobility restrictions. Although more detailed approach is needed to precisely understand the benefits and limitations of mobility restrictions as part of a response to the COVID-19 pandemic.
    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v1 [cs.LG])
    (0 min) Multivariate Time Series Forecasting (TSF) focuses on the prediction of future values based on historical context. In these problems, dependent variables provide additional information or early warning signs of changes in future behavior. State-of-the-art forecasting models rely on neural attention between timesteps. This allows for temporal learning but fails to consider distinct spatial relationships between variables. This paper addresses the problem by translating multivariate TSF into a novel spatiotemporal sequence formulation where each input token represents the value of a single variable at a given timestep. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, scales to high dimensional forecasting problems dominated by Graph Neural Networks that rely on predefined variable graphs. We achieve competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning spatial and temporal relationships purely from data.
    Regret-Optimal Filtering. (arXiv:2101.10357v2 [math.OC] UPDATED)
    (0 min) In this paper, we study the filtering problem of causally estimating a desired signal from a related observation signal, through the lens of regret optimization. Classical filter designs, such as $\mathcal H_2$ and $\mathcal H_\infty$, minimize the average and worst-case estimation errors, respectively. As a result $\mathcal H_2$ filters are sensitive to inaccuracies in the underlying statistical model, and $\mathcal H_\infty$ filters are overly conservative since they safeguard against the worst-case. To obtain filters that perform better under all possible environments, we propose to minimize the \emph{regret} of the causal filter relative to a clairvoyant one. More explicitly, we minimize the largest deviation of the squared estimation error of a causal filter from that of a clairvoyant estimator that also has access to future observations. In this sense, our proposed filter will have the smallest regret no matter what the true signal and the observation process are and thus is more adaptive to different environments. When the underlying signals have linear time-invariant state-space structure, we provide an explicit construction for the regret optimal filter. The solution is obtained by reducing regret-optimal filtering a Nehari problem, i.e., approximating a non-causal operator by a causal one in spectral norm. The regret-optimal filter bears some resemblance to Kalman and $\mathcal H_\infty$ filters: the regret-optimal filter expressed as a state-space model inherits the finite dimension of the original state-space, and its solution requires the solution of $3$ algebraic Riccati equations. Our results and simulations demonstrate that regret-optimality is a viable approach to filters design.
    A Hard Label Black-box Adversarial Attack Against Graph Neural Networks. (arXiv:2108.09513v2 [cs.LG] UPDATED)
    (0 min) Graph Neural Networks (GNNs) have achieved state-of-the-art performance in various graph structure related tasks such as node classification and graph classification. However, GNNs are vulnerable to adversarial attacks. Existing works mainly focus on attacking GNNs for node classification; nevertheless, the attacks against GNNs for graph classification have not been well explored. In this work, we conduct a systematic study on adversarial attacks against GNNs for graph classification via perturbing the graph structure. In particular, we focus on the most challenging attack, i.e., hard label black-box attack, where an attacker has no knowledge about the target GNN model and can only obtain predicted labels through querying the target model.To achieve this goal, we formulate our attack as an optimization problem, whose objective is to minimize the number of edges to be perturbed in a graph while maintaining the high attack success rate. The original optimization problem is intractable to solve, and we relax the optimization problem to be a tractable one, which is solved with theoretical convergence guarantee. We also design a coarse-grained searching algorithm and a query-efficient gradient computation algorithm to decrease the number of queries to the target GNN model. Our experimental results on three real-world datasets demonstrate that our attack can effectively attack representative GNNs for graph classification with less queries and perturbations. We also evaluate the effectiveness of our attack under two defenses: one is well-designed adversarial graph detector and the other is that the target GNN model itself is equipped with a defense to prevent adversarial graph generation. Our experimental results show that such defenses are not effective enough, which highlights more advanced defenses.
    Use of the Deep Learning Approach to Measure Alveolar Bone Level. (arXiv:2109.12115v1 [eess.IV])
    (0 min) Abstract: Aim: The goal was to use a Deep Convolutional Neural Network to measure the radiographic alveolar bone level to aid periodontal diagnosis. Material and methods: A Deep Learning (DL) model was developed by integrating three segmentation networks (bone area, tooth, cementoenamel junction) and image analysis to measure the radiographic bone level and assign radiographic bone loss (RBL) stages. The percentage of RBL was calculated to determine the stage of RBL for each tooth. A provisional periodontal diagnosis was assigned using the 2018 periodontitis classification. RBL percentage, staging, and presumptive diagnosis were compared to the measurements and diagnoses made by the independent examiners. Results: The average Dice Similarity Coefficient (DSC) for segmentation was over 0.91. There was no significant difference in RBL percentage measurements determined by DL and examiners (p=0.65). The Area Under the Receiver Operating Characteristics Curve of RBL stage assignment for stage I, II and III was 0.89, 0.90 and 0.90, respectively. The accuracy of the case diagnosis was 0.85. Conclusion: The proposed DL model provides reliable RBL measurements and image-based periodontal diagnosis using periapical radiographic images. However, this model has to be further optimized and validated by a larger number of images to facilitate its application.
    EDDA: Explanation-driven Data Augmentation to Improve Explanation Faithfulness. (arXiv:2105.14162v3 [cs.LG] UPDATED)
    (0 min) Recent years have seen the introduction of a range of methods for post-hoc explainability of image classifier predictions. However, these post-hoc explanations may not always be faithful to classifier predictions, which poses a significant challenge when attempting to debug models based on such explanations. To this end, we seek a methodology that can improve the faithfulness of an explanation method with respect to model predictions which does not require ground truth explanations. We achieve this through a novel explanation-driven data augmentation (EDDA) technique that augments the training data with occlusions inferred from model explanations; this is based on the simple motivating principle that \emph{if} the explainer is faithful to the model \emph{then} occluding salient regions for the model prediction should decrease the model confidence in the prediction, while occluding non-salient regions should not change the prediction. To verify that the proposed augmentation method has the potential to improve faithfulness, we evaluate EDDA using a variety of datasets and classification models. We demonstrate empirically that our approach leads to a significant increase of faithfulness, which can facilitate better debugging and successful deployment of image classification models in real-world applications.
    On the Fairness of Swarm Learning in Skin Lesion Classification. (arXiv:2109.12176v1 [cs.DC])
    (0 min) in healthcare. However, the existing AI model may be biased in its decision marking. The bias induced by data itself, such as collecting data in subgroups only, can be mitigated by including more diversified data. Distributed and collaborative learning is an approach to involve training models in massive, heterogeneous, and distributed data sources, also known as nodes. In this work, we target on examining the fairness issue in Swarm Learning (SL), a recent edge-computing based decentralized machine learning approach, which is designed for heterogeneous illnesses detection in precision medicine. SL has achieved high performance in clinical applications, but no attempt has been made to evaluate if SL can improve fairness. To address the problem, we present an empirical study by comparing the fairness among single (node) training, SL, centralized training. Specifically, we evaluate on large public available skin lesion dataset, which contains samples from various subgroups. The experiments demonstrate that SL does not exacerbate the fairness problem compared to centralized training and improves both performance and fairness compared to single training. However, there still exists biases in SL model and the implementation of SL is more complex than the alternative two strategies.
    Robotic Vision for Space Mining. (arXiv:2109.12109v1 [cs.RO])
    (0 min) Future Moon bases will likely be constructed using resources mined from the surface of the Moon. The difficulty of maintaining a human workforce on the Moon and communications lag with Earth means that mining will need to be conducted using collaborative robots with a high degree of autonomy. In this paper, we explore the utility of robotic vision towards addressing several major challenges in autonomous mining in the lunar environment: lack of satellite positioning systems, navigation in hazardous terrain, and delicate robot interactions. Specifically, we describe and report the results of robotic vision algorithms that we developed for Phase 2 of the NASA Space Robotics Challenge, which was framed in the context of autonomous collaborative robots for mining on the Moon. The competition provided a simulated lunar environment that exhibits the complexities alluded to above. We show how machine learning-enabled vision could help alleviate the challenges posed by the lunar environment. A robust multi-robot coordinator was also developed to achieve long-term operation and effective collaboration between robots.
    Cardiac Complication Risk Profiling for Cancer Survivors via Multi-View Multi-Task Learning. (arXiv:2109.12276v1 [cs.LG])
    (0 min) Complication risk profiling is a key challenge in the healthcare domain due to the complex interaction between heterogeneous entities (e.g., visit, disease, medication) in clinical data. With the availability of real-world clinical data such as electronic health records and insurance claims, many deep learning methods are proposed for complication risk profiling. However, these existing methods face two open challenges. First, data heterogeneity relates to those methods leveraging clinical data from a single view only while the data can be considered from multiple views (e.g., sequence of clinical visits, set of clinical features). Second, generalized prediction relates to most of those methods focusing on single-task learning, whereas each complication onset is predicted independently, leading to suboptimal models. We propose a multi-view multi-task network (MuViTaNet) for predicting the onset of multiple complications to tackle these issues. In particular, MuViTaNet complements patient representation by using a multi-view encoder to effectively extract information by considering clinical data as both sequences of clinical visits and sets of clinical features. In addition, it leverages additional information from both related labeled and unlabeled datasets to generate more generalized representations by using a new multi-task learning scheme for making more accurate predictions. The experimental results show that MuViTaNet outperforms existing methods for profiling the development of cardiac complications in breast cancer survivors. Furthermore, thanks to its multi-view multi-task architecture, MuViTaNet also provides an effective mechanism for interpreting its predictions in multiple perspectives, thereby helping clinicians discover the underlying mechanism triggering the onset and for making better clinical treatments in real-world scenarios.
    Datasets for Studying Generalization from Easy to Hard Examples. (arXiv:2108.06011v2 [cs.LG] UPDATED)
    (0 min) We describe new datasets for studying generalization from easy to hard examples.
    Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions. (arXiv:2106.04492v2 [eess.AS] UPDATED)
    (0 min) We present the task description and discussion on the results of the DCASE 2021 Challenge Task 2. In 2020, we organized an unsupervised anomalous sound detection (ASD) task, identifying whether a given sound was normal or anomalous without anomalous training data. In 2021, we organized an advanced unsupervised ASD task under domain-shift conditions, which focuses on the inevitable problem of the practical use of ASD systems. The main challenge of this task is to detect unknown anomalous sounds where the acoustic characteristics of the training and testing samples are different, i.e., domain-shifted. This problem frequently occurs due to changes in seasons, manufactured products, and/or environmental noise. We received 75 submissions from 26 teams, and several novel approaches have been developed in this challenge. On the basis of the analysis of the evaluation results, we found that there are two types of remarkable approaches that TOP-5 winning teams adopted: 1) ensemble approaches of ``outlier exposure'' (OE)-based detectors and ``inlier modeling'' (IM)-based detectors and 2) approaches based on IM-based detection for features learned in a machine-identification task.
    Hybrid Quantum Classical Graph Neural Networks for Particle Track Reconstruction. (arXiv:2109.12636v1 [quant-ph])
    (0 min) The Large Hadron Collider (LHC) at the European Organisation for Nuclear Research (CERN) will be upgraded to further increase the instantaneous rate of particle collisions (luminosity) and become the High Luminosity LHC (HL-LHC). This increase in luminosity will significantly increase the number of particles interacting with the detector. The interaction of particles with a detector is referred to as "hit". The HL-LHC will yield many more detector hits, which will pose a combinatorial challenge by using reconstruction algorithms to determine particle trajectories from those hits. This work explores the possibility of converting a novel Graph Neural Network model, that can optimally take into account the sparse nature of the tracking detector data and their complex geometry, to a Hybrid Quantum-Classical Graph Neural Network that benefits from using Variational Quantum layers. We show that this hybrid model can perform similar to the classical approach. Also, we explore Parametrized Quantum Circuits (PQC) with different expressibility and entangling capacities, and compare their training performance in order to quantify the expected benefits. These results can be used to build a future road map to further develop circuit based Hybrid Quantum-Classical Graph Neural Networks.
    Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations. (arXiv:2109.12174v1 [cs.CL])
    (0 min) Fine-tuning pretrained models for automatically summarizing doctor-patient conversation transcripts presents many challenges: limited training data, significant domain shift, long and noisy transcripts, and high target summary variability. In this paper, we explore the feasibility of using pretrained transformer models for automatically summarizing doctor-patient conversations directly from transcripts. We show that fluent and adequate summaries can be generated with limited training data by fine-tuning BART on a specially constructed dataset. The resulting models greatly surpass the performance of an average human annotator and the quality of previous published work for the task. We evaluate multiple methods for handling long conversations, comparing them to the obvious baseline of truncating the conversation to fit the pretrained model length limit. We introduce a multistage approach that tackles the task by learning two fine-tuned models: one for summarizing conversation chunks into partial summaries, followed by one for rewriting the collection of partial summaries into a complete summary. Using a carefully chosen fine-tuning dataset, this method is shown to be effective at handling longer conversations, improving the quality of generated summaries. We conduct both an automatic evaluation (through ROUGE and two concept-based metrics focusing on medical findings) and a human evaluation (through qualitative examples from literature, assessing hallucination, generalization, fluency, and general quality of the generated summaries).
    WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels. (arXiv:2107.01002v2 [cs.NI] UPDATED)
    (0 min) We introduce WiCluster, a new machine learning (ML) approach for passive indoor positioning using radio frequency (RF) channel state information (CSI). WiCluster can predict both a zone-level position and a precise 2D or 3D position, without using any precise position labels during training. Prior CSI-based indoor positioning work has relied on non-parametric approaches using digital signal-processing (DSP) and, more recently, parametric approaches (e.g., fully supervised ML methods). However these do not handle the complexity of real-world environments well and do not meet requirements for large-scale commercial deployments: the accuracy of DSP-based method deteriorates significantly in non-line-of-sight conditions, while supervised ML methods need large amounts of hard-to-acquire centimeter accuracy position labels. In contrast, WiCluster is precise, requires weaker label-information that can be easily collected, and works well in non-line-of-sight conditions. Our first contribution is a novel dimensionality reduction method for charting. It combines a triplet-loss with a multi-scale clustering-loss to map the high-dimensional CSI representation to a 2D/3D latent space. Our second contribution is two weakly supervised losses that map this latent space into a Cartesian map, resulting in meter-accuracy position results. These losses only require simple to acquire priors: a sketch of the floorplan, approximate access-point locations and a few CSI packets that are labelled with the corresponding zone in the floorplan. Thirdly, we report results and a robustness study for 2D positioning in two single-floor office buildings and 3D positioning in a two-story home.
    Distributionally Robust Multi-Output Regression Ranking. (arXiv:2109.12803v1 [cs.LG])
    (0 min) Despite their empirical success, most existing listwiselearning-to-rank (LTR) models are not built to be robust to errors in labeling or annotation, distributional data shift, or adversarial data perturbations. To fill this gap, we introduce a new listwise LTR model called Distributionally Robust Multi-output Regression Ranking (DRMRR). Different from existing methods, the scoring function of DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores, which captures local context information and cross-document interactions. DRMRR uses a Distributionally Robust Optimization (DRO) framework to minimize a multi-output loss function under the most adverse distributions in the neighborhood of the empirical data distribution defined by a Wasserstein ball. We show that this is equivalent to a regularized regression problem with a matrix norm regularizer. Our experiments were conducted on two real-world applications, medical document retrieval, and drug response prediction, showing that DRMRR notably outperforms state-of-the-art LTR models. We also conducted a comprehensive analysis to assess the resilience of DRMRR against various types of noise: Gaussian noise, adversarial perturbations, and label poisoning. We show that DRMRR is not only able to achieve significantly better performance than other baselines, but it can maintain a relatively stable performance as more noise is added to the data.
    Reinforcement Learning and its Connections with Neuroscience and Psychology. (arXiv:2007.01099v5 [cs.LG] UPDATED)
    (0 min) Reinforcement learning methods have recently been very successful at performing complex sequential tasks like playing Atari games, Go and Poker. These algorithms have outperformed humans in several tasks by learning from scratch, using only scalar rewards obtained through interaction with their environment. While there certainly has been considerable independent innovation to produce such results, many core ideas in reinforcement learning are inspired by phenomena in animal learning, psychology and neuroscience. In this paper, we comprehensively review a large number of findings in both neuroscience and psychology that evidence reinforcement learning as a promising candidate for modeling learning and decision making in the brain. In doing so, we construct a mapping between various classes of modern RL algorithms and specific findings in both neurophysiological and behavioral literature. We then discuss the implications of this observed relationship between RL, neuroscience and psychology and its role in advancing research in both AI and brain science.
    Leveraging Multiple CNNs for Triaging Medical Workflow. (arXiv:2109.12783v1 [eess.IV])
    (0 min) High hospitalization rates due to the global spread of Covid-19 bring about a need for improvements to classical triaging workflows. To this end, convolutional neural networks (CNNs) can effectively differentiate critical from non-critical images so that critical cases may be addressed quickly, so long as there exists some representative image for the illness. Presented is a conglomerate neural network system consisting of multiple VGG16 CNNs; the system trains on weighted skin disease images re-labelled as critical or non-critical, to then attach to input images a critical index between 0 and 10. A critical index offers a more comprehensive rating system compared to binary critical/non-critical labels. Results for batches of input images run through the trained network are promising. A batch is shown being re-ordered by the proposed architecture from most critical to least critical roughly accurately.
    Robustness in Compressed Neural Networks for Object Detection. (arXiv:2102.05509v3 [cs.LG] UPDATED)
    (0 min) Model compression techniques allow to significantly reduce the computational cost associated with data processing by deep neural networks with only a minor decrease in average accuracy. Simultaneously, reducing the model size may have a large effect on noisy cases or objects belonging to less frequent classes. It is a crucial problem from the perspective of the models' safety, especially for object detection in the autonomous driving setting, which is considered in this work. It was shown in the paper that the sensitivity of compressed models to different distortion types is nuanced, and some of the corruptions are heavily impacted by the compression methods (i.e., additive noise), while others (blur effect) are only slightly affected. A common way to improve the robustness of models is to use data augmentation, which was confirmed to positively affect models' robustness, also for highly compressed models. It was further shown that while data imbalance methods brought only a slight increase in accuracy for the baseline model (without compression), the impact was more striking at higher compression rates for the structured pruning. Finally, methods for handling data imbalance brought a significant improvement of the pruned models' worst-detected class accuracy.
    Explanation-Based Human Debugging of NLP Models: A Survey. (arXiv:2104.15135v2 [cs.CL] UPDATED)
    (0 min) Debugging a machine learning model is hard since the bug usually involves the training data and the learning process. This becomes even harder for an opaque deep learning model if we have no clue about how the model actually works. In this survey, we review papers that exploit explanations to enable humans to give feedback and debug NLP models. We call this problem explanation-based human debugging (EBHD). In particular, we categorize and discuss existing work along three dimensions of EBHD (the bug context, the workflow, and the experimental setting), compile findings on how EBHD components affect the feedback providers, and highlight open problems that could be future research directions.
    Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms. (arXiv:2109.12286v1 [cs.LG])
    (0 min) The hierarchical interaction between the actor and critic in actor-critic based reinforcement learning algorithms naturally lends itself to a game-theoretic interpretation. We adopt this viewpoint and model the actor and critic interaction as a two-player general-sum game with a leader-follower structure known as a Stackelberg game. Given this abstraction, we propose a meta-framework for Stackelberg actor-critic algorithms where the leader player follows the total derivative of its objective instead of the usual individual gradient. From a theoretical standpoint, we develop a policy gradient theorem for the refined update and provide a local convergence guarantee for the Stackelberg actor-critic algorithms to a local Stackelberg equilibrium. From an empirical standpoint, we demonstrate via simple examples that the learning dynamics we study mitigate cycling and accelerate convergence compared to the usual gradient dynamics given cost structures induced by actor-critic formulations. Finally, extensive experiments on OpenAI gym environments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts.
    A Rigorous Information-Theoretic Definition of Redundancy and Relevancy in Feature Selection Based on (Partial) Information Decomposition. (arXiv:2105.04187v2 [cs.IT] UPDATED)
    (0 min) Selecting a minimal feature set that is maximally informative about a target variable is a central task in machine learning and statistics. Information theory provides a powerful framework for formulating feature selection algorithms -- yet, a rigorous, information-theoretic definition of feature relevancy, which accounts for feature interactions such as redundant and synergistic contributions, is still missing. We argue that this lack is inherent to classical information theory which does not provide measures to decompose the information a set of variables provides about a target into unique, redundant, and synergistic contributions. Such a decomposition has been introduced only recently by the partial information decomposition (PID) framework. Using PID, we clarify why feature selection is a conceptually difficult problem when approached using information theory and provide a novel definition of feature relevancy and redundancy in PID terms. From this definition, we show that the conditional mutual information (CMI) maximizes relevancy while minimizing redundancy and propose an iterative, CMI-based algorithm for practical feature selection. We demonstrate the power of our CMI-based algorithm in comparison to the unconditional mutual information on benchmark examples and provide corresponding PID estimates to highlight how PID allows to quantify information contribution of features and their interactions in feature-selection problems.
    Tensor Full Feature Measure and Its Nonconvex Relaxation Applications to Tensor Recovery. (arXiv:2109.12257v1 [cs.CV])
    (0 min) Tensor sparse modeling as a promising approach, in the whole of science and engineering has been a huge success. As is known to all, various data in practical application are often generated by multiple factors, so the use of tensors to represent the data containing the internal structure of multiple factors came into being. However, different from the matrix case, constructing reasonable sparse measure of tensor is a relatively difficult and very important task. Therefore, in this paper, we propose a new tensor sparsity measure called Tensor Full Feature Measure (FFM). It can simultaneously describe the feature information of each dimension of the tensor and the related features between two dimensions, and connect the Tucker rank with the tensor tube rank. This measurement method can describe the sparse features of the tensor more comprehensively. On this basis, we establish its non-convex relaxation, and apply FFM to low rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). LRTC and TRPCA models based on FFM are proposed, and two efficient Alternating Direction Multiplier Method (ADMM) algorithms are developed to solve the proposed model. A variety of real numerical experiments substantiate the superiority of the proposed methods beyond state-of-the-arts.
    MCTS Based Agents for Multistage Single-Player Card Game. (arXiv:2109.12112v1 [cs.AI])
    (0 min) The article presents the use of Monte Carlo Tree Search algorithms for the card game Lord of the Rings. The main challenge was the complexity of the game mechanics, in which each round consists of 5 decision stages and 2 random stages. To test various decision-making algorithms, a game simulator has been implemented. The research covered an agent based on expert rules, using flat Monte-Carlo search, as well as complete MCTS-UCB. Moreover different playout strategies has been compared. As a result of experiments, an optimal (assuming a limited time) combination of algorithms were formulated. The developed MCTS based method have demonstrated a advantage over agent with expert knowledge.
    AuGPT: Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language Models. (arXiv:2102.05126v2 [cs.CL] UPDATED)
    (0 min) Attention-based pre-trained language models such as GPT-2 brought considerable progress to end-to-end dialogue modelling. However, they also present considerable risks for task-oriented dialogue, such as lack of knowledge grounding or diversity. To address these issues, we introduce modified training objectives for language model finetuning, and we employ massive data augmentation via back-translation to increase the diversity of the training data. We further examine the possibilities of combining data from multiples sources to improve performance on the target dataset. We carefully evaluate our contributions with both human and automatic methods. Our model substantially outperforms the baseline on the MultiWOZ data and shows competitive performance with state of the art in both automatic and human evaluation.
    Privacy Assessment of Federated Learning using Private Personalized Layers. (arXiv:2106.08060v2 [cs.CR] UPDATED)
    (0 min) Federated Learning (FL) is a collaborative scheme to train a learning model across multiple participants without sharing data. While FL is a clear step forward towards enforcing users' privacy, different inference attacks have been developed. In this paper, we quantify the utility and privacy trade-off of a FL scheme using private personalized layers. While this scheme has been proposed as local adaptation to improve the accuracy of the model through local personalization, it has also the advantage to minimize the information about the model exchanged with the server. However, the privacy of such a scheme has never been quantified. Our evaluations using motion sensor dataset show that personalized layers speed up the convergence of the model and slightly improve the accuracy for all users compared to a standard FL scheme while better preventing both attribute and membership inferences compared to a FL scheme using local differential privacy.
    Dynamically Sampled Nonlocal Gradients for Stronger Adversarial Attacks. (arXiv:2011.02707v4 [cs.LG] UPDATED)
    (0 min) The vulnerability of deep neural networks to small and even imperceptible perturbations has become a central topic in deep learning research. Although several sophisticated defense mechanisms have been introduced, most were later shown to be ineffective. However, a reliable evaluation of model robustness is mandatory for deployment in safety-critical scenarios. To overcome this problem we propose a simple yet effective modification to the gradient calculation of state-of-the-art first-order adversarial attacks. Normally, the gradient update of an attack is directly calculated for the given data point. This approach is sensitive to noise and small local optima of the loss function. Inspired by gradient sampling techniques from non-convex optimization, we propose Dynamically Sampled Nonlocal Gradient Descent (DSNGD). DSNGD calculates the gradient direction of the adversarial attack as the weighted average over past gradients of the optimization history. Moreover, distribution hyperparameters that define the sampling operation are automatically learned during the optimization scheme. We empirically show that by incorporating this nonlocal gradient information, we are able to give a more accurate estimation of the global descent direction on noisy and non-convex loss surfaces. In addition, we show that DSNGD-based attacks are on average 35% faster while achieving 0.9% to 27.1% higher success rates compared to their gradient descent-based counterparts.
    sunny-as2: Enhancing SUNNY for Algorithm Selection. (arXiv:2009.03107v2 [cs.AI] UPDATED)
    (0 min) SUNNY is an Algorithm Selection (AS) technique originally tailored for Constraint Programming (CP). SUNNY enables to schedule, from a portfolio of solvers, a subset of solvers to be run on a given CP problem. This approach has proved to be effective for CP problems, and its parallel version won many gold medals in the Open category of the MiniZinc Challenge -- the yearly international competition for CP solvers. In 2015, the ASlib benchmarks were released for comparing AS systems coming from disparate fields (e.g., ASP, QBF, and SAT) and SUNNY was extended to deal with generic AS problems. This led to the development of sunny-as2, an algorithm selector based on SUNNY for ASlib scenarios. A preliminary version of sunny-as2 was submitted to the Open Algorithm Selection Challenge (OASC) in 2017, where it turned out to be the best approach for the runtime minimization of decision problems. In this work, we present the technical advancements of sunny-as2, including: (i) wrapper-based feature selection; (ii) a training approach combining feature selection and neighbourhood size configuration; (iii) the application of nested cross-validation. We show how sunny-as2 performance varies depending on the considered AS scenarios, and we discuss its strengths and weaknesses. Finally, we also show how sunny-as2 improves on its preliminary version submitted to OASC.

2021-09-27

  • cs.CL updates on arXiv.org

    Calling Out Bluff: Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems. (arXiv:2007.06796v4 [cs.CL] UPDATED)
    (0 min) Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Despite such wide usage, the AI-based testing literature of these "intelligent" models is highly lacking. Most of the papers proposing new models rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, sufficiency and, vocabulary. To date, there has been no study testing Automated Essay Scoring: AES systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems to test their natural language understanding capabilities and overall robustness. We evaluate the current state-of-the-art AES models using the proposed scheme and report the results on five recent models. These models range from feature-engineering-based approaches to the latest deep learning algorithms. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models. On the other hand, irrelevant content, on average, increases the scores, thus showing that the model evaluation strategy and rubrics should be reconsidered. We also ask 200 human raters to score both an original and adversarial response to seeing if humans can detect differences between the two and whether they agree with the scores assigned by auto scores.
    Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality. (arXiv:2010.12730v3 [cs.CL] UPDATED)
    (0 min) Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed--and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
    SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification. (arXiv:2004.14454v2 [cs.CL] UPDATED)
    (0 min) The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.
    Semantic Source Code Search: A Study of the Past and a Glimpse at the Future. (arXiv:1908.06738v2 [cs.SE] UPDATED)
    (0 min) With the recent explosion in the size and complexity of source codebases and software projects, the need for efficient source code search engines has increased dramatically. Unfortunately, existing information retrieval-based methods fail to capture the query semantics and perform well only when the query contains syntax-based keywords. Consequently, such methods will perform poorly when given high-level natural language queries. In this paper, we review existing methods for building code search engines. We also outline the open research directions and the various obstacles that stand in the way of having a universal source code search engine.
    Learned Token Pruning for Transformers. (arXiv:2107.00910v2 [cs.CL] UPDATED)
    (0 min) Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold value which is learned for each layer during training. Our threshold-based method allows the length of the pruned sequence to vary adaptively based on the input sequence, and avoids algorithmically expensive operations such as top-k token selection. We extensively test the performance of LTP on GLUE tasks and show that our method outperforms the prior state-of-the-art token pruning methods by up to ~2.5% higher accuracy with the same amount of FLOPs. In particular, LTP achieves up to 2.1x FLOPs reduction with less than 1% accuracy drop, which results in up to 1.9x and 2.0x throughput improvement on Intel Haswell CPUs and NVIDIA V100 GPUs, respectively. Furthermore, we demonstrate that LTP is more robust than prior methods to variations on input sentence lengths. Our code has been developed in PyTorch and has been open-sourced.
    Bilingual Alignment Pre-Training for Zero-Shot Cross-Lingual Transfer. (arXiv:2106.01732v2 [cs.CL] UPDATED)
    (0 min) Multilingual pre-trained models have achieved remarkable performance on cross-lingual transfer learning. Some multilingual models such as mBERT, have been pre-trained on unlabeled corpora, therefore the embeddings of different languages in the models may not be aligned very well. In this paper, we aim to improve the zero-shot cross-lingual transfer performance by proposing a pre-training task named Word-Exchange Aligning Model (WEAM), which uses the statistical alignment information as the prior knowledge to guide cross-lingual word prediction. We evaluate our model on multilingual machine reading comprehension task MLQA and natural language interface task XNLI. The results show that WEAM can significantly improve the zero-shot performance.
    ValNorm Quantifies Semantics to Reveal Consistent Valence Biases Across Languages and Over Centuries. (arXiv:2006.03950v4 [cs.CY] UPDATED)
    (0 min) Word embeddings learn implicit biases from linguistic regularities captured by word co-occurrence statistics. By extending methods that quantify human-like biases in word embeddings, we introduceValNorm, a novel intrinsic evaluation task and method to quantify the valence dimension of affect in human-rated word sets from social psychology. We apply ValNorm on static word embeddings from seven languages (Chinese, English, German, Polish, Portuguese, Spanish, and Turkish) and from historical English text spanning 200 years. ValNorm achieves consistently high accuracy in quantifying the valence of non-discriminatory, non-social group word sets. Specifically, ValNorm achieves a Pearson correlation of r=0.88 for human judgment scores of valence for 399 words collected to establish pleasantness norms in English. In contrast, we measure gender stereotypes using the same set of word embeddings and find that social biases vary across languages. Our results indicate that valence associations of non-discriminatory, non-social group words represent widely-shared associations, in seven languages and over 200 years.
    AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation. (arXiv:2109.12068v1 [cs.CL])
    (0 min) Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format has recently been proposed as a simple, yet effective, transfer learning approach. Although a multilingual version of the T5 model (mT5) has been introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 on a language with a wide variety of dialects--Arabic. For evaluation, we use an existing benchmark for Arabic language understanding and introduce a new benchmark for Arabic language generation (ARGEN). We also pre-train three powerful Arabic-specific text-to-text Transformer based models and evaluate them on the two benchmarks. Our new models perform significantly better than mT5 and exceed MARBERT, the current state-of-the-art Arabic BERT-based model, on Arabic language understanding. The models also set new SOTA on the generation benchmark. Our new models and are publicly released at https://github.com/UBC-NLP/araT5 and ARLGE will be released through the same repository.
    OpenAttack: An Open-source Textual Adversarial Attack Toolkit. (arXiv:2009.09191v2 [cs.CL] UPDATED)
    (0 min) Textual adversarial attacking has received wide and increasing attention in recent years. Various attack models have been proposed, which are enormously distinct and implemented with different programming frameworks and settings. These facts hinder quick utilization and fair comparison of attack models. In this paper, we present an open-source textual adversarial attack toolkit named OpenAttack to solve these issues. Compared with existing other textual adversarial attack toolkits, OpenAttack has its unique strengths in support for all attack types, multilinguality, and parallel processing. Currently, OpenAttack includes 15 typical attack models that cover all attack types. Its highly inclusive modular design not only supports quick utilization of existing attack models, but also enables great flexibility and extensibility. OpenAttack has broad uses including comparing and evaluating attack models, measuring robustness of a model, assisting in developing new attack models, and adversarial training. Source code and documentation can be obtained at https://github.com/thunlp/OpenAttack.
    SAIS: Supervising and Augmenting Intermediate Steps for Document-Level Relation Extraction. (arXiv:2109.12093v1 [cs.CL])
    (0 min) Stepping from sentence-level to document-level relation extraction, the research community confronts increasing text length and more complicated entity interactions. Consequently, it is more challenging to encode the key sources of information--relevant contexts and entity types. However, existing methods only implicitly learn to model these critical information sources while being trained for relation extraction. As a result, they suffer the problems of ineffective supervision and uninterpretable model predictions. In contrast, we propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction. Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately so as to enhance interpretability. By assessing model uncertainty, SAIS further boosts the performance via evidence-based data augmentation and ensemble inference while reducing the computational cost. Eventually, SAIS delivers state-of-the-art relation extraction results on three benchmarks (DocRED, CDR, and GDA) and achieves 5.04% relative gains in F1 score compared to the runner-up in evidence retrieval on DocRED.
    Statistically significant detection of semantic shifts using contextual word embeddings. (arXiv:2104.03776v2 [cs.CL] UPDATED)
    (0 min) Detecting lexical semantic change in smaller data sets, e.g. in historical linguistics and digital humanities, is challenging due to a lack of statistical power. This issue is exacerbated by non-contextual embedding models that produce one embedding per word and, therefore, mask the variability present in the data. In this article, we propose an approach to estimate semantic shift by combining contextual word embeddings with permutation-based statistical tests. We use the false discovery rate procedure to address the large number of hypothesis tests being conducted simultaneously. We demonstrate the performance of this approach in simulation where it achieves consistently high precision by suppressing false positives. We additionally analyze real-world data from SemEval-2020 Task 1 and the Liverpool FC subreddit corpus. We show that by taking sample variation into account, we can improve the robustness of individual semantic shift estimates without degrading overall performance.
    GERNERMED -- An Open German Medical NER Model. (arXiv:2109.12104v1 [cs.CL])
    (0 min) The current state of adoption of well-structured electronic health records and integration of digital methods for storing medical patient data in structured formats can often considered as inferior compared to the use of traditional, unstructured text based patient data documentation. Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data. In natural language processing (NLP), statistical models have been shown successful in various tasks like part-of-speech tagging, relation extraction (RE) and named entity recognition (NER). In this work, we present GERNERMED, the first open, neural NLP model for NER tasks dedicated to detect medical entity types in German text data. Here, we avoid the conflicting goals of protection of sensitive patient data from training data extraction and the publication of the statistical model weights by training our model on a custom dataset that was translated from publicly available datasets in foreign language by a pretrained neural machine translation model. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED
    Summary Explorer: Visualizing the State of the Art in Text Summarization. (arXiv:2108.01879v2 [cs.CL] UPDATED)
    (0 min) This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems by compiling the outputs of 55~state-of-the-art single document summarization approaches on three benchmark datasets, and visually exploring them during a qualitative assessment. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias), encapsulated in a guided assessment based on tailored visualizations. The tool complements existing approaches for locally debugging summarization models and improves upon them. The tool is available at https://tldr.webis.de/
    Advances and Challenges in Conversational Recommender Systems: A Survey. (arXiv:2101.09459v7 [cs.IR] UPDATED)
    (0 min) Recommender systems exploit interaction history to estimate user preference, having been heavily used in a wide range of industry applications. However, static recommendation models are difficult to answer two important questions well due to inherent shortcomings: (a) What exactly does a user like? (b) Why does a user like an item? The shortcomings are due to the way that static models learn user preference, i.e., without explicit instructions and active feedback from users. The recent rise of conversational recommender systems (CRSs) changes this situation fundamentally. In a CRS, users and the system can dynamically communicate through natural language interactions, which provide unprecedented opportunities to explicitly obtain the exact preference of users. Considerable efforts, spread across disparate settings and applications, have been put into developing CRSs. Existing models, technologies, and evaluation methods for CRSs are far from mature. In this paper, we provide a systematic review of the techniques used in current CRSs. We summarize the key challenges of developing CRSs in five directions: (1) Question-based user preference elicitation. (2) Multi-turn conversational recommendation strategies. (3) Dialogue understanding and generation. (4) Exploitation-exploration trade-offs. (5) Evaluation and user simulation. These research directions involve multiple research fields like information retrieval (IR), natural language processing (NLP), and human-computer interaction (HCI). Based on these research directions, we discuss some future challenges and opportunities. We provide a road map for researchers from multiple communities to get started in this area. We hope this survey can help to identify and address challenges in CRSs and inspire future research.
    CLIPort: What and Where Pathways for Robotic Manipulation. (arXiv:2109.12098v1 [cs.RO])
    (0 min) How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.
    Indirectly Supervised English Sentence Break Prediction Using Paragraph Break Probability Estimates. (arXiv:2109.12023v1 [cs.CL])
    (0 min) This report explores the use of paragraph break probability estimates to help predict the location of sentence breaks in English natural language text. We show that a sentence break predictor based almost solely on paragraph break probability estimates can achieve high accuracy on this task. This sentence break predictor is trained almost entirely on a large amount of naturally occurring text without sentence break annotations, with only a small amount of annotated data needed to tune two hyperparameters. We also show that even better results can be achieved across in-domain and out-of-domain test data, if paragraph break probability signals are combined with a support vector machine classifier trained on a somewhat larger amount of sentence-break-annotated data. Numerous related issues are addressed along the way.
    Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering. (arXiv:2109.12028v1 [cs.CL])
    (0 min) Human knowledge is collectively encoded in the roughly 6500 languages spoken around the world, but it is not distributed equally across languages. Hence, for information-seeking question answering (QA) systems to adequately serve speakers of all languages, they need to operate cross-lingually. In this work we investigate the capabilities of multilingually pre-trained language models on cross-lingual QA. We find that explicitly aligning the representations across languages with a post-hoc fine-tuning step generally leads to improved performance. We additionally investigate the effect of data size as well as the language choice in this fine-tuning step, also releasing a dataset for evaluating cross-lingual QA systems. Code and dataset are publicly available here: https://github.com/ffaisal93/aligned_qa
    Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting. (arXiv:2101.00416v2 [cs.CL] UPDATED)
    (0 min) In this paper, we generalize text infilling (e.g., masked language models) by proposing Sequence Span Rewriting (SSR) as a self-supervised sequence-to-sequence (seq2seq) pre-training objective. SSR provides more fine-grained learning signals for text representations by supervising the model to rewrite imperfect spans to ground truth, and it is more consistent than text infilling with many downstream seq2seq tasks that rewrite a source sentences into a target sentence. Our experiments with T5 models on various seq2seq tasks show that SSR can substantially improve seq2seq pre-training. Moreover, we observe SSR is especially helpful to improve pre-training a small-size seq2seq model with a powerful imperfect span generator, which indicates a new perspective of transferring knowledge from a large model to a smaller model for seq2seq pre-training.
    Faithful Target Attribute Prediction in Neural Machine Translation. (arXiv:2109.12105v1 [cs.CL])
    (2 min) The training data used in NMT is rarely controlled with respect to specific attributes, such as word casing or gender, which can cause errors in translations. We argue that predicting the target word and attributes simultaneously is an effective way to ensure that translations are more faithful to the training data distribution with respect to these attributes. Experimental results on two tasks, uppercased input translation and gender prediction, show that this strategy helps mirror the training data distribution in testing. It also facilitates data augmentation on the task of uppercased input translation.
    Rethinking Crowd Sourcing for Semantic Similarity. (arXiv:2109.11969v1 [cs.CL])
    (2 min) Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic similarity as a binary category (two sentences are either similar or not similar and there is no middle ground) play the most important role in the labeling. The paper offers heuristics to filter out unreliable annotators and stimulates further discussions on human perception of semantic similarity.
    CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models. (arXiv:2109.11797v1 [cs.CV])
    (2 min) Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for quantities of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, our prompt tuning approach enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). All the data and code will be available to facilitate future research.
    Unsupervised Translation of German--Lower Sorbian: Exploring Training and Novel Transfer Methods on a Low-Resource Language. (arXiv:2109.12012v1 [cs.CL])
    (2 min) This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2021 Unsupervised Machine Translation task for German--Lower Sorbian (DE--DSB): a high-resource language to a low-resource one. Our system uses a transformer encoder-decoder architecture in which we make three changes to the standard training procedure. First, our training focuses on two languages at a time, contrasting with a wealth of research on multilingual systems. Second, we introduce a novel method for initializing the vocabulary of an unseen language, achieving improvements of 3.2 BLEU for DE$\rightarrow$DSB and 4.0 BLEU for DSB$\rightarrow$DE. Lastly, we experiment with the order in which offline and online back-translation are used to train an unsupervised system, finding that using online back-translation first works better for DE$\rightarrow$DSB by 2.76 BLEU. Our submissions ranked first (tied with another team) for DSB$\rightarrow$DE and third for DE$\rightarrow$DSB.
    Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction. (arXiv:2109.12008v1 [cs.CL])
    (2 min) State-of-the-art NLP models can adopt shallow heuristics that limit their generalization capability (McCoy et al., 2019). Such heuristics include lexical overlap with the training set in Named-Entity Recognition (Taill\'e et al., 2020) and Event or Type heuristics in Relation Extraction (Rosenman et al., 2020). In the more realistic end-to-end RE setting, we can expect yet another heuristic: the mere retention of training relation triples. In this paper, we propose several experiments confirming that retention of known facts is a key factor of performance on standard benchmarks. Furthermore, one experiment suggests that a pipeline model able to use intermediate type representations is less prone to over-rely on retention.
    Transformers Generalize Linearly. (arXiv:2109.12036v1 [cs.CL])
    (2 min) Natural language exhibits patterns of hierarchically governed dependencies, in which relations between words are sensitive to syntactic structure rather than linear ordering. While re-current network models often fail to generalize in a hierarchically sensitive way (McCoy et al.,2020) when trained on ambiguous data, the improvement in performance of newer Trans-former language models (Vaswani et al., 2017)on a range of syntactic benchmarks trained on large data sets (Goldberg, 2019; Warstadtet al., 2019) opens the question of whether these models might exhibit hierarchical generalization in the face of impoverished data.In this paper we examine patterns of structural generalization for Transformer sequence-to-sequence models and find that not only do Transformers fail to generalize hierarchically across a wide variety of grammatical mapping tasks, but they exhibit an even stronger preference for linear generalization than comparable recurrent networks
    DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference. (arXiv:2109.11745v1 [cs.CL])
    (2 min) Large-scale pre-trained language models have shown remarkable results in diverse NLP applications. Unfortunately, these performance gains have been accompanied by a significant increase in computation time and model size, stressing the need to develop new or complementary strategies to increase the efficiency of these models. In this paper we propose DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models. DACT-BERT adds an adaptive computational mechanism to BERT's regular processing pipeline, which controls the number of Transformer blocks that need to be executed at inference time. By doing this, the model learns to combine the most appropriate intermediate representations for the task at hand. Our experiments demonstrate that our approach, when compared to the baselines, excels on a reduced computational regime and is competitive in other less restrictive ones.
    Dense Contrastive Visual-Linguistic Pretraining. (arXiv:2109.11778v1 [cs.CV])
    (2 min) Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-/Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.
    AES Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses. (arXiv:2109.11728v1 [cs.CL])
    (2 min) Deep-learning based Automatic Essay Scoring (AES) systems are being actively used by states and language testing agencies alike to evaluate millions of candidates for life-changing decisions ranging from college applications to visa approvals. However, little research has been put to understand and interpret the black-box nature of deep-learning based scoring algorithms. Previous studies indicate that scoring models can be easily fooled. In this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity i.e., large change in output score with a little change in input essay content) and overstability i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.
    iFacetSum: Coreference-based Interactive Faceted Summarization for Multi-Document Exploration. (arXiv:2109.11621v1 [cs.CL])
    (2 min) We introduce iFacetSum, a web application for exploring topical document sets. iFacetSum integrates interactive summarization together with faceted search, by providing a novel faceted navigation scheme that yields abstractive summaries for the user's selections. This approach offers both a comprehensive overview as well as concise details regarding subtopics of choice. Fine-grained facets are automatically produced based on cross-document coreference pipelines, rendering generic concepts, entities and statements surfacing in the source texts. We analyze the effectiveness of our application through small-scale user studies, which suggest the usefulness of our approach.
    A Diversity-Enhanced and Constraints-Relaxed Augmentation for Low-Resource Classification. (arXiv:2109.11834v1 [cs.CL])
    (2 min) Data augmentation (DA) aims to generate constrained and diversified data to improve classifiers in Low-Resource Classification (LRC). Previous studies mostly use a fine-tuned Language Model (LM) to strengthen the constraints but ignore the fact that the potential of diversity could improve the effectiveness of generated data. In LRC, strong constraints but weak diversity in DA result in the poor generalization ability of classifiers. To address this dilemma, we propose a {D}iversity-{E}nhanced and {C}onstraints-\{R}elaxed {A}ugmentation (DECRA). Our DECRA has two essential components on top of a transformer-based backbone model. 1) A k-beta augmentation, an essential component of DECRA, is proposed to enhance the diversity in generating constrained data. It expands the changing scope and improves the degree of complexity of the generated data. 2) A masked language model loss, instead of fine-tuning, is used as a regularization. It relaxes constraints so that the classifier can be trained with more scattered generated data. The combination of these two components generates data that can reach or approach category boundaries and hence help the classifier generalize better. We evaluate our DECRA on three public benchmark datasets under low-resource settings. Extensive experiments demonstrate that our DECRA outperforms state-of-the-art approaches by 3.8% in the overall score.
    Detect and Perturb: Neutral Rewriting of Biased and Sensitive Text via Gradient-based Decoding. (arXiv:2109.11708v1 [cs.CL])
    (2 min) Written language carries explicit and implicit biases that can distract from meaningful signals. For example, letters of reference may describe male and female candidates differently, or their writing style may indirectly reveal demographic characteristics. At best, such biases distract from the meaningful content of the text; at worst they can lead to unfair outcomes. We investigate the challenge of re-generating input sentences to 'neutralize' sensitive attributes while maintaining the semantic meaning of the original text (e.g. is the candidate qualified?). We propose a gradient-based rewriting framework, Detect and Perturb to Neutralize (DEPEN), that first detects sensitive components and masks them for regeneration, then perturbs the generation model at decoding time under a neutralizing constraint that pushes the (predicted) distribution of sensitive attributes towards a uniform distribution. Our experiments in two different scenarios show that DEPEN can regenerate fluent alternatives that are neutral in the sensitive attribute while maintaining the semantics of other attributes.
    SD-QA: Spoken Dialectal Question Answering for the Real World. (arXiv:2109.12072v1 [cs.CL])
    (2 min) Question answering (QA) systems are now available through numerous commercial applications for a wide variety of domains, serving millions of users that interact with them via speech interfaces. However, current benchmarks in QA research do not account for the errors that speech recognition models might introduce, nor do they consider the language variations (dialects) of the users. To address this gap, we augment an existing QA dataset to construct a multi-dialect, spoken QA benchmark on five languages (Arabic, Bengali, English, Kiswahili, Korean) with more than 68k audio prompts in 24 dialects from 255 speakers. We provide baseline results showcasing the real-world performance of QA systems and analyze the effect of language variety and other sensitive speaker attributes on downstream performance. Last, we study the fairness of the ASR and QA models with respect to the underlying user populations. The dataset, model outputs, and code for reproducing all our experiments are available: https://github.com/ffaisal93/SD-QA.
    Progressive Adversarial Learning for Bootstrapping: A Case Study on Entity Set Expansion. (arXiv:2109.12082v1 [cs.CL])
    (2 min) Bootstrapping has become the mainstream method for entity set expansion. Conventional bootstrapping methods mostly define the expansion boundary using seed-based distance metrics, which heavily depend on the quality of selected seeds and are hard to be adjusted due to the extremely sparse supervision. In this paper, we propose BootstrapGAN, a new learning method for bootstrapping which jointly models the bootstrapping process and the boundary learning process in a GAN framework. Specifically, the expansion boundaries of different bootstrapping iterations are learned via different discriminator networks; the bootstrapping network is the generator to generate new positive entities, and the discriminator networks identify the expansion boundaries by trying to distinguish the generated entities from known positive entities. By iteratively performing the above adversarial learning, the generator and the discriminators can reinforce each other and be progressively refined along the whole bootstrapping process. Experiments show that BootstrapGAN achieves the new state-of-the-art entity set expansion performance.
    Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus. (arXiv:2109.12053v1 [cs.CL])
    (2 min) The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the lack of resources with a comparable size in other languages. We have therefore developed the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments, which has been created following the same approach and the same steps as the English one. In this paper we describe the corpus creation, we detail its content, and we present the first experiments on this new resource. We compare in-domain and out-of-domain classification, and perform a specific evaluation of nine linguistic phenomena. We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformerbased approaches can benefit from using sentences in two languages during fine-tuning.
    Revisiting the Uniform Information Density Hypothesis. (arXiv:2109.11635v1 [cs.CL])
    (2 min) The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal. While its implications on language production have been well explored, the hypothesis potentially makes predictions about language comprehension and linguistic acceptability as well. Further, it is unclear how uniformity in a linguistic signal -- or lack thereof -- should be measured, and over which linguistic unit, e.g., the sentence or language level, this uniformity should hold. Here we investigate these facets of the UID hypothesis using reading time and acceptability data. While our reading time results are generally consistent with previous work, they are also consistent with a weakly super-linear effect of surprisal, which would be compatible with UID's predictions. For acceptability judgments, we find clearer evidence that non-uniformity in information density is predictive of lower acceptability. We then explore multiple operationalizations of UID, motivated by different interpretations of the original hypothesis, and analyze the scope over which the pressure towards uniformity is exerted. The explanatory power of a subset of the proposed operationalizations suggests that the strongest trend may be a regression towards a mean surprisal across the language, rather than the phrase, sentence, or document -- a finding that supports a typical interpretation of UID, namely that it is the byproduct of language users maximizing the use of a (hypothetical) communication channel.
    Text-based NP Enrichment. (arXiv:2109.12085v1 [cs.CL])
    (2 min) Understanding the relations between entities denoted by NPs in text is a critical part of human-like natural language understanding. However, only a fraction of such relations is covered by NLP tasks and models nowadays. In this work, we establish the task of text-based NP enrichment (TNE), that is, enriching each NP with all the preposition-mediated relations that hold between this and the other NPs in the text. The relations are represented as triplets, each denoting two NPs linked via a preposition. Humans recover such relations seamlessly, while current state-of-the-art models struggle with them due to the implicit nature of the problem. We build the first large-scale dataset for the problem, provide the formal framing and scope of annotation, analyze the data, and report the result of fine-tuned neural language models on the task, demonstrating the challenge it poses to current technology. We created a webpage with the data, data-exploration UI, code, models, and demo to foster further research into this challenging text understanding problem at yanaiela.github.io/TNE/.
    How Does Knowledge Graph Embedding Extrapolate to Unseen Data: a Semantic Evidence View. (arXiv:2109.11800v1 [cs.CL])
    (3 min) Knowledge Graph Embedding (KGE) aims to learn representations for entities and relations. Most KGE models have gained great success, especially on extrapolation scenarios. Specifically, given an unseen triple (h, r, t), a trained model can still correctly predict t from (h, r, ?), or h from (?, r, t), such extrapolation ability is impressive. However, most existing KGE works focus on the design of delicate triple modeling function, which mainly tell us how to measure the plausibility of observed triples, but we have limited understanding of why the methods can extrapolate to unseen data, and what are the important factors to help KGE extrapolate. Therefore in this work, we attempt to, from a data relevant view, study KGE extrapolation of two problems: 1. How does KGE extrapolate to unseen data? 2. How to design the KGE model with better extrapolation ability? For the problem 1, we first discuss the impact factors for extrapolation and from relation, entity and triple level respectively, propose three Semantic Evidences (SEs), which can be observed from training set and provide important semantic information for extrapolation to unseen data. Then we verify the effectiveness of SEs through extensive experiments on several typical KGE methods, and demonstrate that SEs serve as an important role for understanding the extrapolation ability of KGE. For the problem 2, to make better use of the SE information for more extrapolative knowledge representation, we propose a novel GNN-based KGE model, called Semantic Evidence aware Graph Neural Network (SE-GNN). Finally, through extensive experiments on FB15k-237 and WN18RR datasets, we show that SE-GNN achieves state-of-the-art performance on Knowledge Graph Completion task and perform a better extrapolation ability.
    Lacking the embedding of a word? Look it up into a traditional dictionary. (arXiv:2109.11763v1 [cs.CL])
    (2 min) Word embeddings are powerful dictionaries, which may easily capture language variations. However, these dictionaries fail to give sense to rare words, which are surprisingly often covered by traditional dictionaries. In this paper, we propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words. For this purpose, we introduce two methods: Definition Neural Network (DefiNNet) and Define BERT (DefBERT). In our experiments, DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods devised for producing embeddings of unknown words. In fact, DefiNNet significantly outperforms FastText, which implements a method for the same task-based on n-grams, and DefBERT significantly outperforms the BERT method for OOV words. Then, definitions in traditional dictionaries are useful to build word embeddings for rare words.
    Simple and Effective Zero-shot Cross-lingual Phoneme Recognition. (arXiv:2109.11680v1 [cs.CL])
    (2 min) Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.
    Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text. (arXiv:2109.11888v1 [cs.CL])
    (2 min) Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.
    CSAGN: Conversational Structure Aware Graph Network for Conversational Semantic Role Labeling. (arXiv:2109.11541v1 [cs.CL])
    (2 min) Conversational semantic role labeling (CSRL) is believed to be a crucial step towards dialogue understanding. However, it remains a major challenge for existing CSRL parser to handle conversational structural information. In this paper, we present a simple and effective architecture for CSRL which aims to address this problem. Our model is based on a conversational structure-aware graph network which explicitly encodes the speaker dependent information. We also propose a multi-task learning method to further improve the model. Experimental results on benchmark datasets show that our model with our proposed training objectives significantly outperforms previous baselines.
    Document Automation Architectures and Technologies: A Survey. (arXiv:2109.11603v1 [cs.CL])
    (2 min) This paper surveys the current state of the art in document automation (DA). The objective of DA is to reduce the manual effort during the generation of documents by automatically integrating input from different sources and assembling documents conforming to defined templates. There have been reviews of commercial solutions of DA, particularly in the legal domain, but to date there has been no comprehensive review of the academic research on DA architectures and technologies. The current survey of DA reviews the academic literature and provides a clearer definition and characterization of DA and its features, identifies state-of-the-art DA architectures and technologies in academic research, and provides ideas that can lead to new research opportunities within the DA field in light of recent advances in artificial intelligence and deep neural networks.
  • cs.CV updates on arXiv.org

    MRI-based Alzheimer's disease prediction via distilling the knowledge in multi-modal data. (arXiv:2104.03618v2 [eess.IV] UPDATED)
    (0 min) Mild cognitive impairment (MCI) conversion prediction, i.e., identifying MCI patients of high risks converting to Alzheimer's disease (AD), is essential for preventing or slowing the progression of AD. Although previous studies have shown that the fusion of multi-modal data can effectively improve the prediction accuracy, their applications are largely restricted by the limited availability or high cost of multi-modal data. Building an effective prediction model using only magnetic resonance imaging (MRI) remains a challenging research topic. In this work, we propose a multi-modal multi-instance distillation scheme, which aims to distill the knowledge learned from multi-modal data to an MRI-based network for MCI conversion prediction. In contrast to existing distillation algorithms, the proposed multi-instance probabilities demonstrate a superior capability of representing the complicated atrophy distributions, and can guide the MRI-based network to better explore the input MRI. To our best knowledge, this is the first study that attempts to improve an MRI-based prediction model by leveraging extra supervision distilled from multi-modal information. Experiments demonstrate the advantage of our framework, suggesting its potentials in the data-limited clinical settings.
    Faster Convergence in Deep-Predictive-Coding Networks to Learn Deeper Representations. (arXiv:2101.06848v4 [cs.AI] UPDATED)
    (0 min) Deep-predictive-coding networks (DPCNs) are hierarchical, generative models. They rely on feed-forward and feed-back connections to modulate latent feature representations of stimuli in a dynamic and context-sensitive manner. A crucial element of DPCNs is a forward-backward inference procedure to uncover sparse, invariant features. However, this inference is a major computational bottleneck. It severely limits the network depth due to learning stagnation. Here, we prove why this bottleneck occurs. We then propose a new forward-inference strategy based on accelerated proximal gradients. This strategy has faster theoretical convergence guarantees than the one used for DPCNs. It overcomes learning stagnation. We also demonstrate that it permits constructing deep and wide predictive-coding networks. Such convolutional networks implement receptive fields that capture well the entire classes of objects on which the networks are trained. This improves the feature representations compared with our lab's previous non-convolutional and convolutional DPCNs. It yields unsupervised object recognition that surpass convolutional autoencoders and are on par with convolutional networks trained in a supervised manner.
    Multi-StyleGAN: Towards Image-Based Simulation of Time-Lapse Live-Cell Microscopy. (arXiv:2106.08285v3 [cs.CV] UPDATED)
    (0 min) Time-lapse fluorescent microscopy (TLFM) combined with predictive mathematical modelling is a powerful tool to study the inherently dynamic processes of life on the single-cell level. Such experiments are costly, complex and labour intensive. A complimentary approach and a step towards in silico experimentation, is to synthesise the imagery itself. Here, we propose Multi-StyleGAN as a descriptive approach to simulate time-lapse fluorescence microscopy imagery of living cells, based on a past experiment. This novel generative adversarial network synthesises a multi-domain sequence of consecutive timesteps. We showcase Multi-StyleGAN on imagery of multiple live yeast cells in microstructured environments and train on a dataset recorded in our laboratory. The simulation captures underlying biophysical factors and time dependencies, such as cell morphology, growth, physical interactions, as well as the intensity of a fluorescent reporter protein. An immediate application is to generate additional training and validation data for feature extraction algorithms or to aid and expedite development of advanced experimental techniques such as online monitoring or control of cells. Code and dataset is available at https://git.rwth-aachen.de/bcs/projects/tp/multi-stylegan.
    White blood cell subtype detection and classification. (arXiv:2108.04614v2 [cs.CV] UPDATED)
    (0 min) Machine learning has endless applications in the health care industry. White blood cell classification is one of the interesting and promising area of research. The classification of the white blood cells plays an important part in the medical diagnosis. In practise white blood cell classification is performed by the haematologist by taking a small smear of blood and careful examination under the microscope. The current procedures to identify the white blood cell subtype is more time taking and error-prone. The computer aided detection and diagnosis of the white blood cells tend to avoid the human error and reduce the time taken to classify the white blood cells. In the recent years several deep learning approaches have been developed in the context of classification of the white blood cells that are able to identify but are unable to localize the positions of white blood cells in the blood cell image. Following this, the present research proposes to utilize YOLOv3 object detection technique to localize and classify the white blood cells with bounding boxes. With exhaustive experimental analysis, the proposed work is found to detect the white blood cell with 99.2% accuracy and classify with 90% accuracy.
    DeepStroke: An Efficient Stroke Screening Framework for Emergency Rooms with Multimodal Adversarial Deep Learning. (arXiv:2109.12065v1 [cs.CV])
    (0 min) In an emergency room (ER) setting, the diagnosis of stroke is a common challenge. Due to excessive execution time and cost, an MRI scan is usually not available in the ER. Clinical tests are commonly referred to in stroke screening, but neurologists may not be immediately available. We propose a novel multimodal deep learning framework, DeepStroke, to achieve computer-aided stroke presence assessment by recognizing the patterns of facial motion incoordination and speech inability for patients with suspicion of stroke in an acute setting. Our proposed DeepStroke takes video data for local facial paralysis detection and audio data for global speech disorder analysis. It further leverages a multi-modal lateral fusion to combine the low- and high-level features and provides mutual regularization for joint training. A novel adversarial training loss is also introduced to obtain identity-independent and stroke-discriminative features. Experiments on our video-audio dataset with actual ER patients show that the proposed approach outperforms state-of-the-art models and achieves better performance than ER doctors, attaining a 6.60% higher sensitivity and maintaining 4.62% higher accuracy when specificity is aligned. Meanwhile, each assessment can be completed in less than 6 minutes, demonstrating the framework's great potential for clinical implementation.
    ReGenMorph: Visibly Realistic GAN Generated Face Morphing Attacks by Attack Re-generation. (arXiv:2108.09130v2 [cs.CV] UPDATED)
    (0 min) Face morphing attacks aim at creating face images that are verifiable to be the face of multiple identities, which can lead to building faulty identity links in operations like border checks. While creating a morphed face detector (MFD), training on all possible attack types is essential to achieve good detection performance. Therefore, investigating new methods of creating morphing attacks drives the generalizability of MADs. Creating morphing attacks was performed on the image level, by landmark interpolation, or on the latent-space level, by manipulating latent vectors in a generative adversarial network. The earlier results in varying blending artifacts and the latter results in synthetic-like striping artifacts. This work presents the novel morphing pipeline, ReGenMorph, to eliminate the LMA blending artifacts by using a GAN-based generation, as well as, eliminate the manipulation in the latent space, resulting in visibly realistic morphed images compared to previous works. The generated ReGenMorph appearance is compared to recent morphing approaches and evaluated for face recognition vulnerability and attack detectability, whether as known or unknown attacks.
    MISA: Online Defense of Trojaned Models using Misattributions. (arXiv:2103.15918v2 [cs.CR] UPDATED)
    (0 min) Recent studies have shown that neural networks are vulnerable to Trojan attacks, where a network is trained to respond to specially crafted trigger patterns in the inputs in specific and potentially malicious ways. This paper proposes MISA, a new online approach to detect Trojan triggers for neural networks at inference time. Our approach is based on a novel notion called misattributions, which captures the anomalous manifestation of a Trojan activation in the feature space. Given an input image and the corresponding output prediction, our algorithm first computes the model's attribution on different features. It then statistically analyzes these attributions to ascertain the presence of a Trojan trigger. Across a set of benchmarks, we show that our method can effectively detect Trojan triggers for a wide variety of trigger patterns, including several recent ones for which there are no known defenses. Our method achieves 96% AUC for detecting images that include a Trojan trigger without any assumptions on the trigger pattern.
    Boosting Light-Weight Depth Estimation Via Knowledge Distillation. (arXiv:2105.06143v2 [cs.CV] UPDATED)
    (0 min) The advanced performance of depth estimation is achieved by the employment of large and complex neural networks. While the performance is still being continuously improved, we argue that the depth estimation has to be efficient as well since it is a preliminary requirement for real-world applications. However, fast depth estimation tends to lower the performance as the trade-off between the model's capacity and accuracy. In this paper, we aim to achieve accurate depth estimation with a light-weight network. To this end, we first introduce a highly compact network that can estimate a depth map in real-time. We then develop a knowledge distillation paradigm to further improve the performance. We observe that many scenarios have the same scene scales in real-world, yielding similar depth histograms, thus they are potentially valuable and applicable to develop a better learning strategy. Therefore, we propose to employ auxiliary unlabeled/labeled data to improve knowledge distillation. Through extensive and rigorous experiments, we show that our method can achieve comparable performance against state-of-the-of-art methods with only 1% parameters, and outperforms previous light-weight methods in terms of inference accuracy, computational efficiency and generalizability.
    Beyond the Hausdorff Metric in Digital Topology. (arXiv:2108.03114v2 [cs.CG] UPDATED)
    (0 min) Two objects may be close in the Hausdorff metric, yet have very different geometric and topological properties. We examine other methods of comparing digital images such that objects close in each of these measures have some similar geometric or topological property. Such measures may be combined with the Hausdorff metric to yield a metric in which close images are similar with respect to multiple properties.
    GSIP: Green Semantic Segmentation of Large-Scale Indoor Point Clouds. (arXiv:2109.11835v1 [cs.CV])
    (0 min) An efficient solution to semantic segmentation of large-scale indoor scene point clouds is proposed in this work. It is named GSIP (Green Segmentation of Indoor Point clouds) and its performance is evaluated on a representative large-scale benchmark -- the Stanford 3D Indoor Segmentation (S3DIS) dataset. GSIP has two novel components: 1) a room-style data pre-processing method that selects a proper subset of points for further processing, and 2) a new feature extractor which is extended from PointHop. For the former, sampled points of each room form an input unit. For the latter, the weaknesses of PointHop's feature extraction when extending it to large-scale point clouds are identified and fixed with a simpler processing pipeline. As compared with PointNet, which is a pioneering deep-learning-based solution, GSIP is green since it has significantly lower computational complexity and a much smaller model size. Furthermore, experiments show that GSIP outperforms PointNet in segmentation performance for the S3DIS dataset.
    CLIPort: What and Where Pathways for Robotic Manipulation. (arXiv:2109.12098v1 [cs.RO])
    (0 min) How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.
    Beyond Triplet Loss: Meta Prototypical N-tuple Loss for Person Re-identification. (arXiv:2006.04991v2 [cs.CV] UPDATED)
    (0 min) Person Re-identification (ReID) aims at matching a person of interest across images. In convolutional neural network (CNN) based approaches, loss design plays a vital role in pulling closer features of the same identity and pushing far apart features of different identities. In recent years, triplet loss achieves superior performance and is predominant in ReID. However, triplet loss considers only three instances of two classes in per-query optimization (with an anchor sample as query) and it is actually equivalent to a two-class classification. There is a lack of loss design which enables the joint optimization of multiple instances (of multiple classes) within per-query optimization for person ReID. In this paper, we introduce a multi-class classification loss, i.e., N-tuple loss, to jointly consider multiple (N) instances for per-query optimization. This in fact aligns better with the ReID test/inference process, which conducts the ranking/comparisons among multiple instances. Furthermore, for more efficient multi-class classification, we propose a new meta prototypical N-tuple loss. With the multi-class classification incorporated, our model achieves the state-of-the-art performance on the benchmark person ReID datasets.
    Towards Autonomous Crop-Agnostic Visual Navigation in Arable Fields. (arXiv:2109.11936v1 [cs.RO])
    (0 min) Autonomous navigation of a robot in agricultural fields is essential for every task from crop monitoring through to weed management and fertilizer application. Many current approaches rely on accurate GPS, however, such technology is expensive and also prone to failure~(e.g. through lack of coverage). As such, navigation through sensors that can interpret their environment (such as cameras) is important to achieve the goal of autonomy in agriculture. In this paper, we introduce a purely vision-based navigation scheme which is able to reliably guide the robot through row-crop fields. Independent of any global localization or mapping, this approach is able to accurately follow the crop-rows and switch between the rows, only using on-board cameras. With the help of a novel crop-row detection and a novel crop-row switching technique, our navigation scheme can be deployed in a wide range of fields with different canopy types in various growth stages. We have extensively tested our approach in five different fields under various illumination conditions using our agricultural robotic platform (BonnBot-I). And our evaluations show that we have achieved a navigation accuracy of 3.82cm over five different crop fields.
    Is First Person Vision Challenging for Object Tracking?. (arXiv:2011.12263v2 [cs.CV] UPDATED)
    (0 min) Understanding human-object interactions is fundamental in First Person Vision (FPV). Tracking algorithms which follow the objects manipulated by the camera wearer can provide useful cues to effectively model such interactions. Despite a few previous attempts to exploit trackers in FPV applications, a methodical analysis of the performance of state-of-the-art visual trackers in this domain is still missing. In this short paper, we provide a recap of the first systematic study of object tracking in FPV. Our work extensively analyses the performance of recent and baseline FPV trackers with respect to different aspects. This is achieved through TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences. The results suggest that more research efforts should be devoted to this problem so that tracking could benefit FPV tasks. The full version of this paper is available at arXiv:2108.13665.
    Two-Stage Mesh Deep Learning for Automated Tooth Segmentation and Landmark Localization on 3D Intraoral Scans. (arXiv:2109.11941v1 [cs.CV])
    (0 min) Accurately segmenting teeth and identifying the corresponding anatomical landmarks on dental mesh models are essential in computer-aided orthodontic treatment. Manually performing these two tasks is time-consuming, tedious, and, more importantly, highly dependent on orthodontists' experiences due to the abnormality and large-scale variance of patients' teeth. Some machine learning-based methods have been designed and applied in the orthodontic field to automatically segment dental meshes (e.g., intraoral scans). In contrast, the number of studies on tooth landmark localization is still limited. This paper proposes a two-stage framework based on mesh deep learning (called TS-MDL) for joint tooth labeling and landmark identification on raw intraoral scans. Our TS-MDL first adopts an end-to-end \emph{i}MeshSegNet method (i.e., a variant of the existing MeshSegNet with both improved accuracy and efficiency) to label each tooth on the downsampled scan. Guided by the segmentation outputs, our TS-MDL further selects each tooth's region of interest (ROI) on the original mesh to construct a light-weight variant of the pioneering PointNet (i.e., PointNet-Reg) for regressing the corresponding landmark heatmaps. Our TS-MDL was evaluated on a real-clinical dataset, showing promising segmentation and localization performance. Specifically, \emph{i}MeshSegNet in the first stage of TS-MDL reached an averaged Dice similarity coefficient (DSC) at $0.953\pm0.076$, significantly outperforming the original MeshSegNet. In the second stage, PointNet-Reg achieved a mean absolute error (MAE) of $0.623\pm0.718 \, mm$ in distances between the prediction and ground truth for $44$ landmarks, which is superior compared with other networks for landmark detection. All these results suggest the potential usage of our TS-MDL in clinical practices.
    CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models. (arXiv:2109.11797v1 [cs.CV])
    (0 min) Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for quantities of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, our prompt tuning approach enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). All the data and code will be available to facilitate future research.
    Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives. (arXiv:2003.08854v3 [cs.RO] UPDATED)
    (0 min) Visuomotor control (VMC) is an effective means of achieving basic manipulation tasks such as pushing or pick-and-place from raw images. Conditioning VMC on desired goal states is a promising way of achieving versatile skill primitives. However, common conditioning schemes either rely on task-specific fine tuning - e.g. using one-shot imitation learning (IL) - or on sampling approaches using a forward model of scene dynamics i.e. model-predictive control (MPC), leaving deployability and planning horizon severely limited. In this paper we propose a conditioning scheme which avoids these pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion and the distance to a given target observation. In contrast to related works, this enables our approach to efficiently perform complex manipulation tasks from raw image observations without predefined control primitives or test time demonstrations. We report significant improvements in task success over representative MPC and IL baselines. We also demonstrate our model's generalisation capabilities in challenging, unseen tasks featuring visual noise, cluttered scenes and unseen object geometries.
    ImplicitVol: Sensorless 3D Ultrasound Reconstruction with Deep Implicit Representation. (arXiv:2109.12108v1 [eess.IV])
    (0 min) The objective of this work is to achieve sensorless reconstruction of a 3D volume from a set of 2D freehand ultrasound images with deep implicit representation. In contrast to the conventional way that represents a 3D volume as a discrete voxel grid, we do so by parameterizing it as the zero level-set of a continuous function, i.e. implicitly representing the 3D volume as a mapping from the spatial coordinates to the corresponding intensity values. Our proposed model, termed as ImplicitVol, takes a set of 2D scans and their estimated locations in 3D as input, jointly re?fing the estimated 3D locations and learning a full reconstruction of the 3D volume. When testing on real 2D ultrasound images, novel cross-sectional views that are sampled from ImplicitVol show significantly better visual quality than those sampled from existing reconstruction approaches, outperforming them by over 30% (NCC and SSIM), between the output and ground-truth on the 3D volume testing data. The code will be made publicly available.
    Learning-based Noise Component Map Estimation for Image Denoising. (arXiv:2109.11877v1 [eess.IV])
    (0 min) A problem of image denoising when images are corrupted by a non-stationary noise is considered in this paper. Since in practice no a priori information on noise is available, noise statistics should be pre-estimated for image denoising. In this paper, deep convolutional neural network (CNN) based method for estimation of a map of local, patch-wise, standard deviations of noise (so-called sigma-map) is proposed. It achieves the state-of-the-art performance in accuracy of estimation of sigma-map for the case of non-stationary noise, as well as estimation of noise variance for the case of additive white Gaussian noise. Extensive experiments on image denoising using estimated sigma-maps demonstrate that our method outperforms recent CNN-based blind image denoising methods by up to 6 dB in PSNR, as well as other state-of-the-art methods based on sigma-map estimation by up to 0.5 dB, providing same time better usage flexibility. Comparison with the ideal case, when denoising is applied using ground-truth sigma-map, shows that a difference of corresponding PSNR values for most of noise levels is within 0.1-0.2 dB and does not exceeds 0.6 dB.
    Few-shot Learning Based on Multi-stage Transfer and Class-Balanced Loss for Diabetic Retinopathy Grading. (arXiv:2109.11806v1 [eess.IV])
    (0 min) Diabetic retinopathy (DR) is one of the major blindness-causing diseases current-ly known. Automatic grading of DR using deep learning methods not only speeds up the diagnosis of the disease but also reduces the rate of misdiagnosis. However, problems such as insufficient samples and imbalanced class distribu-tion in DR datasets have constrained the improvement of grading performance. In this paper, we introduce the idea of multi-stage transfer into the grading task of DR. The new transfer learning technique leverages multiple datasets with differ-ent scales to enable the model to learn more feature representation information. Meanwhile, to cope with imbalanced DR datasets, we present a class-balanced loss function that performs well in natural image classification tasks, and adopt a simple and easy-to-implement training method for it. The experimental results show that the application of multi-stage transfer and class-balanced loss function can effectively improve the grading performance metrics such as accuracy and quadratic weighted kappa. In fact, our method has outperformed two state-of-the-art methods and achieved the best result on the DR grading task of IDRiD Sub-Challenge 2.
    A 3D Mesh-based Lifting-and-Projection Network for Human Pose Transfer. (arXiv:2109.11719v1 [cs.CV])
    (0 min) Human pose transfer has typically been modeled as a 2D image-to-image translation problem. This formulation ignores the human body shape prior in 3D space and inevitably causes implausible artifacts, especially when facing occlusion. To address this issue, we propose a lifting-and-projection framework to perform pose transfer in the 3D mesh space. The core of our framework is a foreground generation module, that consists of two novel networks: a lifting-and-projection network (LPNet) and an appearance detail compensating network (ADCNet). To leverage the human body shape prior, LPNet exploits the topological information of the body mesh to learn an expressive visual representation for the target person in the 3D mesh space. To preserve texture details, ADCNet is further introduced to enhance the feature produced by LPNet with the source foreground image. Such design of the foreground generation module enables the model to better handle difficult cases such as those with occlusions. Experiments on the iPER and Fashion datasets empirically demonstrate that the proposed lifting-and-projection framework is effective and outperforms the existing image-to-image-based and mesh-based methods on human pose transfer task in both self-transfer and cross-transfer settings.
    Tackling Inter-Class Similarity and Intra-Class Variance for Microscopic Image-based Classification. (arXiv:2109.11891v1 [cs.CV])
    (0 min) Automatic classification of aquatic microorganisms is based on the morphological features extracted from individual images. The current works on their classification do not consider the inter-class similarity and intra-class variance that causes misclassification. We are particularly interested in the case where variance within a class occurs due to discrete visual changes in microscopic images. In this paper, we propose to account for it by partitioning the classes with high variance based on the visual features. Our algorithm automatically decides the optimal number of sub-classes to be created and consider each of them as a separate class for training. This way, the network learns finer-grained visual features. Our experiments on two databases of freshwater benthic diatoms and marine plankton show that our method can outperform the state-of-the-art approaches for classification of these aquatic microorganisms.
    Fine-Grained Image Generation from Bangla Text Description using Attentional Generative Adversarial Network. (arXiv:2109.11749v1 [cs.CV])
    (0 min) Generating fine-grained, realistic images from text has many applications in the visual and semantic realm. Considering that, we propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation. Our model can integrate the most specific details at different sub-regions of the image. We distinctively concentrate on the relevant words in the natural language description. This framework has achieved a better inception score on the CUB dataset. For the first time, a fine-grained image is generated from Bangla text using attentional GAN. Bangla has achieved 7th position among 100 most spoken languages. This inspires us to explicitly focus on this language, which will ensure the inevitable need of many people. Moreover, Bangla has a more complex syntactic structure and less natural language processing resource that validates our work more.
    Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild. (arXiv:2109.11874v1 [cs.CV])
    (0 min) This work investigates the problem of sketch-guided object localization (SGOL), where human sketches are used as queries to conduct the object localization in natural images. In this cross-modal setting, we first contribute with a tough-to-beat baseline that without any specific SGOL training is able to outperform the previous works on a fixed set of classes. The baseline is useful to analyze the performance of SGOL approaches based on available simple yet powerful methods. We advance prior arts by proposing a sketch-conditioned DETR (DEtection TRansformer) architecture which avoids a hard classification and alleviates the domain gap between sketches and images to localize object instances. Although the main goal of SGOL is focused on object detection, we explored its natural extension to sketch-guided instance segmentation. This novel task allows to move towards identifying the objects at pixel level, which is of key importance in several applications. We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results. All training and testing code of our model will be released to facilitate future research{{https://github.com/priba/sgol_wild}}.
    Keypoints-Based Deep Feature Fusion for Cooperative Vehicle Detection of Autonomous Driving. (arXiv:2109.11615v1 [cs.CV])
    (0 min) Sharing collective perception messages (CPM) between vehicles is investigated to decrease occlusions, so as to improve perception accuracy and safety of autonomous driving. However, highly accurate data sharing and low communication overhead is a big challenge for collective perception, especially when real-time communication is required among connected and automated vehicles. In this paper, we propose an efficient and effective keypoints-based deep feature fusion framework, called FPV-RCNN, for collective perception, which is built on top of the 3D object detector PV-RCNN. We introduce a bounding box proposal matching module and a keypoints selection strategy to compress the CPM size and solve the multi-vehicle data fusion problem. Compared to a bird's-eye view (BEV) keypoints feature fusion, FPV-RCNN achieves improved detection accuracy by about 14% at a high evaluation criterion (IoU 0.7) on a synthetic dataset COMAP dedicated to collective perception. Also, its performance is comparable to two raw data fusion baselines that have no data loss in sharing. Moreover, our method also significantly decreases the CPM size to less than 0.3KB, which is about 50 times smaller than the BEV feature map sharing used in previous works. Even with a further decreased number of CPM feature channels, i.e., from 128 to 32, the detection performance only drops about 1%. The code of our method is available at https://github.com/YuanYunshuang/FPV_RCNN.
    RSDet++: Point-based Modulated Loss for More Accurate Rotated Object Detection. (arXiv:2109.11906v1 [cs.CV])
    (0 min) We classify the discontinuity of loss in both five-param and eight-param rotated object detection methods as rotation sensitivity error (RSE) which will result in performance degeneration. We introduce a novel modulated rotation loss to alleviate the problem and propose a rotation sensitivity detection network (RSDet) which is consists of an eight-param single-stage rotated object detector and the modulated rotation loss. Our proposed RSDet has several advantages: 1) it reformulates the rotated object detection problem as predicting the corners of objects while most previous methods employ a five-para-based regression method with different measurement units. 2) modulated rotation loss achieves consistent improvement on both five-param and eight-param rotated object detection methods by solving the discontinuity of loss. To further improve the accuracy of our method on objects smaller than 10 pixels, we introduce a novel RSDet++ which is consists of a point-based anchor-free rotated object detector and a modulated rotation loss. Extensive experiments demonstrate the effectiveness of both RSDet and RSDet++, which achieve competitive results on rotated object detection in the challenging benchmarks DOTA1.0, DOTA1.5, and DOTA2.0. We hope the proposed method can provide a new perspective for designing algorithms to solve rotated object detection and pay more attention to tiny objects. The codes and models are available at: https://github.com/yangxue0827/RotationDetection.
    Lifelong 3D Object Recognition and Grasp Synthesis Using Dual Memory Recurrent Self-Organization Networks. (arXiv:2109.11544v1 [cs.RO])
    (0 min) Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the problem of catastrophic forgetting, where the newly gained knowledge overwrites existing representations. Furthermore, most state-of-the-art models excel either in recognizing the objects or in grasp prediction, while both tasks use visual input. The combined architecture to tackle both tasks is very limited. In this paper, we proposed a hybrid model architecture consists of a dynamically growing dual-memory recurrent neural network (GDM) and an autoencoder to tackle object recognition and grasping simultaneously. The autoencoder network is responsible to extract a compact representation for a given object, which serves as input for the GDM learning, and is responsible to predict pixel-wise antipodal grasp configurations. The GDM part is designed to recognize the object in both instances and categories levels. We address the problem of catastrophic forgetting using the intrinsic memory replay, where the episodic memory periodically replays the neural activation trajectories in the absence of external sensory information. To extensively evaluate the proposed model in a lifelong setting, we generate a synthetic dataset due to lack of sequential 3D objects dataset. Experiment results demonstrated that the proposed model can learn both object representation and grasping simultaneously in continual learning scenarios.
    ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation. (arXiv:2109.12066v1 [cs.CV])
    (0 min) Real-world object sampling produces long-tailed distributions requiring exponentially more images for rare types. Zero-shot detection, which aims to detect unseen objects, is one direction to address this problem. A dataset such as COCO is extensively annotated across many images but with a sparse number of categories and annotating all object classes across a diverse domain is expensive and challenging. To advance zero-shot detection, we develop a Vision-Language distillation method that aligns both image and text embeddings from a zero-shot pre-trained model such as CLIP to a modified semantic prediction head from a one-stage detector like YOLOv5. With this method, we are able to train an object detector that achieves state-of-the-art accuracy on the COCO zero-shot detection splits with fewer model parameters. During inference, our model can be adapted to detect any number of object classes without additional training. We also find that the improvements provided by the scaling of our method are consistent across various YOLOv5 scales. Furthermore, we develop a self-training method that provides a significant score improvement without needing extra images nor labels.
    Frequency Pooling: Shift-Equivalent and Anti-Aliasing Downsampling. (arXiv:2109.11839v1 [cs.CV])
    (0 min) Convolution utilizes a shift-equivalent prior of images, thus leading to great success in image processing tasks. However, commonly used poolings in convolutional neural networks (CNNs), such as max-pooling, average-pooling, and strided-convolution, are not shift-equivalent. Thus, the shift-equivalence of CNNs is destroyed when convolutions and poolings are stacked. Moreover, anti-aliasing is another essential property of poolings from the perspective of signal processing. However, recent poolings are neither shift-equivalent nor anti-aliasing. To address this issue, we propose a new pooling method that is shift-equivalent and anti-aliasing, named frequency pooling. Frequency pooling first transforms the features into the frequency domain, and then removes the frequency components beyond the Nyquist frequency. Finally, it transforms the features back to the spatial domain. We prove that frequency pooling is shift-equivalent and anti-aliasing based on the property of Fourier transform and Nyquist frequency. Experiments on image classification show that frequency pooling improves accuracy and robustness with respect to the shifts of CNNs.
    Kronecker CP Decomposition with Fast Multiplication for Compressing RNNs. (arXiv:2008.09342v2 [cs.CV] UPDATED)
    (0 min) Recurrent neural networks (RNNs) are powerful in the tasks oriented to sequential data, such as natural language processing and video recognition. However, since the modern RNNs, including long-short term memory (LSTM) and gated recurrent unit (GRU) networks, have complex topologies and expensive space/computation complexity, compressing them becomes a hot and promising topic in recent years. Among plenty of compression methods, tensor decomposition, e.g., tensor train (TT), block term (BT), tensor ring (TR) and hierarchical Tucker (HT), appears to be the most amazing approach since a very high compression ratio might be obtained. Nevertheless, none of these tensor decomposition formats can provide both the space and computation efficiency. In this paper, we consider to compress RNNs based on a novel Kronecker CANDECOMP/PARAFAC (KCP) decomposition, which is derived from Kronecker tensor (KT) decomposition, by proposing two fast algorithms of multiplication between the input and the tensor-decomposed weight. According to our experiments based on UCF11, Youtube Celebrities Face and UCF50 datasets, it can be verified that the proposed KCP-RNNs have comparable performance of accuracy with those in other tensor-decomposed formats, and even 278,219x compression ratio could be obtained by the low rank KCP. More importantly, KCP-RNNs are efficient in both space and computation complexity compared with other tensor-decomposed ones under similar ranks. Besides, we find KCP has the best potential for parallel computing to accelerate the calculations in neural networks.
    SIM2REALVIZ: Visualizing the Sim2Real Gap in Robot Ego-Pose Estimation. (arXiv:2109.11801v1 [cs.RO])
    (0 min) The Robotics community has started to heavily rely on increasingly realistic 3D simulators for large-scale training of robots on massive amounts of data. But once robots are deployed in the real world, the simulation gap, as well as changes in the real world (e.g. lights, objects displacements) lead to errors. In this paper, we introduce Sim2RealViz, a visual analytics tool to assist experts in understanding and reducing this gap for robot ego-pose estimation tasks, i.e. the estimation of a robot's position using trained models. Sim2RealViz displays details of a given model and the performance of its instances in both simulation and real-world. Experts can identify environment differences that impact model predictions at a given location and explore through direct interactions with the model hypothesis to fix it. We detail the design of the tool, and case studies related to the exploit of the regression to the mean bias and how it can be addressed, and how models are perturbed by the vanish of landmarks such as bikes.
    MODNet-V: Improving Portrait Video Matting via Background Restoration. (arXiv:2109.11818v1 [cs.CV])
    (0 min) To address the challenging portrait video matting problem more precisely, existing works typically apply some matting priors that require additional user efforts to obtain, such as annotated trimaps or background images. In this work, we observe that instead of asking the user to explicitly provide a background image, we may recover it from the input video itself. To this end, we first propose a novel background restoration module (BRM) to recover the background image dynamically from the input video. BRM is extremely lightweight and can be easily integrated into existing matting models. By combining BRM with a recent image matting model, MODNet, we then present MODNet-V for portrait video matting. Benefited from the strong background prior provided by BRM, MODNet-V has only 1/3 of the parameters of MODNet but achieves comparable or even better performances. Our design allows MODNet-V to be trained in an end-to-end manner on a single NVIDIA 3090 GPU. Finally, we introduce a new patch refinement module (PRM) to adapt MODNet-V for high-resolution videos while keeping MODNet-V lightweight and fast.
    Multi-View Video-Based 3D Hand Pose Estimation. (arXiv:2109.11747v1 [cs.CV])
    (0 min) Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices. Recent works have shown that videos or multi-view images carry rich information regarding the hand, allowing for the development of more robust HPE systems. In this paper, we present the Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels. Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos. The videos have been simultaneously captured from six different angles with complex backgrounds and random levels of dynamic lighting. The data has been captured from 10 distinct animated subjects using 12 cameras in a semi-circle topology where six tracking cameras only focus on the hand and the other six fixed cameras capture the entire body. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, and graph networks with U-Net architectures to estimate the final 3D pose information. We perform extensive experiments and show the challenging nature of this new dataset as well as the effectiveness of our proposed method. Ablation studies show the added value of each component in MuViHandNet, as well as the benefit of having temporal and sequential information in the dataset.
    Visual Scene Graphs for Audio Source Separation. (arXiv:2109.11955v1 [cs.CV])
    (0 min) State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild'' video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches.
    Paint4Poem: A Dataset for Artistic Visualization of Classical Chinese Poems. (arXiv:2109.11682v1 [cs.CV])
    (0 min) In this work we propose a new task: artistic visualization of classical Chinese poems, where the goal is to generatepaintings of a certain artistic style for classical Chinese poems. For this purpose, we construct a new dataset called Paint4Poem. Thefirst part of Paint4Poem consists of 301 high-quality poem-painting pairs collected manually from an influential modern Chinese artistFeng Zikai. As its small scale poses challenges for effectively training poem-to-painting generation models, we introduce the secondpart of Paint4Poem, which consists of 3,648 caption-painting pairs collected manually from Feng Zikai's paintings and 89,204 poem-painting pairs collected automatically from the web. We expect the former to help learning the artist painting style as it containshis most paintings, and the latter to help learning the semantic relevance between poems and paintings. Further, we analyze Paint4Poem regarding poem diversity, painting style, and the semantic relevance between poems and paintings. We create abenchmark for Paint4Poem: we train two representative text-to-image generation models: AttnGAN and MirrorGAN, and evaluate theirperformance regarding painting pictorial quality, painting stylistic relevance, and semantic relevance between poems and paintings.The results indicate that the models are able to generate paintings that have good pictorial quality and mimic Feng Zikai's style, but thereflection of poem semantics is limited. The dataset also poses many interesting research directions on this task, including transferlearning, few-shot learning, text-to-image generation for low-resource data etc. The dataset is publicly available.(https://github.com/paint4poem/paint4poem)
    Weakly-Supervised Monocular Depth Estimationwith Resolution-Mismatched Data. (arXiv:2109.11573v1 [cs.CV])
    (0 min) Depth estimation from a single image is an active research topic in computer vision. The most accurate approaches are based on fully supervised learning models, which rely on a large amount of dense and high-resolution (HR) ground-truth depth maps. However, in practice, color images are usually captured with much higher resolution than depth maps, leading to the resolution-mismatched effect. In this paper, we propose a novel weakly-supervised framework to train a monocular depth estimation network to generate HR depth maps with resolution-mismatched supervision, i.e., the inputs are HR color images and the ground-truth are low-resolution (LR) depth maps. The proposed weakly supervised framework is composed of a sharing weight monocular depth estimation network and a depth reconstruction network for distillation. Specifically, for the monocular depth estimation network the input color image is first downsampled to obtain its LR version with the same resolution as the ground-truth depth. Then, both HR and LR color images are fed into the proposed monocular depth estimation network to obtain the corresponding estimated depth maps. We introduce three losses to train the network: 1) reconstruction loss between the estimated LR depth and the ground-truth LR depth; 2) reconstruction loss between the downsampled estimated HR depth and the ground-truth LR depth; 3) consistency loss between the estimated LR depth and the downsampled estimated HR depth. In addition, we design a depth reconstruction network from depth to depth. Through distillation loss, features between two networks maintain the structural consistency in affinity space, and finally improving the estimation network performance. Experimental results demonstrate that our method achieves superior performance than unsupervised and semi-supervised learning based schemes, and is competitive or even better compared to supervised ones.
    Fast Point Voxel Convolution Neural Network with Selective Feature Fusion for Point Cloud Semantic Segmentation. (arXiv:2109.11614v1 [cs.CV])
    (0 min) We present a novel lightweight convolutional neural network for point cloud analysis. In contrast to many current CNNs which increase receptive field by downsampling point cloud, our method directly operates on the entire point sets without sampling and achieves good performances efficiently. Our network consists of point voxel convolution (PVC) layer as building block. Each layer has two parallel branches, namely the voxel branch and the point branch. For the voxel branch specifically, we aggregate local features on non-empty voxel centers to reduce geometric information loss caused by voxelization, then apply volumetric convolutions to enhance local neighborhood geometry encoding. For the point branch, we use Multi-Layer Perceptron (MLP) to extract fine-detailed point-wise features. Outputs from these two branches are adaptively fused via a feature selection module. Moreover, we supervise the output from every PVC layer to learn different levels of semantic information. The final prediction is made by averaging all intermediate predictions. We demonstrate empirically that our method is able to achieve comparable results while being fast and memory efficient. We evaluate our method on popular point cloud datasets for object classification and semantic segmentation tasks.
    How to find a good image-text embedding for remote sensing visual question answering?. (arXiv:2109.11848v1 [cs.CV])
    (0 min) Visual question answering (VQA) has recently been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone. VQA considers a question (in natural language, therefore easy to formulate) about an image and aims at providing an answer through a model based on computer vision and natural language processing methods. As such, a VQA model needs to jointly consider visual and textual features, which is frequently done through a fusion step. In this work, we study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity. Our findings indicate that more complex fusion mechanisms yield an improved performance, yet that seeking a trade-of between model complexity and performance is worthwhile in practice.
    SPNet: Multi-Shell Kernel Convolution for Point Cloud Semantic Segmentation. (arXiv:2109.11610v1 [cs.CV])
    (0 min) Feature encoding is essential for point cloud analysis. In this paper, we propose a novel point convolution operator named Shell Point Convolution (SPConv) for shape encoding and local context learning. Specifically, SPConv splits 3D neighborhood space into shells, aggregates local features on manually designed kernel points, and performs convolution on the shells. Moreover, SPConv incorporates a simple yet effective attention module that enhances local feature aggregation. Based upon SPConv, a deep neural network named SPNet is constructed to process large-scale point clouds. Poisson disk sampling and feature propagation are incorporated in SPNet for better efficiency and accuracy. We provided details of the shell design and conducted extensive experiments on challenging large-scale point cloud datasets. Experimental results show that SPConv is effective in local shape encoding, and our SPNet is able to achieve top-ranking performances in semantic segmentation tasks.
    Feasibility study of urban flood mapping using traffic signs for route optimization. (arXiv:2109.11712v1 [cs.CV])
    (0 min) Water events are the most frequent and costliest climate disasters around the world. In the U.S., an estimated 127 million people who live in coastal areas are at risk of substantial home damage from hurricanes or flooding. In flood emergency management, timely and effective spatial decision-making and intelligent routing depend on flood depth information at a fine spatiotemporal scale. In this paper, crowdsourcing is utilized to collect photos of submerged stop signs, and pair each photo with a pre-flood photo taken at the same location. Each photo pair is then analyzed using deep neural network and image processing to estimate the depth of floodwater in the location of the photo. Generated point-by-point depth data is converted to a flood inundation map and used by an A* search algorithm to determine an optimal flood-free path connecting points of interest. Results provide crucial information to rescue teams and evacuees by enabling effective wayfinding during flooding events.
    Training Automatic View Planner for Cardiac MR Imaging via Self-Supervision by Spatial Relationship between Views. (arXiv:2109.11715v1 [eess.IV])
    (0 min) View planning for the acquisition of cardiac magnetic resonance imaging (CMR) requires acquaintance with the cardiac anatomy and remains a challenging task in clinical practice. Existing approaches to its automation relied either on an additional volumetric image not typically acquired in clinic routine, or on laborious manual annotations of cardiac structural landmarks. This work presents a clinic-compatible and annotation-free system for automatic CMR view planning. The system mines the spatial relationship -- more specifically, locates and exploits the intersecting lines -- between the source and target views, and trains deep networks to regress heatmaps defined by these intersecting lines. As the spatial relationship is self-contained in properly stored data, e.g., in the DICOM format, the need for manual annotation is eliminated. Then, a multi-view planning strategy is proposed to aggregate information from the predicted heatmaps for all the source views of a target view, for a globally optimal prescription. The multi-view aggregation mimics the similar strategy practiced by skilled human prescribers. Experimental results on 181 clinical CMR exams show that our system achieves superior accuracy to existing approaches including conventional atlas-based and newer deep learning based ones, in prescribing four standard CMR views. The mean angle difference and point-to-plane distance evaluated against the ground truth planes are 5.98 degrees and 3.48 mm, respectively.
    Holistic Semi-Supervised Approaches for EEG Representation Learning. (arXiv:2109.11732v1 [cs.LG])
    (0 min) Recently, supervised methods, which often require substantial amounts of class labels, have achieved promising results for EEG representation learning. However, labeling EEG data is a challenging task. More recently, holistic semi-supervised learning approaches, which only require few output labels, have shown promising results in the field of computer vision. These methods, however, have not yet been adapted for EEG learning. In this paper, we adapt three state-of-the-art holistic semi-supervised approaches, namely MixMatch, FixMatch, and AdaMatch, as well as five classical semi-supervised methods for EEG learning. We perform rigorous experiments with all 8 methods on two public EEG-based emotion recognition datasets, namely SEED and SEED-IV. The experiments with different amounts of limited labeled samples show that the holistic approaches achieve strong results even when only 1 labeled sample is used per class. Further experiments show that in most cases, AdaMatch is the most effective method, followed by MixMatch and FixMatch.
    SAME: Deformable Image Registration based on Self-supervised Anatomical Embeddings. (arXiv:2109.11572v1 [eess.IV])
    (0 min) In this work, we introduce a fast and accurate method for unsupervised 3D medical image registration. This work is built on top of a recent algorithm SAM, which is capable of computing dense anatomical/semantic correspondences between two images at the pixel level. Our method is named SAME, which breaks down image registration into three steps: affine transformation, coarse deformation, and deep deformable registration. Using SAM embeddings, we enhance these steps by finding more coherent correspondences, and providing features and a loss function with better semantic guidance. We collect a multi-phase chest computed tomography dataset with 35 annotated organs for each patient and conduct inter-subject registration for quantitative evaluation. Results show that SAME outperforms widely-used traditional registration techniques (Elastix FFD, ANTs SyN) and learning based VoxelMorph method by at least 4.7% and 2.7% in Dice scores for two separate tasks of within-contrast-phase and across-contrast-phase registration, respectively. SAME achieves the comparable performance to the best traditional registration method, DEEDS (from our evaluation), while being orders of magnitude faster (from 45 seconds to 1.2 seconds).
    Long Short View Feature Decomposition via Contrastive Video Representation Learning. (arXiv:2109.11593v1 [cs.CV])
    (0 min) Self-supervised video representation methods typically focus on the representation of temporal attributes in videos. However, the role of stationary versus non-stationary attributes is less explored: Stationary features, which remain similar throughout the video, enable the prediction of video-level action classes. Non-stationary features, which represent temporally varying attributes, are more beneficial for downstream tasks involving more fine-grained temporal understanding, such as action segmentation. We argue that a single representation to capture both types of features is sub-optimal, and propose to decompose the representation space into stationary and non-stationary features via contrastive learning from long and short views, i.e. long video sequences and their shorter sub-sequences. Stationary features are shared between the short and long views, while non-stationary features aggregate the short views to match the corresponding long view. To empirically verify our approach, we demonstrate that our stationary features work particularly well on an action recognition downstream task, while our non-stationary features perform better on action segmentation. Furthermore, we analyse the learned representations and find that stationary features capture more temporally stable, static attributes, while non-stationary features encompass more temporally varying ones.
    Dual-Neighborhood Deep Fusion Network for Point Cloud Classification. (arXiv:2108.09228v2 [cs.CV] UPDATED)
    (0 min) Recently, deep neural networks have made remarkable achievements in 3D point cloud classification. However, existing classification methods are mainly implemented on idealized point clouds and suffer heavy degradation of per-formance on non-idealized scenarios. To handle this prob-lem, a feature representation learning method, named Dual-Neighborhood Deep Fusion Network (DNDFN), is proposed to serve as an improved point cloud encoder for the task of non-idealized point cloud classification. DNDFN utilizes a trainable neighborhood learning method called TN-Learning to capture the global key neighborhood. Then, the global neighborhood is fused with the local neighbor-hood to help the network achieve more powerful reasoning ability. Besides, an Information Transfer Convolution (IT-Conv) is proposed for DNDFN to learn the edge infor-mation between point-pairs and benefits the feature transfer procedure. The transmission of information in IT-Conv is similar to the propagation of information in the graph which makes DNDFN closer to the human reasoning mode. Extensive experiments on existing benchmarks especially non-idealized datasets verify the effectiveness of DNDFN and DNDFN achieves the state of the arts.
    SHD360: A Benchmark Dataset for Salient Human Detection in 360{\deg} Videos. (arXiv:2105.11578v5 [cs.CV] UPDATED)
    (0 min) Salient human detection (SHD) in dynamic 360{\deg} immersive videos is of great importance for various applications such as robotics, inter-human and human-object interaction in augmented reality. However, 360{\deg} video SHD has been seldom discussed in the computer vision community due to a lack of datasets with large-scale omnidirectional videos and rich annotations. To this end, we propose SHD360, the first 360{\deg} video SHD dataset which contains various real-life daily scenes. Our SHD360 provides six-level hierarchical annotations for 6,268 key frames uniformly sampled from 37,403 omnidirectional video frames at 4K resolution. Specifically, each collected frame is labeled with a super-class, a sub-class, associated attributes (e.g., geometrical distortion), bounding boxes and per-pixel object-/instance-level masks. As a result, our SHD360 contains totally 16,238 salient human instances with manually annotated pixel-wise ground truth. Since so far there is no method proposed for 360{\deg} image/video SHD, we systematically benchmark 11 representative state-of-the-art salient object detection (SOD) approaches on our SHD360, and explore key issues derived from extensive experimenting results. We hope our proposed dataset and benchmark could serve as a good starting point for advancing human-centric researches towards 360{\deg} panoramic data. Our dataset and benchmark is publicly available at https://github.com/PanoAsh/SHD360.
    FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. (arXiv:2104.10956v3 [cs.CV] UPDATED)
    (0 min) Monocular 3D object detection is an important task for autonomous driving considering its advantage of low cost. It is much more challenging than conventional 2D cases due to its inherent ill-posed property, which is mainly reflected in the lack of depth information. Recent progress on 2D detection offers opportunities to better solving this problem. However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this paper, we study this problem with a practice built on a fully convolutional single-stage detector and propose a general framework FCOS3D. Specifically, we first transform the commonly defined 7-DoF 3D targets to the image domain and decouple them as 2D and 3D attributes. Then the objects are distributed to different feature levels with consideration of their 2D scales and assigned only according to the projected 3D-center for the training procedure. Furthermore, the center-ness is redefined with a 2D Gaussian distribution based on the 3D-center to fit the 3D target formulation. All of these make this framework simple yet effective, getting rid of any 2D detection or 2D-3D correspondence priors. Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020. Code and models are released at https://github.com/open-mmlab/mmdetection3d.
    Learnable Triangulation for Deep Learning-based 3D Reconstruction of Objects of Arbitrary Topology from Single RGB Images. (arXiv:2109.11844v1 [cs.CV])
    (0 min) We propose a novel deep reinforcement learning-based approach for 3D object reconstruction from monocular images. Prior works that use mesh representations are template based. Thus, they are limited to the reconstruction of objects that have the same topology as the template. Methods that use volumetric grids as intermediate representations are computationally expensive, which limits their application in real-time scenarios. In this paper, we propose a novel end-to-end method that reconstructs 3D objects of arbitrary topology from a monocular image. It is composed of of (1) a Vertex Generation Network (VGN), which predicts the initial 3D locations of the object's vertices from an input RGB image, (2) a differentiable triangulation layer, which learns in a non-supervised manner, using a novel reinforcement learning algorithm, the best triangulation of the object's vertices, and finally, (3) a hierarchical mesh refinement network that uses graph convolutions to refine the initial mesh. Our key contribution is the learnable triangulation process, which recovers in an unsupervised manner the topology of the input shape. Our experiments on ShapeNet and Pix3D benchmarks show that the proposed method outperforms the state-of-the-art in terms of visual quality, reconstruction accuracy, and computational time.
    An Orthogonal Classifier for Improving the Adversarial Robustness of Neural Networks. (arXiv:2105.09109v2 [cs.CV] UPDATED)
    (0 min) Neural networks are susceptible to artificially designed adversarial perturbations. Recent efforts have shown that imposing certain modifications on classification layer can improve the robustness of the neural networks. In this paper, we explicitly construct a dense orthogonal weight matrix whose entries have the same magnitude, thereby leading to a novel robust classifier. The proposed classifier avoids the undesired structural redundancy issue in previous work. Applying this classifier in standard training on clean data is sufficient to ensure the high accuracy and good robustness of the model. Moreover, when extra adversarial samples are used, better robustness can be further obtained with the help of a special worst-case loss. Experimental results show that our method is efficient and competitive to many state-of-the-art defensive approaches. Our code is available at \url{https://github.com/MTandHJ/roboc}.
    From images in the wild to video-informed image classification. (arXiv:2109.12040v1 [cs.CV])
    (0 min) Image classifiers work effectively when applied on structured images, yet they often fail when applied on images with very high visual complexity. This paper describes experiments applying state-of-the-art object classifiers toward a unique set of images in the wild with high visual complexity collected on the island of Bali. The text describes differences between actual images in the wild and images from Imagenet, and then discusses a novel approach combining informational cues particular to video with an ensemble of imperfect classifiers in order to improve classification results on video sourced images of plants in the wild.
    Quantitative Matching of Forensic Evidence Fragments Utilizing 3D Microscopy Analysis of Fracture Surface Replicas. (arXiv:2109.11972v1 [eess.IV])
    (0 min) Fractured surfaces carry unique details that can provide an accurate quantitative comparison to support comparative forensic analysis of those fractured surfaces. In this study, a statistical analysis comparison protocol was applied to a set of 3D topological images of fractured surface pairs and their replicas to provide confidence in the quantitative statistical comparison between fractured items and their replicas. A set of 10 fractured stainless steel samples was fractured from the same metal rod under controlled conditions and were cast using a standard forensic casting technique. Six 3D topological maps with 50% overlap were acquired for each fractured pair. Spectral analysis was utilized to identify the correlation between topological surface features at different length scales of the surface topology. We selected two frequency bands over the critical wavelength (which is greater than two-grain diameters) for statistical comparison. Our statistical model utilized a matrix-variate-$t$ distribution that accounts for the image-overlap to model the match and non-match population densities. A decision rule was developed to identify the probability of matched and unmatched pairs of surfaces. The proposed methodology correctly classified the fractured steel surfaces and their replicas with a posterior probability of match exceeding 99.96%. Moreover, the replication technique shows the potential to accurately replicate fracture surface topological details with a wavelength greater than 20$\mu$m, which far exceeds the range for comparison of most metallic alloys of 50-200$\mu$m. The developed framework establishes the basis of forensic comparison of fractured articles and their replicas while providing a reliable quantitative statistical forensic comparison, utilizing fracture mechanics-based analysis of the fracture surface topology.
    Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation. (arXiv:2104.09757v2 [cs.CV] UPDATED)
    (0 min) We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. Quality learning representation of unseen classes (or styles) is critical to facilitate novel image generation and better generative understanding of unseen visual classes, i.e., zero-shot learning (ZSL). By generating representations of unseen classes based on their semantic descriptions, e.g., attributes or text, generative ZSL attempts to differentiate unseen from seen categories. The proposed GRaWD loss is defined by constructing a dynamic graph that includes the seen class/style centers and generated samples in the current minibatch. Our loss initiates a random walk probability from each center through visual generations produced from hallucinated unseen classes. As a deviation signal, we encourage the random walk to eventually land after t steps in a feature representation that is difficult to classify as any of the seen classes. We demonstrate that the proposed loss can improve unseen class representation quality inductively on text-based ZSL benchmarks on CUB and NABirds datasets and attribute-based ZSL benchmarks on AWA2, SUN, and aPY datasets. In addition, we investigate the ability of the proposed loss to generate meaningful novel visual art on the WikiArt dataset. The results of experiments and human evaluations demonstrate that the proposed GRaWD loss can improve StyleGAN1 and StyleGAN2 generation quality and create novel art that is significantly more preferable. Our code is made publicly available at https://github.com/Vision-CAIR/GRaWD.
    AFP-SRC: Identification of Antifreeze Proteins Using Sparse Representation Classifier. (arXiv:2009.05277v3 [cs.CV] UPDATED)
    (0 min) Species living in the extreme cold environment fight against the harsh conditions using antifreeze proteins (AFPs), that manipulates the freezing mechanism of water in more than one way. This amazing nature of AFP turns out to be extremely useful in several industrial and medical applications. The lack of similarity in their structure and sequence makes their prediction an arduous task and identifying them experimentally in the wet-lab is time-consuming and expensive. In this research, we propose a computational framework for the prediction of AFPs which is essentially based on a sample-specific classification method using the sparse reconstruction. A linear model and an over-complete dictionary matrix of known AFPs are used to predict a sparse class-label vector that provides a sample-association score. Delta-rule is applied for the reconstruction of two pseudo-samples using lower and upper parts of the sample-association vector and based on the minimum recovery score, class labels are assigned. We compare our approach with contemporary methods on a standard dataset and the proposed method is found to outperform in terms of Balanced accuracy and Youden's index. The MATLAB implementation of the proposed method is available at the author's GitHub page (\{https://github.com/Shujaat123/AFP-SRC}{https://github.com/Shujaat123/AFP-SRC}).
    DeepKoCo: Efficient latent planning with a task-relevant Koopman representation. (arXiv:2011.12690v3 [cs.LG] UPDATED)
    (0 min) This paper presents DeepKoCo, a novel model-based agent that learns a latent Koopman representation from images. This representation allows DeepKoCo to plan efficiently using linear control methods, such as linear model predictive control. Compared to traditional agents, DeepKoCo learns task-relevant dynamics, thanks to the use of a tailored lossy autoencoder network that allows DeepKoCo to learn latent dynamics that reconstruct and predict only observed costs, rather than all observed dynamics. As our results show, DeepKoCo achieves similar final performance as traditional model-free methods on complex control tasks while being considerably more robust to distractor dynamics, making the proposed agent more amenable for real-life applications.
    Unaligned Image-to-Image Translation by Learning to Reweight. (arXiv:2109.11736v1 [cs.CV])
    (0 min) Unsupervised image-to-image translation aims at learning the mapping from the source to target domain without using paired images for training. An essential yet restrictive assumption for unsupervised image translation is that the two domains are aligned, e.g., for the selfie2anime task, the anime (selfie) domain must contain only anime (selfie) face images that can be translated to some images in the other domain. Collecting aligned domains can be laborious and needs lots of attention. In this paper, we consider the task of image translation between two unaligned domains, which may arise for various possible reasons. To solve this problem, we propose to select images based on importance reweighting and develop a method to learn the weights and perform translation simultaneously and automatically. We compare the proposed method with state-of-the-art image translation approaches and present qualitative and quantitative results on different tasks with unaligned domains. Extensive empirical evidence demonstrates the usefulness of the proposed problem formulation and the superiority of our method.
    Dense Contrastive Visual-Linguistic Pretraining. (arXiv:2109.11778v1 [cs.CV])
    (0 min) Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-/Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.
    Quantifying point cloud realism through adversarially learned latent representations. (arXiv:2109.11775v1 [cs.CV])
    (0 min) Judging the quality of samples synthesized by generative models can be tedious and time consuming, especially for complex data structures, such as point clouds. This paper presents a novel approach to quantify the realism of local regions in LiDAR point clouds. Relevant features are learned from real-world and synthetic point clouds by training on a proxy classification task. Inspired by fair networks, we use an adversarial technique to discourage the encoding of dataset-specific information. The resulting metric can assign a quality score to samples without requiring any task specific annotations. In a series of experiments, we confirm the soundness of our metric by applying it in controllable task setups and on unseen data. Additional experiments show reliable interpolation capabilities of the metric between data with varying degree of realism. As one important application, we demonstrate how the local realism score can be used for anomaly detection in point clouds.
    Training dataset generation for bridge game registration. (arXiv:2109.11861v1 [cs.CV])
    (0 min) This paper presents a method for automatic generation of a training dataset for a deep convolutional neural network used for playing card detection. The solution allows to skip the time-consuming processes of manual image collecting and labelling recognised objects. The YOLOv4 network trained on the generated dataset achieved an efficiency of 99.8% in the cards detection task. The proposed method is a part of a project that aims to automate the process of broadcasting duplicate bridge competitions using a vision system and neural networks.
    Adversarial Domain Feature Adaptation for Bronchoscopic Depth Estimation. (arXiv:2109.11798v1 [eess.IV])
    (0 min) Depth estimation from monocular images is an important task in localization and 3D reconstruction pipelines for bronchoscopic navigation. Various supervised and self-supervised deep learning-based approaches have proven themselves on this task for natural images. However, the lack of labeled data and the bronchial tissue's feature-scarce texture make the utilization of these methods ineffective on bronchoscopic scenes. In this work, we propose an alternative domain-adaptive approach. Our novel two-step structure first trains a depth estimation network with labeled synthetic images in a supervised manner; then adopts an unsupervised adversarial domain feature adaptation scheme to improve the performance on real images. The results of our experiments show that the proposed method improves the network's performance on real images by a considerable margin and can be employed in 3D reconstruction pipelines.
    A Learned Stereo Depth System for Robotic Manipulation in Homes. (arXiv:2109.11644v1 [cs.RO])
    (0 min) We present a passive stereo depth system that produces dense and accurate point clouds optimized for human environments, including dark, textureless, thin, reflective and specular surfaces and objects, at 2560x2048 resolution, with 384 disparities, in 30 ms. The system consists of an algorithm combining learned stereo matching with engineered filtering, a training and data-mixing methodology, and a sensor hardware design. Our architecture is 15x faster than approaches that perform similarly on the Middlebury and Flying Things Stereo Benchmarks. To effectively supervise the training of this model, we combine real data labelled using off-the-shelf depth sensors, as well as a number of different rendered, simulated labeled datasets. We demonstrate the efficacy of our system by presenting a large number of qualitative results in the form of depth maps and point-clouds, experiments validating the metric accuracy of our system and comparisons to other sensors on challenging objects and scenes. We also show the competitiveness of our algorithm compared to state-of-the-art learned models using the Middlebury and FlyingThings datasets.
    Scan-based Semantic Segmentation of LiDAR Point Clouds: An Experimental Study. (arXiv:2004.11803v3 [cs.CV] UPDATED)
    (0 min) Autonomous vehicles need to have a semantic understanding of the three-dimensional world around them in order to reason about their environment. State of the art methods use deep neural networks to predict semantic classes for each point in a LiDAR scan. A powerful and efficient way to process LiDAR measurements is to use two-dimensional, image-like projections. In this work, we perform a comprehensive experimental study of image-based semantic segmentation architectures for LiDAR point clouds. We demonstrate various techniques to boost the performance and to improve runtime as well as memory constraints. First, we examine the effect of network size and suggest that much faster inference times can be achieved at a very low cost to accuracy. Next, we introduce an improved point cloud projection technique that does not suffer from systematic occlusions. We use a cyclic padding mechanism that provides context at the horizontal field-of-view boundaries. In a third part, we perform experiments with a soft Dice loss function that directly optimizes for the intersection-over-union metric. Finally, we propose a new kind of convolution layer with a reduced amount of weight-sharing along one of the two spatial dimensions, addressing the large difference in appearance along the vertical axis of a LiDAR scan. We propose a final set of the above methods with which the model achieves an increase of 3.2% in mIoU segmentation performance over the baseline while requiring only 42% of the original inference time.
    Catadioptric Stereo on a Smartphone. (arXiv:2109.11872v1 [cs.CV])
    (0 min) We present a 3D printed adapter with planar mirrors for stereo reconstruction using front and back smartphone camera. The adapter presents a practical and low-cost solution for enabling any smartphone to be used as a stereo camera, which is currently only possible using high-end phones with expensive 3D sensors. Using the prototype version of the adapter, we experiment with parameters like the angles between cameras and mirrors and the distance to each camera (the stereo baseline). We find the most convenient configuration and calibrate the stereo pair. Based on the presented preliminary analysis, we identify possible improvements in the current design. To demonstrate the working prototype, we reconstruct a 3D human pose using 2D keypoint detections from the stereo pair and evaluate extracted body lengths. The result shows that the adapter can be used for anthropometric measurement of several body segments.
  • cs.IR updates on arXiv.org

    Advances and Challenges in Conversational Recommender Systems: A Survey. (arXiv:2101.09459v7 [cs.IR] UPDATED)
    (3 min) Recommender systems exploit interaction history to estimate user preference, having been heavily used in a wide range of industry applications. However, static recommendation models are difficult to answer two important questions well due to inherent shortcomings: (a) What exactly does a user like? (b) Why does a user like an item? The shortcomings are due to the way that static models learn user preference, i.e., without explicit instructions and active feedback from users. The recent rise of conversational recommender systems (CRSs) changes this situation fundamentally. In a CRS, users and the system can dynamically communicate through natural language interactions, which provide unprecedented opportunities to explicitly obtain the exact preference of users. Considerable efforts, spread across disparate settings and applications, have been put into developing CRSs. Existing models, technologies, and evaluation methods for CRSs are far from mature. In this paper, we provide a systematic review of the techniques used in current CRSs. We summarize the key challenges of developing CRSs in five directions: (1) Question-based user preference elicitation. (2) Multi-turn conversational recommendation strategies. (3) Dialogue understanding and generation. (4) Exploitation-exploration trade-offs. (5) Evaluation and user simulation. These research directions involve multiple research fields like information retrieval (IR), natural language processing (NLP), and human-computer interaction (HCI). Based on these research directions, we discuss some future challenges and opportunities. We provide a road map for researchers from multiple communities to get started in this area. We hope this survey can help to identify and address challenges in CRSs and inspire future research.
    Learning Dual Dynamic Representations on Time-Sliced User-Item Interaction Graphs for Sequential Recommendation. (arXiv:2109.11790v1 [cs.IR])
    (2 min) Sequential Recommendation aims to recommend items that a target user will interact with in the near future based on the historically interacted items. While modeling temporal dynamics is crucial for sequential recommendation, most of the existing studies concentrate solely on the user side while overlooking the sequential patterns existing in the counterpart, i.e., the item side. Although a few studies investigate the dynamics involved in the dual sides, the complex user-item interactions are not fully exploited from a global perspective to derive dynamic user and item representations. In this paper, we devise a novel Dynamic Representation Learning model for Sequential Recommendation (DRL-SRe). To better model the user-item interactions for characterizing the dynamics from both sides, the proposed model builds a global user-item interaction graph for each time slice and exploits time-sliced graph neural networks to learn user and item representations. Moreover, to enable the model to capture fine-grained temporal information, we propose an auxiliary temporal prediction task over consecutive time slices based on temporal point process. Comprehensive experiments on three public real-world datasets demonstrate DRL-SRe outperforms the state-of-the-art sequential recommendation models with a large margin.
    Description-based Label Attention Classifier for Explainable ICD-9 Classification. (arXiv:2109.12026v1 [cs.LG])
    (2 min) ICD-9 coding is a relevant clinical billing task, where unstructured texts with information about a patient's diagnosis and treatments are annotated with multiple ICD-9 codes. Automated ICD-9 coding is an active research field, where CNN- and RNN-based model architectures represent the state-of-the-art approaches. In this work, we propose a description-based label attention classifier to improve the model explainability when dealing with noisy texts like clinical notes. We evaluate our proposed method with different transformer-based encoders on the MIMIC-III-50 dataset. Our method achieves strong results together with augmented explainablilty.
    Modeling Dynamic Attributes for Next Basket Recommendation. (arXiv:2109.11654v1 [cs.AI])
    (2 min) Traditional approaches to next item and next basket recommendation typically extract users' interests based on their past interactions and associated static contextual information (e.g. a user id or item category). However, extracted interests can be inaccurate and become obsolete. Dynamic attributes, such as user income changes, item price changes (etc.), change over time. Such dynamics can intrinsically reflect the evolution of users' interests. We argue that modeling such dynamic attributes can boost recommendation performance. However, properly integrating them into user interest models is challenging since attribute dynamics can be diverse such as time-interval aware, periodic patterns (etc.), and they represent users' behaviors from different perspectives, which can happen asynchronously with interactions. Besides dynamic attributes, items in each basket contain complex interdependencies which might be beneficial but nontrivial to effectively capture. To address these challenges, we propose a novel Attentive network to model Dynamic attributes (named AnDa). AnDa separately encodes dynamic attributes and basket item sequences. We design a periodic aware encoder to allow the model to capture various temporal patterns from dynamic attributes. To effectively learn useful item relationships, intra-basket attention module is proposed. Experimental results on three real-world datasets demonstrate that our method consistently outperforms the state-of-the-art.
    Multi-behavior Graph Contextual Aware Network for Session-based Recommendation. (arXiv:2109.11903v1 [cs.IR])
    (2 min) Predicting the next interaction of a short-term sequence is a challenging task in session-based recommendation (SBR).Multi-behavior session recommendation considers session sequence with multiple interaction types, such as click and purchase, to capture more effective user intention representation sufficiently.Despite the superior performance of existing multi-behavior based methods for SBR, there are still several severe limitations:(i) Almost all existing works concentrate on single target type of next behavior and fail to model multiplex behavior sessions uniformly.(ii) Previous methods also ignore the semantic relations between various next behavior and historical behavior sequence, which are significant signals to obtain current latent intention for SBR.(iii) The global cross-session item-item graph established by some existing models may incorporate semantics and context level noise for multi-behavior session-based recommendation. To overcome the limitations (i) and (ii), we propose two novel tasks for SBR, which require the incorporation of both historical behaviors and next behaviors into unified multi-behavior recommendation modeling. To this end, we design a Multi-behavior Graph Contextual Aware Network (MGCNet) for multi-behavior session-based recommendation for the two proposed tasks. Specifically, we build a multi-behavior global item transition graph based on all sessions involving all interaction types. Based on the global graph, MGCNet attaches the global interest representation to final item representation based on local contextual intention to address the limitation (iii). In the end, we utilize the next behavior information explicitly to guide the learning of general interest and current intention for SBR. Experiments on three public benchmark datasets show that MGCNet can outperform state-of-the-art models for multi-behavior session-based recommendation.
    Graph Learning Augmented Heterogeneous Graph Neural Network for Social Recommendation. (arXiv:2109.11898v1 [cs.IR])
    (2 min) Social recommendation based on social network has achieved great success in improving the performance of recommendation system. Since social network (user-user relations) and user-item interactions are both naturally represented as graph-structured data, Graph Neural Networks (GNNs) have thus been widely applied for social recommendation. In this work, we propose an end-to-end heterogeneous global graph learning framework, namely Graph Learning Augmented Heterogeneous Graph Neural Network (GL-HGNN) for social recommendation. GL-HGNN aims to learn a heterogeneous global graph that makes full use of user-user relations, user-item interactions and item-item similarities in a unified perspective. To this end, we design a Graph Learner (GL) method to learn and optimize user-user and item-item connections separately. Moreover, we employ a Heterogeneous Graph Neural Network (HGNN) to capture the high-order complex semantic relations from our learned heterogeneous global graph. To scale up the computation of graph learning, we further present the Anchor-based Graph Learner (AGL) to reduce computational complexity. Extensive experiments on four real-world datasets demonstrate the effectiveness of our model.
    GERNERMED -- An Open German Medical NER Model. (arXiv:2109.12104v1 [cs.CL])
    (2 min) The current state of adoption of well-structured electronic health records and integration of digital methods for storing medical patient data in structured formats can often considered as inferior compared to the use of traditional, unstructured text based patient data documentation. Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data. In natural language processing (NLP), statistical models have been shown successful in various tasks like part-of-speech tagging, relation extraction (RE) and named entity recognition (NER). In this work, we present GERNERMED, the first open, neural NLP model for NER tasks dedicated to detect medical entity types in German text data. Here, we avoid the conflicting goals of protection of sensitive patient data from training data extraction and the publication of the statistical model weights by training our model on a custom dataset that was translated from publicly available datasets in foreign language by a pretrained neural machine translation model. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED
  • cs.LG updates on arXiv.org

    On the Generalization of Stochastic Gradient Descent with Momentum. (arXiv:2102.13653v2 [cs.LG] UPDATED)
    (0 min) While momentum-based methods, in conjunction with stochastic gradient descent (SGD), are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees when SGD with standard heavy-ball momentum (SGDM) is run for multiple epochs. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM), and show that it admits an upper-bound on the generalization error. Thus, our results show that machine learning models can be trained for multiple epochs of SGDEM with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper-bound on the expected true risk, in terms of the number of training steps, the size of the training set, and the momentum parameter. Experimental evaluations verify the consistency between the numerical results and our theoretical bounds and the effectiveness of SGDEM for smooth Lipschitz loss functions.
    Demonstration of Panda: A Weakly Supervised Entity Matching System. (arXiv:2106.10821v3 [cs.DB] UPDATED)
    (0 min) Entity matching (EM) refers to the problem of identifying tuple pairs in one or more relations that refer to the same real world entities. Supervised machine learning (ML) approaches, and deep learning based approaches in particular, typically achieve state-of-the-art matching results. However, these approaches require many labeled examples, in the form of matching and non-matching pairs, which are expensive and time-consuming to label. In this paper, we introduce Panda, a weakly supervised system specifically designed for EM. Panda uses the same labeling function abstraction as Snorkel, where labeling functions (LF) are user-provided programs that can generate large amounts of (somewhat noisy) labels quickly and cheaply, which can then be combined via a labeling model to generate accurate final predictions. To support users developing LFs for EM, Panda provides an integrated development environment (IDE) that lives in a modern browser architecture. Panda's IDE facilitates the development, debugging, and life-cycle management of LFs in the context of EM tasks, similar to how IDEs such as Visual Studio or Eclipse excel in general-purpose programming. Panda's IDE includes many novel features purpose-built for EM, such as smart data sampling, a builtin library of EM utility functions, automatically generated LFs, visual debugging of LFs, and finally, an EM-specific labeling model. We show in this demo that Panda IDE can greatly accelerate the development of high-quality EM solutions using weak supervision.
    Stochastic Polyak Stepsize with a Moving Target. (arXiv:2106.11851v2 [cs.LG] UPDATED)
    (0 min) We propose a new stochastic gradient method called MOTAPS (Moving Targetted Polyak Stepsize) that uses recorded past loss values to compute adaptive stepsizes. MOTAPS can be seen as a variant of the Stochastic Polyak (SP) which is also a method that also uses loss values to adjust the stepsize. The downside to the SP method is that it only converges when the interpolation condition holds. MOTAPS is an extension of SP that does not rely on the interpolation condition. The MOTAPS method uses $n$ auxiliary variables, one for each data point, that track the loss value for each data point. We provide a global convergence theory for SP, an intermediary method TAPS, and MOTAPS by showing that they all can be interpreted as a special variant of online SGD. We also perform several numerical experiments on convex learning problems, and deep learning models for image classification and language translation. In all of our tasks we show that MOTAPS is competitive with the relevant baseline method.
    S&P 500 Stock Price Prediction Using Technical, Fundamental and Text Data. (arXiv:2108.10826v2 [stat.ML] UPDATED)
    (0 min) We summarized both common and novel predictive models used for stock price prediction and combined them with technical indices, fundamental characteristics and text-based sentiment data to predict S&P stock prices. A 66.18% accuracy in S&P 500 index directional prediction and 62.09% accuracy in individual stock directional prediction was achieved by combining different machine learning models such as Random Forest and LSTM together into state-of-the-art ensemble models. The data we use contains weekly historical prices, finance reports, and text information from news items associated with 518 different common stocks issued by current and former S&P 500 large-cap companies, from January 1, 2000 to December 31, 2019. Our study's innovation includes utilizing deep language models to categorize and infer financial news item sentiment; fusing different models containing different combinations of variables and stocks to jointly make predictions; and overcoming the insufficient data problem for machine learning models in time series by using data across different stocks.
    LDLE: Low Distortion Local Eigenmaps. (arXiv:2101.11055v2 [math.SP] UPDATED)
    (0 min) We present Low Distortion Local Eigenmaps (LDLE), a manifold learning technique which constructs a set of low distortion local views of a dataset in lower dimension and registers them to obtain a global embedding. The local views are constructed using the global eigenvectors of the graph Laplacian and are registered using Procrustes analysis. The choice of these eigenvectors may vary across the regions. In contrast to existing techniques, LDLE can embed closed and non-orientable manifolds into their intrinsic dimension by tearing them apart. It also provides gluing instruction on the boundary of the torn embedding to help identify the topology of the original manifold. Our experimental results will show that LDLE largely preserved distances up to a constant scale while other techniques produced higher distortion. We also demonstrate that LDLE produces high quality embeddings even when the data is noisy or sparse.
    Data Generation in Low Sample Size Setting Using Manifold Sampling and a Geometry-Aware VAE. (arXiv:2103.13751v2 [stat.ML] UPDATED)
    (0 min) We propose a new efficient way to sample from a Variational Autoencoder in the challenging low sample size setting. This method reveals particularly well suited to perform data augmentation in such a low data regime and is validated across various standard and real-life data sets. In particular, this scheme allows to greatly improve classification results on the OASIS database where balanced accuracy jumps from 80.7% for a classifier trained with the raw data to 88.6% when trained only with the synthetic data generated by our method. Such results were also observed on 3 standard data sets and with other classifiers. A code is available at https://github.com/clementchadebec/Data_Augmentation_with_VAE-DALI.
    Deep Network Approximation for Smooth Functions. (arXiv:2001.03040v8 [cs.LG] UPDATED)
    (0 min) This paper establishes the (nearly) optimal approximation error characterization of deep rectified linear unit (ReLU) networks for smooth functions in terms of both width and depth simultaneously. To that end, we first prove that multivariate polynomials can be approximated by deep ReLU networks of width $\mathcal{O}(N)$ and depth $\mathcal{O}(L)$ with an approximation error $\mathcal{O}(N^{-L})$. Through local Taylor expansions and their deep ReLU network approximations, we show that deep ReLU networks of width $\mathcal{O}(N\ln N)$ and depth $\mathcal{O}(L\ln L)$ can approximate $f\in C^s([0,1]^d)$ with a nearly optimal approximation error $\mathcal{O}(\|f\|_{C^s([0,1]^d)}N^{-2s/d}L^{-2s/d})$. Our estimate is non-asymptotic in the sense that it is valid for arbitrary width and depth specified by $N\in\mathbb{N}^+$ and $L\in\mathbb{N}^+$, respectively.
    Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations. (arXiv:2104.12384v2 [stat.ML] UPDATED)
    (0 min) We present a framework that allows for the non-asymptotic study of the $2$-Wasserstein distance between the invariant distribution of an ergodic stochastic differential equation and the distribution of its numerical approximation in the strongly log-concave case. This allows us to study in a unified way a number of different integrators proposed in the literature for the overdamped and underdamped Langevin dynamics. In addition, we analyse a novel splitting method for the underdamped Langevin dynamics which only requires one gradient evaluation per time step. Under an additional smoothness assumption on a $d$--dimensional strongly log-concave distribution with condition number $\kappa$, the algorithm is shown to produce with an $\mathcal{O}\big(\kappa^{5/4} d^{1/4}\epsilon^{-1/2} \big)$ complexity samples from a distribution that, in Wasserstein distance, is at most $\epsilon>0$ away from the target distribution.
    Fast Design Space Exploration of Nonlinear Systems: Part I. (arXiv:2104.01747v4 [cs.LG] UPDATED)
    (0 min) System design tools are often only available as input-output blackboxes: for a given design as input they compute an output representing system behavior. Blackboxes are intended to be run in the forward direction. This paper presents a new method of solving the inverse design problem namely, given requirements or constraints on output, find an input that also optimizes an objective function. This problem is challenging for several reasons. First, blackboxes are not designed to be run in reverse. Second, inputs and outputs can be discrete and continuous. Third, finding designs concurrently satisfying a set of requirements is hard because designs satisfying individual requirements may conflict with each other. Fourth, blackbox evaluations can be expensive. Finally, blackboxes can sometimes fail to produce an output. This paper presents CNMA, a new method of solving the inverse problem that overcomes these challenges. CNMA tries to sample only the part of the design space relevant to solving the problem, leveraging the power of neural networks, Mixed Integer Linear Programs, and a new learning-from-failure feedback loop. The paper also presents a parallel version of CNMA that improves the efficiency and quality of solutions over the sequential version, and tries to steer it away from local optima. CNMA's performance is evaluated against conventional optimization methods for seven nonlinear design problems of 8 (two problems), 10, 15, 36 and 60 real-valued dimensions and one with 186 binary dimensions. Conventional methods evaluated are off-the-shelf implementations of Bayesian Optimization with Gaussian Processes, Nelder Mead and Random Search. The first two do not solve problems that are high-dimensional, have discrete and continuous variables or whose blackboxes can fail to return values. CNMA solves all problems, and surpasses the performance of conventional methods by 1%-87%.
    Towards A Measure Of General Machine Intelligence. (arXiv:2109.12075v1 [cs.AI])
    (0 min) To build increasingly general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure precisely how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, i.e. a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the g-index to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
    Scan-based Semantic Segmentation of LiDAR Point Clouds: An Experimental Study. (arXiv:2004.11803v3 [cs.CV] UPDATED)
    (0 min) Autonomous vehicles need to have a semantic understanding of the three-dimensional world around them in order to reason about their environment. State of the art methods use deep neural networks to predict semantic classes for each point in a LiDAR scan. A powerful and efficient way to process LiDAR measurements is to use two-dimensional, image-like projections. In this work, we perform a comprehensive experimental study of image-based semantic segmentation architectures for LiDAR point clouds. We demonstrate various techniques to boost the performance and to improve runtime as well as memory constraints. First, we examine the effect of network size and suggest that much faster inference times can be achieved at a very low cost to accuracy. Next, we introduce an improved point cloud projection technique that does not suffer from systematic occlusions. We use a cyclic padding mechanism that provides context at the horizontal field-of-view boundaries. In a third part, we perform experiments with a soft Dice loss function that directly optimizes for the intersection-over-union metric. Finally, we propose a new kind of convolution layer with a reduced amount of weight-sharing along one of the two spatial dimensions, addressing the large difference in appearance along the vertical axis of a LiDAR scan. We propose a final set of the above methods with which the model achieves an increase of 3.2% in mIoU segmentation performance over the baseline while requiring only 42% of the original inference time.
    Hamiltonian-based Neural ODE Networks on the SE(3) Manifold For Dynamics Learning and Control. (arXiv:2106.12782v2 [cs.RO] UPDATED)
    (0 min) Accurate models of robot dynamics are critical for safe and stable control and generalization to novel operational conditions. Hand-designed models, however, may be insufficiently accurate, even after careful parameter tuning. This motivates the use of machine learning techniques to approximate the robot dynamics over a training set of state-control trajectories. The dynamics of many robots, including ground, aerial, and underwater vehicles, are described in terms of their SE(3) pose and generalized velocity, and satisfy conservation of energy principles. This paper proposes a Hamiltonian formulation over the SE(3) manifold of the structure of a neural ordinary differential equation (ODE) network to approximate the dynamics of a rigid body. In contrast to a black-box ODE network, our formulation guarantees total energy conservation by construction. We develop energy shaping and damping injection control for the learned, potentially under-actuated SE(3) Hamiltonian dynamics to enable a unified approach for stabilization and trajectory tracking with various platforms, including pendulum, rigid-body, and quadrotor systems.
    A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research. (arXiv:2009.06520v2 [cs.SE] UPDATED)
    (0 min) An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL). The popularity of such techniques largely stems from their automated feature engineering capabilities, which aid in modeling software artifacts. However, due to the rapid pace at which DL techniques have been adopted, it is difficult to distill the current successes, failures, and opportunities of the current research landscape. In an effort to bring clarity to this crosscutting area of work, from its modern inception to the present, this paper presents a systematic literature review of research at the intersection of SE & DL. The review canvases work appearing in the most prominent SE and DL conferences and journals and spans 128 papers across 23 unique SE tasks. We center our analysis around the components of learning, a set of principles that govern the application of machine learning techniques (ML) to a given problem domain, discussing several aspects of the surveyed work at a granular level. The end result of our analysis is a research roadmap that both delineates the foundations of DL techniques applied to SE research, and highlights likely areas of fertile exploration for the future.
    The Open Catalyst 2020 (OC20) Dataset and Community Challenges. (arXiv:2010.09990v5 [cond-mat.mtrl-sci] UPDATED)
    (0 min) Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.
    An Empirical Study of Assumptions in Bayesian Optimisation. (arXiv:2012.03826v5 [cs.LG] UPDATED)
    (0 min) In this work we rigorously analyse assumptions inherent to black-box optimisation hyper-parameter tuning tasks. Our results on the Bayesmark benchmark indicate that heteroscedasticity and non-stationarity pose significant challenges for black-box optimisers. Based on these findings, we propose a Heteroscedastic and Evolutionary Bayesian Optimisation solver (HEBO). HEBO performs non-linear input and output warping, admits exact marginal log-likelihood optimisation and is robust to the values of lea\rned parameters. We demonstrate HEBO's empirical efficacy on the NeurIPS 2020 Black-Box Optimisation challenge, where HEBO placed first. Upon further analysis, we observe that HEBO significantly outperforms existing black-box optimisers on 108 machine learning hyperparameter tuning tasks comprising the Bayesmark benchmark. Our findings indicate that the majority of hyper-parameter tuning tasks exhibit heteroscedasticity and non-stationarity, multi-objective acquisition ensembles with Pareto front solutions improve queried configurations, and robust acquisition maximisers afford empirical advantages relative to their non-robust counterparts. We hope these findings may serve as guiding principles for practitioners of Bayesian optimisation. All code is made available at https://github.com/huawei-noah/HEBO.
    GERNERMED -- An Open German Medical NER Model. (arXiv:2109.12104v1 [cs.CL])
    (0 min) The current state of adoption of well-structured electronic health records and integration of digital methods for storing medical patient data in structured formats can often considered as inferior compared to the use of traditional, unstructured text based patient data documentation. Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data. In natural language processing (NLP), statistical models have been shown successful in various tasks like part-of-speech tagging, relation extraction (RE) and named entity recognition (NER). In this work, we present GERNERMED, the first open, neural NLP model for NER tasks dedicated to detect medical entity types in German text data. Here, we avoid the conflicting goals of protection of sensitive patient data from training data extraction and the publication of the statistical model weights by training our model on a custom dataset that was translated from publicly available datasets in foreign language by a pretrained neural machine translation model. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED
    Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends. (arXiv:2001.00378v2 [cs.SD] UPDATED)
    (0 min) Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech -- a gap that our survey aims to bridge.
    Quantitative analysis of phase transitions in two-dimensional XY models using persistent homology. (arXiv:2109.10960v1 [cond-mat.stat-mech] CROSS LISTED)
    (0 min) We use persistent homology and persistence images as an observable of three different variants of the two-dimensional XY model in order to identify and study their phase transitions. We examine models with the classical XY action, a topological lattice action, and an action with an additional nematic term. In particular, we introduce a new way of computing the persistent homology of lattice spin model configurations and, by considering the fluctuations in the output of logistic regression and k-nearest neighbours models trained on persistence images, we develop a methodology to extract estimates of the critical temperature and the critical exponent of the correlation length. We put particular emphasis on finite-size scaling behaviour and producing estimates with quantifiable error. For each model we successfully identify its phase transition(s) and are able to get an accurate determination of the critical temperatures and critical exponents of the correlation length.
    Combining Discrete Choice Models and Neural Networks through Embeddings: Formulation, Interpretability and Performance. (arXiv:2109.12042v1 [stat.ML])
    (0 min) This study proposes a novel approach that combines theory and data-driven choice models using Artificial Neural Networks (ANNs). In particular, we use continuous vector representations, called embeddings, for encoding categorical or discrete explanatory variables with a special focus on interpretability and model transparency. Although embedding representations within the logit framework have been conceptualized by Camara (2019), their dimensions do not have an absolute definitive meaning, hence offering limited behavioral insights. The novelty of our work lies in enforcing interpretability to the embedding vectors by formally associating each of their dimensions to a choice alternative. Thus, our approach brings benefits much beyond a simple parsimonious representation improvement over dummy encoding, as it provides behaviorally meaningful outputs that can be used in travel demand analysis and policy decisions. Additionally, in contrast to previously suggested ANN-based Discrete Choice Models (DCMs) that either sacrifice interpretability for performance or are only partially interpretable, our models preserve interpretability of the utility coefficients for all the input variables despite being based on ANN principles. The proposed models were tested on two real world datasets and evaluated against benchmark and baseline models that use dummy-encoding. The results of the experiments indicate that our models deliver state-of-the-art predictive performance, outperforming existing ANN-based models while drastically reducing the number of required network parameters.
    A spatiotemporal machine learning approach to forecasting COVID-19 incidence at the county level in the United States. (arXiv:2109.12094v1 [stat.ML])
    (0 min) With COVID-19 affecting every country globally and changing everyday life, the ability to forecast the spread of the disease is more important than any previous epidemic. The conventional methods of disease-spread modeling, compartmental models, are based on the assumption of spatiotemporal homogeneity of the spread of the virus, which may cause forecasting to underperform, especially at high spatial resolutions. In this paper we approach the forecasting task with an alternative technique -- spatiotemporal machine learning. We present COVID-LSTM, a data-driven model based on a Long Short-term Memory deep learning architecture for forecasting COVID-19 incidence at the county-level in the US. We use the weekly number of new positive cases as temporal input, and hand-engineered spatial features from Facebook movement and connectedness datasets to capture the spread of the disease in time and space. COVID-LSTM outperforms the COVID-19 Forecast Hub's Ensemble model (COVIDhub-ensemble) on our 17-week evaluation period, making it the first model to be more accurate than the COVIDhub-ensemble over one or more forecast periods. Over the 4-week forecast horizon, our model is on average 50 cases per county more accurate than the COVIDhub-ensemble. We highlight that the underutilization of data-driven forecasting of disease spread prior to COVID-19 is likely due to the lack of sufficient data available for previous diseases, in addition to the recent advances in machine learning methods for spatiotemporal forecasting. We discuss the impediments to the wider uptake of data-driven forecasting, and whether it is likely that more deep learning-based models will be used in the future.
    CLIPort: What and Where Pathways for Robotic Manipulation. (arXiv:2109.12098v1 [cs.RO])
    (0 min) How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.
    ValNorm Quantifies Semantics to Reveal Consistent Valence Biases Across Languages and Over Centuries. (arXiv:2006.03950v4 [cs.CY] UPDATED)
    (0 min) Word embeddings learn implicit biases from linguistic regularities captured by word co-occurrence statistics. By extending methods that quantify human-like biases in word embeddings, we introduceValNorm, a novel intrinsic evaluation task and method to quantify the valence dimension of affect in human-rated word sets from social psychology. We apply ValNorm on static word embeddings from seven languages (Chinese, English, German, Polish, Portuguese, Spanish, and Turkish) and from historical English text spanning 200 years. ValNorm achieves consistently high accuracy in quantifying the valence of non-discriminatory, non-social group word sets. Specifically, ValNorm achieves a Pearson correlation of r=0.88 for human judgment scores of valence for 399 words collected to establish pleasantness norms in English. In contrast, we measure gender stereotypes using the same set of word embeddings and find that social biases vary across languages. Our results indicate that valence associations of non-discriminatory, non-social group words represent widely-shared associations, in seven languages and over 200 years.
    Sinkhorn Distributionally Robust Optimization. (arXiv:2109.11926v1 [math.OC])
    (0 min) We study distributionally robust optimization with Sinkorn distance -- a variant of Wasserstein distance based on entropic regularization. We derive convex programming dual reformulations when the nominal distribution is an empirical distribution and a general distribution, respectively. Compared with Wasserstein DRO, it is computationally tractable for a larger class of loss functions, and its worst-case distribution is more reasonable. To solve the dual reformulation, we propose an efficient batch gradient descent with a bisection search algorithm. Finally, we provide various numerical examples using both synthetic and real data to demonstrate its competitive performance.
    mfEGRA: Multifidelity Efficient Global Reliability Analysis through Active Learning for Failure Boundary Location. (arXiv:1910.02497v5 [stat.ML] UPDATED)
    (0 min) This paper develops mfEGRA, a multifidelity active learning method using data-driven adaptively refined surrogates for failure boundary location in reliability analysis. This work addresses the issue of prohibitive cost of reliability analysis using Monte Carlo sampling for expensive-to-evaluate high-fidelity models by using cheaper-to-evaluate approximations of the high-fidelity model. The method builds on the Efficient Global Reliability Analysis (EGRA) method, which is a surrogate-based method that uses adaptive sampling for refining Gaussian process surrogates for failure boundary location using a single-fidelity model. Our method introduces a two-stage adaptive sampling criterion that uses a multifidelity Gaussian process surrogate to leverage multiple information sources with different fidelities. The method combines expected feasibility criterion from EGRA with one-step lookahead information gain to refine the surrogate around the failure boundary. The computational savings from mfEGRA depends on the discrepancy between the different models, and the relative cost of evaluating the different models as compared to the high-fidelity model. We show that accurate estimation of reliability using mfEGRA leads to computational savings of $\sim$46% for an analytic multimodal test problem and 24% for a three-dimensional acoustic horn problem, when compared to single-fidelity EGRA. We also show the effect of using a priori drawn Monte Carlo samples in the implementation for the acoustic horn problem, where mfEGRA leads to computational savings of 45% for the three-dimensional case and 48% for a rarer event four-dimensional case as compared to single-fidelity EGRA.
    Adaptive Clustering-based Reduced-Order Modeling Framework: Fast and accurate modeling of localized history-dependent phenomena. (arXiv:2109.11897v1 [math.NA])
    (0 min) This paper proposes a novel Adaptive Clustering-based Reduced-Order Modeling (ACROM) framework to significantly improve and extend the recent family of clustering-based reduced-order models (CROMs). This adaptive framework enables the clustering-based domain decomposition to evolve dynamically throughout the problem solution, ensuring optimum refinement in regions where the relevant fields present steeper gradients. It offers a new route to fast and accurate material modeling of history-dependent nonlinear problems involving highly localized plasticity and damage phenomena. The overall approach is composed of three main building blocks: target clusters selection criterion, adaptive cluster analysis, and computation of cluster interaction tensors. In addition, an adaptive clustering solution rewinding procedure and a dynamic adaptivity split factor strategy are suggested to further enhance the adaptive process. The coined Adaptive Self-Consistent Clustering Analysis (ASCA) is shown to perform better than its static counterpart when capturing the multi-scale elasto-plastic behavior of a particle-matrix composite and predicting the associated fracture and toughness. Given the encouraging results shown in this paper, the ACROM framework sets the stage and opens new avenues to explore adaptivity in the context of CROMs.
    Discovering PDEs from Multiple Experiments. (arXiv:2109.11939v1 [stat.ML])
    (0 min) Automated model discovery of partial differential equations (PDEs) usually considers a single experiment or dataset to infer the underlying governing equations. In practice, experiments have inherent natural variability in parameters, initial and boundary conditions that cannot be simply averaged out. We introduce a randomised adaptive group Lasso sparsity estimator to promote grouped sparsity and implement it in a deep learning based PDE discovery framework. It allows to create a learning bias that implies the a priori assumption that all experiments can be explained by the same underlying PDE terms with potentially different coefficients. Our experimental results show more generalizable PDEs can be found from multiple highly noisy datasets, by this grouped sparsity promotion rather than simply performing independent model discoveries.
    The Mirror Langevin Algorithm Converges with Vanishing Bias. (arXiv:2109.12077v1 [cs.DS])
    (0 min) The technique of modifying the geometry of a problem from Euclidean to Hessian metric has proved to be quite effective in optimization, and has been the subject of study for sampling. The Mirror Langevin Diffusion (MLD) is a sampling analogue of mirror flow in continuous time, and it has nice convergence properties under log-Sobolev or Poincare inequalities relative to the Hessian metric, as shown by Chewi et al. (2020). In discrete time, a simple discretization of MLD is the Mirror Langevin Algorithm (MLA) studied by Zhang et al. (2020), who showed a biased convergence bound with a non-vanishing bias term (does not go to zero as step size goes to zero). This raised the question of whether we need a better analysis or a better discretization to achieve a vanishing bias. Here we study the basic Mirror Langevin Algorithm and show it indeed has a vanishing bias. We apply mean-square analysis based on Li et al. (2019) and Li et al. (2021) to show the mixing time bound for MLA under the modified self-concordance condition introduced by Zhang et al. (2020).
    Optimal policy evaluation using kernel-based temporal difference methods. (arXiv:2109.12002v1 [stat.ML])
    (0 min) We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate; in the population limit of infinite data, it corresponds to the fixed point of a projected Bellman operator defined by the associated reproducing kernel Hilbert space. The estimator itself is obtained by computing the projected fixed point induced by a regularized version of the empirical operator; due to the underlying kernel structure, this reduces to solving a linear system involving kernel matrices. We analyze the error of this estimate in the $L^2(\mu)$-norm, where $\mu$ denotes the stationary distribution of the underlying Markov chain. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size $n$ and the effective horizon $H = (1 - \gamma)^{-1}$. Whereas existing worst-case theory predicts cubic scaling ($H^3$) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.
    A Graph Policy Network Approach for Volt-Var Control in Power Distribution Systems. (arXiv:2109.12073v1 [cs.LG])
    (0 min) Volt-var control (VVC) is the problem of operating power distribution systems within healthy regimes by controlling actuators in power systems. Existing works have mostly adopted the conventional routine of representing the power systems (a graph with tree topology) as vectors to train deep reinforcement learning (RL) policies. We propose a framework that combines RL with graph neural networks and study the benefits and limitations of graph-based policy in the VVC setting. Our results show that graph-based policies converge to the same rewards asymptotically however at a slower rate when compared to vector representation counterpart. We conduct further analysis on the impact of both observations and actions: on the observation end, we examine the robustness of graph-based policy on two typical data acquisition errors in power systems, namely sensor communication failure and measurement misalignment. On the action end, we show that actuators have various impacts on the system, thus using a graph representation induced by power systems topology may not be the optimal choice. In the end, we conduct a case study to demonstrate that the choice of readout function architecture and graph augmentation can further improve training performance and robustness.
    From images in the wild to video-informed image classification. (arXiv:2109.12040v1 [cs.CV])
    (0 min) Image classifiers work effectively when applied on structured images, yet they often fail when applied on images with very high visual complexity. This paper describes experiments applying state-of-the-art object classifiers toward a unique set of images in the wild with high visual complexity collected on the island of Bali. The text describes differences between actual images in the wild and images from Imagenet, and then discusses a novel approach combining informational cues particular to video with an ensemble of imperfect classifiers in order to improve classification results on video sourced images of plants in the wild.
    Noise-Induced Barren Plateaus in Variational Quantum Algorithms. (arXiv:2007.14384v4 [quant-ph] UPDATED)
    (0 min) Variational Quantum Algorithms (VQAs) may be a path to quantum advantage on Noisy Intermediate-Scale Quantum (NISQ) computers. A natural question is whether noise on NISQ devices places fundamental limitations on VQA performance. We rigorously prove a serious limitation for noisy VQAs, in that the noise causes the training landscape to have a barren plateau (i.e., vanishing gradient). Specifically, for the local Pauli noise considered, we prove that the gradient vanishes exponentially in the number of qubits $n$ if the depth of the ansatz grows linearly with $n$. These noise-induced barren plateaus (NIBPs) are conceptually different from noise-free barren plateaus, which are linked to random parameter initialization. Our result is formulated for a generic ansatz that includes as special cases the Quantum Alternating Operator Ansatz and the Unitary Coupled Cluster Ansatz, among others. For the former, our numerical heuristics demonstrate the NIBP phenomenon for a realistic hardware noise model.
    Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction. (arXiv:2109.12008v1 [cs.CL])
    (0 min) State-of-the-art NLP models can adopt shallow heuristics that limit their generalization capability (McCoy et al., 2019). Such heuristics include lexical overlap with the training set in Named-Entity Recognition (Taill\'e et al., 2020) and Event or Type heuristics in Relation Extraction (Rosenman et al., 2020). In the more realistic end-to-end RE setting, we can expect yet another heuristic: the mere retention of training relation triples. In this paper, we propose several experiments confirming that retention of known facts is a key factor of performance on standard benchmarks. Furthermore, one experiment suggests that a pipeline model able to use intermediate type representations is less prone to over-rely on retention.
    Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability. (arXiv:2109.11792v1 [cs.LG])
    (0 min) In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common approximation, which has been recently popularized as meta-RL, is to train the agent on a sample of $N$ problem instances from the prior, with the hope that for large enough $N$, good generalization behavior to an unseen test instance will be obtained. In this work, we study generalization in Bayesian RL under the probably approximately correct (PAC) framework, using the method of algorithmic stability. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense. Most stability results in the literature build on strong convexity of the regularized loss -- an approach that is not suitable for RL as Markov decision processes (MDPs) are not convex. Instead, building on recent results of fast convergence rates for mirror descent in regularized MDPs, we show that regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability. This result, which may be of independent interest, allows us to study the effect of regularization on generalization in the Bayesian RL setting.
    Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. (arXiv:2109.11978v1 [cs.RO])
    (0 min) In this work, we present and study a training set-up that achieves fast policy generation for real-world robotic tasks by using massive parallelism on a single workstation GPU. We analyze and discuss the impact of different training algorithm components in the massively parallel regime on the final policy performance and training times. In addition, we present a novel game-inspired curriculum that is well suited for training with thousands of simulated robots in parallel. We evaluate the approach by training the quadrupedal robot ANYmal to walk on challenging terrain. The parallel approach allows training policies for flat terrain in under four minutes, and in twenty minutes for uneven terrain. This represents a speedup of multiple orders of magnitude compared to previous work. Finally, we transfer the policies to the real robot to validate the approach. We open-source our training code to help accelerate further research in the field of learned legged locomotion.
    Visual Scene Graphs for Audio Source Separation. (arXiv:2109.11955v1 [cs.CV])
    (0 min) State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild'' video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches.
    Learning Multi-Layered GBDT Via Back Propagation. (arXiv:2109.11863v1 [cs.LG])
    (0 min) Deep neural networks are able to learn multi-layered representation via back propagation (BP). Although the gradient boosting decision tree (GBDT) is effective for modeling tabular data, it is non-differentiable with respect to its input, thus suffering from learning multi-layered representation. In this paper, we propose a framework of learning multi-layered GBDT via BP. We approximate the gradient of GBDT based on linear regression. Specifically, we use linear regression to replace the constant value at each leaf ignoring the contribution of individual samples to the tree structure. In this way, we estimate the gradient for intermediate representations, which facilitates BP for multi-layered GBDT. Experiments show the effectiveness of the proposed method in terms of performance and representation ability. To the best of our knowledge, this is the first work of optimizing multi-layered GBDT via BP. This work provides a new possibility of exploring deep tree based learning and combining GBDT with neural networks.
    Theory of overparametrization in quantum neural networks. (arXiv:2109.11676v1 [quant-ph])
    (0 min) The prospect of achieving quantum advantage with Quantum Neural Networks (QNNs) is exciting. Understanding how QNN properties (e.g., the number of parameters $M$) affect the loss landscape is crucial to the design of scalable QNN architectures. Here, we rigorously analyze the overparametrization phenomenon in QNNs with periodic structure. We define overparametrization as the regime where the QNN has more than a critical number of parameters $M_c$ that allows it to explore all relevant directions in state space. Our main results show that the dimension of the Lie algebra obtained from the generators of the QNN is an upper bound for $M_c$, and for the maximal rank that the quantum Fisher information and Hessian matrices can reach. Underparametrized QNNs have spurious local minima in the loss landscape that start disappearing when $M\geq M_c$. Thus, the overparametrization onset corresponds to a computational phase transition where the QNN trainability is greatly improved by a more favorable landscape. We then connect the notion of overparametrization to the QNN capacity, so that when a QNN is overparametrized, its capacity achieves its maximum possible value. We run numerical simulations for eigensolver, compilation, and autoencoding applications to showcase the overparametrization computational phase transition. We note that our results also apply to variational quantum algorithms and quantum optimal control.
    Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models. (arXiv:2109.11762v1 [cs.DC])
    (0 min) Deep Neural Networks have gained significant attraction due to their wide applicability in different domains. DNN sizes and training samples are constantly growing, making training of such workloads more challenging. Distributed training is a solution to reduce the training time. High-performance distributed training platforms should leverage multi-dimensional hierarchical networks, which interconnect accelerators through different levels of the network, to dramatically reduce expensive NICs required for the scale-out network. However, it comes at the expense of communication overhead between distributed accelerators to exchange gradients or input/output activation. In order to allow for further scaling of the workloads, communication overhead needs to be minimized. In this paper, we motivate the fact that in training platforms, adding more intermediate network dimensions is beneficial for efficiently mitigating the excessive use of expensive NIC resources. Further, we address different challenges of the DNN training on hierarchical networks. We discuss when designing the interconnect, how to distribute network bandwidth resources across different dimensions in order to (i) maximize BW utilization of all dimensions, and (ii) minimizing the overall training time for the target workload. We then implement a framework that, for a given workload, determines the best network configuration that maximizes performance, or performance-per-cost.
    How Does Knowledge Graph Embedding Extrapolate to Unseen Data: a Semantic Evidence View. (arXiv:2109.11800v1 [cs.CL])
    (0 min) Knowledge Graph Embedding (KGE) aims to learn representations for entities and relations. Most KGE models have gained great success, especially on extrapolation scenarios. Specifically, given an unseen triple (h, r, t), a trained model can still correctly predict t from (h, r, ?), or h from (?, r, t), such extrapolation ability is impressive. However, most existing KGE works focus on the design of delicate triple modeling function, which mainly tell us how to measure the plausibility of observed triples, but we have limited understanding of why the methods can extrapolate to unseen data, and what are the important factors to help KGE extrapolate. Therefore in this work, we attempt to, from a data relevant view, study KGE extrapolation of two problems: 1. How does KGE extrapolate to unseen data? 2. How to design the KGE model with better extrapolation ability? For the problem 1, we first discuss the impact factors for extrapolation and from relation, entity and triple level respectively, propose three Semantic Evidences (SEs), which can be observed from training set and provide important semantic information for extrapolation to unseen data. Then we verify the effectiveness of SEs through extensive experiments on several typical KGE methods, and demonstrate that SEs serve as an important role for understanding the extrapolation ability of KGE. For the problem 2, to make better use of the SE information for more extrapolative knowledge representation, we propose a novel GNN-based KGE model, called Semantic Evidence aware Graph Neural Network (SE-GNN). Finally, through extensive experiments on FB15k-237 and WN18RR datasets, we show that SE-GNN achieves state-of-the-art performance on Knowledge Graph Completion task and perform a better extrapolation ability.
    Online Adaptation of Parameters using GRU-based Neural Network with BO for Accurate Driving Model. (arXiv:2109.11720v1 [cs.LG])
    (0 min) Testing self-driving cars in different areas requires surrounding cars with accordingly different driving styles such as aggressive or conservative styles. A method of numerically measuring and differentiating human driving styles to create a virtual driver with a certain driving style is in demand. However, most methods for measuring human driving styles require thresholds or labels to classify the driving styles, and some require additional questionnaires for drivers about their driving attitude. These limitations are not suitable for creating a large virtual testing environment. Driving models (DMs) simulate human driving styles. Calibrating a DM makes the simulated driving behavior closer to human-driving behavior, and enable the simulation of human-driving cars. Conventional DM-calibrating methods do not take into account that the parameters in a DM vary while driving. These "fixed" calibrating methods cannot reflect an actual interactive driving scenario. In this paper, we propose a DM-calibration method for measuring human driving styles to reproduce real car-following behavior more accurately. The method includes 1) an objective entropy weight method for measuring and clustering human driving styles, and 2) online adaption of DM parameters based on deep learning by combining Bayesian optimization (BO) and a gated recurrent unit neural network. We conducted experiments to evaluate the proposed method, and the results indicate that it can be easily used to measure human driver styles. The experiments also showed that we can calibrate a corresponding DM in a virtual testing environment with up to 26% more accuracy than with fixed calibration methods.
    Dimension-Free Rates for Natural Policy Gradient in Multi-Agent Reinforcement Learning. (arXiv:2109.11692v1 [cs.LG])
    (0 min) Cooperative multi-agent reinforcement learning is a decentralized paradigm in sequential decision making where agents distributed over a network iteratively collaborate with neighbors to maximize global (network-wide) notions of rewards. Exact computations typically involve a complexity that scales exponentially with the number of agents. To address this curse of dimensionality, we design a scalable algorithm based on the Natural Policy Gradient framework that uses local information and only requires agents to communicate with neighbors within a certain range. Under standard assumptions on the spatial decay of correlations for the transition dynamics of the underlying Markov process and the localized learning policy, we show that our algorithm converges to the globally optimal policy with a dimension-free statistical and computational complexity, incurring a localization error that does not depend on the number of agents and converges to zero exponentially fast as a function of the range of communication.
    Few-shot Learning Based on Multi-stage Transfer and Class-Balanced Loss for Diabetic Retinopathy Grading. (arXiv:2109.11806v1 [eess.IV])
    (0 min) Diabetic retinopathy (DR) is one of the major blindness-causing diseases current-ly known. Automatic grading of DR using deep learning methods not only speeds up the diagnosis of the disease but also reduces the rate of misdiagnosis. However, problems such as insufficient samples and imbalanced class distribu-tion in DR datasets have constrained the improvement of grading performance. In this paper, we introduce the idea of multi-stage transfer into the grading task of DR. The new transfer learning technique leverages multiple datasets with differ-ent scales to enable the model to learn more feature representation information. Meanwhile, to cope with imbalanced DR datasets, we present a class-balanced loss function that performs well in natural image classification tasks, and adopt a simple and easy-to-implement training method for it. The experimental results show that the application of multi-stage transfer and class-balanced loss function can effectively improve the grading performance metrics such as accuracy and quadratic weighted kappa. In fact, our method has outperformed two state-of-the-art methods and achieved the best result on the DR grading task of IDRiD Sub-Challenge 2.
    GeomGCL: Geometric Graph Contrastive Learning for Molecular Property Prediction. (arXiv:2109.11730v1 [cs.LG])
    (0 min) Recently many efforts have been devoted to applying graph neural networks (GNNs) to molecular property prediction which is a fundamental task for computational drug and material discovery. One of major obstacles to hinder the successful prediction of molecule property by GNNs is the scarcity of labeled data. Though graph contrastive learning (GCL) methods have achieved extraordinary performance with insufficient labeled data, most focused on designing data augmentation schemes for general graphs. However, the fundamental property of a molecule could be altered with the augmentation method (like random perturbation) on molecular graphs. Whereas, the critical geometric information of molecules remains rarely explored under the current GNN and GCL architectures. To this end, we propose a novel graph contrastive learning method utilizing the geometry of the molecule across 2D and 3D views, which is named GeomGCL. Specifically, we first devise a dual-view geometric message passing network (GeomMPNN) to adaptively leverage the rich information of both 2D and 3D graphs of a molecule. The incorporation of geometric properties at different levels can greatly facilitate the molecular representation learning. Then a novel geometric graph contrastive scheme is designed to make both geometric views collaboratively supervise each other to improve the generalization ability of GeomMPNN. We evaluate GeomGCL on various downstream property prediction tasks via a finetune process. Experimental results on seven real-life molecular datasets demonstrate the effectiveness of our proposed GeomGCL against state-of-the-art baselines.
    Training dataset generation for bridge game registration. (arXiv:2109.11861v1 [cs.CV])
    (0 min) This paper presents a method for automatic generation of a training dataset for a deep convolutional neural network used for playing card detection. The solution allows to skip the time-consuming processes of manual image collecting and labelling recognised objects. The YOLOv4 network trained on the generated dataset achieved an efficiency of 99.8% in the cards detection task. The proposed method is a part of a project that aims to automate the process of broadcasting duplicate bridge competitions using a vision system and neural networks.
    A dynamic programming algorithm for informative measurements and near-optimal path-planning. (arXiv:2109.11808v1 [cs.LG])
    (0 min) An informative measurement is the most efficient way to gain information about an unknown state. We give a first-principles derivation of a general-purpose dynamic programming algorithm that returns a sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. This algorithm is applicable to states and controls that are continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes. Recent results from approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow an agent or robot to solve the measurement task in real-time. The resulting near-optimal solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly-used greedy heuristics such as maximizing the entropy of each measurement outcome. This is demonstrated for a global search problem, where on-line planning with an extended local search is found to reduce the number of measurements in the search by half.
    Predicting pigging operations in oil pipelines. (arXiv:2109.11812v1 [eess.SP])
    (0 min) This paper presents an innovative machine learning methodology that leverages on long-term vibroacoustic measurements to perform automated predictions of the needed pigging operations in crude oil trunklines. Historical pressure signals have been collected by Eni (e-vpms monitoring system) for two years on discrete points at a relative distance of 30-35 km along an oil pipeline (100 km length, 16 inch diameter pipes) located in Northern Italy. In order to speed up the activity and to check the operation logs, a tool has been implemented to automatically highlight the historical pig operations performed on the line. Such a tool is capable of detecting, in the observed pressure measurements, the acoustic noise generated by the travelling pig. All the data sets have been reanalyzed and exploited by using field data validations to guide a decision tree regressor (DTR). Several statistical indicators, computed from pressure head loss between line segments, are fed to the DTR, which automatically outputs probability values indicating the possible need for pigging the pipeline. The procedure is applied to the vibroacoustic signals of each pair of consecutive monitoring stations, such that the proposed predictive maintenance strategy is capable of tracking the conditions of individual pipeline sections, thus determining which portion of the conduit is subject to the highest occlusion levels in order to optimize the clean-up operations. Prediction accuracy is assessed by evaluating the typical metrics used in statistical analysis of regression problems, such as the Root Mean Squared Error (RMSE).
    Non-Euclidean Self-Organizing Maps. (arXiv:2109.11769v1 [cs.LG])
    (0 min) Self-Organizing Maps (SOMs, Kohonen networks) belong to neural network models of the unsupervised class. In this paper, we present the generalized setup for non-Euclidean SOMs. Most data analysts take it for granted to use some subregions of a flat space as their data model; however, by the assumption that the underlying geometry is non-Euclidean we obtain a new degree of freedom for the techniques that translate the similarities into spatial neighborhood relationships. We improve the traditional SOM algorithm by introducing topology-related extensions. Our proposition can be successfully applied to dimension reduction, clustering or finding similarities in big data (both hierarchical and non-hierarchical).
    Deep Learning with Kernel Flow Regularization for Time Series Forecasting. (arXiv:2109.11649v1 [cs.LG])
    (0 min) Long Short-Term Memory (LSTM) neural networks have been widely used for time series forecasting problems. However, LSTMs are prone to overfitting and performance reduction during test phases. Several different regularization techniques have been shown in literature to prevent overfitting problems in neural networks. In this paper, first, we introduce application of kernel flow methods for time series forecasting in general. Afterward, we examine the effectiveness of applying kernel flow regularization on LSTM layers to avoid overfitting problems. We describe a regularization method by applying kernel flow loss function on LSTM layers. In experimental results, we show that kernel flow outperforms baseline models on time series forecasting benchmarks. We also compare the effect of dropout and kernel flow regularization techniques on LSTMs. The experimental results illustrate that kernel flow achieves similar regularization effect to dropout. It also shows that the best results is obtained using both kernel flow and dropout regularizations with early stopping on LSTM layers on some time series datasets (e.g. power-load demand forecasts).
    Deep Social Force. (arXiv:2109.12081v1 [cs.LG])
    (0 min) The Social Force model introduced by Helbing and Molnar in 1995 is a cornerstone of pedestrian simulation. This paper introduces a differentiable simulation of the Social Force model where the assumptions on the shapes of interaction potentials are relaxed with the use of universal function approximators in the form of neural networks. Classical force-based pedestrian simulations suffer from unnatural locking behavior on head-on collision paths. In addition, they cannot model the bias of pedestrians to avoid each other on the right or left depending on the geographic region. My experiments with more general interaction potentials show that potentials with a sharp tip in the front avoid locking. In addition, asymmetric interaction potentials lead to a left or right bias when pedestrians avoid each other.
    Evaluating Attacker Risk Behavior in an Internet of Things Ecosystem. (arXiv:2109.11592v1 [cs.CR])
    (0 min) In cybersecurity, attackers range from brash, unsophisticated script kiddies and cybercriminals to stealthy, patient advanced persistent threats. When modeling these attackers, we can observe that they demonstrate different risk-seeking and risk-averse behaviors. This work explores how an attacker's risk seeking or risk averse behavior affects their operations against detection-optimizing defenders in an Internet of Things ecosystem. Using an evaluation framework which uses real, parametrizable malware, we develop a game that is played by a defender against attackers with a suite of malware that is parameterized to be more aggressive and more stealthy. These results are evaluated under a framework of exponential utility according to their willingness to accept risk. We find that against a defender who must choose a single strategy up front, risk-seeking attackers gain more actual utility than risk-averse attackers, particularly in cases where the defender is better equipped than the two attackers anticipate. Additionally, we empirically confirm that high-risk, high-reward scenarios are more beneficial to risk-seeking attackers like cybercriminals, while low-risk, low-reward scenarios are more beneficial to risk-averse attackers like advanced persistent threats.
    Optimal Decision Making in High-Throughput Virtual Screening Pipelines. (arXiv:2109.11683v1 [math.OC])
    (0 min) Effective selection of the potential candidates that meet certain conditions in a tremendously large search space has been one of the major concerns in many real-world applications. In addition to the nearly infinitely large search space, rigorous evaluation of a sample based on the reliable experimental or computational platform is often prohibitively expensive, making the screening problem more challenging. In such a case, constructing a high-throughput screening (HTS) pipeline that pre-sifts the samples expected to be potential candidates through the efficient earlier stages, results in a significant amount of savings in resources. However, to the best of our knowledge, despite many successful applications, no one has studied optimal pipeline design or optimal pipeline operations. In this study, we propose two optimization frameworks, applying to most (if not all) screening campaigns involving experimental or/and computational evaluations, for optimally determining the screening thresholds of an HTS pipeline. We validate the proposed frameworks on both analytic and practical scenarios. In particular, we consider the optimal computational campaign for the long non-coding RNA (lncRNA) classification as a practical example. To accomplish this, we built the high-throughput virtual screening (HTVS) pipeline for classifying the lncRNA. The simulation results demonstrate that the proposed frameworks significantly reduce the effective selection cost per potential candidate and make the HTS pipelines less sensitive to their structural variations. In addition to the validation, we provide insights on constructing a better HTS pipeline based on the simulation results.
    Optimisation of MCTS Player for The Lord of the Rings: The Card Game. (arXiv:2109.12001v1 [cs.LG])
    (0 min) The article presents research on the use of Monte-Carlo Tree Search (MCTS) methods to create an artificial player for the popular card game "The Lord of the Rings". The game is characterized by complicated rules, multi-stage round construction, and a high level of randomness. The described study found that the best probability of a win is received for a strategy combining expert knowledge-based agents with MCTS agents at different decision stages. It is also beneficial to replace random playouts with playouts using expert knowledge. The results of the final experiments indicate that the relative effectiveness of the developed solution grows as the difficulty of the game increases.
    Untrained Graph Neural Networks for Denoising. (arXiv:2109.11700v1 [eess.SP])
    (0 min) A fundamental problem in signal processing is to denoise a signal. While there are many well-performing methods for denoising signals defined on regular supports, such as images defined on two-dimensional grids of pixels, many important classes of signals are defined over irregular domains such as graphs. This paper introduces two untrained graph neural network architectures for graph signal denoising, provides theoretical guarantees for their denoising capabilities in a simple setup, and numerically validates the theoretical results in more general scenarios. The two architectures differ on how they incorporate the information encoded in the graph, with one relying on graph convolutions and the other employing graph upsampling operators based on hierarchical clustering. Each architecture implements a different prior over the targeted signals. To numerically illustrate the validity of the theoretical results and to compare the performance of the proposed architectures with other denoising alternatives, we present several experimental results with real and synthetic datasets.
    Approximate Latent Force Model Inference. (arXiv:2109.11851v1 [cs.LG])
    (0 min) Physically-inspired latent force models offer an interpretable alternative to purely data driven tools for inference in dynamical systems. They carry the structure of differential equations and the flexibility of Gaussian processes, yielding interpretable parameters and dynamics-imposed latent functions. However, the existing inference techniques associated with these models rely on the exact computation of posterior kernel terms which are seldom available in analytical form. Most applications relevant to practitioners, such as Hill equations or diffusion equations, are hence intractable. In this paper, we overcome these computational problems by proposing a variational solution to a general class of non-linear and parabolic partial differential equation latent force models. Further, we show that a neural operator approach can scale our model to thousands of instances, enabling fast, distributed computation. We demonstrate the efficacy and flexibility of our framework by achieving competitive performance on several tasks where the kernels are of varying degrees of tractability.
    Optimization Strategies in Multi-Task Learning: Averaged or Separated Losses?. (arXiv:2109.11678v1 [cs.LG])
    (0 min) In Multi-Task Learning (MTL), it is a common practice to train multi-task networks by optimizing an objective function, which is a weighted average of the task-specific objective functions. Although the computational advantages of this strategy are clear, the complexity of the resulting loss landscape has not been studied in the literature. Arguably, its optimization may be more difficult than a separate optimization of the constituting task-specific objectives. In this work, we investigate the benefits of such an alternative, by alternating independent gradient descent steps on the different task-specific objective functions and we formulate a novel way to combine this approach with state-of-the-art optimizers. As the separation of task-specific objectives comes at the cost of increased computational time, we propose a random task grouping as a trade-off between better optimization and computational efficiency. Experimental results over three well-known visual MTL datasets show better overall absolute performance on losses and standard metrics compared to an averaged objective function and other state-of-the-art MTL methods. In particular, our method shows the most benefits when dealing with tasks of different nature and it enables a wider exploration of the shared parameter space. We also show that our random grouping strategy allows to trade-off between these benefits and computational efficiency.
    Indoor Localization Using Smartphone Magnetic with Multi-Scale TCN and LSTM. (arXiv:2109.11750v1 [eess.SP])
    (0 min) A novel multi-scale temporal convolutional network (TCN) and long short-term memory network (LSTM) based magnetic localization approach is proposed. To enhance the discernibility of geomagnetic signals, the time-series preprocessing approach is constructed at first. Next, the TCN is invoked to expand the feature dimensions on the basis of keeping the time-series characteristics of LSTM model. Then, a multi-scale time-series layer is constructed with multiple TCNs of different dilation factors to address the problem of inconsistent time-series speed between localization model and mobile users. A stacking framework of multi-scale TCN and LSTM is eventually proposed for indoor magnetic localization. Experiment results demonstrate the effectiveness of the proposed algorithm in indoor localization.
    Document Automation Architectures and Technologies: A Survey. (arXiv:2109.11603v1 [cs.CL])
    (0 min) This paper surveys the current state of the art in document automation (DA). The objective of DA is to reduce the manual effort during the generation of documents by automatically integrating input from different sources and assembling documents conforming to defined templates. There have been reviews of commercial solutions of DA, particularly in the legal domain, but to date there has been no comprehensive review of the academic research on DA architectures and technologies. The current survey of DA reviews the academic literature and provides a clearer definition and characterization of DA and its features, identifies state-of-the-art DA architectures and technologies in academic research, and provides ideas that can lead to new research opportunities within the DA field in light of recent advances in artificial intelligence and deep neural networks.
    Adversarial Neural Trip Recommendation. (arXiv:2109.11731v1 [cs.LG])
    (0 min) Trip recommender system, which targets at recommending a trip consisting of several ordered Points of Interest (POIs), has long been treated as an important application for many location-based services. Currently, most prior arts generate trips following pre-defined objectives based on constraint programming, which may fail to reflect the complex latent patterns hidden in the human mobility data. And most of these methods are usually difficult to respond in real time when the number of POIs is large. To that end, we propose an Adversarial Neural Trip Recommendation (ANT) framework to tackle the above challenges. First of all, we devise a novel attention-based encoder-decoder trip generator that can learn the correlations among POIs and generate well-designed trips under given constraints. Another novelty of ANT relies on an adversarial learning strategy integrating with reinforcement learning to guide the trip generator to produce high-quality trips. For this purpose, we introduce a discriminator, which distinguishes the generated trips from real-life trips taken by users, to provide reward signals to optimize the generator. Moreover, we devise a novel pre-train schema based on learning from demonstration, which speeds up the convergence to achieve a sufficient-and-efficient training process. Extensive experiments on four real-world datasets validate the effectiveness and efficiency of our proposed ANT framework, which demonstrates that ANT could remarkably outperform the state-of-the-art baselines with short response time.
    Edge but not Least: Cross-View Graph Pooling. (arXiv:2109.11796v1 [cs.LG])
    (0 min) Graph neural networks have emerged as a powerful model for graph representation learning to undertake graph-level prediction tasks. Various graph pooling methods have been developed to coarsen an input graph into a succinct graph-level representation through aggregating node embeddings obtained via graph convolution. However, most graph pooling methods are heavily node-centric and are unable to fully leverage the crucial information contained in global graph structure. This paper presents a cross-view graph pooling (Co-Pooling) method to better exploit crucial graph structure information. The proposed Co-Pooling fuses pooled representations learnt from both node view and edge view. Through cross-view interaction, edge-view pooling and node-view pooling seamlessly reinforce each other to learn more informative graph-level representations. Co-Pooling has the advantage of handling various graphs with different types of node attributes. Extensive experiments on a total of 15 graph benchmark datasets validate the effectiveness of our proposed method, demonstrating its superior performance over state-of-the-art pooling methods on both graph classification and graph regression tasks.
    CSAGN: Conversational Structure Aware Graph Network for Conversational Semantic Role Labeling. (arXiv:2109.11541v1 [cs.CL])
    (0 min) Conversational semantic role labeling (CSRL) is believed to be a crucial step towards dialogue understanding. However, it remains a major challenge for existing CSRL parser to handle conversational structural information. In this paper, we present a simple and effective architecture for CSRL which aims to address this problem. Our model is based on a conversational structure-aware graph network which explicitly encodes the speaker dependent information. We also propose a multi-task learning method to further improve the model. Experimental results on benchmark datasets show that our model with our proposed training objectives significantly outperforms previous baselines.
    Holistic Semi-Supervised Approaches for EEG Representation Learning. (arXiv:2109.11732v1 [cs.LG])
    (0 min) Recently, supervised methods, which often require substantial amounts of class labels, have achieved promising results for EEG representation learning. However, labeling EEG data is a challenging task. More recently, holistic semi-supervised learning approaches, which only require few output labels, have shown promising results in the field of computer vision. These methods, however, have not yet been adapted for EEG learning. In this paper, we adapt three state-of-the-art holistic semi-supervised approaches, namely MixMatch, FixMatch, and AdaMatch, as well as five classical semi-supervised methods for EEG learning. We perform rigorous experiments with all 8 methods on two public EEG-based emotion recognition datasets, namely SEED and SEED-IV. The experiments with different amounts of limited labeled samples show that the holistic approaches achieve strong results even when only 1 labeled sample is used per class. Further experiments show that in most cases, AdaMatch is the most effective method, followed by MixMatch and FixMatch.
    Text Ranking and Classification using Data Compression. (arXiv:2109.11577v1 [cs.LG])
    (0 min) A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest. In applications, this approach simplifies configuration, avoiding careful feature extraction and large ML models. Our ablation studies confirm the value of individual enhancements we introduce. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.
    Distributed Deep Reinforcement Learning for Adaptive Medium Access and Modulation in Shared Spectrum. (arXiv:2109.11723v1 [eess.SP])
    (0 min) Spectrum scarcity has led to growth in the use of unlicensed spectrum for cellular systems. This motivates intelligent adaptive approaches to spectrum access for both WiFi and 5G that improve upon traditional carrier sensing and listen-before-talk methods. We study decentralized contention-based medium access for base stations (BSs) of a single Radio Access Technology (RAT) operating on unlicensed shared spectrum. We devise a learning-based algorithm for both contention and adaptive modulation that attempts to maximize a network-wide downlink throughput objective. We formulate and develop novel distributed implementations of two deep reinforcement learning approaches - Deep Q Networks and Proximal Policy Optimization - modelled on a two stage Markov decision process. Empirically, we find the (proportional fairness) reward accumulated by the policy gradient approach to be significantly higher than even a genie-aided adaptive energy detection threshold. Our approaches are further validated by improved sum and peak throughput. The scalability of our approach to large networks is demonstrated via an improved cumulative reward earned on both indoor and outdoor layouts with a large number of BSs.
    Recurrent Neural Networks for Partially Observed Dynamical Systems. (arXiv:2109.11629v1 [cs.LG])
    (0 min) Complex nonlinear dynamics are ubiquitous in many fields. Moreover, we rarely have access to all of the relevant state variables governing the dynamics. Delay embedding allows us, in principle, to account for unobserved state variables. Here we provide an algebraic approach to delay embedding that permits explicit approximation of error. We also provide the asymptotic dependence of the first order approximation error on the system size. More importantly, this formulation of delay embedding can be directly implemented using a Recurrent Neural Network (RNN). This observation expands the interpretability of both delay embedding and RNN and facilitates principled incorporation of structure and other constraints into these approaches.
    MORSE-STF: A Privacy Preserving Computation System. (arXiv:2109.11726v1 [cs.CR])
    (0 min) Privacy-preserving machine learning has become a popular area of research due to the increasing concern over data privacy. One way to achieve privacy-preserving machine learning is to use secure multi-party computation, where multiple distrusting parties can perform computations on data without revealing the data itself. We present Secure-TF, a privacy-preserving machine learning framework based on MPC. Our framework is able to support widely-used machine learning models such as logistic regression, fully-connected neural network, and convolutional neural network. We propose novel cryptographic protocols that has lower round complexity and less communication for computing sigmoid, ReLU, conv2D and there derivatives. All are central building blocks for modern machine learning models. With our more efficient protocols, our system is able to outperform previous state-of-the-art privacy-preserving machine learning framework in the WAN setting.
    Chess AI: Competing Paradigms for Machine Intelligence. (arXiv:2109.11602v1 [cs.AI])
    (0 min) Endgame studies have long served as a tool for testing human creativity and intelligence. We find that they can serve as a tool for testing machine ability as well. Two of the leading chess engines, Stockfish and Leela Chess Zero (LCZero), employ significantly different methods during play. We use Plaskett's Puzzle, a famous endgame study from the late 1970s, to compare the two engines. Our experiments show that Stockfish outperforms LCZero on the puzzle. We examine the algorithmic differences between the engines and use our observations as a basis for carefully interpreting the test results. Drawing inspiration from how humans solve chess problems, we ask whether machines can possess a form of imagination. On the theoretical side, we describe how Bellman's equation may be applied to optimize the probability of winning. To conclude, we discuss the implications of our work on artificial intelligence (AI) and artificial general intelligence (AGI), suggesting possible avenues for future research.
    Learning Generative Deception Strategies in Combinatorial Masking Games. (arXiv:2109.11637v1 [cs.GT])
    (0 min) Deception is a crucial tool in the cyberdefence repertoire, enabling defenders to leverage their informational advantage to reduce the likelihood of successful attacks. One way deception can be employed is through obscuring, or masking, some of the information about how systems are configured, increasing attacker's uncertainty about their targets. We present a novel game-theoretic model of the resulting defender-attacker interaction, where the defender chooses a subset of attributes to mask, while the attacker responds by choosing an exploit to execute. The strategies of both players have combinatorial structure with complex informational dependencies, and therefore even representing these strategies is not trivial. First, we show that the problem of computing an equilibrium of the resulting zero-sum defender-attacker game can be represented as a linear program with a combinatorial number of system configuration variables and constraints, and develop a constraint generation approach for solving this problem. Next, we present a novel highly scalable approach for approximately solving such games by representing the strategies of both players as neural networks. The key idea is to represent the defender's mixed strategy using a deep neural network generator, and then using alternating gradient-descent-ascent algorithm, analogous to the training of Generative Adversarial Networks. Our experiments, as well as a case study, demonstrate the efficacy of the proposed approach.
    Bridging the Last Mile in Sim-to-Real Robot Perception via Bayesian Active Learning. (arXiv:2109.11547v1 [cs.RO])
    (0 min) Learning from synthetic data is popular in avariety of robotic vision tasks such as object detection, becauselarge amount of data can be generated without annotationsby humans. However, when relying only on synthetic data,we encounter the well-known problem of the simulation-to-reality (Sim-to-Real) gap, which is hard to resolve completelyin practice. For such cases, real human-annotated data isnecessary to bridge this gap, and in our work we focus on howto acquire this data efficiently. Therefore, we propose a Sim-to-Real pipeline that relies on deep Bayesian active learningand aims to minimize the manual annotation efforts. We devisea learning paradigm that autonomously selects the data thatis considered useful for the human expert to annotate. Toachieve this, a Bayesian Neural Network (BNN) object detectorproviding reliable uncertain estimates is adapted to infer theinformativeness of the unlabeled data, in order to performactive learning. In our experiments on two object detectiondata sets, we show that the labeling effort required to bridge thereality gap can be reduced to a small amount. Furthermore, wedemonstrate the practical effectiveness of this idea in a graspingtask on an assistive robot.
    Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection. (arXiv:2109.11641v1 [eess.AS])
    (0 min) In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.
    Lifelong 3D Object Recognition and Grasp Synthesis Using Dual Memory Recurrent Self-Organization Networks. (arXiv:2109.11544v1 [cs.RO])
    (0 min) Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the problem of catastrophic forgetting, where the newly gained knowledge overwrites existing representations. Furthermore, most state-of-the-art models excel either in recognizing the objects or in grasp prediction, while both tasks use visual input. The combined architecture to tackle both tasks is very limited. In this paper, we proposed a hybrid model architecture consists of a dynamically growing dual-memory recurrent neural network (GDM) and an autoencoder to tackle object recognition and grasping simultaneously. The autoencoder network is responsible to extract a compact representation for a given object, which serves as input for the GDM learning, and is responsible to predict pixel-wise antipodal grasp configurations. The GDM part is designed to recognize the object in both instances and categories levels. We address the problem of catastrophic forgetting using the intrinsic memory replay, where the episodic memory periodically replays the neural activation trajectories in the absence of external sensory information. To extensively evaluate the proposed model in a lifelong setting, we generate a synthetic dataset due to lack of sequential 3D objects dataset. Experiment results demonstrated that the proposed model can learn both object representation and grasping simultaneously in continual learning scenarios.
    A Multi-Agent Deep Reinforcement Learning Coordination Framework for Connected and Automated Vehicles at Merging Roadways. (arXiv:2109.11672v1 [eess.SY])
    (0 min) The steady increase in the number of vehicles operating on the highways continues to exacerbate congestion, accidents, energy consumption, and greenhouse gas emissions. Emerging mobility systems, e.g., connected and automated vehicles (CAVs), have the potential to directly address these issues and improve transportation network efficiency and safety. In this paper, we consider a highway merging scenario and propose a framework for coordinating CAVs such that stop-and-go driving is eliminated. We use a decentralized form of the actor-critic approach to deep reinforcement learning$-$multi-agent deep deterministic policy gradient. We demonstrate the coordination of CAVs through numerical simulations and show that a smooth traffic flow is achieved by eliminating stop-and-go driving. Videos and plots of the simulation results can be found at this supplemental $\href{https://sites.google.com/view/ud-ids-lab/MADRL}{site}$.
    Remaining useful life prediction with uncertainty quantification: development of a highly accurate model for rotating machinery. (arXiv:2109.11579v1 [eess.SP])
    (0 min) Rotating machinery is essential to modern life, from power generation to transportation and a host of other industrial applications. Since such equipment generally operates under challenging working conditions, which can lead to untimely failures, accurate remaining useful life (RUL) prediction is essential for maintenance planning and to prevent catastrophic failures. In this work, we address current challenges in data-driven RUL prediction for rotating machinery. The challenges revolve around the accuracy and uncertainty quantification of the prediction, and the non-stationarity of the system degradation and RUL estimation given sensor data. We devise a novel architecture and RUL prediction model with uncertainty quantification, termed VisPro, which integrates time-frequency analysis, deep learning image recognition, and nonstationary Gaussian process regression. We analyze and benchmark the results obtained with our model against those of other advanced data-driven RUL prediction models for rotating machinery using the PHM12 bearing vibration dataset. The computational experiments show that (1) the VisPro predictions are highly accurate and provide significant improvements over existing prediction models (three times more accurate than the second-best model), and (2) the RUL uncertainty bounds are valid and informative. We identify and discuss the architectural and modeling choices made that explain this excellent predictive performance of VisPro.
    Explaining Local, Global, And Higher-Order Interactions In Deep Learning. (arXiv:2006.08601v4 [cs.LG] UPDATED)
    (0 min) We present a simple yet highly generalizable method for explaining interacting parts within a neural network's reasoning process. First, we design an algorithm based on cross derivatives for computing statistical interaction effects between individual features, which is generalized to both 2-way and higher-order (3-way or more) interactions. We present results side by side with a weight-based attribution technique, corroborating that cross derivatives are a superior metric for both 2-way and higher-order interaction detection. Moreover, we extend the use of cross derivatives as an explanatory device in neural networks to the computer vision setting by expanding Grad-CAM, a popular gradient-based explanatory tool for CNNs, to the higher order. While Grad-CAM can only explain the importance of individual objects in images, our method, which we call Taylor-CAM, can explain a neural network's relational reasoning across multiple objects. We show the success of our explanations both qualitatively and quantitatively, including with a user study. We will release all code as a tool package to facilitate explainable deep learning.
    Nishimori meets Bethe: a spectral method for node classification in sparse weighted graphs. (arXiv:2103.03561v2 [stat.ML] UPDATED)
    (0 min) This article unveils a new relation between the Nishimori temperature parametrizing a distribution P and the Bethe free energy on random Erdos-Renyi graphs with edge weights distributed according to P. Estimating the Nishimori temperature being a task of major importance in Bayesian inference problems, as a practical corollary of this new relation, a numerical method is proposed to accurately estimate the Nishimori temperature from the eigenvalues of the Bethe Hessian matrix of the weighted graph. The algorithm, in turn, is used to propose a new spectral method for node classification in weighted (possibly sparse) graphs. The superiority of the method over competing state-of-the-art approaches is demonstrated both through theoretical arguments and real-world data experiments.
    Multi-StyleGAN: Towards Image-Based Simulation of Time-Lapse Live-Cell Microscopy. (arXiv:2106.08285v3 [cs.CV] UPDATED)
    (0 min) Time-lapse fluorescent microscopy (TLFM) combined with predictive mathematical modelling is a powerful tool to study the inherently dynamic processes of life on the single-cell level. Such experiments are costly, complex and labour intensive. A complimentary approach and a step towards in silico experimentation, is to synthesise the imagery itself. Here, we propose Multi-StyleGAN as a descriptive approach to simulate time-lapse fluorescence microscopy imagery of living cells, based on a past experiment. This novel generative adversarial network synthesises a multi-domain sequence of consecutive timesteps. We showcase Multi-StyleGAN on imagery of multiple live yeast cells in microstructured environments and train on a dataset recorded in our laboratory. The simulation captures underlying biophysical factors and time dependencies, such as cell morphology, growth, physical interactions, as well as the intensity of a fluorescent reporter protein. An immediate application is to generate additional training and validation data for feature extraction algorithms or to aid and expedite development of advanced experimental techniques such as online monitoring or control of cells. Code and dataset is available at https://git.rwth-aachen.de/bcs/projects/tp/multi-stylegan.
    CNN-based synthesis of realistic high-resolution LiDAR data. (arXiv:1907.00787v2 [eess.IV] UPDATED)
    (0 min) This paper presents a novel CNN-based approach for synthesizing high-resolution LiDAR point cloud data. Our approach generates semantically and perceptually realistic results with guidance from specialized loss-functions. First, we utilize a modified per-point loss that addresses missing LiDAR point measurements. Second, we align the quality of our generated output with real-world sensor data by applying a perceptual loss. In large-scale experiments on real-world datasets, we evaluate both the geometric accuracy and semantic segmentation performance using our generated data vs. ground truth. In a mean opinion score testing we further assess the perceptual quality of our generated point clouds. Our results demonstrate a significant quantitative and qualitative improvement in both geometry and semantics over traditional non CNN-based up-sampling methods.
    An Improved Frequent Directions Algorithm for Low-Rank Approximation via Block Krylov Iteration. (arXiv:2109.11703v1 [cs.LG])
    (0 min) Frequent Directions, as a deterministic matrix sketching technique, has been proposed for tackling low-rank approximation problems. This method has a high degree of accuracy and practicality, but experiences a lot of computational cost for large-scale data. Several recent works on the randomized version of Frequent Directions greatly improve the computational efficiency, but unfortunately sacrifice some precision. To remedy such issue, this paper aims to find a more accurate projection subspace to further improve the efficiency and effectiveness of the existing Frequent Directions techniques. Specifically, by utilizing the power of Block Krylov Iteration and random projection technique, this paper presents a fast and accurate Frequent Directions algorithm named as r-BKIFD. The rigorous theoretical analysis shows that the proposed r-BKIFD has a comparable error bound with original Frequent Directions, and the approximation error can be arbitrarily small when the number of iterations is chosen appropriately. Extensive experimental results on both synthetic and real data further demonstrate the superiority of r-BKIFD over several popular Frequent Directions algorithms, both in terms of computational efficiency and accuracy.
    AFP-SRC: Identification of Antifreeze Proteins Using Sparse Representation Classifier. (arXiv:2009.05277v3 [cs.CV] UPDATED)
    (0 min) Species living in the extreme cold environment fight against the harsh conditions using antifreeze proteins (AFPs), that manipulates the freezing mechanism of water in more than one way. This amazing nature of AFP turns out to be extremely useful in several industrial and medical applications. The lack of similarity in their structure and sequence makes their prediction an arduous task and identifying them experimentally in the wet-lab is time-consuming and expensive. In this research, we propose a computational framework for the prediction of AFPs which is essentially based on a sample-specific classification method using the sparse reconstruction. A linear model and an over-complete dictionary matrix of known AFPs are used to predict a sparse class-label vector that provides a sample-association score. Delta-rule is applied for the reconstruction of two pseudo-samples using lower and upper parts of the sample-association vector and based on the minimum recovery score, class labels are assigned. We compare our approach with contemporary methods on a standard dataset and the proposed method is found to outperform in terms of Balanced accuracy and Youden's index. The MATLAB implementation of the proposed method is available at the author's GitHub page (\{https://github.com/Shujaat123/AFP-SRC}{https://github.com/Shujaat123/AFP-SRC}).
    Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning. (arXiv:2109.12021v1 [cs.AR])
    (0 min) Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm. We show that prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and system-aware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms multiple state-of-the-art prefetchers over a wide range of workloads and system configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.
    ReNAS:Relativistic Evaluation of Neural Architecture Search. (arXiv:1910.01523v7 [cs.LG] UPDATED)
    (0 min) An effective and efficient architecture performance evaluation scheme is essential for the success of Neural Architecture Search (NAS). To save computational cost, most of existing NAS algorithms often train and evaluate intermediate neural architectures on a small proxy dataset with limited training epochs. But it is difficult to expect an accurate performance estimation of an architecture in such a coarse evaluation way. This paper advocates a new neural architecture evaluation scheme, which aims to determine which architecture would perform better instead of accurately predict the absolute architecture performance. Therefore, we propose a \textbf{relativistic} architecture performance predictor in NAS (ReNAS). We encode neural architectures into feature tensors, and further refining the representations with the predictor. The proposed relativistic performance predictor can be deployed in discrete searching methods to search for the desired architectures without additional evaluation. Experimental results on NAS-Bench-101 dataset suggests that, sampling 424 ($0.1\%$ of the entire search space) neural architectures and their corresponding validation performance is already enough for learning an accurate architecture performance predictor. The accuracies of our searched neural architectures on NAS-Bench-101 and NAS-Bench-201 datasets are higher than that of the state-of-the-art methods and show the priority of the proposed method.
    A black-box adversarial attack for poisoning clustering. (arXiv:2009.05474v2 [cs.LG] UPDATED)
    (0 min) Clustering algorithms play a fundamental role as tools in decision-making and sensible automation processes. Due to the widespread use of these applications, a robustness analysis of this family of algorithms against adversarial noise has become imperative. To the best of our knowledge, however, only a few works have currently addressed this problem. In an attempt to fill this gap, in this work, we propose a black-box adversarial attack for crafting adversarial samples to test the robustness of clustering algorithms. We formulate the problem as a constrained minimization program, general in its structure and customizable by the attacker according to her capability constraints. We do not assume any information about the internal structure of the victim clustering algorithm, and we allow the attacker to query it as a service only. In the absence of any derivative information, we perform the optimization with a custom approach inspired by the Abstract Genetic Algorithm (AGA). In the experimental part, we demonstrate the sensibility of different single and ensemble clustering algorithms against our crafted adversarial samples on different scenarios. Furthermore, we perform a comparison of our algorithm with a state-of-the-art approach showing that we are able to reach or even outperform its performance. Finally, to highlight the general nature of the generated noise, we show that our attacks are transferable even against supervised algorithms such as SVMs, random forests, and neural networks.
    Sample Efficient Model Evaluation. (arXiv:2109.12043v1 [cs.LG])
    (0 min) Labelling data is a major practical bottleneck in training and testing classifiers. Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics such as accuracy, $F_1$ score or micro/macro $F_1$. We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling. For both approaches we derive the minimal error sampling distributions and how to approximate and use them to form estimators and confidence intervals. We show that Poisson Sampling outperforms Importance Sampling both theoretically and experimentally.
    Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives. (arXiv:2003.08854v3 [cs.RO] UPDATED)
    (0 min) Visuomotor control (VMC) is an effective means of achieving basic manipulation tasks such as pushing or pick-and-place from raw images. Conditioning VMC on desired goal states is a promising way of achieving versatile skill primitives. However, common conditioning schemes either rely on task-specific fine tuning - e.g. using one-shot imitation learning (IL) - or on sampling approaches using a forward model of scene dynamics i.e. model-predictive control (MPC), leaving deployability and planning horizon severely limited. In this paper we propose a conditioning scheme which avoids these pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion and the distance to a given target observation. In contrast to related works, this enables our approach to efficiently perform complex manipulation tasks from raw image observations without predefined control primitives or test time demonstrations. We report significant improvements in task success over representative MPC and IL baselines. We also demonstrate our model's generalisation capabilities in challenging, unseen tasks featuring visual noise, cluttered scenes and unseen object geometries.
    Clustering multivariate functional data using unsupervised binary trees. (arXiv:2012.05973v3 [stat.ML] UPDATED)
    (0 min) We propose a model-based clustering algorithm for a general class of functional data for which the components could be curves or images. The random functional data realizations could be measured with error at discrete, and possibly random, points in the definition domain. The idea is to build a set of binary trees by recursive splitting of the observations. The number of groups are determined in a data-driven way. The new algorithm provides easily interpretable results and fast predictions for online data sets. Results on simulated datasets reveal good performance in various complex settings. The methodology is applied to the analysis of vehicle trajectories on a German roundabout.
    Reduced-Lead ECG Classifier Model Trained with DivideMix and Model Ensemble. (arXiv:2109.12063v1 [cs.LG])
    (0 min) Automatic diagnosis of multiple cardiac abnormalities from reduced-lead electrocardiogram (ECG) data is challenging. One of the reasons for this is the difficulty of defining labels from standard 12-lead data. Reduced-lead ECG data usually do not have identical characteristics of cardiac abnormalities because of the noisy label problem. Thus, there is an inconsistency in the annotated labels between the reduced-lead and 12-lead ECG data. To solve this, we propose deep neural network (DNN)-based ECG classifier models that incorporate DivideMix and stochastic weight averaging (SWA). DivideMix was used to refine the noisy label by using two separate models. Besides DivideMix, we used a model ensemble technique, SWA, which also focuses on the noisy label problem, to enhance the effect of the models generated by DivideMix. Our classifiers (ami_kagoshima) received scores of 0.49, 0.47, 0.48, 0.47, and 0.47 (ranked 9th, 10th, 10th, 11th, and 10th, respectively, out of 39 teams) for the 12-lead, 6-lead, 4-lead, 3-lead, and 2-lead versions, respectively, of the hidden test set with the challenge evaluation metric. We obtained the scores of 0.701, 0.686, 0.693, 0.693, and 0.685 on the 10-fold cross validation, and 0.623, 0.593, 0.606, 0.612, and 0.601 on the hidden validation set for each lead combination.
    Weight-of-evidence 2.0 with shrinkage and spline-binning. (arXiv:2101.01494v3 [stat.ML] UPDATED)
    (0 min) In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear methods, however, are not well equipped to handle categorical predictors with high-cardinality or to exploit non-linear relations in the data. As a solution, data preprocessing methods such as weight-of-evidence are typically used for transforming the predictors. The binning procedure that underlies the weight-of-evidence approach, however, has been little researched and typically relies on ad-hoc or expert driven procedures. The objective in this paper, therefore, is to propose a formalized, data-driven and powerful method. To this end, we explore the discretization of continuous variables through the binning of spline functions, which allows for capturing non-linear effects in the predictor variables and yields highly interpretable predictors taking only a small number of discrete values. Moreover, we extend upon the weight-of-evidence approach and propose to estimate the proportions using shrinkage estimators. Together, this offers an improved ability to exploit both non-linear and categorical predictors for achieving increased classification precision, while maintaining interpretability of the resulting model and decreasing the risk of overfitting. We present the results of a series of experiments in a fraud detection setting, which illustrate the effectiveness of the presented approach. We facilitate reproduction of the presented results and adoption of the proposed approaches by providing both the dataset and the code for implementing the experiments and the presented approach.
    Distributed Estimation of Sparse Inverse Covariance Matrices. (arXiv:2109.12020v1 [cs.LG])
    (0 min) Learning the relationships between various entities from time-series data is essential in many applications. Gaussian graphical models have been studied to infer these relationships. However, existing algorithms process data in a batch at a central location, limiting their applications in scenarios where data is gathered by different agents. In this paper, we propose a distributed sparse inverse covariance algorithm to learn the network structure (i.e., dependencies among observed entities) in real-time from data collected by distributed agents. Our approach is built on an online graphical alternating minimization algorithm, augmented with a consensus term that allows agents to learn the desired structure cooperatively. We allow the system designer to select the number of communication rounds and optimization steps per data point. We characterize the rate of convergence of our algorithm and provide simulations on synthetic datasets.
    Local Intrinsic Dimensionality Signals Adversarial Perturbations. (arXiv:2109.11803v1 [cs.LG])
    (0 min) The vulnerability of machine learning models to adversarial perturbations has motivated a significant amount of research under the broad umbrella of adversarial machine learning. Sophisticated attacks may cause learning algorithms to learn decision functions or make decisions with poor predictive performance. In this context, there is a growing body of literature that uses local intrinsic dimensionality (LID), a local metric that describes the minimum number of latent variables required to describe each data point, for detecting adversarial samples and subsequently mitigating their effects. The research to date has tended to focus on using LID as a practical defence method often without fully explaining why LID can detect adversarial samples. In this paper, we derive a lower-bound and an upper-bound for the LID value of a perturbed data point and demonstrate that the bounds, in particular the lower-bound, has a positive correlation with the magnitude of the perturbation. Hence, we demonstrate that data points that are perturbed by a large amount would have large LID values compared to unperturbed samples, thus justifying its use in the prior literature. Furthermore, our empirical validation demonstrates the validity of the bounds on benchmark datasets.
    Combing Policy Evaluation and Policy Improvement in a Unified f-Divergence Framework. (arXiv:2109.11867v1 [cs.LG])
    (0 min) The framework of deep reinforcement learning (DRL) provides a powerful and widely applicable mathematical formalization for sequential decision-making. In this paper, we start from studying the f-divergence between learning policy and sampling policy and derive a novel DRL framework, termed f-Divergence Reinforcement Learning (FRL). We highlight that the policy evaluation and policy improvement phases are induced by minimizing f-divergence between learning policy and sampling policy, which is distinct from the conventional DRL algorithm objective that maximizes the expected cumulative rewards. Besides, we convert this framework to a saddle-point optimization problem with a specific f function through Fenchel conjugate, which consists of policy evaluation and policy improvement. Then we derive new policy evaluation and policy improvement methods in FRL. Our framework may give new insights for analyzing DRL algorithms. The FRL framework achieves two advantages: (1) policy evaluation and policy improvement processes are derived simultaneously by f-divergence; (2) overestimation issue of value function are alleviated. To evaluate the effectiveness of the FRL framework, we conduct experiments on Atari 2600 video games, which show that our framework matches or surpasses the DRL algorithms we tested.
    Description-based Label Attention Classifier for Explainable ICD-9 Classification. (arXiv:2109.12026v1 [cs.LG])
    (0 min) ICD-9 coding is a relevant clinical billing task, where unstructured texts with information about a patient's diagnosis and treatments are annotated with multiple ICD-9 codes. Automated ICD-9 coding is an active research field, where CNN- and RNN-based model architectures represent the state-of-the-art approaches. In this work, we propose a description-based label attention classifier to improve the model explainability when dealing with noisy texts like clinical notes. We evaluate our proposed method with different transformer-based encoders on the MIMIC-III-50 dataset. Our method achieves strong results together with augmented explainablilty.
    Is the Number of Trainable Parameters All That Actually Matters?. (arXiv:2109.11928v1 [stat.ML])
    (0 min) Recent work has identified simple empirical scaling laws for language models, linking compute budget, dataset size, model size, and autoregressive modeling loss. The validity of these simple power laws across orders of magnitude in model scale provides compelling evidence that larger models are also more capable models. However, scaling up models under the constraints of hardware and infrastructure is no easy feat, and rapidly becomes a hard and expensive engineering problem. We investigate ways to tentatively cheat scaling laws, and train larger models for cheaper. We emulate an increase in effective parameters, using efficient approximations: either by doping the models with frozen random parameters, or by using fast structured transforms in place of dense linear layers. We find that the scaling relationship between test loss and compute depends only on the actual number of trainable parameters; scaling laws cannot be deceived by spurious parameters.
    Learning Relative Interactions through Imitation. (arXiv:2109.12013v1 [cs.RO])
    (0 min) In this project we trained a neural network to perform specific interactions between a robot and objects in the environment, through imitation learning. In particular, we tackle the task of moving the robot to a fixed pose with respect to a certain object and later extend our method to handle any arbitrary pose around this object. We show that a simple network, with relatively little training data, is able to reach very good performance on the fixed-pose task, while more work is needed to perform the arbitrary-pose task satisfactorily. We also explore the effect of ambiguities in the sensor readings, in particular caused by symmetries in the target object, on the behaviour of the learned controller.
    A Generative Federated Learning Framework for Differential Privacy. (arXiv:2109.12062v1 [cs.LG])
    (0 min) In machine learning, differential privacy and federated learning concepts are gaining more and more importance in an increasingly interconnected world. While the former refers to the sharing of private data characterized by strict security rules to protect individual privacy, the latter refers to distributed learning techniques in which a central server exchanges information with different clients for machine learning purposes. In recent years, many studies have shown the possibility of bypassing the privacy shields of these systems and exploiting the vulnerabilities of machine learning models, making them leak the information with which they have been trained. In this work, we present the 3DGL framework, an alternative to the current federated learning paradigms. Its goal is to share generative models with high levels of $\varepsilon$-differential privacy. In addition, we propose DDP-$\beta$VAE, a deep generative model capable of generating synthetic data with high levels of utility and safety for the individual. We evaluate the 3DGL framework based on DDP-$\beta$VAE, showing how the overall system is resilient to the principal attacks in federated learning and improves the performance of distributed learning algorithms.
    Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts. (arXiv:2109.11817v1 [cs.LG])
    (0 min) Training large-scale mixture of experts models efficiently on modern hardware requires assigning datapoints in a batch to different experts, each with a limited capacity. Recently proposed assignment procedures lack a probabilistic interpretation and use biased estimators for training. As an alternative, we propose two unbiased estimators based on principled stochastic assignment procedures: one that skips datapoints which exceed expert capacity, and one that samples perfectly balanced assignments using an extension of the Gumbel-Matching distribution [29]. Both estimators are unbiased, as they correct for the used sampling procedure. On a toy experiment, we find the `skip'-estimator is more effective than the balanced sampling one, and both are more robust in solving the task than biased alternatives.
    A Bayesian Optimization Approach for Attenuation Correction in SPECT Brain Imaging. (arXiv:2109.11920v1 [physics.med-ph])
    (0 min) Photon attenuation and scatter are the two main physical factors affecting the diagnostic quality of SPECT in its applications in brain imaging. In this work, we present a novel Bayesian Optimization approach for Attenuation Correction (BOAC) in SPECT brain imaging. BOAC utilizes a prior model parametrizing the head geometry and exploits High Performance Computing (HPC) to reconstruct attenuation corrected images without requiring prior anatomical information from complementary CT scans. BOAC is demonstrated in SPECT brain imaging using noisy and attenuated sinograms, simulated from numerical phantoms. The quality of the tomographic images obtained with the proposed method are compared to those obtained without attenuation correction by employing the appropriate image quality metrics. The quantitative results show the capacity of BOAC to provide images exhibiting higher contrast and less background artifacts as compared to the non-attenuation corrected MLEM images.
    Deep Bayesian Estimation for Dynamic Treatment Regimes with a Long Follow-up Time. (arXiv:2109.11929v1 [stat.ML])
    (0 min) Causal effect estimation for dynamic treatment regimes (DTRs) contributes to sequential decision making. However, censoring and time-dependent confounding under DTRs are challenging as the amount of observational data declines over time due to a reducing sample size but the feature dimension increases over time. Long-term follow-up compounds these challenges. Another challenge is the highly complex relationships between confounders, treatments, and outcomes, which causes the traditional and commonly used linear methods to fail. We combine outcome regression models with treatment models for high dimensional features using uncensored subjects that are small in sample size and we fit deep Bayesian models for outcome regression models to reveal the complex relationships between confounders, treatments, and outcomes. Also, the developed deep Bayesian models can model uncertainty and output the prediction variance which is essential for the safety-aware applications, such as self-driving cars and medical treatment design. The experimental results on medical simulations of HIV treatment show the ability of the proposed method to obtain stable and accurate dynamic causal effect estimation from observational data, especially with long-term follow-up. Our technique provides practical guidance for sequential decision making, and policy-making.
    Regret Lower Bound and Optimal Algorithm for High-Dimensional Contextual Linear Bandit. (arXiv:2109.11612v1 [cs.LG])
    (0 min) In this paper, we consider the multi-armed bandit problem with high-dimensional features. First, we prove a minimax lower bound, $\mathcal{O}\big((\log d)^{\frac{\alpha+1}{2}}T^{\frac{1-\alpha}{2}}+\log T\big)$, for the cumulative regret, in terms of horizon $T$, dimension $d$ and a margin parameter $\alpha\in[0,1]$, which controls the separation between the optimal and the sub-optimal arms. This new lower bound unifies existing regret bound results that have different dependencies on T due to the use of different values of margin parameter $\alpha$ explicitly implied by their assumptions. Second, we propose a simple and computationally efficient algorithm inspired by the general Upper Confidence Bound (UCB) strategy that achieves a regret upper bound matching the lower bound. The proposed algorithm uses a properly centered $\ell_1$-ball as the confidence set in contrast to the commonly used ellipsoid confidence set. In addition, the algorithm does not require any forced sampling step and is thereby adaptive to the practically unknown margin parameter. Simulations and a real data analysis are conducted to compare the proposed method with existing ones in the literature.
    Discovering Novel Customer Features with Recurrent Neural Networks for Personality Based Financial Services. (arXiv:2109.11871v1 [cs.LG])
    (0 min) The micro-segmentation of customers in the finance sector is a non-trivial task and has been an atypical omission from recent scientific literature. Where traditional segmentation classifies customers based on coarse features such as demographics, micro-segmentation depicts more nuanced differences between individuals, bringing forth several advantages including the potential for improved personalization in financial services. AI and representation learning offer a unique opportunity to solve the problem of micro-segmentation. Although ubiquitous in many industries, the proliferation of AI in sensitive industries such as finance has become contingent on the imperatives of responsible AI. We had previously solved the micro-segmentation problem by extracting temporal features from the state space of a recurrent neural network (RNN). However, due to the inherent opacity of RNNs our solution lacked an explanation - one of the imperatives of responsible AI. In this study, we address this issue by extracting an explanation for and providing an interpretation of our temporal features. We investigate the state space of our RNN and through a linear regression model reconstruct the trajectories in the state space with high fidelity. We show that our linear regression coefficients have not only learned the rules used to create the RNN's output data but have also learned the relationships that were not directly evident in the raw data.
    Estimating R\'enyi's $\alpha$-Cross-Entropies in a Matrix-Based Way. (arXiv:2109.11737v1 [cs.IT])
    (0 min) Conventional information-theoretic quantities assume access to probability distributions. Estimating such distributions is not trivial. Here, we consider function-based formulations of cross entropy that sidesteps this a priori estimation requirement. We propose three measures of R\'enyi's $\alpha$-cross-entropies in the setting of reproducing-kernel Hilbert spaces. Each measure has its appeals. We prove that we can estimate these measures in an unbiased, non-parametric, and minimax-optimal way. We do this via sample-constructed Gram matrices. This yields matrix-based estimators of R\'enyi's $\alpha$-cross-entropies. These estimators satisfy all of the axioms that R\'enyi established for divergences. Our cross-entropies can thus be used for assessing distributional differences. They are also appropriate for handling high-dimensional distributions, since the convergence rate of our estimator is independent of the sample dimensionality. Python code for implementing these measures can be found at https://github.com/isledge/MBRCE
    A data acquisition setup for data driven acoustic design. (arXiv:2109.12014v1 [cs.SD])
    (0 min) In this paper, we present a novel interdisciplinary approach to study the relationship between diffusive surface structures and their acoustic performance. Using computational design, surface structures are iteratively generated and 3D printed at 1:10 model scale. They originate from different fabrication typologies and are designed to have acoustic diffusion and absorption effects. An automated robotic process measures the impulse responses of these surfaces by positioning a microphone and a speaker at multiple locations. The collected data serves two purposes: first, as an exploratory catalogue of different spatio-temporal-acoustic scenarios and second, as data set for predicting the acoustic response of digitally designed surface geometries using machine learning. In this paper, we present the automated data acquisition setup, the data processing and the computational generation of diffusive surface structures. We describe first results of comparative studies of measured surface panels and conclude with steps of future research.
    Simple and Effective Zero-shot Cross-lingual Phoneme Recognition. (arXiv:2109.11680v1 [cs.CL])
    (0 min) Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.
    Learning-based Noise Component Map Estimation for Image Denoising. (arXiv:2109.11877v1 [eess.IV])
    (0 min) A problem of image denoising when images are corrupted by a non-stationary noise is considered in this paper. Since in practice no a priori information on noise is available, noise statistics should be pre-estimated for image denoising. In this paper, deep convolutional neural network (CNN) based method for estimation of a map of local, patch-wise, standard deviations of noise (so-called sigma-map) is proposed. It achieves the state-of-the-art performance in accuracy of estimation of sigma-map for the case of non-stationary noise, as well as estimation of noise variance for the case of additive white Gaussian noise. Extensive experiments on image denoising using estimated sigma-maps demonstrate that our method outperforms recent CNN-based blind image denoising methods by up to 6 dB in PSNR, as well as other state-of-the-art methods based on sigma-map estimation by up to 0.5 dB, providing same time better usage flexibility. Comparison with the ideal case, when denoising is applied using ground-truth sigma-map, shows that a difference of corresponding PSNR values for most of noise levels is within 0.1-0.2 dB and does not exceeds 0.6 dB.
    Parameter-Free Deterministic Reduction of the Estimation Bias in Continuous Control. (arXiv:2109.11788v1 [cs.LG])
    (0 min) Approximation of the value functions in value-based deep reinforcement learning systems induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We introduce a parameter-free, novel deep Q-learning variant to reduce this underestimation bias for continuous control. By obtaining fixed weights in computing the critic objective as a linear combination of the approximate critic functions, our Q-value update rule integrates the concepts of Clipped Double Q-learning and Maxmin Q-learning. We test the performance of our improvement on a set of MuJoCo and Box2D continuous control tasks and find that it improves the state-of-the-art and outperforms the baseline algorithms in the majority of the environments.
    The More, the Better? A Study on Collaborative Machine Learning for DGA Detection. (arXiv:2109.11830v1 [cs.CR])
    (0 min) Domain generation algorithms (DGAs) prevent the connection between a botnet and its master from being blocked by generating a large number of domain names. Promising single-data-source approaches have been proposed for separating benign from DGA-generated domains. Collaborative machine learning (ML) can be used in order to enhance a classifier's detection rate, reduce its false positive rate (FPR), and to improve the classifier's generalization capability to different networks. In this paper, we complement the research area of DGA detection by conducting a comprehensive collaborative learning study, including a total of 13,440 evaluation runs. In two real-world scenarios we evaluate a total of eleven different variations of collaborative learning using three different state-of-the-art classifiers. We show that collaborative ML can lead to a reduction in FPR by up to 51.7%. However, while collaborative ML is beneficial for DGA detection, not all approaches and classifier types profit equally. We round up our comprehensive study with a thorough discussion of the privacy threats implicated by the different collaborative ML approaches.
    Identifying Distributional Differences in Convective Evolution Prior to Rapid Intensification in Tropical Cyclones. (arXiv:2109.12029v1 [stat.ML])
    (0 min) Tropical cyclone (TC) intensity forecasts are issued by human forecasters who evaluate spatio-temporal observations (e.g., satellite imagery) and model output (e.g., numerical weather prediction, statistical models) to produce forecasts every 6 hours. Within these time constraints, it can be challenging to draw insight from such data. While high-capacity machine learning methods are well suited for prediction problems with complex sequence data, extracting interpretable scientific information with such methods is difficult. Here we leverage powerful AI prediction algorithms and classical statistical inference to identify patterns in the evolution of TC convective structure leading up to the rapid intensification of a storm, hence providing forecasters and scientists with key insight into TC behavior.
    Dimension Reduction for Data with Heterogeneous Missingness. (arXiv:2109.11765v1 [stat.ML])
    (0 min) Dimension reduction plays a pivotal role in analysing high-dimensional data. However, observations with missing values present serious difficulties in directly applying standard dimension reduction techniques. As a large number of dimension reduction approaches are based on the Gram matrix, we first investigate the effects of missingness on dimension reduction by studying the statistical properties of the Gram matrix with or without missingness, and then we present a bias-corrected Gram matrix with nice statistical properties under heterogeneous missingness. Extensive empirical results, on both simulated and publicly available real datasets, show that the proposed unbiased Gram matrix can significantly improve a broad spectrum of representative dimension reduction approaches.
    Learning to maximize global influence from local observations. (arXiv:2109.11909v1 [cs.LG])
    (0 min) We study a family online influence maximization problems where in a sequence of rounds $t=1,\ldots,T$, a decision maker selects one from a large number of agents with the goal of maximizing influence. Upon choosing an agent, the decision maker shares a piece of information with the agent, which information then spreads in an unobserved network over which the agents communicate. The goal of the decision maker is to select the sequence of agents in a way that the total number of influenced nodes in the network. In this work, we consider a scenario where the networks are generated independently for each $t$ according to some fixed but unknown distribution, so that the set of influenced nodes corresponds to the connected component of the random graph containing the vertex corresponding to the selected agent. Furthermore, we assume that the decision maker only has access to limited feedback: instead of making the unrealistic assumption that the entire network is observable, we suppose that the available feedback is generated based on a small neighborhood of the selected vertex. Our results show that such partial local observations can be sufficient for maximizing global influence. We model the underlying random graph as a sparse inhomogeneous Erd\H{o}s--R\'enyi graph, and study three specific families of random graph models in detail: stochastic block models, Chung--Lu models and Kronecker random graphs. We show that in these cases one may learn to maximize influence by merely observing the degree of the selected vertex in the generated random graph. We propose sequential learning algorithms that aim at maximizing influence, and provide their theoretical analysis in both the subcritical and supercritical regimes of all considered models.
    Discovering and Validating AI Errors With Crowdsourced Failure Reports. (arXiv:2109.11690v1 [cs.HC])
    (0 min) AI systems can fail to learn important behaviors, leading to real-world issues like safety concerns and biases. Discovering these systematic failures often requires significant developer attention, from hypothesizing potential edge cases to collecting evidence and validating patterns. To scale and streamline this process, we introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors. We also design and implement Deblinder, a visual analytics system for synthesizing failure reports that developers can use to discover and validate systematic failures. In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings. Lastly, we show how collecting additional data from the groups identified by developers can improve model performance.
    A Learned Stereo Depth System for Robotic Manipulation in Homes. (arXiv:2109.11644v1 [cs.RO])
    (0 min) We present a passive stereo depth system that produces dense and accurate point clouds optimized for human environments, including dark, textureless, thin, reflective and specular surfaces and objects, at 2560x2048 resolution, with 384 disparities, in 30 ms. The system consists of an algorithm combining learned stereo matching with engineered filtering, a training and data-mixing methodology, and a sensor hardware design. Our architecture is 15x faster than approaches that perform similarly on the Middlebury and Flying Things Stereo Benchmarks. To effectively supervise the training of this model, we combine real data labelled using off-the-shelf depth sensors, as well as a number of different rendered, simulated labeled datasets. We demonstrate the efficacy of our system by presenting a large number of qualitative results in the form of depth maps and point-clouds, experiments validating the metric accuracy of our system and comparisons to other sensors on challenging objects and scenes. We also show the competitiveness of our algorithm compared to state-of-the-art learned models using the Middlebury and FlyingThings datasets.
    Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience. (arXiv:2109.11767v1 [cs.LG])
    (0 min) Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm, essentially based on entropy regularization. SAC trains a policy by maximizing the trade-off between expected return and entropy (randomness in the policy). It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks, outperforming prior on-policy and off-policy methods. SAC works in an off-policy fashion where data are sampled uniformly from past experiences (stored in a buffer) using which parameters of the policy and value function networks are updated. We propose certain crucial modifications for boosting the performance of SAC and make it more sample efficient. In our proposed improved SAC, we firstly introduce a new prioritization scheme for selecting better samples from the experience replay buffer. Secondly we use a mixture of the prioritized off-policy data with the latest on-policy data for training the policy and the value function networks. We compare our approach with the vanilla SAC and some recent variants of SAC and show that our approach outperforms the said algorithmic benchmarks. It is comparatively more stable and sample efficient when tested on a number of continuous control tasks in MuJoCo environments.
    Efficient, Interpretable Atomistic Graph Neural Network Representation for Angle-dependent Properties and its Application to Optical Spectroscopy Prediction. (arXiv:2109.11576v1 [cs.LG])
    (0 min) Graph neural networks (GNNs) are attractive for learning properties of atomic structures thanks to their intuitive, physically informed graph encoding of atoms and bonds. However, conventional GNN encodings do not account for angular information, which is critical for describing complex atomic arrangements in disordered materials, interfaces, and molecular distortions. In this work, we extend the recently proposed ALIGNN encoding, which incorporates bond angles, to also include dihedral angles (ALIGNN-d), and we apply the model to capture the structures of aqua copper complexes for spectroscopy prediction. This simple extension is shown to lead to a memory-efficient graph representation capable of capturing the full geometric information of atomic structures. Specifically, the ALIGNN-d encoding is a sparse yet equally expressive representation compared to the dense, maximally-connected graph, in which all bonds are encoded. We also explore model interpretability based on ALIGNN-d by elucidating the relative contributions of individual structural components to the optical response of the copper complexes. Lastly, we briefly discuss future developments to validate the computational efficiency and to extend the interpretability of ALIGNN-d.
    ADVERSARIALuscator: An Adversarial-DRL Based Obfuscator and Metamorphic Malware SwarmGenerator. (arXiv:2109.11542v1 [cs.CR])
    (0 min) Advanced metamorphic malware and ransomware, by using obfuscation, could alter their internal structure with every attack. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. It is challenging to obtain training data for such evasive malware. Therefore, in this paper, we present ADVERSARIALuscator, a novel system that uses specialized Adversarial-DRL to obfuscate malware at the opcode level and create multiple metamorphic instances of the same. To the best of our knowledge, ADVERSARIALuscator is the first-ever system that adopts the Markov Decision Process-based approach to convert and find a solution to the problem of creating individual obfuscations at the opcode level. This is important as the machine language level is the least at which functionality could be preserved so as to mimic an actual attack effectively. ADVERSARIALuscator is also the first-ever system to use efficient continuous action control capable of deep reinforcement learning agents like the Proximal Policy Optimization in the area of cyber security. Experimental results indicate that ADVERSARIALuscator could raise the metamorphic probability of a corpus of malware by >0.45. Additionally, more than 33% of metamorphic instances generated by ADVERSARIALuscator were able to evade the most potent IDS. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. Hence ADVERSARIALuscator could be used to generate data representative of a swarm of very potent and coordinated AI-based metamorphic malware attacks. The so generated data and simulations could be used to bolster the defenses of an IDS against an actual AI-based metamorphic attack from advanced malware and ransomware.
    DeepKoCo: Efficient latent planning with a task-relevant Koopman representation. (arXiv:2011.12690v3 [cs.LG] UPDATED)
    (0 min) This paper presents DeepKoCo, a novel model-based agent that learns a latent Koopman representation from images. This representation allows DeepKoCo to plan efficiently using linear control methods, such as linear model predictive control. Compared to traditional agents, DeepKoCo learns task-relevant dynamics, thanks to the use of a tailored lossy autoencoder network that allows DeepKoCo to learn latent dynamics that reconstruct and predict only observed costs, rather than all observed dynamics. As our results show, DeepKoCo achieves similar final performance as traditional model-free methods on complex control tasks while being considerably more robust to distractor dynamics, making the proposed agent more amenable for real-life applications.
    Adversarial Bandits with Knapsacks. (arXiv:1811.11881v7 [cs.DS] UPDATED)
    (0 min) We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to the algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version and use it as a subroutine to solve the latter.
    Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach. (arXiv:2002.06442v4 [cs.DB] UPDATED)
    (0 min) Due to the outstanding capability of capturing underlying data distributions, deep learning techniques have been recently utilized for a series of traditional database problems. In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficiently is essential to many data management applications, especially for query optimization. Moreover, in some applications the estimated cardinality is supposed to be consistent and interpretable. Hence a monotonic estimation w.r.t. the query threshold is preferred. We propose a novel and generic method that can be applied to any data type and distance function. Our method consists of a feature extraction model and a regression model. The feature extraction model transforms original data and threshold to a Hamming space, in which a deep learning-based regression model is utilized to exploit the incremental property of cardinality w.r.t. the threshold for both accuracy and monotonicity. We develop a training strategy tailored to our model as well as techniques for fast estimation. We also discuss how to handle updates. We demonstrate the accuracy and the efficiency of our method through experiments, and show how it improves the performance of a query optimizer.
    Optimization-based Causal Estimation from Heterogenous Environments. (arXiv:2109.11990v1 [stat.ME])
    (0 min) This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.
    Learning-Based Path Planning for Long-Range Autonomous Valet Parking. (arXiv:2109.11661v1 [cs.LG])
    (0 min) In this paper, to reduce the congestion rate at the city center and increase the quality of experience (QoE) of each user, the framework of long-range autonomous valet parking (LAVP) is presented, where an Electric Autonomous Vehicle (EAV) is deployed in the city, which can pick up, drop off users at their required spots, and then drive to the car park out of city center autonomously. In this framework, we aim to minimize the overall distance of the EAV, while guarantee all users are served, i.e., picking up, and dropping off users at their required spots through optimizing the path planning of the EAV and number of serving time slots. To this end, we first propose a learning based algorithm, which is named as Double-Layer Ant Colony Optimization (DL-ACO) algorithm to solve the above problem in an iterative way. Then, to make the real-time decision, while consider the dynamic environment (i.e., the EAV may pick up and drop off users from different locations), we further present a deep reinforcement learning (DRL) based algorithm, which is known as deep Q network (DQN). The experimental results show that the DL-ACO and DQN-based algorithms both achieve the considerable performance.
    SIM2REALVIZ: Visualizing the Sim2Real Gap in Robot Ego-Pose Estimation. (arXiv:2109.11801v1 [cs.RO])
    (0 min) The Robotics community has started to heavily rely on increasingly realistic 3D simulators for large-scale training of robots on massive amounts of data. But once robots are deployed in the real world, the simulation gap, as well as changes in the real world (e.g. lights, objects displacements) lead to errors. In this paper, we introduce Sim2RealViz, a visual analytics tool to assist experts in understanding and reducing this gap for robot ego-pose estimation tasks, i.e. the estimation of a robot's position using trained models. Sim2RealViz displays details of a given model and the performance of its instances in both simulation and real-world. Experts can identify environment differences that impact model predictions at a given location and explore through direct interactions with the model hypothesis to fix it. We detail the design of the tool, and case studies related to the exploit of the regression to the mean bias and how it can be addressed, and how models are perturbed by the vanish of landmarks such as bikes.
    Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment. (arXiv:2109.11679v1 [stat.ML])
    (0 min) Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these and other data-driven policies are based on known, deterministic rules to ensure their transparency and interpretability. This is especially true when such policies are used for public policy decision-making. For example, algorithmic pre-trial risk assessments, which serve as our motivating application, provide relatively simple, deterministic classification scores and recommendations to help judges make release decisions. Unfortunately, existing methods for policy learning are not applicable because they require existing policies to be stochastic rather than deterministic. We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy by minimizing the worst-case regret. The resulting policy is conservative but has a statistical safety guarantee, allowing the policy-maker to limit the probability of producing a worse outcome than the existing policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. Lastly, we apply the proposed methodology to a unique field experiment on pre-trial risk assessments. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument while potentially leading to better overall outcomes at a lower cost.
    MLIMC: Machine learning-based implicit-solvent Monte Carlo. (arXiv:2109.12100v1 [q-bio.BM])
    (0 min) Monte Carlo (MC) methods are important computational tools for molecular structure optimizations and predictions. When solvent effects are explicitly considered, MC methods become very expensive due to the large degree of freedom associated with the water molecules and mobile ions. Alternatively implicit-solvent MC can largely reduce the computational cost by applying a mean field approximation to solvent effects and meanwhile maintains the atomic detail of the target molecule. The two most popular implicit-solvent models are the Poisson-Boltzmann (PB) model and the Generalized Born (GB) model in a way such that the GB model is an approximation to the PB model but is much faster in simulation time. In this work, we develop a machine learning-based implicit-solvent Monte Carlo (MLIMC) method by combining the advantages of both implicit solvent models in accuracy and efficiency. Specifically, the MLIMC method uses a fast and accurate PB-based machine learning (PBML) scheme to compute the electrostatic solvation free energy at each step. We validate our MLIMC method by using a benzene-water system and a protein-water system. We show that the proposed MLIMC method has great advantages in speed and accuracy for molecular structure optimization and prediction.

2021-09-24

  • cs.CL updates on arXiv.org

    ParaShoot: A Hebrew Question Answering Dataset. (arXiv:2109.11314v1 [cs.CL])
    (2 min) NLP research in Hebrew has largely focused on morphology and syntax, where rich annotated datasets in the spirit of Universal Dependencies are available. Semantic datasets, however, are in short supply, hindering crucial advances in the development of NLP technology in Hebrew. In this work, we present ParaShoot, the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages. We provide the first baseline results using recently-released BERT-style models for Hebrew, showing that there is significant room for improvement on this task.
    Active Learning for Argument Strength Estimation. (arXiv:2109.11319v1 [cs.LG])
    (2 min) High-quality arguments are an essential part of decision-making. Automatically predicting the quality of an argument is a complex task that recently got much attention in argument mining. However, the annotation effort for this task is exceptionally high. Therefore, we test uncertainty-based active learning (AL) methods on two popular argument-strength data sets to estimate whether sample-efficient learning can be enabled. Our extensive empirical evaluation shows that uncertainty-based acquisition functions can not surpass the accuracy reached with the random acquisition on these data sets.
    Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System. (arXiv:2109.11292v1 [cs.CL])
    (2 min) Consistency Identification has obtained remarkable success on open-domain dialogue, which can be used for preventing inconsistent response generation. However, in contrast to the rapid development in open-domain dialogue, few efforts have been made to the task-oriented dialogue direction. In this paper, we argue that consistency problem is more urgent in task-oriented domain. To facilitate the research, we introduce CI-ToD, a novel dataset for Consistency Identification in Task-oriented Dialog system. In addition, we not only annotate the single label to enable the model to judge whether the system response is contradictory, but also provide more fine-grained labels (i.e., Dialogue History Inconsistency, User Query Inconsistency and Knowledge Base Inconsistency) to encourage model to know what inconsistent sources lead to it. Empirical results show that state-of-the-art methods only achieve 51.3%, which is far behind the human performance of 93.2%, indicating that there is ample room for improving consistency identification ability. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide guidance for future directions. All datasets and models are publicly available at \url{https://github.com/yizhen20133868/CI-ToD}.
    Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models. (arXiv:2104.08066v2 [cs.CL] UPDATED)
    (2 min) A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.
    Finetuning Pretrained Transformers into Variational Autoencoders. (arXiv:2108.02446v2 [cs.CL] UPDATED)
    (2 min) Text variational autoencoders (VAEs) are notorious for posterior collapse, a phenomenon where the model's decoder learns to ignore signals from the encoder. Because posterior collapse is known to be exacerbated by expressive decoders, Transformers have seen limited adoption as components of text VAEs. Existing studies that incorporate Transformers into text VAEs (Li et al., 2020; Fang et al., 2021) mitigate posterior collapse using massive pretraining, a technique unavailable to most of the research community without extensive computing resources. We present a simple two-phase training scheme to convert a sequence-to-sequence Transformer into a VAE with just finetuning. The resulting language model is competitive with massively pretrained Transformer-based VAEs in some internal metrics while falling short on others. To facilitate training we comprehensively explore the impact of common posterior collapse alleviation techniques in the literature. We release our code for reproducability.
    Exploring Decomposition for Table-based Fact Verification. (arXiv:2109.11020v1 [cs.CL])
    (2 min) Fact verification based on structured data is challenging as it requires models to understand both natural language and symbolic operations performed over tables. Although pre-trained language models have demonstrated a strong capability in verifying simple statements, they struggle with complex statements that involve multiple operations. In this paper, we improve fact verification by decomposing complex statements into simpler subproblems. Leveraging the programs synthesized by a weakly supervised semantic parser, we propose a program-guided approach to constructing a pseudo dataset for decomposition model training. The subproblems, together with their predicted answers, serve as the intermediate evidence to enhance our fact verification model. Experiments show that our proposed approach achieves the new state-of-the-art performance, an 82.7\% accuracy, on the TabFact benchmark.
    Pregroup Grammars, their Syntax and Semantics. (arXiv:2109.11237v1 [cs.CL])
    (2 min) Pregroup grammars were developed in 1999 and stayed Lambek's preferred algebraic model of grammar. The set-theoretic semantics of pregroups, however, faces an ambiguity problem. In his latest book, Lambek suggests that this problem might be overcome using finite dimensional vector spaces rather than sets. What is the right notion of composition in this setting, direct sum or tensor product of spaces?
    Summary-Source Proposition-level Alignment: Task, Datasets and Supervised Baseline. (arXiv:2009.00590v2 [cs.CL] UPDATED)
    (2 min) Aligning sentences in a reference summary with their counterparts in source documents was shown as a useful auxiliary summarization task, notably for generating training data for salience detection. Despite its assessed utility, the alignment step was mostly approached with heuristic unsupervised methods, typically ROUGE-based, and was never independently optimized or evaluated. In this paper, we propose establishing summary-source alignment as an explicit task, while introducing two major novelties: (1) applying it at the more accurate proposition span level, and (2) approaching it as a supervised classification task. To that end, we created a novel training dataset for proposition-level alignment, derived automatically from available summarization evaluation data. In addition, we crowdsourced dev and test datasets, enabling model development and proper evaluation. Utilizing these data, we present a supervised proposition alignment baseline model, showing improved alignment-quality over the unsupervised approach.
    Conditional Poisson Stochastic Beam Search. (arXiv:2109.11034v1 [cs.CL])
    (2 min) Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process: Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et. al. 2019's stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.
    Improving Simultaneous Translation by Incorporating Pseudo-References with Fewer Reorderings. (arXiv:2010.11247v2 [cs.CL] UPDATED)
    (2 min) Simultaneous translation is vastly different from full-sentence translation, in the sense that it starts translation before the source sentence ends, with only a few words delay. However, due to the lack of large-scale, high-quality simultaneous translation datasets, most such systems are still trained on conventional full-sentence bitexts. This is far from ideal for the simultaneous scenario due to the abundance of unnecessary long-distance reorderings in those bitexts. We propose a novel method that rewrites the target side of existing full-sentence corpora into simultaneous-style translation. Experiments on Zh->En and Ja->En simultaneous translation show substantial improvements (up to +2.7 BLEU) with the addition of these generated pseudo-references.
    Token-Level Supervised Contrastive Learning for Punctuation Restoration. (arXiv:2107.09099v2 [cs.CL] UPDATED)
    (2 min) Punctuation is critical in understanding natural language text. Currently, most automatic speech recognition (ASR) systems do not generate punctuation, which affects the performance of downstream tasks, such as intent detection and slot filling. This gives rise to the need for punctuation restoration. Recent work in punctuation restoration heavily utilizes pre-trained language models without considering data imbalance when predicting punctuation classes. In this work, we address this problem by proposing a token-level supervised contrastive learning method that aims at maximizing the distance of representation of different punctuation marks in the embedding space. The result shows that training with token-level supervised contrastive learning obtains up to 3.2% absolute F1 improvement on the test set.
    Transformers: "The End of History" for NLP?. (arXiv:2105.00813v2 [cs.CL] UPDATED)
    (2 min) Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks -- segmentation and segment labeling -- and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naive ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.
    Adapted End-to-End Coreference Resolution System for Anaphoric Identities in Dialogues. (arXiv:2109.00185v2 [cs.CL] UPDATED)
    (2 min) We present an effective system adapted from the end-to-end neural coreference resolution model, targeting on the task of anaphora resolution in dialogues. Three aspects are specifically addressed in our approach, including the support of singletons, encoding speakers and turns throughout dialogue interactions, and knowledge transfer utilizing existing resources. Despite the simplicity of our adaptation strategies, they are shown to bring significant impact to the final performance, with up to 27 F1 improvement over the baseline. Our final system ranks the 1st place on the leaderboard of the anaphora resolution track in the CRAC 2021 shared task, and achieves the best evaluation results on all four datasets.
    Alzheimers Dementia Detection using Acoustic & Linguistic features and Pre-Trained BERT. (arXiv:2109.11010v1 [cs.CL])
    (2 min) Alzheimers disease is a fatal progressive brain disorder that worsens with time. It is high time we have inexpensive and quick clinical diagnostic techniques for early detection and care. In previous studies, various Machine Learning techniques and Pre-trained Deep Learning models have been used in conjunction with the extraction of various acoustic and linguistic features. Our study focuses on three models for the classification task in the ADReSS (The Alzheimers Dementia Recognition through Spontaneous Speech) 2021 Challenge. We use the well-balanced dataset provided by the ADReSS Challenge for training and validating our models. Model 1 uses various acoustic features from the eGeMAPs feature-set, Model 2 uses various linguistic features that we generated from auto-generated transcripts and Model 3 uses the auto-generated transcripts directly to extract features using a Pre-trained BERT and TF-IDF. These models are described in detail in the models section.
    Model Bias in NLP -- Application to Hate Speech Classification. (arXiv:2109.09725v2 [cs.CL] UPDATED)
    (2 min) This document sums up our results forthe NLP lecture at ETH in the spring semester 2021. In this work, a BERT based neural network model (Devlin et al.,2018) is applied to the JIGSAW dataset (Jigsaw/Conversation AI, 2019) in order to create a model identifying hateful and toxic comments (strictly seperated from offensive language) in online social platforms (English language), inthis case Twitter. Three other neural network architectures and a GPT-2 (Radfordet al., 2019) model are also applied on the provided data set in order to compare these different models. The trained BERT model is then applied on two different data sets to evaluate its generalisation power, namely on another Twitter data set (Tom Davidson, 2017) (Davidsonet al., 2017) and the data set HASOC 2019 (Thomas Mandl, 2019) (Mandl et al.,2019) which includes Twitter and also Facebook comments; we focus on the English HASOC 2019 data. In addition, it can be shown that by fine-tuning the trained BERT model on these two datasets by applying different transfer learning scenarios via retraining partial or all layers the predictive scores improve compared to simply applying the model pre-trained on the JIGSAW data set. Withour results, we get precisions from 64% to around 90% while still achieving acceptable recall values of at least lower 60s%, proving that BERT is suitable for real usecases in social platforms.
    Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation. (arXiv:2109.00194v2 [cs.CL] UPDATED)
    (2 min) Recent multilingual pre-trained language models have achieved remarkable zero-shot performance, where the model is only finetuned on one source language and directly evaluated on target languages. In this work, we propose a self-learning framework that further utilizes unlabeled data of target languages, combined with uncertainty estimation in the process to select high-quality silver labels. Three different uncertainties are adapted and analyzed specifically for the cross lingual transfer: Language Heteroscedastic/Homoscedastic Uncertainty (LEU/LOU), Evidential Uncertainty (EVI). We evaluate our framework with uncertainties on two cross-lingual tasks including Named Entity Recognition (NER) and Natural Language Inference (NLI) covering 40 languages in total, which outperforms the baselines significantly by 10 F1 on average for NER and 2.5 accuracy score for NLI.
    Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?. (arXiv:2109.11321v1 [cs.CL])
    (2 min) Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large language models by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models. The method is based on two steps, 1) a novel task querying for knowledge of memory colors, i.e. typical colors of well-known objects, and 2) filtering of model training data to clearly separate knowledge contributions. Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method. We find that our method can successfully be used to measure visual knowledge transfer capabilities in models and that our novel model architecture shows promising results for leveraging multimodal knowledge in a unimodal setting.
    N-LTP: An Open-source Neural Language Technology Platform for Chinese. (arXiv:2009.11616v4 [cs.CL] UPDATED)
    (2 min) We introduce \texttt{N-LTP}, an open-source neural language technology platform supporting six fundamental Chinese NLP tasks: {lexical analysis} (Chinese word segmentation, part-of-speech tagging, and named entity recognition), {syntactic parsing} (dependency parsing), and {semantic parsing} (semantic dependency parsing and semantic role labeling). Unlike the existing state-of-the-art toolkits, such as \texttt{Stanza}, that adopt an independent model for each task, \texttt{N-LTP} adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks. In addition, a knowledge distillation method \cite{DBLP:journals/corr/abs-1907-04829} where the single-task model teaches the multi-task model is further introduced to encourage the multi-task model to surpass its single-task teacher. Finally, we provide a collection of easy-to-use APIs and a visualization tool to make users to use and view the processing results more easily and directly. To the best of our knowledge, this is the first toolkit to support six Chinese NLP fundamental tasks. Source code, documentation, and pre-trained models are available at \url{https://github.com/HIT-SCIR/ltp}.
    Automated Fact-Checking: A Survey. (arXiv:2109.11427v1 [cs.CL])
    (2 min) As online false information continues to grow, automated fact-checking has gained an increasing amount of attention in recent years. Researchers in the field of Natural Language Processing (NLP) have contributed to the task by building fact-checking datasets, devising automated fact-checking pipelines and proposing NLP methods to further research in the development of different components. This paper reviews relevant research on automated fact-checking covering both the claim detection and claim validation components.
    Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors. (arXiv:2107.01545v2 [eess.AS] UPDATED)
    (2 min) Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.
    WRENCH: A Comprehensive Benchmark for Weak Supervision. (arXiv:2109.11377v1 [cs.LG])
    (2 min) Recent \emph{Weak Supervision (WS)} approaches have had widespread success in easing the bottleneck of labeling training data for machine learning by synthesizing labels from multiple potentially noisy supervision sources. However, proper measurement and analysis of these approaches remain a challenge. First, datasets used in existing works are often private and/or custom, limiting standardization. Second, WS datasets with the same name and base data often vary in terms of the labels and weak supervision sources used, a significant "hidden" source of evaluation variance. Finally, WS studies often diverge in terms of the evaluation protocol and ablations used. To address these problems, we introduce a benchmark platform, \benchmark, for a thorough and standardized evaluation of WS approaches. It consists of 22 varied real-world datasets for classification and sequence tagging; a range of real, synthetic, and procedurally-generated weak supervision sources; and a modular, extensible framework for WS evaluation, including implementations for popular WS methods. We use \benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform. The code is available at \url{https://github.com/JieyuZ2/wrench}.
    Knowledge-Aware Graph-Enhanced GPT-2 for Dialogue State Tracking. (arXiv:2104.04466v3 [cs.CL] UPDATED)
    (2 min) Dialogue State Tracking is central to multi-domain task-oriented dialogue systems, responsible for extracting information from user utterances. We present a novel hybrid architecture that augments GPT-2 with representations derived from Graph Attention Networks in such a way to allow causal, sequential prediction of slot values. The model architecture captures inter-slot relationships and dependencies across domains that otherwise can be lost in sequential prediction. We report improvements in state tracking performance in MultiWOZ 2.0 against a strong GPT-2 baseline and investigate a simplified sparse training scenario in which DST models are trained only on session-level annotations but evaluated at the turn level. We further report detailed analyses to demonstrate the effectiveness of graph models in DST by showing that the proposed graph modules capture inter-slot dependencies and improve the predictions of values that are common to multiple domains.
    Dynamic Knowledge Distillation for Pre-trained Language Models. (arXiv:2109.11295v1 [cs.CL])
    (2 min) Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected teacher model on the pre-defined training dataset. In this paper, we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation. Experimental results show that (1) proper selection of teacher model can boost the performance of student model; (2) conducting KD with 10% informative instances achieves comparable performance while greatly accelerates the training; (3) the student performance can be boosted by adjusting the supervision contribution of different alignment objective. We find dynamic knowledge distillation is promising and provide discussions on potential future directions towards more efficient KD methods. Our code is available at https://github.com/lancopku/DynamicKD.
    An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System. (arXiv:2109.11421v1 [cs.AI])
    (2 min) This research is aimed to propose an artificial intelligence algorithm comprising an ontology-based design, text mining, and natural language processing for automatically generating gap-fill multiple choice questions (MCQs). The simulation of this research demonstrated an application of the algorithm in generating gap-fill MCQs about software testing. The simulation results revealed that by using 103 online documents as inputs, the algorithm could automatically produce more than 16 thousand valid gap-fill MCQs covering a variety of topics in the software testing domain. Finally, in the discussion section of this paper we suggest how the proposed algorithm should be applied to produce gap-fill MCQs being collected in a question pool used by a knowledge expert system.
    MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks. (arXiv:2109.11526v1 [cs.CV])
    (2 min) Political activity on social media presents a data-rich window into political behavior, but the vast amount of data means that almost all content analyses of social media require a data labeling step. However, most automated machine classification methods ignore the multimodality of posted content, focusing either on text or images. State-of-the-art vision-and-language models are unusable for most political science research: they require all observations to have both image and text and require computationally expensive pretraining. This paper proposes a novel vision-and-language framework called multimodal representations using modality translation (MARMOT). MARMOT presents two methodological contributions: it can construct representations for observations missing image or text, and it replaces the computationally expensive pretraining with modality translation. MARMOT outperforms an ensemble text-only classifier in 19 of 20 categories in multilabel classifications of tweets reporting election incidents during the 2016 U.S. general election. Moreover, MARMOT shows significant improvements over the results of benchmark multimodal models on the Hateful Memes dataset, improving the best result set by VisualBERT in terms of accuracy from 0.6473 to 0.6760 and area under the receiver operating characteristic curve (AUC) from 0.7141 to 0.7530.
    Validating Label Consistency in NER Data Annotation. (arXiv:2101.08698v2 [cs.CL] UPDATED)
    (2 min) Data annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.
    Can Question Generation Debias Question Answering Models? A Case Study on Question-Context Lexical Overlap. (arXiv:2109.11256v1 [cs.CL])
    (2 min) Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as question-context lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can properly debias QA datasets. However, we discover that recent neural QG models are biased towards generating questions with high lexical overlap, which can amplify the dataset bias. Moreover, our analysis reveals that data augmentation with these QG models frequently impairs the performance on questions with low lexical overlap, while improving that on questions with high lexical overlap. To address this problem, we use a synonym replacement-based approach to augment questions with low lexical overlap. We demonstrate that the proposed data augmentation approach is simple yet effective to mitigate the degradation problem with only 70k synthetic examples. Our data is publicly available at https://github.com/KazutoshiShinoda/Synonym-Replacement.
    Cross-linguistically Consistent Semantic and Syntactic Annotation of Child-directed Speech. (arXiv:2109.10952v1 [cs.CL])
    (2 min) While corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, semantic annotation for such corpora is still scarce and lacks a uniform standard. We compile two CDS corpora with sentential logical forms, one in English and the other in Hebrew. In compiling the corpora we employ a methodology that enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. The corpora are based on a sizable portion of Brown's Adam corpus from CHILDES (about 80% of its child-directed utterances), and to all child-directed utterances from Berman's Hebrew CHILDES corpus Hagar. We begin by annotating the corpora with the Universal Dependencies (UD) scheme for syntactic annotation, motivated by its applicability to a wide variety of domains and languages. We then proceed by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The two representations have complementary strengths: UD structures are language-neutral and support direct annotation, whereas LFs are neutral as to the interface between syntax and semantics, and transparently encode semantic distinctions. We verify the quality of the annotated UD annotation using an inter-annotator agreement study. We then demonstrate the utility of the compiled corpora through a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena.
    Corpus and Models for Lemmatisation and POS-tagging of Old French. (arXiv:2109.11442v1 [cs.CL])
    (2 min) Old French is a typical example of an under-resourced historic languages, that furtherly displays animportant amount of linguistic variation. In this paper, we present the current results of a long going project (2015-...) and describe how we broached the difficult question of providing lemmatisation andPOS models for Old French with the help of neural taggers and the progressive constitution of dedicated corpora.
    Putting Words in BERT's Mouth: Navigating Contextualized Vector Spaces with Pseudowords. (arXiv:2109.11491v1 [cs.CL])
    (2 min) We present a method for exploring regions around individual points in a contextualized vector space (particularly, BERT space), as a way to investigate how these regions correspond to word senses. By inducing a contextualized "pseudoword" as a stand-in for a static embedding in the input layer, and then performing masked prediction of a word in the sentence, we are able to investigate the geometry of the BERT-space in a controlled manner around individual instances. Using our method on a set of carefully constructed sentences targeting ambiguous English words, we find substantial regularity in the contextualized space, with regions that correspond to distinct word senses; but between these regions there are occasionally "sense voids" -- regions that do not correspond to any intelligible sense.
    Named Entity Recognition and Classification on Historical Documents: A Survey. (arXiv:2109.11406v1 [cs.CL])
    (2 min) After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.
    Finding a Balanced Degree of Automation for Summary Evaluation. (arXiv:2109.11503v1 [cs.CL])
    (2 min) Human evaluation for summarization tasks is reliable but brings in issues of reproducibility and high costs. Automatic metrics are cheap and reproducible but sometimes poorly correlated with human judgment. In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method. Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s) but replaces the manual work of judging SCUs' presence in system summaries with a natural language inference (NLI) model. Fully automatic Lite3Pyramid further substitutes SCUs with automatically extracted Semantic Triplet Units (STUs) via a semantic role labeling (SRL) model. Finally, we propose in-between metrics, Lite2.xPyramid, where we use a simple regressor to predict how well the STUs can simulate SCUs and retain SCUs that are more difficult to simulate, which provides a smooth transition and balance between automation and manual evaluation. Comparing to 15 existing metrics, we evaluate human-metric correlations on 3 existing meta-evaluation datasets and our newly-collected PyrXSum (with 100/10 XSum examples/systems). It shows that Lite2Pyramid consistently has the best summary-level correlations; Lite3Pyramid works better than or comparable to other automatic metrics; Lite2.xPyramid trades off small correlation drops for larger manual effort reduction, which can reduce costs for future data collection. Our code and data are publicly available at: https://github.com/ZhangShiyue/Lite2-3Pyramid
    Integrating Pattern- and Fact-based Fake News Detection via Model Preference Learning. (arXiv:2109.11333v1 [cs.CL])
    (2 min) To defend against fake news, researchers have developed various methods based on texts. These methods can be grouped as 1) pattern-based methods, which focus on shared patterns among fake news posts rather than the claim itself; and 2) fact-based methods, which retrieve from external sources to verify the claim's veracity without considering patterns. The two groups of methods, which have different preferences of textual clues, actually play complementary roles in detecting fake news. However, few works consider their integration. In this paper, we study the problem of integrating pattern- and fact-based models into one framework via modeling their preference differences, i.e., making the pattern- and fact-based models focus on respective preferred parts in a post and mitigate interference from non-preferred parts as possible. To this end, we build a Preference-aware Fake News Detection Framework (Pref-FEND), which learns the respective preferences of pattern- and fact-based models for joint detection. We first design a heterogeneous dynamic graph convolutional network to generate the respective preference maps, and then use these maps to guide the joint learning of pattern- and fact-based models for final prediction. Experiments on two real-world datasets show that Pref-FEND effectively captures model preferences and improves the performance of models based on patterns, facts, or both.
    Cross-Lingual Language Model Meta-Pretraining. (arXiv:2109.11129v1 [cs.CL])
    (2 min) The success of pretrained cross-lingual language models relies on two essential abilities, i.e., generalization ability for learning downstream tasks in a source language, and cross-lingual transferability for transferring the task knowledge to other languages. However, current methods jointly learn the two abilities in a single-phase cross-lingual pretraining process, resulting in a trade-off between generalization and cross-lingual transfer. In this paper, we propose cross-lingual language model meta-pretraining, which learns the two abilities in different training phases. Our method introduces an additional meta-pretraining phase before cross-lingual pretraining, where the model learns generalization ability on a large-scale monolingual corpus. Then, the model focuses on learning cross-lingual transfer on a multilingual corpus. Experimental results show that our method improves both generalization and cross-lingual transfer, and produces better-aligned representations across different languages.
    Fuzzy Generalised Quantifiers for Natural Language in Categorical Compositional Distributional Semantics. (arXiv:2109.11227v1 [cs.CL])
    (2 min) Recent work on compositional distributional models shows that bialgebras over finite dimensional vector spaces can be applied to treat generalised quantifiers for natural language. That technique requires one to construct the vector space over powersets, and therefore is computationally costly. In this paper, we overcome this problem by considering fuzzy versions of quantifiers along the lines of Zadeh, within the category of many valued relations. We show that this category is a concrete instantiation of the compositional distributional model. We show that the semantics obtained in this model is equivalent to the semantics of the fuzzy quantifiers of Zadeh. As a result, we are now able to treat fuzzy quantification without requiring a powerset construction.
    Incorporating Linguistic Knowledge for Abstractive Multi-document Summarization. (arXiv:2109.11199v1 [cs.CL])
    (2 min) Within natural language processing tasks, linguistic knowledge can always serve an important role in assisting the model to learn excel representations and better guide the natural language generation. In this work, we develop a neural network based abstractive multi-document summarization (MDS) model which leverages dependency parsing to capture cross-positional dependencies and grammatical structures. More concretely, we process the dependency information into the linguistic-guided attention mechanism and further fuse it with the multi-head attention for better feature representation. With the help of linguistic signals, sentence-level relations can be correctly captured, thus improving MDS performance. Our model has two versions based on Flat-Transformer and Hierarchical Transformer respectively. Empirical studies on both versions demonstrate that this simple but effective method outperforms existing works on the benchmark dataset. Extensive analyses examine different settings and configurations of the proposed model which provide a good reference to the community.
    Actionable Conversational Quality Indicators for Improving Task-Oriented Dialog Systems. (arXiv:2109.11064v1 [cs.CL])
    (2 min) Automatic dialog systems have become a mainstream part of online customer service. Many such systems are built, maintained, and improved by customer service specialists, rather than dialog systems engineers and computer programmers. As conversations between people and machines become commonplace, it is critical to understand what is working, what is not, and what actions can be taken to reduce the frequency of inappropriate system responses. These analyses and recommendations need to be presented in terms that directly reflect the user experience rather than the internal dialog processing. This paper introduces and explains the use of Actionable Conversational Quality Indicators (ACQIs), which are used both to recognize parts of dialogs that can be improved, and to recommend how to improve them. This combines benefits of previous approaches, some of which have focused on producing dialog quality scoring while others have sought to categorize the types of errors the dialog system is making. We demonstrate the effectiveness of using ACQIs on LivePerson internal dialog systems used in commercial customer service applications, and on the publicly available CMU LEGOv2 conversational dataset (Raux et al. 2005). We report on the annotation and analysis of conversational datasets showing which ACQIs are important to fix in various situations. The annotated datasets are then used to build a predictive model which uses a turn-based vector embedding of the message texts and achieves an 79% weighted average f1-measure at the task of finding the correct ACQI for a given conversation. We predict that if such a model worked perfectly, the range of potential improvement actions a bot-builder must consider at each turn could be reduced by an average of 81%.
    A Second Pandemic? Analysis of Fake News About COVID-19 Vaccines in Qatar. (arXiv:2109.11372v1 [cs.CL])
    (2 min) While COVID-19 vaccines are finally becoming widely available, a second pandemic that revolves around the circulation of anti-vaxxer fake news may hinder efforts to recover from the first one. With this in mind, we performed an extensive analysis of Arabic and English tweets about COVID-19 vaccines, with focus on messages originating from Qatar. We found that Arabic tweets contain a lot of false information and rumors, while English tweets are mostly factual. However, English tweets are much more propagandistic than Arabic ones. In terms of propaganda techniques, about half of the Arabic tweets express doubt, and 1/5 use loaded language, while English tweets are abundant in loaded language, exaggeration, fear, name-calling, doubt, and flag-waving. Finally, in terms of framing, Arabic tweets adopt a health and safety perspective, while in English economic concerns dominate.
    Joint speaker diarisation and tracking in switching state-space model. (arXiv:2109.11140v1 [cs.SD])
    (2 min) Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeting. This paper relaxes this assumption, by proposing to explicitly track the movements of speakers while jointly performing diarisation within a unified model. A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers. The model is implemented as a particle filter. Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
    The Current State of Finnish NLP. (arXiv:2109.11326v1 [cs.CL])
    (2 min) There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools and models resulting from academic research are made available for others to use on platforms such as Github.
    Breaking BERT: Understanding its Vulnerabilities for Named Entity Recognition through Adversarial Attack. (arXiv:2109.11308v1 [cs.CL])
    (2 min) Both generic and domain-specific BERT models are widely used for natural language processing (NLP) tasks. In this paper we investigate the vulnerability of BERT models to variation in input data for Named Entity Recognition (NER) through adversarial attack. Experimental results show that the original as well as the domain-specific BERT models are highly vulnerable to entity replacement: They can be fooled in 89.2 to 99.4% of the cases to mislabel previously correct entities. BERT models are also vulnerable to variation in the entity context with 20.2 to 45.0% of entities predicted completely wrong and another 29.3 to 53.3% of entities predicted wrong partially. Often a single change is sufficient to fool the model. BERT models seem most vulnerable to changes in the local context of entities. Of the two domain-specific BERT models, the vulnerability of BioBERT is comparable to the original BERT model whereas SciBERT is even more vulnerable. Our results chart the vulnerabilities of BERT models for NER and emphasize the importance of further research into uncovering and reducing these weaknesses.
    The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21. (arXiv:2109.11247v1 [cs.CL])
    (2 min) This paper describes the Volctrans' submission to the WMT21 news translation shared task for German->English translation. We build a parallel (i.e., non-autoregressive) translation system using the Glancing Transformer, which enables fast and accurate parallel decoding in contrast to the currently prevailing autoregressive models. To the best of our knowledge, this is the first parallel translation system that can be scaled to such a practical scenario like WMT competition. More importantly, our parallel translation system achieves the best BLEU score (35.0) on German->English translation task, outperforming all strong autoregressive counterparts.
    Towards Universal Dense Retrieval for Open-domain Question Answering. (arXiv:2109.11085v1 [cs.CL])
    (2 min) In open-domain question answering, a model receives a text question as input and searches for the correct answer using a large evidence corpus. The retrieval step is especially difficult as typical evidence corpora have \textit{millions} of documents, each of which may or may not have the correct answer to the question. Very recently, dense models have replaced sparse methods as the de facto retrieval method. Rather than focusing on lexical overlap to determine similarity, dense methods build an encoding function that captures semantic similarity by learning from a small collection of question-answer or question-context pairs. In this paper, we investigate dense retrieval models in the context of open-domain question answering across different input distributions. To do this, first we introduce an entity-rich question answering dataset constructed from Wikidata facts and demonstrate dense models are unable to generalize to unseen input question distributions. Second, we perform analyses aimed at better understanding the source of the problem and propose new training techniques to improve out-of-domain performance on a wide variety of datasets. We encourage the field to further investigate the creation of a single, universal dense retrieval model that generalizes well across all input distributions.
    Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing. (arXiv:2109.11105v1 [cs.CL])
    (2 min) We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component's contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-$\alpha$ objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyperparameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-$\alpha$ performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.
    Cluster-based Mention Typing for Named Entity Disambiguation. (arXiv:2109.11389v1 [cs.CL])
    (2 min) An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g. location or person) of a mentioned entity can be correctly predicted from the context, it may increase the chance of selecting the right candidate by assigning low probability to the unlikely ones. This paper proposes cluster-based mention typing for named entity disambiguation. The aim of mention typing is to predict the type of a given mention based on its context. Generally, manually curated type taxonomies such as Wikipedia categories are used. We introduce cluster-based mention typing, where named entities are clustered based on their contextual similarities and the cluster ids are assigned as types. The hyperlinked mentions and their context in Wikipedia are used in order to obtain these cluster-based types. Then, mention typing models are trained on these mentions, which have been labeled with their cluster-based types through distant supervision. At the named entity disambiguation phase, first the cluster-based types of a given mention are predicted and then, these types are used as features in a ranking model to select the best entity among the candidates. We represent entities at multiple contextual levels and obtain different clusterings (and thus typing models) based on each level. As each clustering breaks the entity space differently, mention typing based on each clustering discriminates the mention differently. When predictions from all typing models are used together, our system achieves better or comparable results based on randomization tests with respect to the state-of-the-art levels on four defacto test sets.
    BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles. (arXiv:2109.11087v1 [cs.CL])
    (2 min) A riddle is a question or statement with double or veiled meanings, followed by an unexpected answer. Solving riddle is a challenging task for both machine and human, testing the capability of understanding figurative, creative natural language and reasoning with commonsense knowledge. We introduce BiRdQA, a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles. For each riddle-answer pair, we provide four distractors with additional information from Wikipedia. The distractors are automatically generated at scale with minimal bias. Existing monolingual and multilingual QA models fail to perform well on our dataset, indicating that there is a long way to go before machine can beat human on solving tricky riddles. The dataset has been released to the community.
    Using Sociolinguistic Variables to Reveal Changing Attitudes Towards Sexuality and Gender. (arXiv:2109.11061v1 [cs.CL])
    (2 min) Individuals signal aspects of their identity and beliefs through linguistic choices. Studying these choices in aggregate allows us to examine large-scale attitude shifts within a population. Here, we develop computational methods to study word choice within a sociolinguistic lexical variable -- alternate words used to express the same concept -- in order to test for change in the United States towards sexuality and gender. We examine two variables: i) referents to significant others, such as the word "partner" and ii) referents to an indefinite person, both of which could optionally be marked with gender. The linguistic choices in each variable allow us to study increased rates of acceptances of gay marriage and gender equality, respectively. In longitudinal analyses across Twitter and Reddit over 87M messages, we demonstrate that attitudes are changing but that these changes are driven by specific demographics within the United States. Further, in a quasi-causal analysis, we show that passages of Marriage Equality Acts in different states are drivers of linguistic change.
    Zero-Shot Information Extraction as a Unified Text-to-Triple Translation. (arXiv:2109.11171v1 [cs.CL])
    (2 min) We cast a suite of information extraction tasks into a text-to-triple translation framework. Instead of solving each task relying on task-specific datasets and models, we formalize the task as a translation between task-specific input text and output triples. By taking the task-specific input, we enable a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task. We further demonstrate that a simple pre-training task of predicting which relational information corresponds to which input text is an effective way to produce task-specific outputs. This enables the zero-shot transfer of our framework to downstream tasks. We study the zero-shot performance of this framework on open information extraction (OIE2016, NYT, WEB, PENN), relation classification (FewRel and TACRED), and factual probe (Google-RE and T-REx). The model transfers non-trivially to most tasks and is often competitive with a fully supervised method without the need for any task-specific training. For instance, we significantly outperform the F1 score of the supervised open information extraction without needing to use its training set.
    Non-Parametric Online Learning from Human Feedback for Neural Machine Translation. (arXiv:2109.11136v1 [cs.CL])
    (2 min) We study the problem of online learning with human feedback in the human-in-the-loop machine translation, in which the human translators revise the machine-generated translations and then the corrected translations are used to improve the neural machine translation (NMT) system. However, previous methods require online model updating or additional translation memory networks to achieve high-quality performance, making them inflexible and inefficient in practice. In this paper, we propose a novel non-parametric online learning method without changing the model structure. This approach introduces two k-nearest-neighbor (KNN) modules: one module memorizes the human feedback, which is the correct sentences provided by human translators, while the other balances the usage of the history human feedback and original NMT models adaptively. Experiments conducted on EMEA and JRC-Acquis benchmarks demonstrate that our proposed method obtains substantial improvements on translation accuracy and achieves better adaptation performance with less repeating human correction operations.
    Controlled Evaluation of Grammatical Knowledge in Mandarin Chinese Language Models. (arXiv:2109.11058v1 [cs.CL])
    (2 min) Prior work has shown that structural supervision helps English language models learn generalizations about syntactic phenomena such as subject-verb agreement. However, it remains unclear if such an inductive bias would also improve language models' ability to learn grammatical dependencies in typologically different languages. Here we investigate this question in Mandarin Chinese, which has a logographic, largely syllable-based writing system; different word order; and sparser morphology than English. We train LSTMs, Recurrent Neural Network Grammars, Transformer language models, and Transformer-parameterized generative parsing models on two Mandarin Chinese datasets of different sizes. We evaluate the models' ability to learn different aspects of Mandarin grammar that assess syntactic and semantic relationships. We find suggestive evidence that structural supervision helps with representing syntactic state across intervening content and improves performance in low-data settings, suggesting that the benefits of hierarchical inductive biases in acquiring dependency relationships may extend beyond English.
    Exploiting Curriculum Learning in Unsupervised Neural Machine Translation. (arXiv:2109.11177v1 [cs.CL])
    (2 min) Back-translation (BT) has become one of the de facto components in unsupervised neural machine translation (UNMT), and it explicitly makes UNMT have translation ability. However, all the pseudo bi-texts generated by BT are treated equally as clean data during optimization without considering the quality diversity, leading to slow convergence and limited translation performance. To address this problem, we propose a curriculum learning method to gradually utilize pseudo bi-texts based on their quality from multiple granularities. Specifically, we first apply cross-lingual word embedding to calculate the potential translation difficulty (quality) for the monolingual sentences. Then, the sentences are fed into UNMT from easy to hard batch by batch. Furthermore, considering the quality of sentences/tokens in a particular batch are also diverse, we further adopt the model itself to calculate the fine-grained quality scores, which are served as learning factors to balance the contributions of different parts when computing loss and encourage the UNMT model to focus on pseudo data with higher quality. Experimental results on WMT 14 En-Fr, WMT 16 En-De, WMT 16 En-Ro, and LDC En-Zh translation tasks demonstrate that the proposed method achieves consistent improvements with faster convergence speed.
    Scalable Fact-checking with Human-in-the-Loop. (arXiv:2109.10992v1 [cs.CL])
    (2 min) Researchers have been investigating automated solutions for fact-checking in a variety of fronts. However, current approaches often overlook the fact that the amount of information released every day is escalating, and a large amount of them overlap. Intending to accelerate fact-checking, we bridge this gap by grouping similar messages and summarizing them into aggregated claims. Specifically, we first clean a set of social media posts (e.g., tweets) and build a graph of all posts based on their semantics; Then, we perform two clustering methods to group the messages for further claim summarization. We evaluate the summaries both quantitatively with ROUGE scores and qualitatively with human evaluation. We also generate a graph of summaries to verify that there is no significant overlap among them. The results reduced 28,818 original messages to 700 summary claims, showing the potential to speed up the fact-checking process by organizing and selecting representative claims from massive disorganized and redundant messages.
  • cs.CV updates on arXiv.org

    From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation. (arXiv:1907.10326v6 [cs.CV] UPDATED)
    (2 min) Estimating accurate depth from a single image is challenging because it is an ill-posed problem as infinitely many 3D scenes can be projected to the same 2D scene. However, recent works based on deep convolutional neural networks show great progress with plausible results. The convolutional neural networks are generally composed of two parts: an encoder for dense feature extraction and a decoder for predicting the desired depth. In the encoder-decoder schemes, repeated strided convolution and spatial pooling layers lower the spatial resolution of transitional outputs, and several techniques such as skip connections or multi-layer deconvolutional networks are adopted to recover the original resolution for effective dense prediction. In this paper, for more effective guidance of densely encoded features to the desired depth prediction, we propose a network architecture that utilizes novel local planar guidance layers located at multiple stages in the decoding phase. We show that the proposed method outperforms the state-of-the-art works with significant margin evaluating on challenging benchmarks. We also provide results from an ablation study to validate the effectiveness of the proposed method.
    Weaving Attention U-net: A Novel Hybrid CNN and Attention-based Method for Organs-at-risk Segmentation in Head and Neck CT Images. (arXiv:2107.04847v2 [eess.IV] UPDATED)
    (2 min) In radiotherapy planning, manual contouring is labor-intensive and time-consuming. Accurate and robust automated segmentation models improve the efficiency and treatment outcome. We aim to develop a novel hybrid deep learning approach, combining convolutional neural networks (CNNs) and the self-attention mechanism, for rapid and accurate multi-organ segmentation on head and neck computed tomography (CT) images. Head and neck CT images with manual contours of 115 patients were retrospectively collected and used. We set the training/validation/testing ratio to 81/9/25 and used the 10-fold cross-validation strategy to select the best model parameters. The proposed hybrid model segmented ten organs-at-risk (OARs) altogether for each case. The performance of the model was evaluated by three metrics, i.e., the Dice Similarity Coefficient (DSC), Hausdorff distance 95% (HD95), and mean surface distance (MSD). We also tested the performance of the model on the Head and Neck 2015 challenge dataset and compared it against several state-of-the-art automated segmentation algorithms. The proposed method generated contours that closely resemble the ground truth for ten OARs. Our results of the new Weaving Attention U-net demonstrate superior or similar performance on the segmentation of head and neck CT images.
    Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling. (arXiv:2109.11103v1 [cs.RO])
    (2 min) Instance-aware segmentation of unseen objects is essential for a robotic system in an unstructured environment. Although previous works achieved encouraging results, they were limited to segmenting the only visible regions of unseen objects. For robotic manipulation in a cluttered scene, amodal perception is required to handle the occluded objects behind others. This paper addresses Unseen Object Amodal Instance Segmentation (UOAIS) to detect 1) visible masks, 2) amodal masks, and 3) occlusions on unseen object instances. For this, we propose a Hierarchical Occlusion Modeling (HOM) scheme designed to reason about the occlusion by assigning a hierarchy to a feature fusion and prediction order. We evaluated our method on three benchmarks (tabletop, indoors, and bin environments) and achieved state-of-the-art (SOTA) performance. Robot demos for picking up occluded objects, codes, and datasets are available at https://sites.google.com/view/uoais
    Dynamic Gesture Recognition. (arXiv:2109.09396v2 [cs.CV] UPDATED)
    (2 min) The Human-Machine Interaction (HMI) research field is an important topic in machine learning that has been deeply investigated thanks to the rise of computing power in the last years. The first time, it is possible to use machine learning to classify images and/or videos instead of the traditional computer vision algorithms. The aim of this project is to builda symbiosis between a convolutional neural network (CNN)[1] and a recurrent neural network (RNN) [2] to recognize cultural/anthropological Italian sign language gestures from videos. The CNN extracts important features that later areused by the RNN. With RNNs we are able to store temporal information inside the model to provide contextual information from previous frames to enhance the prediction accuracy. Our novel approach uses different data augmentation techniquesand regularization methods from only RGB frames to avoid overfitting and provide a small generalization error.
    D-DARTS: Distributed Differentiable Architecture Search. (arXiv:2108.09306v2 [cs.LG] UPDATED)
    (2 min) Differentiable ARchiTecture Search (DARTS) is one of the most trending Neural Architecture Search (NAS) methods, drastically reducing search cost by resorting to Stochastic Gradient Descent (SGD) and weight-sharing. However, it also greatly reduces the search space, thus excluding potential promising architectures from being discovered. In this paper, we propose D-DARTS, a novel solution that addresses this problem by nesting several neural networks at cell-level instead of using weight-sharing to produce more diversified and specialized architectures. Moreover, we introduce a novel algorithm which can derive deeper architectures from a few trained cells, increasing performance and saving computation time. Our solution is able to provide state-of-the-art results on CIFAR-10, CIFAR-100 and ImageNet while using significantly less parameters than previous baselines, resulting in more hardware-efficient neural networks.
    Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark. (arXiv:2109.11243v1 [cs.CV])
    (2 min) Recognizing the emotional state of people is a basic but challenging task in video understanding. In this paper, we propose a new task in this field, named Pairwise Emotional Relationship Recognition (PERR). This task aims to recognize the emotional relationship between the two interactive characters in a given video clip. It is different from the traditional emotion and social relation recognition task. Varieties of information, consisting of character appearance, behaviors, facial emotions, dialogues, background music as well as subtitles contribute differently to the final results, which makes the task more challenging but meaningful in developing more advanced multi-modal models. To facilitate the task, we develop a new dataset called Emotional RelAtionship of inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal dataset for PERR task, which has 31,182 video clips, lasting about 203 video hours. Different from the existing datasets, ERATO contains interaction-centric videos with multi-shots, varied video length, and multiple modalities including visual, audio and text. As a minor contribution, we propose a baseline model composed of Synchronous Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information for the PERR task. In contrast to other prevailing attention mechanisms, our proposed SMTA can steadily improve the performance by about 1\%. We expect the ERATO as well as our proposed SMTA to open up a new way for PERR task in video understanding and further improve the research of multi-modal fusion methodology.
    TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes. (arXiv:2109.00532v2 [cs.CV] UPDATED)
    (2 min) The longitudinal modeling of neuroanatomical changes related to Alzheimer's disease (AD) is crucial for studying the progression of the disease. To this end, we introduce TransforMesh, a spatio-temporal network based on transformers that models longitudinal shape changes on 3D anatomical meshes. While transformer and mesh networks have recently shown impressive performances in natural language processing and computer vision, their application to medical image analysis has been very limited. To the best of our knowledge, this is the first work that combines transformer and mesh networks. Our results show that TransforMesh can model shape trajectories better than other baseline architectures that do not capture temporal dependencies. Moreover, we also explore the capabilities of TransforMesh in detecting structural anomalies of the hippocampus in patients developing AD.
    Diffusion-Based Representation Learning. (arXiv:2105.14257v2 [cs.LG] UPDATED)
    (2 min) Score-based methods represented as stochastic differential equations on a continuous time domain have recently proven successful as a non-adversarial generative model. Training such models relies on denoising score matching, which can be seen as multi-scale denoising autoencoders. Here, we augment the denoising score-matching framework to enable representation learning without any supervised signal. GANs and VAEs learn representations by directly transforming latent codes to data samples. In contrast, the introduced diffusion based representation learning relies on a new formulation of the denoising score-matching objective and thus encodes information needed for denoising. We illustrate how this difference allows for manual control of the level of details encoded in the representation. Using the same approach, we propose to learn an infinite-dimensional latent code which achieves improvements of state-of-the-art models on semi-supervised image classification. As a side contribution, we show how adversarial training in score-based models can improve sample quality and improve sampling speed using a new approximation of the prior at smaller noise scales.
    Few-Shot Action Localization without Knowing Boundaries. (arXiv:2106.04150v2 [cs.CV] UPDATED)
    (2 min) Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
    Mixed-supervised segmentation: Confidence maximization helps knowledge distillation. (arXiv:2109.10902v1 [eess.IV])
    (2 min) Despite achieving promising results in a breadth of medical image segmentation tasks, deep neural networks require large training datasets with pixel-wise annotations. Obtaining these curated datasets is a cumbersome process which limits the application in scenarios where annotated images are scarce. Mixed supervision is an appealing alternative for mitigating this obstacle, where only a small fraction of the data contains complete pixel-wise annotations and other images have a weaker form of supervision. In this work, we propose a dual-branch architecture, where the upper branch (teacher) receives strong annotations, while the bottom one (student) is driven by limited supervision and guided by the upper branch. Combined with a standard cross-entropy loss over the labeled pixels, our novel formulation integrates two important terms: (i) a Shannon entropy loss defined over the less-supervised images, which encourages confident student predictions in the bottom branch; and (ii) a Kullback-Leibler (KL) divergence term, which transfers the knowledge of the strongly supervised branch to the less-supervised branch and guides the entropy (student-confidence) term to avoid trivial solutions. We show that the synergy between the entropy and KL divergence yields substantial improvements in performance. We also discuss an interesting link between Shannon-entropy minimization and standard pseudo-mask generation, and argue that the former should be preferred over the latter for leveraging information from unlabeled pixels. Quantitative and qualitative results on two publicly available datasets demonstrate that our method significantly outperforms other strategies for semantic segmentation within a mixed-supervision framework, as well as recent semi-supervised approaches. Moreover, we show that the branch trained with reduced supervision and guided by the top branch largely outperforms the latter.
    Puzzle-CAM: Improved localization via matching partial and full features. (arXiv:2101.11253v4 [cs.CV] UPDATED)
    (0 min) Weakly-supervised semantic segmentation (WSSS) is introduced to narrow the gap for semantic segmentation performance from pixel-level supervision to image-level supervision. Most advanced approaches are based on class activation maps (CAMs) to generate pseudo-labels to train the segmentation network. The main limitation of WSSS is that the process of generating pseudo-labels from CAMs that use an image classifier is mainly focused on the most discriminative parts of the objects. To address this issue, we propose Puzzle-CAM, a process that minimizes differences between the features from separate patches and the whole image. Our method consists of a puzzle module and two regularization terms to discover the most integrated region in an object. Puzzle-CAM can activate the overall region of an object using image-level supervision without requiring extra parameters. % In experiments, Puzzle-CAM outperformed previous state-of-the-art methods using the same labels for supervision on the PASCAL VOC 2012 test dataset. In experiments, Puzzle-CAM outperformed previous state-of-the-art methods using the same labels for supervision on the PASCAL VOC 2012 dataset. Code associated with our experiments is available at https://github.com/OFRIN/PuzzleCAM.
    Improving Generalization of Transfer Learning Across Domains Using Spatio-Temporal Features in Autonomous Driving. (arXiv:2103.08116v2 [cs.CV] UPDATED)
    (2 min) Practical learning-based autonomous driving models must be capable of generalizing learned behaviors from simulated to real domains, and from training data to unseen domains with unusual image properties. In this paper, we investigate transfer learning methods that achieve robustness to domain shifts by taking advantage of the invariance of spatio-temporal features across domains. In this paper, we propose a transfer learning method to improve generalization across domains via transfer of spatio-temporal features and salient data augmentation. Our model uses a CNN-LSTM network with Inception modules for image feature extraction. Our method runs in two phases: Phase 1 involves training on source domain data, while Phase 2 performs training on target domain data that has been supplemented by feature maps generated using the Phase 1 model. Our model significantly improves performance in unseen test cases for both simulation-to-simulation transfer as well as simulation-to-real transfer by up to +37.3\% in test accuracy and up to +40.8\% in steering angle prediction, compared to other SOTA methods across multiple datasets.
    Learning to Downsample for Segmentation of Ultra-High Resolution Images. (arXiv:2109.11071v1 [cs.CV])
    (2 min) Segmentation of ultra-high resolution images with deep learning is challenging because of their enormous size, often millions or even billions of pixels. Typical solutions drastically downsample the image uniformly to meet memory constraints, implicitly assuming all pixels equally important by sampling at the same density at all spatial locations. However this assumption is not true and compromises the performance of deep learning techniques that have proved powerful on standard-sized images. For example with uniform downsampling, see green boxed region in Fig.1, the rider and bike do not have enough corresponding samples while the trees and buildings are oversampled, and lead to a negative effect on the segmentation prediction from the low-resolution downsampled image. In this work we show that learning the spatially varying downsampling strategy jointly with segmentation offers advantages in segmenting large images with limited computational budget. Fig.1 shows that our method adapts the sampling density over different locations so that more samples are collected from the small important regions and less from the others, which in turn leads to better segmentation accuracy. We show on two public and one local high-resolution datasets that our method consistently learns sampling locations preserving more information and boosting segmentation accuracy over baseline methods.
    Domain-guided Machine Learning for Remotely Sensed In-Season Crop Growth Estimation. (arXiv:2106.13323v2 [cs.LG] UPDATED)
    (2 min) Advanced machine learning techniques have been used in remote sensing (RS) applications such as crop mapping and yield prediction, but remain under-utilized for tracking crop progress. In this study, we demonstrate the use of agronomic knowledge of crop growth drivers in a Long Short-Term Memory-based, domain-guided neural network (DgNN) for in-season crop progress estimation. The DgNN uses a branched structure and attention to separate independent crop growth drivers and capture their varying importance throughout the growing season. The DgNN is implemented for corn, using RS data in Iowa for the period 2003-2019, with USDA crop progress reports used as ground truth. State-wide DgNN performance shows significant improvement over sequential and dense-only NN structures, and a widely-used Hidden Markov Model method. The DgNN had a 4.0% higher Nash-Sutfliffe efficiency over all growth stages and 39% more weeks with highest cosine similarity than the next best NN during test years. The DgNN and Sequential NN were more robust during periods of abnormal crop progress, though estimating the Silking-Grainfill transition was difficult for all methods. Finally, Uniform Manifold Approximation and Projection visualizations of layer activations showed how LSTM-based NNs separate crop growth time-series differently from a dense-only structure. Results from this study exhibit both the viability of NNs in crop growth stage estimation (CGSE) and the benefits of using domain knowledge. The DgNN methodology presented here can be extended to provide near-real time CGSE of other crops.
    LGD: Label-guided Self-distillation for Object Detection. (arXiv:2109.11496v1 [cs.CV])
    (2 min) In this paper, we propose the first self-distillation framework for general object detection, termed LGD (Label-Guided self-Distillation). Previous studies rely on a strong pretrained teacher to provide instructive knowledge for distillation. However, this could be unavailable in real-world scenarios. Instead, we generate an instructive knowledge by inter-and-intra relation modeling among objects, requiring only student representations and regular labels. In detail, our framework involves sparse label-appearance encoding, inter-object relation adaptation and intra-object knowledge mapping to obtain the instructive knowledge. Modules in LGD are trained end-to-end with student detector and are discarded in inference. Empirically, LGD obtains decent results on various detectors, datasets, and extensive task like instance segmentation. For example in MS-COCO dataset, LGD improves RetinaNet with ResNet-50 under 2x single-scale training from 36.2% to 39.0% mAP (+ 2.8%). For much stronger detectors like FCOS with ResNeXt-101 DCN v2 under 2x multi-scale training (46.1%), LGD achieves 47.9% (+ 1.8%). For pedestrian detection in CrowdHuman dataset, LGD boosts mMR by 2.3% for Faster R-CNN with ResNet-50. Compared with a classical teacher-based method FGFI, LGD not only performs better without requiring pretrained teacher but also with 51% lower training cost beyond inherent student learning.
    Convolutional Neural Network for emotion recognition to assist psychiatrists and psychologists during the COVID-19 pandemic: experts opinion. (arXiv:2005.07649v2 [cs.CV] UPDATED)
    (3 min) A web application with real-time emotion recognition for psychologists and psychiatrists is presented. Mental health effects during COVID-19 quarantine need to be handled because society is being emotionally impacted. The human micro-expressions can describe genuine emotions that can be captured by Convolutional Neural Networks (CNN) models. But the challenge is to implement it under the poor performance of a part of society computers and the low speed of internet connection, i.e., improve the computational efficiency and reduce the data transfer. To validate the computational efficiency premise, we compare CNN architectures results, collecting the floating-point operations per second (FLOPS), the Number of Parameters (NP) and accuracy from the MobileNet, PeleeNet, Extended Deep Neural Network (EDNN), Inception- Based Deep Neural Network (IDNN) and our proposed Residual mobile-based Network model (ResmoNet). Also, we compare the trained models results in terms of Main Memory Utilization (MMU) and Response Time to complete the Emotion (RTE) recognition. Besides, we design a data transfer that includes the raw data of emotions and the basic patient information. The web application was evaluated with the System Usability Scale (SUS) and a utility questionnaire by psychologists and psychiatrists. ResmoNet model generated the most reduced NP, FLOPS, and MMU results, only EDNN overcomes ResmoNet in 0.01sec in RTE. The optimizations to our model impacted the accuracy, therefore IDNN and EDNN are 0.02 and 0.05 more accurate than our model respectively. Finally, according to psychologists and psychiatrists, the web application has good usability (73.8 of 100) and utility (3.94 of 5).
    Stereo CenterNet based 3D Object Detection for Autonomous Driving. (arXiv:2103.11071v3 [cs.CV] UPDATED)
    (2 min) Recently, three-dimensional (3D) detection based on stereo images has progressed remarkably; however, most advanced methods adopt anchor-based two-dimensional (2D) detection or depth estimation to address this problem. Nevertheless, high computational cost inhibits these methods from achieving real-time performance. In this study, we propose a 3D object detection method, Stereo CenterNet (SC), using geometric information in stereo imagery. SC predicts the four semantic key points of the 3D bounding box of the object in space and utilizes 2D left and right boxes, 3D dimension, orientation, and key points to restore the bounding box of the object in the 3D space. Subsequently, we adopt an improved photometric alignment module to further optimize the position of the 3D bounding box. Experiments conducted on the KITTI dataset indicate that the proposed SC exhibits the best speed-accuracy trade-off among advanced methods without using extra data.
    Training Affective Computer Vision Models by Crowdsourcing Soft-Target Labels. (arXiv:2101.03477v2 [cs.CV] UPDATED)
    (3 min) Emotion classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle subjective labels. We explore the use of crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier trained with these labels. We center our study on the Child Affective Facial Expression (CAFE) dataset, a gold standard collection of images depicting pediatric facial expressions along with 100 human labels per image. To test the feasibility of crowdsourcing to generate these labels, we used Microworkers to acquire labels for 207 CAFE images. We evaluate both unfiltered workers as well as workers selected through a short crowd filtration process. We then train two versions of a classifiers on soft-target CAFE labels using the original 100 annotations provided with the dataset: (1) a classifier trained with traditional one-hot encoded labels, and (2) a classifier trained with vector labels representing the distribution of CAFE annotator responses. We compare the resulting softmax output distributions of the two classifiers with a 2-sample independent t-test of L1 distances between the classifier's output probability distribution and the distribution of human labels. While agreement with CAFE is weak for unfiltered crowd workers, the filtered crowd agree with the CAFE labels 100% of the time for many emotions. While the F1-score for a one-hot encoded classifier is much higher (94.33% vs. 78.68%) with respect to the ground truth CAFE labels, the output probability vector of the crowd-trained classifier more closely resembles the distribution of human labels (t=3.2827, p=0.0014). Reporting an emotion probability distribution that accounts for the subjectivity of human interpretation. Crowdsourcing, including a sufficient filtering mechanism, is a feasible solution for acquiring soft-target labels.
    The Hilti SLAM Challenge Dataset. (arXiv:2109.11316v1 [cs.RO])
    (2 min) Accurate and robust pose estimation is a fundamental capability for autonomous systems to navigate, map and perform tasks. Particularly, construction environments pose challenging problem to Simultaneous Localization and Mapping (SLAM) algorithms due to sparsity, varying illumination conditions, and dynamic objects. Current academic research in SLAM is focused on developing more accurate and robust algorithms for example by fusing different sensor modalities. To help this research, we propose a new dataset, the Hilti SLAM Challenge Dataset. The sensor platform used to collect this dataset contains a number of visual, lidar and inertial sensors which have all been rigorously calibrated. All data is temporally aligned to support precise multi-sensor fusion. Each dataset includes accurate ground truth to allow direct testing of SLAM results. Raw data as well as intrinsic and extrinsic sensor calibration data from twelve datasets in various environments is provided. Each environment represents common scenarios found in building construction sites in various stages of completion.
    Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds. (arXiv:2109.11453v1 [cs.CV])
    (2 min) Outdoor scene completion is a challenging issue in 3D scene understanding, which plays an important role in intelligent robotics and autonomous driving. Due to the sparsity of LiDAR acquisition, it is far more complex for 3D scene completion and semantic segmentation. Since semantic features can provide constraints and semantic priors for completion tasks, the relationship between them is worth exploring. Therefore, we propose an end-to-end semantic segmentation-assisted scene completion network, including a 2D completion branch and a 3D semantic segmentation branch. Specifically, the network takes a raw point cloud as input, and merges the features from the segmentation branch into the completion branch hierarchically to provide semantic information. By adopting BEV representation and 3D sparse convolution, we can benefit from the lower operand while maintaining effective expression. Besides, the decoder of the segmentation branch is used as an auxiliary, which can be discarded in the inference stage to save computational consumption. Extensive experiments demonstrate that our method achieves competitive performance on SemanticKITTI dataset with low latency. Code and models will be released at https://github.com/jokester-zzz/SSA-SC.
    Boosting the Generalization Capability in Cross-Domain Few-shot Learning via Noise-enhanced Supervised Autoencoder. (arXiv:2108.05028v2 [cs.CV] UPDATED)
    (2 min) State of the art (SOTA) few-shot learning (FSL) methods suffer significant performance drop in the presence of domain differences between source and target datasets. The strong discrimination ability on the source dataset does not necessarily translate to high classification accuracy on the target dataset. In this work, we address this cross-domain few-shot learning (CDFSL) problem by boosting the generalization capability of the model. Specifically, we teach the model to capture broader variations of the feature distributions with a novel noise-enhanced supervised autoencoder (NSAE). NSAE trains the model by jointly reconstructing inputs and predicting the labels of inputs as well as their reconstructed pairs. Theoretical analysis based on intra-class correlation (ICC) shows that the feature embeddings learned from NSAE have stronger discrimination and generalization abilities in the target domain. We also take advantage of NSAE structure and propose a two-step fine-tuning procedure that achieves better adaption and improves classification performance in the target domain. Extensive experiments and ablation studies are conducted to demonstrate the effectiveness of the proposed method. Experimental results show that our proposed method consistently outperforms SOTA methods under various conditions.
    Angular Triplet Loss-based Camera Network for ReID. (arXiv:2005.05740v2 [cs.CV] UPDATED)
    (2 min) Person re-identification (ReID) is a challenging crosscamera retrieval task to identify pedestrians. Many complex network structures are proposed recently and many of them concentrate on multi-branch features to achieve high performance. However, they are too heavy-weight to deploy in realworld applications. Additionally, pedestrian images are often captured by different surveillance cameras, so the varied lights, perspectives and resolutions result in inevitable multi-camera domain gaps for ReID. To address these issues, this paper proposes ATCN, a simple but effective angular triplet loss-based camera network, which is able to achieve compelling performance with only global features. In ATCN, a novel angular distance is introduced to learn a more discriminative feature representation in the embedding space. Meanwhile, a lightweight camera network is designed to transfer global features to more discriminative features. ATCN is designed to be simple and flexible so it can be easily deployed in practice. The experiment results on various benchmark datasets show that ATCN outperforms many SOTA approaches.
    How much "human-like" visual experience do current self-supervised learning algorithms need to achieve human-level object recognition?. (arXiv:2109.11523v1 [cs.CV])
    (2 min) This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like", natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is on the order of a million years of natural visual experience, in other words several orders of magnitude longer than a human lifetime. However, this estimate is quite sensitive to some underlying assumptions, underscoring the need to run carefully controlled human experiments. We discuss the main caveats surrounding our estimate and the implications of this rather surprising result.
    Multi-view analysis of unregistered medical images using cross-view transformers. (arXiv:2103.11390v2 [cs.CV] UPDATED)
    (2 min) Multi-view medical image analysis often depends on the combination of information from multiple views. However, differences in perspective or other forms of misalignment can make it difficult to combine views effectively, as registration is not always possible. Without registration, views can only be combined at a global feature level, by joining feature vectors after global pooling. We present a novel cross-view transformer method to transfer information between unregistered views at the level of spatial feature maps. We demonstrate this method on multi-view mammography and chest X-ray datasets. On both datasets, we find that a cross-view transformer that links spatial feature maps can outperform a baseline model that joins feature vectors after global pooling.
    Annotation-efficient deep learning for automatic medical image segmentation. (arXiv:2012.04885v3 [eess.IV] UPDATED)
    (2 min) Automatic medical image segmentation plays a critical role in scientific research and medical care. Existing high-performance deep learning methods typically rely on large training datasets with high-quality manual annotations, which are difficult to obtain in many clinical applications. Here, we introduce Annotation-effIcient Deep lEarning (AIDE), an open-source framework to handle imperfect training datasets. Methodological analyses and empirical evaluations are conducted, and we demonstrate that AIDE surpasses conventional fully-supervised models by presenting better performance on open datasets possessing scarce or noisy annotations. We further test AIDE in a real-life case study for breast tumor segmentation. Three datasets containing 11,852 breast images from three medical centers are employed, and AIDE, utilizing 10% training annotations, consistently produces segmentation maps comparable to those generated by fully-supervised counterparts or provided by independent radiologists. The 10-fold enhanced efficiency in utilizing expert labels has the potential to promote a wide range of biomedical applications.
    A Gradient Flow Framework For Analyzing Network Pruning. (arXiv:2009.11839v4 [cs.LG] UPDATED)
    (2 min) Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several results related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and is therefore appropriate for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, such that increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100. Code available at https://github.com/EkdeepSLubana/flowandprune.
    Recent Advances of Continual Learning in Computer Vision: An Overview. (arXiv:2109.11369v1 [cs.CV])
    (2 min) In contrast to batch learning where all training data is available at once, continual learning represents a family of methods that accumulate knowledge and learn continuously with data available in sequential order. Similar to the human learning process with the ability of learning, fusing, and accumulating new knowledge coming at different time steps, continual learning is considered to have high practical significance. Hence, continual learning has been studied in various artificial intelligence tasks. In this paper, we present a comprehensive review of the recent progress of continual learning in computer vision. In particular, the works are grouped by their representative techniques, including regularization, knowledge distillation, memory, generative replay, parameter isolation, and a combination of the above techniques. For each category of these techniques, both its characteristics and applications in computer vision are presented. At the end of this overview, several subareas, where continuous knowledge accumulation is potentially helpful while continual learning has not been well studied, are discussed.
    Informative and Representative Triplet Selection for Multi-Label Remote Sensing Image Retrieval. (arXiv:2105.03647v2 [cs.CV] UPDATED)
    (3 min) Learning the similarity between remote sensing (RS) images forms the foundation for content based RS image retrieval (CBIR). Recently, deep metric learning approaches that map the semantic similarity of images into an embedding (metric) space have been found very popular in RS. A common approach for learning the metric space relies on the selection of triplets of similar (positive) and dissimilar (negative) images to a reference image called as an anchor. Choosing triplets is a difficult task particularly for multi-label RS CBIR, where each training image is annotated by multiple class labels. To address this problem, in this paper we propose a novel triplet sampling method in the framework of deep neural networks (DNNs) defined for multi-label RS CBIR problems. The proposed method selects a small set of the most representative and informative triplets based on two main steps. In the first step, a set of anchors that are diverse to each other in the embedding space is selected from the current mini-batch using an iterative algorithm. In the second step, different sets of positive and negative images are chosen for each anchor by evaluating the relevancy, hardness and diversity of the images among each other based on a novel strategy. Experimental results obtained on two multi-label benchmark archives show that the selection of the most informative and representative triplets in the context of DNNs results in: i) reducing the computational complexity of the training phase of the DNNs without any significant loss on the performance; and ii) an increase in learning speed since informative triplets allow fast convergence. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/image-retrieval-from-triplets.
    Towards Adversarially Robust and Domain Generalizable Stereo Matching by Rethinking DNN Feature Backbones. (arXiv:2108.00335v2 [cs.CV] UPDATED)
    (2 min) Stereo matching has recently witnessed remarkable progress using Deep Neural Networks (DNNs). But, how robust are they? Although it has been well-known that DNNs often suffer from adversarial vulnerability with a catastrophic drop in performance, the situation is even worse in stereo matching. This paper first shows that a type of weak white-box attacks can overwhelm state-of-the-art methods. The attack is learned by a proposed stereo-constrained projected gradient descent (PGD) method in stereo matching. This observation raises serious concerns for the deployment of DNN-based stereo matching. Parallel to the adversarial vulnerability, DNN-based stereo matching is typically trained under the so-called simulation to reality pipeline, and thus domain generalizability is an important problem. This paper proposes to rethink the learnable DNN-based feature backbone towards adversarially-robust and domain generalizable stereo matching by completely removing it for matching. In experiments, the proposed method is tested in the SceneFlow dataset and the KITTI2015 benchmark, with promising results. We compute the matching cost volume using the classic multi-scale census transform (i.e., local binary pattern) of the raw input stereo images, followed by a stacked Hourglass head sub-network solving the matching problem. It significantly improves the adversarial robustness, while retaining accuracy performance comparable to state-of-the-art methods. It also shows better generalizability from simulation (SceneFlow) to real (KITTI) datasets when no fine-tuning is used.
    Deep Co-Training with Task Decomposition for Semi-Supervised Domain Adaptation. (arXiv:2007.12684v5 [cs.CV] UPDATED)
    (2 min) Semi-supervised domain adaptation (SSDA) aims to adapt models trained from a labeled source domain to a different but related target domain, from which unlabeled data and a small set of labeled data are provided. Current methods that treat source and target supervision without distinction overlook their inherent discrepancy, resulting in a source-dominated model that has not effectively used the target supervision. In this paper, we argue that the labeled target data needs to be distinguished for effective SSDA, and propose to explicitly decompose the SSDA task into two sub-tasks: a semi-supervised learning (SSL) task in the target domain and an unsupervised domain adaptation (UDA) task across domains. By doing so, the two sub-tasks can better leverage the corresponding supervision and thus yield very different classifiers. To integrate the strengths of the two classifiers, we apply the well-established co-training framework, in which the two classifiers exchange their high confident predictions to iteratively "teach each other" so that both classifiers can excel in the target domain. We call our approach Deep Co-training with Task decomposition (DeCoTa). DeCoTa requires no adversarial training and is easy to implement. Moreover, DeCoTa is well-founded on the theoretical condition of when co-training would succeed. As a result, DeCoTa achieves state-of-the-art results on several SSDA datasets, outperforming the prior art by a notable 4% margin on DomainNet. Code is available at https://github.com/LoyoYang/DeCoTa
    End-to-End AI-based MRI Reconstruction and Lesion Detection Pipeline for Evaluation of Deep Learning Image Reconstruction. (arXiv:2109.11524v1 [cs.CV])
    (2 min) Deep learning techniques have emerged as a promising approach to highly accelerated MRI. However, recent reconstruction challenges have shown several drawbacks in current deep learning approaches, including the loss of fine image details even using models that perform well in terms of global quality metrics. In this study, we propose an end-to-end deep learning framework for image reconstruction and pathology detection, which enables a clinically aware evaluation of deep learning reconstruction quality. The solution is demonstrated for a use case in detecting meniscal tears on knee MRI studies, ultimately finding a loss of fine image details with common reconstruction methods expressed as a reduced ability to detect important pathology like meniscal tears. Despite the common practice of quantitative reconstruction methodology evaluation with metrics such as SSIM, impaired pathology detection as an automated pathology-based reconstruction evaluation approach suggests existing quantitative methods do not capture clinically important reconstruction outcomes.
    Graph Pruning for Model Compression. (arXiv:1911.09817v2 [cs.CV] UPDATED)
    (2 min) Previous AutoML pruning works utilized individual layer features to automatically prune filters. We analyze the correlation for two layers from the different blocks which have a short-cut structure. It shows that, in one block, the deeper layer has many redundant filters which can be represented by filters in the former layer. So, it is necessary to take information from other layers into consideration in pruning. In this paper, a novel pruning method, named GraphPruning, is proposed. Any series of the network is viewed as a graph. To automatically aggregate neighboring features for each node, a graph aggregator based on graph convolution networks(GCN) is designed. In the training stage, a PruningNet that is given aggregated node features generates reasonable weights for any size of the sub-network. Subsequently, the best configuration of the Pruned Network is searched by reinforcement learning. Different from previous work, we take the node features from a well-trained graph aggregator instead of the hand-craft features, as the states in reinforcement learning. Compared with other AutoML pruning works, our method has achieved the state-of-the-art under the same conditions on ImageNet-2012.
    DeepRare: Generic Unsupervised Visual Attention Models. (arXiv:2109.11439v1 [cs.CV])
    (2 min) Human visual system is modeled in engineering field providing feature-engineered methods which detect contrasted/surprising/unusual data into images. This data is "interesting" for humans and leads to numerous applications. Deep learning (DNNs) drastically improved the algorithms efficiency on the main benchmark datasets. However, DNN-based models are counter-intuitive: surprising or unusual data is by definition difficult to learn because of its low occurrence probability. In reality, DNN-based models mainly learn top-down features such as faces, text, people, or animals which usually attract human attention, but they have low efficiency in extracting surprising or unusual data in the images. In this paper, we propose a new visual attention model called DeepRare2021 (DR21) which uses the power of DNNs feature extraction and the genericity of feature-engineered algorithms. This algorithm is an evolution of a previous version called DeepRare2019 (DR19) based on a common framework. DR21 1) does not need any training and uses the default ImageNet training, 2) is fast even on CPU, 3) is tested on four very different eye-tracking datasets showing that the DR21 is generic and is always in the within the top models on all datasets and metrics while no other model exhibits such a regularity and genericity. Finally DR21 4) is tested with several network architectures such as VGG16 (V16), VGG19 (V19) and MobileNetV2 (MN2) and 5) it provides explanation and transparency on which parts of the image are the most surprising at different levels despite the use of a DNN-based feature extractor. DeepRare2021 code can be found at https://github.com/numediart/VisualAttention-RareFamil}.
    Self-supervised Learning for Semi-supervised Temporal Language Grounding. (arXiv:2109.11475v1 [cs.CV])
    (2 min) Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires to have comprehensive understanding of both video contents and text sentences. Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance. To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework. STLG consists of two parts: (1) A pseudo label generation module that produces adaptive instant pseudo labels for unlabeled data based on predictions from a teacher model; (2) A self-supervised feature learning module with two sequential perturbations, i.e., time lagging and time scaling, for improving the video representation by inter-modal and intra-modal contrastive learning. We conduct experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets and the results demonstrate that our proposed STLG framework achieve competitive performance compared to fully-supervised state-of-the-art methods with only a small portion of temporal annotations.
    Ultrafast Focus Detection for Automated Microscopy. (arXiv:2108.12050v2 [eess.IV] UPDATED)
    (2 min) Recent advances in scientific instruments have resulted in dramatic increase in the volumes and velocities of data being generated in every-day laboratories. Scanning electron microscopy is one such example where technological advancements are now overwhelming scientists with critical data for montaging, alignment, and image segmentation -- key practices for many scientific domains, including, for example, neuroscience, where they are used to derive the anatomical relationships of the brain. These instruments now necessitate equally advanced computing resources and techniques to realize their full potential. Here we present a fast out-of-focus detection algorithm for electron microscopy images collected serially and demonstrate that it can be used to provide near-real time quality control for neurology research. Our technique, Multi-scale Histologic Feature Detection, adapts classical computer vision techniques and is based on detecting various fine-grained histologic features. We further exploit the inherent parallelism in the technique by employing GPGPU primitives in order to accelerate characterization. Tests are performed that demonstrate near-real-time detection of out-of-focus conditions. We deploy these capabilities as a funcX function and show that it can be applied as data are collected using an automated pipeline . We discuss extensions that enable scaling out to support multi-beam microscopes and integration with existing focus systems for purposes of implementing auto-focus.
    Multimodal Representation Learning via Maximization of Local Mutual Information. (arXiv:2103.04537v3 [cs.CV] UPDATED)
    (2 min) We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.
    Leveraging distributed contact force measurements for slip detection: a physics-based approach enabled by a data-driven tactile sensor. (arXiv:2109.11504v1 [cs.RO])
    (2 min) Grasping objects whose physical properties are unknown is still a great challenge in robotics. Most solutions rely entirely on visual data to plan the best grasping strategy. However, to match human abilities and be able to reliably pick and hold unknown objects, the integration of an artificial sense of touch in robotic systems is pivotal. This paper describes a novel model-based slip detection pipeline that can predict possibly failing grasps in real-time and signal a necessary increase in grip force. As such, the slip detector does not rely on manually collected data, but exploits physics to generalize across different tasks. To evaluate the approach, a state-of-the-art vision-based tactile sensor that accurately estimates distributed forces was integrated into a grasping setup composed of a six degrees-of-freedom cobot and a two-finger gripper. Results show that the system can reliably predict slip while manipulating objects of different shapes, materials, and weights. The sensor can detect both translational and rotational slip in various scenarios, making it suitable to improve the stability of a grasp.
    Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning. (arXiv:2106.05956v3 [cs.LG] UPDATED)
    (2 min) Inspired by BatchNorm, there has been an explosion of normalization layers for deep neural networks (DNNs). However, these alternative normalization layers have seen minimal use, partially due to a lack of guiding principles that can help identify when these layers can serve as a replacement for BatchNorm. To address this problem, we take a theoretical approach, generalizing the known beneficial mechanisms of BatchNorm to several recently proposed normalization techniques. Our generalized theory leads to the following set of principles: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric layers require explicit remedies; (ii) use of GroupNorm can ensure informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.
    Improving Tuberculosis (TB) Prediction using Synthetically Generated Computed Tomography (CT) Images. (arXiv:2109.11480v1 [eess.IV])
    (2 min) The evaluation of infectious disease processes on radiologic images is an important and challenging task in medical image analysis. Pulmonary infections can often be best imaged and evaluated through computed tomography (CT) scans, which are often not available in low-resource environments and difficult to obtain for critically ill patients. On the other hand, X-ray, a different type of imaging procedure, is inexpensive, often available at the bedside and more widely available, but offers a simpler, two dimensional image. We show that by relying on a model that learns to generate CT images from X-rays synthetically, we can improve the automatic disease classification accuracy and provide clinicians with a different look at the pulmonary disease process. Specifically, we investigate Tuberculosis (TB), a deadly bacterial infectious disease that predominantly affects the lungs, but also other organ systems. We show that relying on synthetically generated CT improves TB identification by 7.50% and distinguishes TB properties up to 12.16% better than the X-ray baseline.
    A Skeleton-Driven Neural Occupancy Representation for Articulated Hands. (arXiv:2109.11399v1 [cs.CV])
    (2 min) We present Hand ArticuLated Occupancy (HALO), a novel representation of articulated hands that bridges the advantages of 3D keypoints and neural implicit surfaces and can be used in end-to-end trainable architectures. Unlike existing statistical parametric hand models (e.g.~MANO), HALO directly leverages 3D joint skeleton as input and produces a neural occupancy volume representing the posed hand surface. The key benefits of HALO are (1) it is driven by 3D key points, which have benefits in terms of accuracy and are easier to learn for neural networks than the latent hand-model parameters; (2) it provides a differentiable volumetric occupancy representation of the posed hand; (3) it can be trained end-to-end, allowing the formulation of losses on the hand surface that benefit the learning of 3D keypoints. We demonstrate the applicability of HALO to the task of conditional generation of hands that grasp 3D objects. The differentiable nature of HALO is shown to improve the quality of the synthesized hands both in terms of physical plausibility and user preference.
    Adaptive Logit Adjustment Loss for Long-Tailed Visual Recognition. (arXiv:2104.06094v2 [cs.CV] UPDATED)
    (2 min) Data in the real world tends to exhibit a long-tailed label distribution, which poses great challenges for the training of neural networks in visual recognition. Existing methods tackle this problem mainly from the perspective of data quantity, i.e., the number of samples in each class. To be specific, they pay more attention to tail classes, like applying larger adjustments to the logit. However, in the training process, the quantity and difficulty of data are two intertwined and equally crucial problems. For some tail classes, the features of their instances are distinct and discriminative, which can also bring satisfactory accuracy; for some head classes, although with sufficient samples, the high semantic similarity with other classes and lack of discriminative features will bring bad accuracy. Based on these observations, we propose Adaptive Logit Adjustment Loss (ALA Loss) to apply an adaptive adjusting term to the logit. The adaptive adjusting term is composed of two complementary factors: 1) quantity factor, which pays more attention to tail classes, and 2) difficulty factor, which adaptively pays more attention to hard instances in the training process. The difficulty factor can alleviate the over-optimization on tail yet easy instances and under-optimization on head yet hard instances. The synergy of the two factors can not only advance the performance on tail classes even further, but also promote the accuracy on head classes. Unlike previous logit adjusting methods that only concerned about data quantity, ALA Loss tackles the long-tailed problem from a more comprehensive, fine-grained and adaptive perspective. Extensive experimental results show that our method achieves the state-of-the-art performance on challenging recognition benchmarks, including ImageNet-LT, iNaturalist 2018, and Places-LT.
    MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks. (arXiv:2109.11526v1 [cs.CV])
    (2 min) Political activity on social media presents a data-rich window into political behavior, but the vast amount of data means that almost all content analyses of social media require a data labeling step. However, most automated machine classification methods ignore the multimodality of posted content, focusing either on text or images. State-of-the-art vision-and-language models are unusable for most political science research: they require all observations to have both image and text and require computationally expensive pretraining. This paper proposes a novel vision-and-language framework called multimodal representations using modality translation (MARMOT). MARMOT presents two methodological contributions: it can construct representations for observations missing image or text, and it replaces the computationally expensive pretraining with modality translation. MARMOT outperforms an ensemble text-only classifier in 19 of 20 categories in multilabel classifications of tweets reporting election incidents during the 2016 U.S. general election. Moreover, MARMOT shows significant improvements over the results of benchmark multimodal models on the Hateful Memes dataset, improving the best result set by VisualBERT in terms of accuracy from 0.6473 to 0.6760 and area under the receiver operating characteristic curve (AUC) from 0.7141 to 0.7530.
    Multi-resolution deep learning pipeline for dense large scale point clouds. (arXiv:2109.11311v1 [cs.CV])
    (2 min) Recent development of 3D sensors allows the acquisition of extremely dense 3D point clouds of large-scale scenes. The main challenge of processing such large point clouds remains in the size of the data, which induce expensive computational and memory cost. In this context, the full resolution cloud is particularly hard to process, and details it brings are rarely exploited. Although fine-grained details are important for detection of small objects, they can alter the local geometry of large structural parts and mislead deep learning networks. In this paper, we introduce a new generic deep learning pipeline to exploit the full precision of large scale point clouds, but only for objects that require details. The core idea of our approach is to split up the process into multiple sub-networks which operate on different resolutions and with each their specific classes to retrieve. Thus, the pipeline allows each class to benefit either from noise and memory cost reduction of a sub-sampling or from fine-grained details.
    Revisit Geophysical Imaging in A New View of Physics-informed Generative Adversarial Learning. (arXiv:2109.11452v1 [eess.IV])
    (2 min) Seismic full waveform inversion (FWI) is a powerful geophysical imaging technique that produces high-resolution subsurface models by iteratively minimizing the misfit between the simulated and observed seismograms. Unfortunately, conventional FWI with least-squares function suffers from many drawbacks such as the local-minima problem and computation of explicit gradient. It is particularly challenging with the contaminated measurements or poor starting models. Recent works relying on partial differential equations and neural networks show promising performance for two-dimensional FWI. Inspired by the competitive learning of generative adversarial networks, we proposed an unsupervised learning paradigm that integrates wave equation with a discriminate network to accurately estimate the physically consistent models in a distribution sense. Our framework needs no labelled training data nor pretraining of the network, is flexible to achieve multi-parameters inversion with minimal user interaction. The proposed method faithfully recovers the well-known synthetic models that outperforms the classical algorithms. Furthermore, our work paves the way to sidestep the local-minima issue via reducing the sensitivity to initial models and noise.
    Layered Neural Atlases for Consistent Video Editing. (arXiv:2109.11418v1 [cs.CV])
    (0 min) We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases, each providing a unified representation of the appearance of an object (or background) over the video. For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases, giving us a consistent parameterization of the video, along with an associated alpha (opacity) value. Importantly, we design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain, with minimal manual work required. Edits applied to a single 2D atlas (or input video frame) are automatically and consistently mapped back to the original video frames, while preserving occlusions, deformation, and other complex scene effects such as shadows and reflections. Our method employs a coordinate-based Multilayer Perceptron (MLP) representation for mappings, atlases, and alphas, which are jointly optimized on a per-video basis, using a combination of video reconstruction and regularization losses. By operating purely in 2D, our method does not require any prior 3D knowledge about scene geometry or camera poses, and can handle complex dynamic real world videos. We demonstrate various video editing applications, including texture mapping, video style transfer, image-to-video texture transfer, and segmentation/labeling propagation, all automatically produced by editing a single 2D atlas image.
    Adversarial Transfer Attacks With Unknown Data and Class Overlap. (arXiv:2109.11125v1 [cs.LG])
    (2 min) The ability to transfer adversarial attacks from one model (the surrogate) to another model (the victim) has been an issue of concern within the machine learning (ML) community. The ability to successfully evade unseen models represents an uncomfortable level of ease toward implementing attacks. In this work we note that as studied, current transfer attack research has an unrealistic advantage for the attacker: the attacker has the exact same training data as the victim. We present the first study of transferring adversarial attacks focusing on the data available to attacker and victim under imperfect settings without querying the victim, where there is some variable level of overlap in the exact data used or in the classes learned by each model. This threat model is relevant to applications in medicine, malware, and others. Under this new threat model attack success rate is not correlated with data or class overlap in the way one would expect, and varies with dataset. This makes it difficult for attacker and defender to reason about each other and contributes to the broader study of model robustness and security. We remedy this by developing a masked version of Projected Gradient Descent that simulates class disparity, which enables the attacker to reliably estimate a lower-bound on their attack's success.
    Towards Fine-grained 3D Face Dense Registration: An Optimal Dividing and Diffusing Method. (arXiv:2109.11204v1 [cs.CV])
    (2 min) Dense vertex-to-vertex correspondence between 3D faces is a fundamental and challenging issue for 3D&2D face analysis. While the sparse landmarks have anatomically ground-truth correspondence, the dense vertex correspondences on most facial regions are unknown. In this view, the current literatures commonly result in reasonable but diverse solutions, which deviate from the optimum to the 3D face dense registration problem. In this paper, we revisit dense registration by a dimension-degraded problem, i.e. proportional segmentation of a line, and employ an iterative dividing and diffusing method to reach the final solution uniquely. This method is then extended to 3D surface by formulating a local registration problem for dividing and a linear least-square problem for diffusing, with constraints on fixed features. On this basis, we further propose a multi-resolution algorithm to accelerate the computational process. The proposed method is linked to a novel local scaling metric, where we illustrate the physical meaning as smooth rearrangement for local cells of 3D facial shapes. Extensive experiments on public datasets demonstrate the effectiveness of the proposed method in various aspects. Generally, the proposed method leads to coherent local registrations and elegant mesh grid routines for fine-grained 3D face dense registrations, which benefits many downstream applications significantly. It can also be applied to dense correspondence for other format of data which are not limited to face. The core code will be publicly available at https://github.com/NaughtyZZ/3D_face_dense_registration.
    Towards practical object detection for weed spraying in precision agriculture. (arXiv:2109.11048v1 [cs.CV])
    (2 min) The evolution of smaller, faster processors and cheaper digital storage mechanisms across the last 4-5 decades has vastly increased the opportunity to integrate intelligent technologies in a wide range of practical environments to address a broad spectrum of tasks. One exciting application domain for such technologies is precision agriculture, where the ability to integrate on-board machine vision with data-driven actuation means that farmers can make decisions about crop care and harvesting at the level of the individual plant rather than the whole field. This makes sense both economically and environmentally. However, the key driver for this capability is fast and robust machine vision -- typically driven by machine learning (ML) solutions and dependent on accurate modelling. One critical challenge is that the bulk of ML-based vision research considers only metrics that evaluate the accuracy of object detection and do not assess practical factors. This paper introduces three metrics that highlight different aspects relevant for real-world deployment of precision weeding and demonstrates their utility through experimental results.
    OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification. (arXiv:2109.11159v1 [cs.CV])
    (2 min) Transformers have shown preferable performance on many vision tasks. However, for the task of person re-identification (ReID), vanilla transformers leave the rich contexts on high-order feature relations under-exploited and deteriorate local feature details, which are insufficient due to the dramatic variations of pedestrians. In this work, we propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for ReID. First, to strengthen the capacity of visual representation, instead of obtaining the attention matrix based on pairs of queries and isolated keys at each spatial location, we take a step further to model high-order statistics information for the non-local mechanism. We share the attention weights in the corresponding layer of each order with a prior mixing mechanism to reduce the computation cost. Then, a convolution-based local relation perception module is proposed to extract the local relations and 2D position information. The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.
    Rational Polynomial Camera Model Warping for Deep Learning Based Satellite Multi-View Stereo Matching. (arXiv:2109.11121v1 [eess.IV])
    (2 min) Satellite multi-view stereo (MVS) imagery is particularly suited for large-scale Earth surface reconstruction. Differing from the perspective camera model (pin-hole model) that is commonly used for close-range and aerial cameras, the cubic rational polynomial camera (RPC) model is the mainstream model for push-broom linear-array satellite cameras. However, the homography warping used in the prevailing learning based MVS methods is only applicable to pin-hole cameras. In order to apply the SOTA learning based MVS technology to the satellite MVS task for large-scale Earth surface reconstruction, RPC warping should be considered. In this work, we propose, for the first time, a rigorous RPC warping module. The rational polynomial coefficients are recorded as a tensor, and the RPC warping is formulated as a series of tensor transformations. Based on the RPC warping, we propose the deep learning based satellite MVS (SatMVS) framework for large-scale and wide depth range Earth surface reconstruction. We also introduce a large-scale satellite image dataset consisting of 519 5120${\times}$5120 images, which we call the TLC SatMVS dataset. The satellite images were acquired from a three-line camera (TLC) that catches triple-view images simultaneously, forming a valuable supplement to the existing open-source WorldView-3 datasets with single-scanline images. Experiments show that the proposed RPC warping module and the SatMVS framework can achieve a superior reconstruction accuracy compared to the pin-hole fitting method and conventional MVS methods. Code and data are available at https://github.com/WHU-GPCV/SatMVS.
    End-to-End Dense Video Grounding via Parallel Regression. (arXiv:2109.11265v1 [cs.CV])
    (2 min) Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key design in our PRVG is to use languages as queries, and directly regress the moment boundaries based on language-modulated visual representations. Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes (sparse or dense grounding) and allows for efficient inference without any post-processing technique. In addition, we devise a robust proposal-level attention loss to guide the training of PRVG, which is invariant to moment duration and contributes to model convergence. We perform experiments on two video grounding benchmarks of ActivityNet Captions and TACoS, demonstrating that our PRVG can significantly outperform previous methods. We also perform in-depth studies to investigate the effectiveness of parallel regression paradigm on video grounding.
    A two-step machine learning approach for crop disease detection: an application of GAN and UAV technology. (arXiv:2109.11066v1 [cs.CV])
    (2 min) Automated plant diagnosis is a technology that promises large increases in cost-efficiency for agriculture. However, multiple problems reduce the effectiveness of drones, including the inverse relationship between resolution and speed and the lack of adequate labeled training data. This paper presents a two-step machine learning approach that analyzes low-fidelity and high-fidelity images in sequence, preserving efficiency as well as accuracy. Two data-generators are also used to minimize class imbalance in the high-fidelity dataset and to produce low-fidelity data that is representative of UAV images. The analysis of applications and methods is conducted on a database of high-fidelity apple tree images which are corrupted with class imbalance. The application begins by generating high-fidelity data using generative networks and then uses this novel data alongside the original high-fidelity data to produce low-fidelity images. A machine-learning identifier identifies plants and labels them as potentially diseased or not. A machine learning classifier is then given the potentially diseased plant images and returns actual diagnoses for these plants. The results show an accuracy of 96.3% for the high-fidelity system and a 75.5% confidence level for our low-fidelity system. Our drone technology shows promising results in accuracy when compared to labor-based methods of diagnosis.
    Towards Generalized and Incremental Few-Shot Object Detection. (arXiv:2109.11336v1 [cs.CV])
    (2 min) Real-world object detection is highly desired to be equipped with the learning expandability that can enlarge its detection classes incrementally. Moreover, such learning from only few annotated training samples further adds the flexibility for the object detector, which is highly expected in many applications such as autonomous driving, robotics, etc. However, such sequential learning scenario with few-shot training samples generally causes catastrophic forgetting and dramatic overfitting. In this paper, to address the above incremental few-shot learning issues, a novel Incremental Few-Shot Object Detection (iFSOD) method is proposed to enable the effective continual learning from few-shot samples. Specifically, a Double-Branch Framework (DBF) is proposed to decouple the feature representation of base and novel (few-shot) class, which facilitates both the old-knowledge retention and new-class adaption simultaneously. Furthermore, a progressive model updating rule is carried out to preserve the long-term memory on old classes effectively when adapt to sequential new classes. Moreover, an inter-task class separation loss is proposed to extend the decision region of new-coming classes for better feature discrimination. We conduct experiments on both Pascal VOC and MS-COCO, which demonstrate that our method can effectively solve the problem of incremental few-shot detection and significantly improve the detection accuracy on both base and novel classes.
    PRANet: Point Cloud Registration with an Artificial Agent. (arXiv:2109.11349v1 [cs.CV])
    (2 min) Point cloud registration plays a critical role in a multitude of computer vision tasks, such as pose estimation and 3D localization. Recently, a plethora of deep learning methods were formulated that aim to tackle this problem. Most of these approaches find point or feature correspondences, from which the transformations are computed. We give a different perspective and frame the registration problem as a Markov Decision Process. Instead of directly searching for the transformation, the problem becomes one of finding a sequence of translation and rotation actions that is equivalent to this transformation. To this end, we propose an artificial agent trained end-to-end using deep supervised learning. In contrast to conventional reinforcement learning techniques, the observations are sampled i.i.d. and thus no experience replay buffer is required, resulting in a more streamlined training process. Experiments on ModelNet40 show results comparable or superior to the state of the art in the case of clean, noisy and partially visible datasets.
    Scene Graph Generation for Better Image Captioning?. (arXiv:2109.11398v1 [cs.CV])
    (2 min) We investigate the incorporation of visual relationships into the task of supervised image caption generation by proposing a model that leverages detected objects and auto-generated visual relationships to describe images in natural language. To do so, we first generate a scene graph from raw image pixels by identifying individual objects and visual relationships between them. This scene graph then serves as input to our graph-to-text model, which generates the final caption. In contrast to previous approaches, our model thus explicitly models the detection of objects and visual relationships in the image. For our experiments we construct a new dataset from the intersection of Visual Genome and MS COCO, consisting of images with both a corresponding gold scene graph and human-authored caption. Our results show that our methods outperform existing state-of-the-art end-to-end models that generate image descriptions directly from raw input pixels when compared in terms of the BLEU and METEOR evaluation metrics.
    Predicting the Timing of Camera Movements From the Kinematics of Instruments in Robotic-Assisted Surgery Using Artificial Neural Networks. (arXiv:2109.11192v1 [cs.LG])
    (2 min) Robotic-assisted surgeries benefit both surgeons and patients, however, surgeons frequently need to adjust the endoscopic camera to achieve good viewpoints. Simultaneously controlling the camera and the surgical instruments is impossible, and consequentially, these camera adjustments repeatedly interrupt the surgery. Autonomous camera control could help overcome this challenge, but most existing systems are reactive, e.g., by having the camera follow the surgical instruments. We propose a predictive approach for anticipating when camera movements will occur using artificial neural networks. We used the kinematic data of the surgical instruments, which were recorded during robotic-assisted surgical training on porcine models. We split the data into segments, and labeled each either as a segment that immediately precedes a camera movement, or one that does not. Due to the large class imbalance, we trained an ensemble of networks, each on a balanced sub-set of the training data. We found that the instruments' kinematic data can be used to predict when camera movements will occur, and evaluated the performance on different segment durations and ensemble sizes. We also studied how much in advance an upcoming camera movement can be predicted, and found that predicting a camera movement 0.25, 0.5, and 1 second before they occurred achieved 98%, 94%, and 84% accuracy relative to the prediction of an imminent camera movement. This indicates that camera movement events can be predicted early enough to leave time for computing and executing an autonomous camera movement and suggests that an autonomous camera controller for RAMIS may one day be feasible.
    Hierarchical Memory Matching Network for Video Object Segmentation. (arXiv:2109.11404v1 [cs.CV])
    (2 min) We present Hierarchical Memory Matching Network (HMMN) for semi-supervised video object segmentation. Based on a recent memory-based method [33], we propose two advanced memory read modules that enable us to perform memory reading in multiple scales while exploiting temporal smoothness. We first propose a kernel guided memory matching module that replaces the non-local dense memory read, commonly adopted in previous memory-based methods. The module imposes the temporal smoothness constraint in the memory read, leading to accurate memory retrieval. More importantly, we introduce a hierarchical memory matching scheme and propose a top-k guided memory matching module in which memory read on a fine-scale is guided by that on a coarse-scale. With the module, we perform memory read in multiple scales efficiently and leverage both high-level semantic and low-level fine-grained memory features to predict detailed object masks. Our network achieves state-of-the-art performance on the validation sets of DAVIS 2016/2017 (90.8% and 84.7%) and YouTube-VOS 2018/2019 (82.6% and 82.5%), and test-dev set of DAVIS 2017 (78.6%). The source code and model are available online: https://github.com/Hongje/HMMN.
    Clustering performance analysis using new correlation based cluster validity indices. (arXiv:2109.11172v1 [stat.ML])
    (2 min) There are various cluster validity measures used for evaluating clustering results. One of the main objective of using these measures is to seek the optimal unknown number of clusters. Some measures work well for clusters with different densities, sizes and shapes. Yet, one of the weakness that those validity measures share is that they sometimes provide only one clear optimal number of clusters. That number is actually unknown and there might be more than one potential sub-optimal options that a user may wish to choose based on different applications. We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated. Furthermore, the introduced correlation can also be used for evaluating the quality of a selected clustering result. Several experiments in different scenarios including the well-known iris data set and a real-world marketing application have been conducted in order to compare the proposed validity indices with several well-known ones.
    Cross Attention-guided Dense Network for Images Fusion. (arXiv:2109.11393v1 [cs.CV])
    (2 min) In recent years, various applications in computer vision have achieved substantial progress based on deep learning, which has been widely used for image fusion and shown to achieve adequate performance. However, suffering from limited ability in modelling the spatial correspondence of different source images, it still remains a great challenge for existing unsupervised image fusion models to extract appropriate feature and achieves adaptive and balanced fusion. In this paper, we propose a novel cross attention-guided image fusion network, which is a unified and unsupervised framework for multi-modal image fusion, multi-exposure image fusion, and multi-focus image fusion. Different from the existing self-attention module, our cross attention module focus on modelling the cross-correlation between different source images. Using the proposed cross attention module as core block, a densely connected cross attention-guided network is built to dynamically learn the spatial correspondence to derive better alignment of important details from different input images. Meanwhile, an auxiliary branch is also designed to model the long-range information, and a merging network is attached to finally reconstruct the fusion image. Extensive experiments have been carried out on publicly available datasets, and the results demonstrate that the proposed model outperforms the state-of-the-art quantitatively and qualitatively.
    Deep Learning Strategies for Industrial Surface Defect Detection Systems. (arXiv:2109.11304v1 [cs.CV])
    (2 min) Deep learning methods have proven to outperform traditional computer vision methods in various areas of image processing. However, the application of deep learning in industrial surface defect detection systems is challenging due to the insufficient amount of training data, the expensive data generation process, the small size, and the rare occurrence of surface defects. From literature and a polymer products manufacturing use case, we identify design requirements which reflect the aforementioned challenges. Addressing these, we conceptualize design principles and features informed by deep learning research. Finally, we instantiate and evaluate the gained design knowledge in the form of actionable guidelines and strategies based on an industrial surface defect detection use case. This article, therefore, contributes to academia as well as practice by (1) systematically identifying challenges for the industrial application of deep learning-based surface defect detection, (2) strategies to overcome these, and (3) an experimental case study assessing the strategies' applicability and usefulness.
    A Benchmark Comparison of Visual Place Recognition Techniques for Resource-Constrained Embedded Platforms. (arXiv:2109.11002v1 [cs.CV])
    (2 min) Visual Place Recognition (VPR) has been a subject of significant research over the last 15 to 20 years. VPR is a fundamental task for autonomous navigation as it enables self-localization within an environment. Although robots are often equipped with resource-constrained hardware, the computational requirements of and effects on VPR techniques have received little attention. In this work, we present a hardware-focused benchmark evaluation of a number of state-of-the-art VPR techniques on public datasets. We consider popular single board computers, including ODroid, UP and Raspberry Pi 3, in addition to a commodity desktop and laptop for reference. We present our analysis based on several key metrics, including place-matching accuracy, image encoding time, descriptor matching time and memory needs. Key questions addressed include: (1) How does the performance accuracy of a VPR technique change with processor architecture? (2) How does power consumption vary for different VPR techniques and embedded platforms? (3) How much does descriptor size matter in comparison to today's embedded platforms' storage? (4) How does the performance of a high-end platform relate to an on-board low-end embedded platform for VPR? The extensive analysis and results in this work serve not only as a benchmark for the VPR community, but also provide useful insights for real-world adoption of VPR applications.
    An Efficient and Scalable Collection of Fly-inspired Voting Units for Visual Place Recognition in Changing Environments. (arXiv:2109.10986v1 [cs.CV])
    (2 min) State-of-the-art visual place recognition performance is currently being achieved utilizing deep learning based approaches. Despite the recent efforts in designing lightweight convolutional neural network based models, these can still be too expensive for the most hardware restricted robot applications. Low-overhead VPR techniques would not only enable platforms equipped with low-end, cheap hardware but also reduce computation on more powerful systems, allowing these resources to be allocated for other navigation tasks. In this work, our goal is to provide an algorithm of extreme compactness and efficiency while achieving state-of-the-art robustness to appearance changes and small point-of-view variations. Our first contribution is DrosoNet, an exceptionally compact model inspired by the odor processing abilities of the fruit fly, Drosophyla melanogaster. Our second and main contribution is a voting mechanism that leverages multiple small and efficient classifiers to achieve more robust and consistent VPR compared to a single one. We use DrosoNet as the baseline classifier for the voting mechanism and evaluate our models on five benchmark datasets, assessing moderate to extreme appearance changes and small to moderate viewpoint variations. We then compare the proposed algorithms to state-of-the-art methods, both in terms of precision-recall AUC results and computational efficiency.
    A Novel Factor Graph-Based Optimization Technique for Stereo Correspondence Estimation. (arXiv:2109.11077v1 [cs.CV])
    (2 min) Dense disparities among multiple views is essential for estimating the 3D architecture of a scene based on the geometrical relationship among the scene and the views or cameras. Scenes with larger extents of heterogeneous textures, differing scene illumination among the multiple views and with occluding objects affect the accuracy of the estimated disparities. Markov random fields (MRF) based methods for disparity estimation address these limitations using spatial dependencies among the observations and among the disparity estimates. These methods, however, are limited by spatially fixed and smaller neighborhood systems or cliques. In this work, we present a new factor graph-based probabilistic graphical model for disparity estimation that allows a larger and a spatially variable neighborhood structure determined based on the local scene characteristics. We evaluated our method using the Middlebury benchmark stereo datasets and the Middlebury evaluation dataset version 3.0 and compared its performance with recent state-of-the-art disparity estimation algorithms. The new factor graph-based method provided disparity estimates with higher accuracy when compared to the recent non-learning- and learning-based disparity estimation algorithms. In addition to disparity estimation, our factor graph formulation can be useful for obtaining maximum a posteriori solution to optimization problems with complex and variable dependency structures as well as for other dense estimation problems such as optical flow estimation.
    Cross-Modal Coherence for Text-to-Image Retrieval. (arXiv:2109.11047v1 [cs.CV])
    (2 min) Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model. However, co-occurring images and text can be related in qualitatively different ways, and explicitly modeling it could improve the performance of current joint understanding models. In this paper, we train a Cross-Modal Coherence Modelfor text-to-image retrieval task. Our analysis shows that models trained with image--text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. We also show via human evaluation that images retrieved by the proposed coherence-aware model are preferred over a coherence-agnostic baseline by a huge margin. Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
    T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression. (arXiv:2109.10948v1 [cs.CV])
    (2 min) 6D pose estimation is the task of predicting the translation and orientation of objects in a given input image, which is a crucial prerequisite for many robotics and augmented reality applications. Lately, the Transformer Network architecture, equipped with a multi-head self-attention mechanism, is emerging to achieve state-of-the-art results in many computer vision tasks. DETR, a Transformer-based model, formulated object detection as a set prediction problem and achieved impressive results without standard components like region of interest pooling, non-maximal suppression, and bounding box proposals. In this work, we propose T6D-Direct, a real-time single-stage direct method with a transformer-based architecture built on DETR to perform 6D multi-object pose direct estimation. We evaluate the performance of our method on the YCB-Video dataset. Our method achieves the fastest inference time, and the pose estimation accuracy is comparable to state-of-the-art methods.
    The CAMELS Multifield Dataset: Learning the Universe's Fundamental Parameters with Artificial Intelligence. (arXiv:2109.10915v1 [cs.LG])
    (2 min) We present the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) Multifield Dataset, CMD, a collection of hundreds of thousands of 2D maps and 3D grids containing many different properties of cosmic gas, dark matter, and stars from 2,000 distinct simulated universes at several cosmic times. The 2D maps and 3D grids represent cosmic regions that span $\sim$100 million light years and have been generated from thousands of state-of-the-art hydrodynamic and gravity-only N-body simulations from the CAMELS project. Designed to train machine learning models, CMD is the largest dataset of its kind containing more than 70 Terabytes of data. In this paper we describe CMD in detail and outline a few of its applications. We focus our attention on one such task, parameter inference, formulating the problems we face as a challenge to the community. We release all data and provide further technical details at https://camels-multifield-dataset.readthedocs.io.
    Learning Contrastive Representation for Semantic Correspondence. (arXiv:2109.10967v1 [cs.CV])
    (2 min) Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling pixel-level dense correspondences is labor intensive and infeasible to scale. Most existing approaches focus on designing various matching approaches with fully-supervised ImageNet pretrained networks. On the other hand, while a variety of self-supervised approaches are proposed to explicitly measure image-level similarities, correspondence matching the pixel level remains under-explored. In this work, we propose a multi-level contrastive learning approach for semantic matching, which does not rely on any ImageNet pretrained model. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects, while the performance can be further enhanced by regularizing cross-instance cycle-consistency at intermediate feature levels. Experimental results on the PF-PASCAL, PF-WILLOW, and SPair-71k benchmark datasets demonstrate that our method performs favorably against the state-of-the-art approaches. The source code and trained models will be made available to the public.
  • cs.IR updates on arXiv.org

    Towards Explainable Scientific Venue Recommendations. (arXiv:2109.11343v1 [cs.IR])
    (2 min) Selecting the best scientific venue (i.e., conference/journal) for the submission of a research article constitutes a multifaceted challenge. Important aspects to consider are the suitability of research topics, a venue's prestige, and the probability of acceptance. The selection problem is exacerbated through the continuous emergence of additional venues. Previously proposed approaches for supporting authors in this process rely on complex recommender systems, e.g., based on Word2Vec or TextCNN. These, however, often elude an explanation for their recommendations. In this work, we propose an unsophisticated method that advances the state-of-the-art in two aspects: First, we enhance the interpretability of recommendations through non-negative matrix factorization based topic models; Second, we surprisingly can obtain competitive recommendation performance while using simpler learning methods.
    Graph Neural Netwrok with Interaction Pattern for Group Recommendation. (arXiv:2109.11345v1 [cs.IR])
    (2 min) With the development of social platforms, people are more and more inclined to combine into groups to participate in some activities, so group recommendation has gradually become a problem worthy of research. For group recommendation, an important issue is how to obtain the characteristic representation of the group and the item through personal interaction history, and obtain the group's preference for the item. For this problem, we proposed the model GIP4GR (Graph Neural Network with Interaction Pattern For Group Recommendation). Specifically, our model use the graph neural network framework with powerful representation capabilities to represent the interaction between group-user-items in the topological structure of the graph, and at the same time, analyze the interaction pattern of the graph to adjust the feature output of the graph neural network, the feature representations of groups, and items are obtained to calculate the group's preference for items. We conducted a lot of experiments on two real-world datasets to illustrate the superior performance of our model.
    Transformers: "The End of History" for NLP?. (arXiv:2105.00813v2 [cs.CL] UPDATED)
    (2 min) Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks -- segmentation and segment labeling -- and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naive ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.
    Dynamic inference of user context through social tag embedding for music recommendation. (arXiv:2109.11231v1 [cs.IR])
    (2 min) Music listening preferences at a given time depend on a wide range of contextual factors, such as user emotional state, location and activity at listening time, the day of the week, the time of the day, etc. It is therefore of great importance to take them into account when recommending music. However, it is very difficult to develop context-aware recommender systems that consider these factors, both because of the difficulty of detecting some of them, such as emotional state, and because of the drawbacks derived from the inclusion of many factors, such as sparsity problems in contextual pre-filtering. This work involves the proposal of a method for the detection of the user contextual state when listening to music based on the social tags of music items. The intrinsic characteristics of social tagging that allow for the description of items in multiple dimensions can be exploited to capture many contextual dimensions in the user listening sessions. The embeddings of the tags of the first items played in each session are used to represent the context of that session. Recommendations are then generated based on both user preferences and the similarity of the items computed from tag embeddings. Social tags have been used extensively in many recommender systems, however, to our knowledge, they have been hardly used to dynamically infer contextual states.
    Breaking BERT: Understanding its Vulnerabilities for Named Entity Recognition through Adversarial Attack. (arXiv:2109.11308v1 [cs.CL])
    (2 min) Both generic and domain-specific BERT models are widely used for natural language processing (NLP) tasks. In this paper we investigate the vulnerability of BERT models to variation in input data for Named Entity Recognition (NER) through adversarial attack. Experimental results show that the original as well as the domain-specific BERT models are highly vulnerable to entity replacement: They can be fooled in 89.2 to 99.4% of the cases to mislabel previously correct entities. BERT models are also vulnerable to variation in the entity context with 20.2 to 45.0% of entities predicted completely wrong and another 29.3 to 53.3% of entities predicted wrong partially. Often a single change is sufficient to fool the model. BERT models seem most vulnerable to changes in the local context of entities. Of the two domain-specific BERT models, the vulnerability of BioBERT is comparable to the original BERT model whereas SciBERT is even more vulnerable. Our results chart the vulnerabilities of BERT models for NER and emphasize the importance of further research into uncovering and reducing these weaknesses.
    Exploring Heterogeneous Metadata for Video Recommendation with Two-tower Model. (arXiv:2109.11059v1 [cs.IR])
    (2 min) Online video services acquire new content on a daily basis to increase engagement, and improve the user experience. Traditional recommender systems solely rely on watch history, delaying the recommendation of newly added titles to the right customer. However, one can use the metadata information of a cold-start title to bootstrap the personalization. In this work, we propose to adopt a two-tower model, in which one tower is to learn the user representation based on their watch history, and the other tower is to learn the effective representations for titles using metadata. The contribution of this work can be summarized as: (1) we show the feasibility of using two-tower model for recommendations and conduct a series of offline experiments to show its performance for cold-start titles; (2) we explore different types of metadata (categorical features, text description, cover-art image) and an attention layer to fuse them; (3) with our Amazon proprietary data, we show that the attention layer can assign weights adaptively to different metadata with improved recommendation for warm- and cold-start items.
    Towards Universal Dense Retrieval for Open-domain Question Answering. (arXiv:2109.11085v1 [cs.CL])
    (2 min) In open-domain question answering, a model receives a text question as input and searches for the correct answer using a large evidence corpus. The retrieval step is especially difficult as typical evidence corpora have \textit{millions} of documents, each of which may or may not have the correct answer to the question. Very recently, dense models have replaced sparse methods as the de facto retrieval method. Rather than focusing on lexical overlap to determine similarity, dense methods build an encoding function that captures semantic similarity by learning from a small collection of question-answer or question-context pairs. In this paper, we investigate dense retrieval models in the context of open-domain question answering across different input distributions. To do this, first we introduce an entity-rich question answering dataset constructed from Wikidata facts and demonstrate dense models are unable to generalize to unseen input question distributions. Second, we perform analyses aimed at better understanding the source of the problem and propose new training techniques to improve out-of-domain performance on a wide variety of datasets. We encourage the field to further investigate the creation of a single, universal dense retrieval model that generalizes well across all input distributions.
  • cs.LG updates on arXiv.org

    A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems. (arXiv:2006.02032v2 [math.OC] UPDATED)
    (2 min) Much recent research effort has been directed to the development of efficient algorithms for solving minimax problems with theoretical convergence guarantees due to the relevance of these problems to a few emergent applications. In this paper, we propose a unified single-loop alternating gradient projection (AGP) algorithm for solving smooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. AGP employs simple gradient projection steps for updating the primal and dual variables alternatively at each iteration. We show that it can find an $\varepsilon$-stationary point of the objective function in $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) iterations under nonconvex-strongly concave (resp. nonconvex-concave) setting. Moreover, its gradient complexity to obtain an $\varepsilon$-stationary point of the objective function is bounded by $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp., $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under the strongly convex-nonconcave (resp., convex-nonconcave) setting. To the best of our knowledge, this is the first time that a simple and unified single-loop algorithm is developed for solving both nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. Moreover, the complexity results for solving the latter (strongly) convex-nonconcave minimax problems have never been obtained before in the literature. Numerical results show the efficiency of the proposed AGP algorithm. Furthermore, we extend the AGP algorithm by presenting a block alternating proximal gradient (BAPG) algorithm for solving more general multi-block nonsmooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. We can similarly establish the gradient complexity of the proposed algorithm under these four different settings.
    Fast and Efficient MMD-based Fair PCA via Optimization over Stiefel Manifold. (arXiv:2109.11196v1 [stat.ML])
    (2 min) This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes. The incorporation of MMD naturally leads to an exact and tractable mathematical formulation of fairness with good statistical properties. We formulate the problem of fair PCA subject to MMD constraints as a non-convex optimization over the Stiefel manifold and solve it using the Riemannian Exact Penalty Method with Smoothing (REPMS; Liu and Boumal, 2019). Importantly, we provide local optimality guarantees and explicitly show the theoretical effect of each hyperparameter in practical settings, extending previous results. Experimental comparisons based on synthetic and UCI datasets show that our approach outperforms prior work in explained variance, fairness, and runtime.
    Training Deep Spiking Auto-encoders without Bursting or Dying Neurons through Regularization. (arXiv:2109.11045v1 [cs.NE])
    (2 min) Spiking neural networks are a promising approach towards next-generation models of the brain in computational neuroscience. Moreover, compared to classic artificial neural networks, they could serve as an energy-efficient deployment of AI by enabling fast computation in specialized neuromorphic hardware. However, training deep spiking neural networks, especially in an unsupervised manner, is challenging and the performance of a spiking model is significantly hindered by dead or bursting neurons. Here, we apply end-to-end learning with membrane potential-based backpropagation to a spiking convolutional auto-encoder with multiple trainable layers of leaky integrate-and-fire neurons. We propose bio-inspired regularization methods to control the spike density in latent representations. In the experiments, we show that applying regularization on membrane potential and spiking output successfully avoids both dead and bursting neurons and significantly decreases the reconstruction error of the spiking auto-encoder. Training regularized networks on the MNIST dataset yields image reconstruction quality comparable to non-spiking baseline models (deterministic and variational auto-encoder) and indicates improvement upon earlier approaches. Importantly, we show that, unlike the variational auto-encoder, the spiking latent representations display structure associated with the image class.
    Correlation between Air and Urban Travelling with New Confirmed Cases of COVID-19 A Case Study. (arXiv:2010.01413v2 [stat.AP] UPDATED)
    (2 min) COVID-19 which has spread in Iran from February 19, 2020, has infected 202,584 people and killed 9,507 people until June 20, 2020. The immediate suggested solution to prevent the spread of this virus was to avoid traveling around. In this study, the correlation between traveling between cities with new confirmed cases of COVID-19 in Iran is demonstrated. The data, used in the study, consisted of the daily inter-state traffic, air traffic data, and daily new COVID-19 confirmed cases. The data is used to train a regression model and voting was used to show the highest correlation between travels made between cities and new cases of COVID-19. Although the available data was very coarse and there was no detail of inner-city commute, an accuracy of 81% was achieved showing a positive correlation between the number of inter-state travels and the new cases of COVID-19. Consequently, the result suggests that one of the best ways to avoid the spread of the virus is limiting or eliminating traveling around.
    Performance and accuracy assessments of an incompressible fluid solver coupled with a deep Convolutional Neural Network. (arXiv:2109.09363v2 [physics.flu-dyn] UPDATED)
    (2 min) The resolution of the Poisson equation is usually one of the most computationally intensive steps for incompressible fluid solvers. Lately, Deep Learning, and especially Convolutional Neural Networks (CNN), has been introduced to solve this equation, leading to significant inference time reduction at the cost of a lack of guarantee on the accuracy of the solution. This drawback might lead to inaccuracies and potentially unstable simulations. It also makes impossible a fair assessment of the CNN speedup, for instance, when changing the network architecture, since evaluated at different error levels. To circumvent this issue, a hybrid strategy is developed, which couples a CNN with a traditional iterative solver to ensure a user-defined accuracy level. The CNN hybrid method is tested on two flow cases, consisting of a variable-density plume with and without obstacles, demostrating remarkable generalization capabilities, ensuring both the accuracy and stability of the simulations. The error distribution of the predictions using several network architectures is further investigated. Results show that the threshold of the hybrid strategy defined as the mean divergence of the velocity field is ensuring a consistent physical behavior of the CNN-based hybrid computational strategy. This strategy allows a systematic evaluation of the CNN performance at the same accuracy level for various network architectures. In particular, the importance of incorporating multiple scales in the network architecture is demonstrated, since improving both the accuracy and the inference performance compared with feedforward CNN architectures, as these networks can provide solutions 1 10-25 faster than traditional iterative solvers.
    Performance of a Geometric Deep Learning Pipeline for HL-LHC Particle Tracking. (arXiv:2103.06995v2 [physics.data-an] UPDATED)
    (2 min) The Exa.TrkX project has applied geometric learning concepts such as metric learning and graph neural networks to HEP particle tracking. Exa.TrkX's tracking pipeline groups detector measurements to form track candidates and filters them. The pipeline, originally developed using the TrackML dataset (a simulation of an LHC-inspired tracking detector), has been demonstrated on other detectors, including DUNE Liquid Argon TPC and CMS High-Granularity Calorimeter. This paper documents new developments needed to study the physics and computing performance of the Exa.TrkX pipeline on the full TrackML dataset, a first step towards validating the pipeline using ATLAS and CMS data. The pipeline achieves tracking efficiency and purity similar to production tracking algorithms. Crucially for future HEP applications, the pipeline benefits significantly from GPU acceleration, and its computational requirements scale close to linearly with the number of particles in the event.
    Social-Media Activity Forecasting with Exogenous Information Signals. (arXiv:2109.11024v1 [cs.SI])
    (2 min) Due to their widespread adoption, social media platforms present an ideal environment for studying and understanding social behavior, especially on information spread. Modeling social media activity has numerous practical implications such as supporting efforts to analyze strategic information operations, designing intervention techniques to mitigate disinformation, or delivering critical information during disaster relief operations. In this paper we propose a modeling technique that forecasts topic-specific daily volume of social media activities by using both exogenous signals, such as news or armed conflicts records, and endogenous data from the social media platform we model. Empirical evaluations with real datasets from two different platforms and two different contexts each composed of multiple interrelated topics demonstrate the effectiveness of our solution.
    Does SLOPE outperform bridge regression?. (arXiv:1909.09345v3 [stat.ML] UPDATED)
    (2 min) A recently proposed SLOPE estimator (arXiv:1407.3824) has been shown to adaptively achieve the minimax $\ell_2$ estimation rate under high-dimensional sparse linear regression models (arXiv:1503.08393). Such minimax optimality holds in the regime where the sparsity level $k$, sample size $n$, and dimension $p$ satisfy $k/p \rightarrow 0$, $k\log p/n \rightarrow 0$. In this paper, we characterize the estimation error of SLOPE under the complementary regime where both $k$ and $n$ scale linearly with $p$, and provide new insights into the performance of SLOPE estimators. We first derive a concentration inequality for the finite sample mean square error (MSE) of SLOPE. The quantity that MSE concentrates around takes a complicated and implicit form. With delicate analysis of the quantity, we prove that among all SLOPE estimators, LASSO is optimal for estimating $k$-sparse parameter vectors that do not have tied non-zero components in the low noise scenario. On the other hand, in the large noise scenario, the family of SLOPE estimators are sub-optimal compared with bridge regression such as the Ridge estimator.
    D-DARTS: Distributed Differentiable Architecture Search. (arXiv:2108.09306v2 [cs.LG] UPDATED)
    (2 min) Differentiable ARchiTecture Search (DARTS) is one of the most trending Neural Architecture Search (NAS) methods, drastically reducing search cost by resorting to Stochastic Gradient Descent (SGD) and weight-sharing. However, it also greatly reduces the search space, thus excluding potential promising architectures from being discovered. In this paper, we propose D-DARTS, a novel solution that addresses this problem by nesting several neural networks at cell-level instead of using weight-sharing to produce more diversified and specialized architectures. Moreover, we introduce a novel algorithm which can derive deeper architectures from a few trained cells, increasing performance and saving computation time. Our solution is able to provide state-of-the-art results on CIFAR-10, CIFAR-100 and ImageNet while using significantly less parameters than previous baselines, resulting in more hardware-efficient neural networks.
    Multivariate Time Series Imputation by Graph Neural Networks. (arXiv:2108.00298v2 [cs.LG] UPDATED)
    (2 min) Dealing with missing values and incomplete time series is a labor-intensive and time-consuming inevitable task when handling data coming from real-world applications. Effective spatio-temporal representations would allow imputation methods to reconstruct missing temporal data by exploiting information coming from sensors at different locations. However, standard methods fall short in capturing the nonlinear time and space dependencies existing within networks of interconnected sensors and do not take full advantage of the available - and often strong - relational information. Notably, most of state-of-the-art imputation methods based on deep learning do not explicitly model relational aspects and, in any case, do not exploit processing frameworks able to adequately represent structured spatio-temporal data. Conversely, graph neural networks have recently surged in popularity as both expressive and scalable tools for processing sequential data with relational inductive biases. In this work, we present the first assessment of graph neural networks in the context of multivariate time series imputation. In particular, we introduce a novel graph neural network architecture, named GRIL, which aims at reconstructing missing data in the different channels of a multivariate time series by learning spatial-temporal representations through message passing. Preliminary empirical results show that our model outperforms state-of-the-art methods in the imputation task on relevant benchmarks with mean absolute error improvements often higher than 20%.
    Domain-guided Machine Learning for Remotely Sensed In-Season Crop Growth Estimation. (arXiv:2106.13323v2 [cs.LG] UPDATED)
    (2 min) Advanced machine learning techniques have been used in remote sensing (RS) applications such as crop mapping and yield prediction, but remain under-utilized for tracking crop progress. In this study, we demonstrate the use of agronomic knowledge of crop growth drivers in a Long Short-Term Memory-based, domain-guided neural network (DgNN) for in-season crop progress estimation. The DgNN uses a branched structure and attention to separate independent crop growth drivers and capture their varying importance throughout the growing season. The DgNN is implemented for corn, using RS data in Iowa for the period 2003-2019, with USDA crop progress reports used as ground truth. State-wide DgNN performance shows significant improvement over sequential and dense-only NN structures, and a widely-used Hidden Markov Model method. The DgNN had a 4.0% higher Nash-Sutfliffe efficiency over all growth stages and 39% more weeks with highest cosine similarity than the next best NN during test years. The DgNN and Sequential NN were more robust during periods of abnormal crop progress, though estimating the Silking-Grainfill transition was difficult for all methods. Finally, Uniform Manifold Approximation and Projection visualizations of layer activations showed how LSTM-based NNs separate crop growth time-series differently from a dense-only structure. Results from this study exhibit both the viability of NNs in crop growth stage estimation (CGSE) and the benefits of using domain knowledge. The DgNN methodology presented here can be extended to provide near-real time CGSE of other crops.
    A Profile-Based Binary Feature Extraction Method Using Frequent Itemsets for Improving Coronary Artery Disease Diagnosis. (arXiv:2109.10966v1 [cs.LG])
    (0 min) Recent years have seen growing interest in the diagnosis of Coronary Artery Disease (CAD) with machine learning methods to reduce the cost and health implications of conventional diagnosis. This paper introduces a CAD diagnosis method with a novel feature extraction technique called the Profile-Based Binary Feature Extraction (PBBFE). In this method, after partitioning numerical features, frequent itemsets are extracted by the Apriori algorithm and then used as features to increase the CAD diagnosis accuracy. The proposed method consists of two main phases. In the first phase, each patient is assigned a profile based on age, gender, and medical condition, and then all numerical features are discretized based on assigned profiles. All features then undergo a binarization process to become ready for feature extraction by Apriori. In the last step of this phase, frequent itemsets are extracted from the dataset by Apriori and used to build a new dataset. In the second phase, the Genetic Algorithm and the Support Vector Machine are used to identify the best subset of extracted features for classification. The proposed method was tested on the Z-Alizadeh Sani dataset, which is one the richest databases in the field of CAD. Performance comparisons conducted on this dataset showed that the proposed method outperforms all major alternative methods with 98.35% accuracy, 100% sensitivity, and 94.25% specificity. The proposed method also achieved the highest accuracy on several other datasets.
    Token-Level Supervised Contrastive Learning for Punctuation Restoration. (arXiv:2107.09099v2 [cs.CL] UPDATED)
    (0 min) Punctuation is critical in understanding natural language text. Currently, most automatic speech recognition (ASR) systems do not generate punctuation, which affects the performance of downstream tasks, such as intent detection and slot filling. This gives rise to the need for punctuation restoration. Recent work in punctuation restoration heavily utilizes pre-trained language models without considering data imbalance when predicting punctuation classes. In this work, we address this problem by proposing a token-level supervised contrastive learning method that aims at maximizing the distance of representation of different punctuation marks in the embedding space. The result shows that training with token-level supervised contrastive learning obtains up to 3.2% absolute F1 improvement on the test set.
    Orthogonal Graph Neural Networks. (arXiv:2109.11338v1 [cs.LG])
    (0 min) Graph neural networks (GNNs) have received tremendous attention due to their superiority in learning node representations. These models rely on message passing and feature transformation functions to encode the structural and feature information from neighbors. However, stacking more convolutional layers significantly decreases the performance of GNNs. Most recent studies attribute this limitation to the over-smoothing issue, where node embeddings converge to indistinguishable vectors. Through a number of experimental observations, we argue that the main factor degrading the performance is the unstable forward normalization and backward gradient resulted from the improper design of the feature transformation, especially for shallow GNNs where the over-smoothing has not happened. Therefore, we propose a novel orthogonal feature transformation, named Ortho-GConv, which could generally augment the existing GNN backbones to stabilize the model training and improve the model's generalization performance. Specifically, we maintain the orthogonality of the feature transformation comprehensively from three perspectives, namely hybrid weight initialization, orthogonal transformation, and orthogonal regularization. By equipping the existing GNNs (e.g. GCN, JKNet, GCNII) with Ortho-GConv, we demonstrate the generality of the orthogonal feature transformation to enable stable training, and show its effectiveness for node and graph classification tasks.
    Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem. (arXiv:2109.11067v1 [cs.DC])
    (0 min) Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs that partitions one physical GPU into multiple GPU instances. With MIG, A100 can be the most cost-efficient GPU ever for serving Deep Neural Networks (DNNs). However, discovering the most efficient GPU partitions is challenging. The underlying problem is NP-hard; moreover, it is a new abstract problem, which we define as the Reconfigurable Machine Scheduling Problem (RMS). This paper studies serving DNNs with MIG, a new case of RMS. We further propose a solution, MIG-serving. MIG- serving is an algorithm pipeline that blends a variety of newly designed algorithms and customized classic algorithms, including a heuristic greedy algorithm, Genetic Algorithm (GA), and Monte Carlo Tree Search algorithm (MCTS). We implement MIG-serving on Kubernetes. Our experiments show that compared to using A100 as-is, MIG-serving can save up to 40% of GPUs while providing the same throughput.
    Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming. (arXiv:2109.11410v1 [cs.LG])
    (2 min) A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data, in a paradigm that is now commonly referred to as data programming. However, previous approaches to automatically generate LFs make no attempt to further use the given labeled data for model training, thus giving up opportunities for improved performance. Moreover, since the LFs are generated from a relatively small labeled dataset, they are prone to being noisy, and naively aggregating these LFs can lead to very poor performance in practice. In this work, we propose an LF based reweighting framework \ouralgo{} to solve these two critical limitations. Our algorithm learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that our algorithm significantly outperforms prior approaches on several text classification datasets.
    Expected Scalarised Returns Dominance: A New Solution Concept for Multi-Objective Decision Making. (arXiv:2106.01048v2 [cs.LG] UPDATED)
    (2 min) In many real-world scenarios, the utility of a user is derived from the single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user's preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this paper we address this challenge by proposing first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also propose a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we define a new multi-objective distributional tabular reinforcement learning (MOT-DRL) algorithm to learn the ESR set in a multi-objective multi-armed bandit setting.
    Hierarchies of Planning and Reinforcement Learning for Robot Navigation. (arXiv:2109.11178v1 [cs.RO])
    (0 min) Solving robotic navigation tasks via reinforcement learning (RL) is challenging due to their sparse reward and long decision horizon nature. However, in many navigation tasks, high-level (HL) task representations, like a rough floor plan, are available. Previous work has demonstrated efficient learning by hierarchal approaches consisting of path planning in the HL representation and using sub-goals derived from the plan to guide the RL policy in the source task. However, these approaches usually neglect the complex dynamics and sub-optimal sub-goal-reaching capabilities of the robot during planning. This work overcomes these limitations by proposing a novel hierarchical framework that utilizes a trainable planning policy for the HL representation. Thereby robot capabilities and environment conditions can be learned utilizing collected rollout data. We specifically introduce a planning policy based on value iteration with a learned transition model (VI-RL). In simulated robotic navigation tasks, VI-RL results in consistent strong improvement over vanilla RL, is on par with vanilla hierarchal RL on single layouts but more broadly applicable to multiple layouts, and is on par with trainable HL path planning baselines except for a parking task with difficult non-holonomic dynamics where it shows marked improvements.
    Boosting the Generalization Capability in Cross-Domain Few-shot Learning via Noise-enhanced Supervised Autoencoder. (arXiv:2108.05028v2 [cs.CV] UPDATED)
    (2 min) State of the art (SOTA) few-shot learning (FSL) methods suffer significant performance drop in the presence of domain differences between source and target datasets. The strong discrimination ability on the source dataset does not necessarily translate to high classification accuracy on the target dataset. In this work, we address this cross-domain few-shot learning (CDFSL) problem by boosting the generalization capability of the model. Specifically, we teach the model to capture broader variations of the feature distributions with a novel noise-enhanced supervised autoencoder (NSAE). NSAE trains the model by jointly reconstructing inputs and predicting the labels of inputs as well as their reconstructed pairs. Theoretical analysis based on intra-class correlation (ICC) shows that the feature embeddings learned from NSAE have stronger discrimination and generalization abilities in the target domain. We also take advantage of NSAE structure and propose a two-step fine-tuning procedure that achieves better adaption and improves classification performance in the target domain. Extensive experiments and ablation studies are conducted to demonstrate the effectiveness of the proposed method. Experimental results show that our proposed method consistently outperforms SOTA methods under various conditions.
    Dynamic Knowledge Distillation for Pre-trained Language Models. (arXiv:2109.11295v1 [cs.CL])
    (2 min) Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected teacher model on the pre-defined training dataset. In this paper, we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation. Experimental results show that (1) proper selection of teacher model can boost the performance of student model; (2) conducting KD with 10% informative instances achieves comparable performance while greatly accelerates the training; (3) the student performance can be boosted by adjusting the supervision contribution of different alignment objective. We find dynamic knowledge distillation is promising and provide discussions on potential future directions towards more efficient KD methods. Our code is available at https://github.com/lancopku/DynamicKD.
    Learning Predictive and Interpretable Timeseries Summaries from ICU Data. (arXiv:2109.11043v1 [cs.LG])
    (2 min) Machine learning models that utilize patient data across time (rather than just the most recent measurements) have increased performance for many risk stratification tasks in the intensive care unit. However, many of these models and their learned representations are complex and therefore difficult for clinicians to interpret, creating challenges for validation. Our work proposes a new procedure to learn summaries of clinical time-series that are both predictive and easily understood by humans. Specifically, our summaries consist of simple and intuitive functions of clinical data (e.g. falling mean arterial pressure). Our learned summaries outperform traditional interpretable model classes and achieve performance comparable to state-of-the-art deep learning models on an in-hospital mortality classification task.
    How do Quadratic Regularizers Prevent Catastrophic Forgetting: The Role of Interpolation. (arXiv:2102.02805v3 [cs.LG] UPDATED)
    (2 min) Catastrophic forgetting undermines the effectiveness of deep neural networks (DNNs) in scenarios such as continual learning and lifelong learning. While several methods have been proposed to tackle this problem, there is limited work explaining why these methods work well. This paper has the goal of better explaining a popularly used technique for avoiding catastrophic forgetting: quadratic regularization. We show that quadratic regularizers prevent forgetting of past tasks by interpolating current and previous values of model parameters at every training iteration. Over multiple training iterations, this interpolation operation reduces the learning rates of more important model parameters, thereby minimizing their movement. Our analysis also reveals two drawbacks of quadratic regularization: (a) dependence of parameter interpolation on training hyperparameters, which often leads to training instability and (b) assignment of lower importance to deeper layers, which are generally the place forgetting occurs in DNNs. Via a simple modification to the order of operations, we show these drawbacks can be easily avoided, resulting in 6.2% higher average accuracy at 4.5% lower average forgetting. Code available at \url{https://github.com/EkdeepSLubana/QRforgetting}
    Revisit Geophysical Imaging in A New View of Physics-informed Generative Adversarial Learning. (arXiv:2109.11452v1 [eess.IV])
    (2 min) Seismic full waveform inversion (FWI) is a powerful geophysical imaging technique that produces high-resolution subsurface models by iteratively minimizing the misfit between the simulated and observed seismograms. Unfortunately, conventional FWI with least-squares function suffers from many drawbacks such as the local-minima problem and computation of explicit gradient. It is particularly challenging with the contaminated measurements or poor starting models. Recent works relying on partial differential equations and neural networks show promising performance for two-dimensional FWI. Inspired by the competitive learning of generative adversarial networks, we proposed an unsupervised learning paradigm that integrates wave equation with a discriminate network to accurately estimate the physically consistent models in a distribution sense. Our framework needs no labelled training data nor pretraining of the network, is flexible to achieve multi-parameters inversion with minimal user interaction. The proposed method faithfully recovers the well-known synthetic models that outperforms the classical algorithms. Furthermore, our work paves the way to sidestep the local-minima issue via reducing the sensitivity to initial models and noise.
    Reinforcement Learning Under Algorithmic Triage. (arXiv:2109.11328v1 [cs.LG])
    (2 min) Methods to learn under algorithmic triage have predominantly focused on supervised learning settings where each decision, or prediction, is independent of each other. Under algorithmic triage, a supervised learning model predicts a fraction of the instances and humans predict the remaining ones. In this work, we take a first step towards developing reinforcement learning models that are optimized to operate under algorithmic triage. To this end, we look at the problem through the framework of options and develop a two-stage actor-critic method to learn reinforcement learning models under triage. The first stage performs offline, off-policy training using human data gathered in an environment where the human has operated on their own. The second stage performs on-policy training to account for the impact that switching may have on the human policy, which may be difficult to anticipate from the above human data. Extensive simulation experiments in a synthetic car driving task show that the machine models and the triage policies trained using our two-stage method effectively complement human policies and outperform those provided by several competitive baselines.
    Generalisations and improvements of New Q-Newton's method Backtracking. (arXiv:2109.11395v1 [math.OC])
    (2 min) In this paper, we propose a general framework for the algorithm New Q-Newton's method Backtracking, developed in the author's previous work. For a symmetric, square real matrix $A$, we define $minsp(A):=\min _{||e||=1} ||Ae||$. Given a $C^2$ cost function $f:\mathbb{R}^m\rightarrow \mathbb{R}$ and a real number $0}{||A(x)e_i(x)||}e_i(x);$$ (we can also normalise by $w(x)/\max \{1,||w(x)||\}$ when needed) $\gamma (x)>0$ learning rate chosen by Backtracking line search so that Armijo's condition is satisfied: $$f(x-\gamma (x)w(x))-f(x)\leq -\frac{1}{3}\gamma (x).$$ The update rule for our algorithm is $x\mapsto H(x)=x-\gamma (x)w(x)$. In New Q-Newton's method Backtracking, the choices are $\tau =1+\alpha >1$ and $e_1(x),\ldots ,e_m(x)$'s are eigenvectors of $\nabla ^2f(x)$. In this paper, we allow more flexibility and generality, for example $\tau$ can be chosen to be $<1$ or $e_1(x),\ldots ,e_m(x)$'s are not necessarily eigenvectors of $\nabla ^2f(x)$. New Q-Newton's method Backtracking (as well as Backtracking gradient descent) is a special case, and some versions have flavours of quasi-Newton's methods. Several versions allow good theoretical guarantees. An application to solving systems of polynomial equations is given.
    A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification. (arXiv:2109.11301v1 [cs.LG])
    (2 min) Pool-based active learning (AL) aims to optimize the annotation process (i.e., labeling) as the acquisition of annotations is often time-consuming and therefore expensive. For this purpose, an AL strategy queries annotations intelligently from annotators to train a high-performance classification model at a low annotation cost. Traditional AL strategies operate in an idealized framework. They assume a single, omniscient annotator who never gets tired and charges uniformly regardless of query difficulty. However, in real-world applications, we often face human annotators, e.g., crowd or in-house workers, who make annotation mistakes and can be reluctant to respond if tired or faced with complex queries. Recently, a wide range of novel AL strategies has been proposed to address these issues. They differ in at least one of the following three central aspects from traditional AL: (1) They explicitly consider (multiple) human annotators whose performances can be affected by various factors, such as missing expertise. (2) They generalize the interaction with human annotators by considering different query and annotation types, such as asking an annotator for feedback on an inferred classification rule. (3) They take more complex cost schemes regarding annotations and misclassifications into account. This survey provides an overview of these AL strategies and refers to them as real-world AL. Therefore, we introduce a general real-world AL strategy as part of a learning cycle and use its elements, e.g., the query and annotator selection algorithm, to categorize about 60 real-world AL strategies. Finally, we outline possible directions for future research in the field of AL.
    Alzheimers Dementia Detection using Acoustic & Linguistic features and Pre-Trained BERT. (arXiv:2109.11010v1 [cs.CL])
    (0 min) Alzheimers disease is a fatal progressive brain disorder that worsens with time. It is high time we have inexpensive and quick clinical diagnostic techniques for early detection and care. In previous studies, various Machine Learning techniques and Pre-trained Deep Learning models have been used in conjunction with the extraction of various acoustic and linguistic features. Our study focuses on three models for the classification task in the ADReSS (The Alzheimers Dementia Recognition through Spontaneous Speech) 2021 Challenge. We use the well-balanced dataset provided by the ADReSS Challenge for training and validating our models. Model 1 uses various acoustic features from the eGeMAPs feature-set, Model 2 uses various linguistic features that we generated from auto-generated transcripts and Model 3 uses the auto-generated transcripts directly to extract features using a Pre-trained BERT and TF-IDF. These models are described in detail in the models section.
    Memory-Efficient Convex Optimization for Self-Dictionary Separable Nonnegative Matrix Factorization: A Frank-Wolfe Approach. (arXiv:2109.11135v1 [eess.SP])
    (2 min) Nonnegative matrix factorization (NMF) often relies on the separability condition for tractable algorithm design. Separability-based NMF is mainly handled by two types of approaches, namely, greedy pursuit and convex programming. A notable convex NMF formulation is the so-called self-dictionary multiple measurement vectors (SD-MMV), which can work without knowing the matrix rank a priori, and is arguably more resilient to error propagation relative to greedy pursuit. However, convex SD-MMV renders a large memory cost that scales quadratically with the problem size. This memory challenge has been around for a decade, and a major obstacle for applying convex SD-MMV to big data analytics. This work proposes a memory-efficient algorithm for convex SD-MMV. Our algorithm capitalizes on the special update rules of a classic algorithm from the 1950s, namely, the Frank-Wolfe (FW) algorithm. It is shown that, under reasonable conditions, the FW algorithm solves the noisy SD-MMV problem with a memory cost that grows linearly with the amount of data. To handle noisier scenarios, a smoothed group sparsity regularizer is proposed to improve robustness while maintaining the low memory footprint with guarantees. The proposed approach presents the first linear memory complexity algorithmic framework for convex SD-MMV based NMF. The method is tested over a couple of unsupervised learning tasks, i.e., text mining and community detection, to showcase its effectiveness and memory efficiency.
    Graph Neural Netwrok with Interaction Pattern for Group Recommendation. (arXiv:2109.11345v1 [cs.IR])
    (2 min) With the development of social platforms, people are more and more inclined to combine into groups to participate in some activities, so group recommendation has gradually become a problem worthy of research. For group recommendation, an important issue is how to obtain the characteristic representation of the group and the item through personal interaction history, and obtain the group's preference for the item. For this problem, we proposed the model GIP4GR (Graph Neural Network with Interaction Pattern For Group Recommendation). Specifically, our model use the graph neural network framework with powerful representation capabilities to represent the interaction between group-user-items in the topological structure of the graph, and at the same time, analyze the interaction pattern of the graph to adjust the feature output of the graph neural network, the feature representations of groups, and items are obtained to calculate the group's preference for items. We conducted a lot of experiments on two real-world datasets to illustrate the superior performance of our model.
    Offline reinforcement learning with uncertainty for treatment strategies in sepsis. (arXiv:2107.04491v2 [cs.LG] UPDATED)
    (2 min) Guideline-based treatment for sepsis and septic shock is difficult because sepsis is a disparate range of life-threatening organ dysfunctions whose pathophysiology is not fully understood. Early intervention in sepsis is crucial for patient outcome, yet those interventions have adverse effects and are frequently overadministered. Greater personalization is necessary, as no single action is suitable for all patients. We present a novel application of reinforcement learning in which we identify optimal recommendations for sepsis treatment from data, estimate their confidence level, and identify treatment options infrequently observed in training data. Rather than a single recommendation, our method can present several treatment options. We examine learned policies and discover that reinforcement learning is biased against aggressive intervention due to the confounding relationship between mortality and level of treatment received. We mitigate this bias using subspace learning, and develop methodology that can yield more accurate learning policies across healthcare applications.
    Weighted Low Rank Matrix Approximation and Acceleration. (arXiv:2109.11057v1 [stat.ML])
    (2 min) Low-rank matrix approximation is one of the central concepts in machine learning, with applications in dimension reduction, de-noising, multivariate statistical methodology, and many more. A recent extension to LRMA is called low-rank matrix completion (LRMC). It solves the LRMA problem when some observations are missing and is especially useful for recommender systems. In this paper, we consider an element-wise weighted generalization of LRMA. The resulting weighted low-rank matrix approximation technique therefore covers LRMC as a special case with binary weights. WLRMA has many applications. For example, it is an essential component of GLM optimization algorithms, where an exponential family is used to model the entries of a matrix, and the matrix of natural parameters admits a low-rank structure. We propose an algorithm for solving the weighted problem, as well as two acceleration techniques. Further, we develop a non-SVD modification of the proposed algorithm that is able to handle extremely high-dimensional data. We compare the performance of all the methods on a small simulation example as well as a real-data application.
    An Exploration of Learnt Representations of W Jets. (arXiv:2109.10919v1 [hep-ph])
    (2 min) I present a Variational Autoencoder (VAE) trained on collider physics data (specifically boosted $W$ jets), with reconstruction error given by an approximation to the Earth Movers Distance (EMD) between input and output jets. This VAE learns a concrete representation of the data manifold, with semantically meaningful and interpretable latent space directions which are hierarchically organized in terms of their relation to physical EMD scales in the underlying physical generative process. A hyperparameter $\beta$ controls the resolution at which the VAE is sensitive to structures in the data manifold. The variation of the latent space structure with $\beta$, and the scaling of some VAE properties, provide insight into scale dependent structure of the dataset and its information complexity. I introduce two measures of the dimensionality of the learnt representation that are calculated from this scaling.
    Improving significance of binary black hole mergers in Advanced LIGO data using deep learning : Confirmation of GW151216. (arXiv:2010.08584v3 [gr-qc] UPDATED)
    (3 min) We present a novel Machine Learning (ML) based strategy to search for binary black hole (BBH) mergers in data from ground-based gravitational wave (GW) observatories. This is the first ML-based search that not only recovers all the compact binary coalescences (CBCs) in the first GW transients catalog (GWTC-1), but also makes a clean detection of GW151216 by only adding a new coincident ranking statistic (MLStat) to a standard analysis that was used for GWTC-1. In CBC searches, reducing contamination by terrestrial and instrumental transients, which create a loud noise background by triggering numerous false alarms, is crucial to improving the sensitivity for detecting true events. The sheer volume of data and a large number of expected detections also prompts the use of ML techniques. We perform transfer learning to train "InceptionV3", a pre-trained deep neural network, along with curriculum learning to distinguish GW signals from noisy events by analysing their continuous wavelet transform (CWT) maps. MLStat incorporates information from this ML classifier into the coincident search likelihood used by the standard PyCBC search. This leads to at least an order of magnitude improvement in the inverse false-alarm-rate (IFAR) for the previously "low significance" events GW151012, GW170729 and GW151216. We also perform the parameter estimation of GW151216 using SEOBNRv4HM_ROM. We carry out an injection study to show that MLStat brings substantial improvement to the detection sensitivity of Advanced LIGO for all compact binary coalescences. The average improvement in the sensitive volume is ~10% for low chirp masses (0.8-5 Msun), and ~30% for higher masses (5-50 Msun). This work demonstrates the immense potential and readiness of MLStat for finding new sources in current data and the possibility of its adaptation in similar searches.
    Improved Convergence Guarantees for Learning Gaussian Mixture Models by EM and Gradient EM. (arXiv:2101.00575v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of estimating the parameters a Gaussian Mixture Model with K components of known weights, all with an identity covariance matrix. We make two contributions. First, at the population level, we present a sharper analysis of the local convergence of EM and gradient EM, compared to previous works. Assuming a separation of $\Omega(\sqrt{\log K})$, we prove convergence of both methods to the global optima from an initialization region larger than those of previous works. Specifically, the initial guess of each component can be as far as (almost) half its distance to the nearest Gaussian. This is essentially the largest possible contraction region. Our second contribution are improved sample size requirements for accurate estimation by EM and gradient EM. In previous works, the required number of samples had a quadratic dependence on the maximal separation between the K components, and the resulting error estimate increased linearly with this maximal separation. In this manuscript we show that both quantities depend only logarithmically on the maximal separation.
    Causal Discovery in High-Dimensional Point Process Networks with Hidden Nodes. (arXiv:2109.10947v1 [stat.ML])
    (0 min) Thanks to technological advances leading to near-continuous time observations, emerging multivariate point process data offer new opportunities for causal discovery. However, a key obstacle in achieving this goal is that many relevant processes may not be observed in practice. Naive estimation approaches that ignore these hidden variables can generate misleading results because of the unadjusted confounding. To plug this gap, we propose a deconfounding procedure to estimate high-dimensional point process networks with only a subset of the nodes being observed. Our method allows flexible connections between the observed and unobserved processes. It also allows the number of unobserved processes to be unknown and potentially larger than the number of observed nodes. Theoretical analyses and numerical studies highlight the advantages of the proposed method in identifying causal interactions among the observed processes.
    Adapted End-to-End Coreference Resolution System for Anaphoric Identities in Dialogues. (arXiv:2109.00185v2 [cs.CL] UPDATED)
    (2 min) We present an effective system adapted from the end-to-end neural coreference resolution model, targeting on the task of anaphora resolution in dialogues. Three aspects are specifically addressed in our approach, including the support of singletons, encoding speakers and turns throughout dialogue interactions, and knowledge transfer utilizing existing resources. Despite the simplicity of our adaptation strategies, they are shown to bring significant impact to the final performance, with up to 27 F1 improvement over the baseline. Our final system ranks the 1st place on the leaderboard of the anaphora resolution track in the CRAC 2021 shared task, and achieves the best evaluation results on all four datasets.
    Adversarial Transfer Attacks With Unknown Data and Class Overlap. (arXiv:2109.11125v1 [cs.LG])
    (2 min) The ability to transfer adversarial attacks from one model (the surrogate) to another model (the victim) has been an issue of concern within the machine learning (ML) community. The ability to successfully evade unseen models represents an uncomfortable level of ease toward implementing attacks. In this work we note that as studied, current transfer attack research has an unrealistic advantage for the attacker: the attacker has the exact same training data as the victim. We present the first study of transferring adversarial attacks focusing on the data available to attacker and victim under imperfect settings without querying the victim, where there is some variable level of overlap in the exact data used or in the classes learned by each model. This threat model is relevant to applications in medicine, malware, and others. Under this new threat model attack success rate is not correlated with data or class overlap in the way one would expect, and varies with dataset. This makes it difficult for attacker and defender to reason about each other and contributes to the broader study of model robustness and security. We remedy this by developing a masked version of Projected Gradient Descent that simulates class disparity, which enables the attacker to reliably estimate a lower-bound on their attack's success.
    Novel Deep Learning Architecture for Heart Disease Prediction using Convolutional Neural Network. (arXiv:2105.10816v3 [cs.LG] UPDATED)
    (2 min) Healthcare is one of the most important aspects of human life. Heart disease is known to be one of the deadliest diseases which is hampering the lives of many people around the world. Heart disease must be detected early so the loss of lives can be prevented. The availability of large-scale data for medical diagnosis has helped developed complex machine learning and deep learning-based models for automated early diagnosis of heart diseases. The classical approaches have been limited in terms of not generalizing well to new data which have not been seen in the training set. This is indicated by a large gap in training and test accuracies. This paper proposes a novel deep learning architecture using a 1D convolutional neural network for classification between healthy and non-healthy persons to overcome the limitations of classical approaches. Various clinical parameters are used for assessing the risk profile in the patients which helps in early diagnosis. Various techniques are used to avoid overfitting in the proposed network. The proposed network achieves over 97% training accuracy and 96% test accuracy on the dataset. The accuracy of the model is compared in detail with other classification algorithms using various performance parameters which proves the effectiveness of the proposed architecture.
    SampleFix: Learning to Generate Functionally Diverse Fixes. (arXiv:1906.10502v3 [cs.SE] UPDATED)
    (2 min) Automatic program repair holds the potential of dramatically improving the productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning, and NLP have rekindled the hope to eventually fully automate the process of repairing programs. However, previous approaches that aim to predict a single fix are prone to fail due to uncertainty about the true intend of the programmer. Therefore, we propose a generative model that learns a distribution over potential fixes. Our model is formulated as a deep conditional variational autoencoder that can efficiently sample fixes for a given erroneous program. In order to ensure diverse solutions, we propose a novel regularizer that encourages diversity over a semantic embedding space. Our evaluations on common programming errors show for the first time the generation of diverse fixes and strong improvements over the state-of-the-art approaches by fixing up to 45% of the erroneous programs. We additionally show that for the 65% of the repaired programs, our approach was able to generate multiple programs with diverse functionalities.
    Quantization and Deployment of Deep Neural Networks on Microcontrollers. (arXiv:2105.13331v2 [cs.LG] UPDATED)
    (3 min) Embedding Artificial Intelligence onto low-power devices is a challenging task that has been partly overcome with recent advances in machine learning and hardware design. Presently, deep neural networks can be deployed on embedded targets to perform different tasks such as speech recognition,object detection or Human Activity Recognition. However, there is still room for optimization of deep neural networks onto embedded devices. These optimizations mainly address power consumption,memory and real-time constraints, but also an easier deployment at the edge. Moreover, there is still a need for a better understanding of what can be achieved for different use cases. This work focuses on quantization and deployment of deep neural networks onto low-power 32-bit microcontrollers. The quantization methods, relevant in the context of an embedded execution onto a microcontroller, are first outlined. Then, a new framework for end-to-end deep neural networks training, quantization and deployment is presented. This framework, called MicroAI, is designed as an alternative to existing inference engines (TensorFlow Lite for Microcontrollers and STM32CubeAI). Our framework can indeed be easily adjusted and/or extended for specific use cases. Execution using single precision 32-bit floating-point as well as fixed-point on 8- and 16-bit integers are supported. The proposed quantization method is evaluated with three different datasets (UCI-HAR, Spoken MNIST and GTSRB). Finally, a comparison study between MicroAI and both existing embedded inference engines is provided in terms of memory and power efficiency. On-device evaluation is done using ARM Cortex-M4F-based microcontrollers (Ambiq Apollo3 and STM32L452RE).
    Critic Regularized Regression. (arXiv:2006.15134v3 [cs.LG] UPDATED)
    (2 min) Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.
    Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning. (arXiv:2106.05956v3 [cs.LG] UPDATED)
    (2 min) Inspired by BatchNorm, there has been an explosion of normalization layers for deep neural networks (DNNs). However, these alternative normalization layers have seen minimal use, partially due to a lack of guiding principles that can help identify when these layers can serve as a replacement for BatchNorm. To address this problem, we take a theoretical approach, generalizing the known beneficial mechanisms of BatchNorm to several recently proposed normalization techniques. Our generalized theory leads to the following set of principles: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric layers require explicit remedies; (ii) use of GroupNorm can ensure informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.
    Transformers: "The End of History" for NLP?. (arXiv:2105.00813v2 [cs.CL] UPDATED)
    (2 min) Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks -- segmentation and segment labeling -- and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naive ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.
    Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Non-uniform Subsampling of Gradients. (arXiv:2002.08949v3 [cs.LG] UPDATED)
    (2 min) Many Markov Chain Monte Carlo (MCMC) methods leverage gradient information of the potential function of target distribution to explore sample space efficiently. However, computing gradients can often be computationally expensive for large scale applications, such as those in contemporary machine learning. Stochastic Gradient (SG-)MCMC methods approximate gradients by stochastic ones, commonly via uniformly subsampled data points, and achieve improved computational efficiency, however at the price of introducing sampling error. We propose a non-uniform subsampling scheme to improve the sampling accuracy. The proposed exponentially weighted stochastic gradient (EWSG) is designed so that a non-uniform-SG-MCMC method mimics the statistical behavior of a batch-gradient-MCMC method, and hence the inaccuracy due to SG approximation is reduced. EWSG differs from classical variance reduction (VR) techniques as it focuses on the entire distribution instead of just the variance; nevertheless, its reduced local variance is also proved. EWSG can also be viewed as an extension of the importance sampling idea, successful for stochastic-gradient-based optimizations, to sampling tasks. In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hastings chain on the data index, which is coupled to the MCMC algorithm. Numerical experiments are provided, not only to demonstrate EWSG's effectiveness, but also to guide hyperparameter choices, and validate our \emph{non-asymptotic global error bound} despite of approximations in the implementation. Notably, while statistical accuracy is improved, convergence speed can be comparable to the uniform version, which renders EWSG a practical alternative to VR (but EWSG and VR can be combined too).
    Diffusion-Based Representation Learning. (arXiv:2105.14257v2 [cs.LG] UPDATED)
    (2 min) Score-based methods represented as stochastic differential equations on a continuous time domain have recently proven successful as a non-adversarial generative model. Training such models relies on denoising score matching, which can be seen as multi-scale denoising autoencoders. Here, we augment the denoising score-matching framework to enable representation learning without any supervised signal. GANs and VAEs learn representations by directly transforming latent codes to data samples. In contrast, the introduced diffusion based representation learning relies on a new formulation of the denoising score-matching objective and thus encodes information needed for denoising. We illustrate how this difference allows for manual control of the level of details encoded in the representation. Using the same approach, we propose to learn an infinite-dimensional latent code which achieves improvements of state-of-the-art models on semi-supervised image classification. As a side contribution, we show how adversarial training in score-based models can improve sample quality and improve sampling speed using a new approximation of the prior at smaller noise scales.
    Wearable-based Classification of Running Styles with Deep Learning. (arXiv:2109.00594v2 [cs.LG] UPDATED)
    (2 min) Automatic classification of running styles can enable runners to obtain feedback with the aim of optimizing performance in terms of minimizing energy expenditure, fatigue, and risk of injury. To develop a system capable of classifying running styles using wearables, we collect a dataset from 10 healthy runners performing 8 different pre-defined running styles. Five wearable devices are used to record accelerometer data from different parts of the lower body, namely left and right foot, left and right medial tibia, and lower back. Using the collected dataset, we develop a deep learning solution which consists of a Convolutional Neural Network and Long Short-Term Memory network to first automatically extract effective features, followed by learning temporal relationships. Score-level fusion is used to aggregate the classification results from the different sensors. Experiments show that the proposed model is capable of automatically classifying different running styles in a subject-dependant manner, outperforming several classical machine learning methods (following manual feature extraction) and a convolutional neural network baseline. Moreover, our study finds that subject-independent classification of running styles is considerably more challenging than a subject-dependant scheme, indicating a high level of personalization in such running styles. Finally, we demonstrate that by fine-tuning the model with as few as 5% subject-specific samples, considerable performance boost is obtained.
    Conditional Poisson Stochastic Beam Search. (arXiv:2109.11034v1 [cs.CL])
    (2 min) Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process: Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et. al. 2019's stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.
    Making Human-Like Trade-offs in Constrained Environments by Learning from Demonstrations. (arXiv:2109.11018v1 [cs.AI])
    (2 min) Many real-life scenarios require humans to make difficult trade-offs: do we always follow all the traffic rules or do we violate the speed limit in an emergency? These scenarios force us to evaluate the trade-off between collective norms and our own personal objectives. To create effective AI-human teams, we must equip AI agents with a model of how humans make trade-offs in complex, constrained environments. These agents will be able to mirror human behavior or to draw human attention to situations where decision making could be improved. To this end, we propose a novel inverse reinforcement learning (IRL) method for learning implicit hard and soft constraints from demonstrations, enabling agents to quickly adapt to new settings. In addition, learning soft constraints over states, actions, and state features allows agents to transfer this knowledge to new domains that share similar aspects. We then use the constraint learning method to implement a novel system architecture that leverages a cognitive model of human decision making, multi-alternative decision field theory (MDFT), to orchestrate competing objectives. We evaluate the resulting agent on trajectory length, number of violated constraints, and total reward, demonstrating that our agent architecture is both general and achieves strong performance. Thus we are able to capture and replicate human-like trade-offs from demonstrations in environments when constraints are not explicit.
    FooBaR: Fault Fooling Backdoor Attack on Neural Network Training. (arXiv:2109.11249v1 [cs.CR])
    (2 min) Neural network implementations are known to be vulnerable to physical attack vectors such as fault injection attacks. As of now, these attacks were only utilized during the inference phase with the intention to cause a misclassification. In this work, we explore a novel attack paradigm by injecting faults during the training phase of a neural network in a way that the resulting network can be attacked during deployment without the necessity of further faulting. In particular, we discuss attacks against ReLU activation functions that make it possible to generate a family of malicious inputs, which are called fooling inputs, to be used at inference time to induce controlled misclassifications. Such malicious inputs are obtained by mathematically solving a system of linear equations that would cause a particular behaviour on the attacked activation functions, similar to the one induced in training through faulting. We call such attacks fooling backdoors as the fault attacks at the training phase inject backdoors into the network that allow an attacker to produce fooling inputs. We evaluate our approach against multi-layer perceptron networks and convolutional networks on a popular image classification task obtaining high attack success rates (from 60% to 100%) and high classification confidence when as little as 25 neurons are attacked while preserving high accuracy on the originally intended classification task.
    TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes. (arXiv:2109.00532v2 [cs.CV] UPDATED)
    (2 min) The longitudinal modeling of neuroanatomical changes related to Alzheimer's disease (AD) is crucial for studying the progression of the disease. To this end, we introduce TransforMesh, a spatio-temporal network based on transformers that models longitudinal shape changes on 3D anatomical meshes. While transformer and mesh networks have recently shown impressive performances in natural language processing and computer vision, their application to medical image analysis has been very limited. To the best of our knowledge, this is the first work that combines transformer and mesh networks. Our results show that TransforMesh can model shape trajectories better than other baseline architectures that do not capture temporal dependencies. Moreover, we also explore the capabilities of TransforMesh in detecting structural anomalies of the hippocampus in patients developing AD.
    Toward a Unified Framework for Debugging Gray-box Models. (arXiv:2109.11160v1 [cs.LG])
    (2 min) We are concerned with debugging concept-based gray-box models (GBMs). These models acquire task-relevant concepts appearing in the inputs and then compute a prediction by aggregating the concept activations. This work stems from the observation that in GBMs both the concepts and the aggregation function can be affected by different bugs, and that correcting these bugs requires different kinds of corrective supervision. To this end, we introduce a simple schema for identifying and prioritizing bugs in both components, discuss possible implementations and open problems. At the same time, we introduce a new loss function for debugging the aggregation step that extends existing approaches to align the model's explanations to GBMs by making them robust to how the concepts change during training.
    Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation. (arXiv:2103.06406v2 [cs.LG] UPDATED)
    (3 min) Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA solutions are fast becoming irrelevant in the modern era of big data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to study of distributed PSA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA in the case of data that are partitioned across either samples or (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study of effect of straggler machines on the proposed algorithms.
    Improving Generalization of Transfer Learning Across Domains Using Spatio-Temporal Features in Autonomous Driving. (arXiv:2103.08116v2 [cs.CV] UPDATED)
    (2 min) Practical learning-based autonomous driving models must be capable of generalizing learned behaviors from simulated to real domains, and from training data to unseen domains with unusual image properties. In this paper, we investigate transfer learning methods that achieve robustness to domain shifts by taking advantage of the invariance of spatio-temporal features across domains. In this paper, we propose a transfer learning method to improve generalization across domains via transfer of spatio-temporal features and salient data augmentation. Our model uses a CNN-LSTM network with Inception modules for image feature extraction. Our method runs in two phases: Phase 1 involves training on source domain data, while Phase 2 performs training on target domain data that has been supplemented by feature maps generated using the Phase 1 model. Our model significantly improves performance in unseen test cases for both simulation-to-simulation transfer as well as simulation-to-real transfer by up to +37.3\% in test accuracy and up to +40.8\% in steering angle prediction, compared to other SOTA methods across multiple datasets.
    Joint speaker diarisation and tracking in switching state-space model. (arXiv:2109.11140v1 [cs.SD])
    (2 min) Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeting. This paper relaxes this assumption, by proposing to explicitly track the movements of speakers while jointly performing diarisation within a unified model. A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers. The model is implemented as a particle filter. Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
    Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits. (arXiv:2007.10229v3 [cs.LG] UPDATED)
    (2 min) We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.
    Clustering performance analysis using new correlation based cluster validity indices. (arXiv:2109.11172v1 [stat.ML])
    (2 min) There are various cluster validity measures used for evaluating clustering results. One of the main objective of using these measures is to seek the optimal unknown number of clusters. Some measures work well for clusters with different densities, sizes and shapes. Yet, one of the weakness that those validity measures share is that they sometimes provide only one clear optimal number of clusters. That number is actually unknown and there might be more than one potential sub-optimal options that a user may wish to choose based on different applications. We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated. Furthermore, the introduced correlation can also be used for evaluating the quality of a selected clustering result. Several experiments in different scenarios including the well-known iris data set and a real-world marketing application have been conducted in order to compare the proposed validity indices with several well-known ones.
    Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation. (arXiv:2109.00194v2 [cs.CL] UPDATED)
    (2 min) Recent multilingual pre-trained language models have achieved remarkable zero-shot performance, where the model is only finetuned on one source language and directly evaluated on target languages. In this work, we propose a self-learning framework that further utilizes unlabeled data of target languages, combined with uncertainty estimation in the process to select high-quality silver labels. Three different uncertainties are adapted and analyzed specifically for the cross lingual transfer: Language Heteroscedastic/Homoscedastic Uncertainty (LEU/LOU), Evidential Uncertainty (EVI). We evaluate our framework with uncertainties on two cross-lingual tasks including Named Entity Recognition (NER) and Natural Language Inference (NLI) covering 40 languages in total, which outperforms the baselines significantly by 10 F1 on average for NER and 2.5 accuracy score for NLI.
    Parallelised Diffeomorphic Sampling-based Motion Planning. (arXiv:2108.11775v2 [cs.RO] UPDATED)
    (2 min) We propose Parallelised Diffeomorphic Sampling-based Motion Planning (PDMP). PDMP is a novel parallelised framework that uses bijective and differentiable mappings, or diffeomorphisms, to transform sampling distributions of sampling-based motion planners, in a manner akin to normalising flows. Unlike normalising flow models which use invertible neural network structures to represent these diffeomorphisms, we develop them from gradient information of desired costs, and encode desirable behaviour, such as obstacle avoidance. These transformed sampling distributions can then be used for sampling-based motion planning. A particular example is when we wish to imbue the sampling distribution with knowledge of the environment geometry, such that drawn samples are less prone to be in collisions. To this end, we propose to learn a continuous occupancy representation from environment occupancy data, such that gradients of the representation defines a valid diffeomorphism and is amenable to fast parallel evaluation. We use this to "morph" the sampling distribution to draw far fewer collision-prone samples. PDMP is able to leverage gradient information of costs, to inject specifications, in a manner similar to optimisation-based motion planning methods, but relies on drawing from a sampling distribution, retaining the tendency to find more global solutions, thereby bridging the gap between trajectory optimisation and sampling-based planning methods.
    VBridge: Connecting the Dots Between Features and Data to Explain Healthcare Models. (arXiv:2108.02550v2 [cs.HC] UPDATED)
    (2 min) Machine learning (ML) is increasingly applied to Electronic Health Records (EHRs) to solve clinical prediction tasks. Although many ML models perform promisingly, issues with model transparency and interpretability limit their adoption in clinical practice. Directly using existing explainable ML techniques in clinical settings can be challenging. Through literature surveys and collaborations with six clinicians with an average of 17 years of clinical experience, we identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence. Following an iterative design process, we further designed and developed VBridge, a visual analytics tool that seamlessly incorporates ML explanations into clinicians' decision-making workflow. The system includes a novel hierarchical display of contribution-based feature explanations and enriched interactions that connect the dots between ML features, explanations, and data. We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians, showing that visually associating model explanations with patients' situational records can help clinicians better interpret and use model predictions when making clinician decisions. We further derived a list of design implications for developing future explainable ML tools to support clinical decision-making.
    Knowledge-Aware Graph-Enhanced GPT-2 for Dialogue State Tracking. (arXiv:2104.04466v3 [cs.CL] UPDATED)
    (2 min) Dialogue State Tracking is central to multi-domain task-oriented dialogue systems, responsible for extracting information from user utterances. We present a novel hybrid architecture that augments GPT-2 with representations derived from Graph Attention Networks in such a way to allow causal, sequential prediction of slot values. The model architecture captures inter-slot relationships and dependencies across domains that otherwise can be lost in sequential prediction. We report improvements in state tracking performance in MultiWOZ 2.0 against a strong GPT-2 baseline and investigate a simplified sparse training scenario in which DST models are trained only on session-level annotations but evaluated at the turn level. We further report detailed analyses to demonstrate the effectiveness of graph models in DST by showing that the proposed graph modules capture inter-slot dependencies and improve the predictions of values that are common to multiple domains.
    The Seven-League Scheme: Deep learning for large time step Monte Carlo simulations of stochastic differential equations. (arXiv:2009.03202v5 [math.NA] UPDATED)
    (2 min) We propose an accurate data-driven numerical scheme to solve Stochastic Differential Equations (SDEs), by taking large time steps. The SDE discretization is built up by means of a polynomial chaos expansion method, on the basis of accurately determined stochastic collocation (SC) points. By employing an artificial neural network to learn these SC points, we can perform Monte Carlo simulations with large time steps. Error analysis confirms that this data-driven scheme results in accurate SDE solutions in the sense of strong convergence, provided the learning methodology is robust and accurate. With a method variant called the compression-decompression collocation and interpolation technique, we can drastically reduce the number of neural network functions that have to be learned, so that computational speed is enhanced. Numerical experiments confirm a high-quality strong convergence error when using large time steps, and the novel scheme outperforms some classical numerical SDE discretizations. Some applications, here in financial option valuation, are also presented.
    Deep Co-Training with Task Decomposition for Semi-Supervised Domain Adaptation. (arXiv:2007.12684v5 [cs.CV] UPDATED)
    (2 min) Semi-supervised domain adaptation (SSDA) aims to adapt models trained from a labeled source domain to a different but related target domain, from which unlabeled data and a small set of labeled data are provided. Current methods that treat source and target supervision without distinction overlook their inherent discrepancy, resulting in a source-dominated model that has not effectively used the target supervision. In this paper, we argue that the labeled target data needs to be distinguished for effective SSDA, and propose to explicitly decompose the SSDA task into two sub-tasks: a semi-supervised learning (SSL) task in the target domain and an unsupervised domain adaptation (UDA) task across domains. By doing so, the two sub-tasks can better leverage the corresponding supervision and thus yield very different classifiers. To integrate the strengths of the two classifiers, we apply the well-established co-training framework, in which the two classifiers exchange their high confident predictions to iteratively "teach each other" so that both classifiers can excel in the target domain. We call our approach Deep Co-training with Task decomposition (DeCoTa). DeCoTa requires no adversarial training and is easy to implement. Moreover, DeCoTa is well-founded on the theoretical condition of when co-training would succeed. As a result, DeCoTa achieves state-of-the-art results on several SSDA datasets, outperforming the prior art by a notable 4% margin on DomainNet. Code is available at https://github.com/LoyoYang/DeCoTa
    Convolutional Neural Network for emotion recognition to assist psychiatrists and psychologists during the COVID-19 pandemic: experts opinion. (arXiv:2005.07649v2 [cs.CV] UPDATED)
    (3 min) A web application with real-time emotion recognition for psychologists and psychiatrists is presented. Mental health effects during COVID-19 quarantine need to be handled because society is being emotionally impacted. The human micro-expressions can describe genuine emotions that can be captured by Convolutional Neural Networks (CNN) models. But the challenge is to implement it under the poor performance of a part of society computers and the low speed of internet connection, i.e., improve the computational efficiency and reduce the data transfer. To validate the computational efficiency premise, we compare CNN architectures results, collecting the floating-point operations per second (FLOPS), the Number of Parameters (NP) and accuracy from the MobileNet, PeleeNet, Extended Deep Neural Network (EDNN), Inception- Based Deep Neural Network (IDNN) and our proposed Residual mobile-based Network model (ResmoNet). Also, we compare the trained models results in terms of Main Memory Utilization (MMU) and Response Time to complete the Emotion (RTE) recognition. Besides, we design a data transfer that includes the raw data of emotions and the basic patient information. The web application was evaluated with the System Usability Scale (SUS) and a utility questionnaire by psychologists and psychiatrists. ResmoNet model generated the most reduced NP, FLOPS, and MMU results, only EDNN overcomes ResmoNet in 0.01sec in RTE. The optimizations to our model impacted the accuracy, therefore IDNN and EDNN are 0.02 and 0.05 more accurate than our model respectively. Finally, according to psychologists and psychiatrists, the web application has good usability (73.8 of 100) and utility (3.94 of 5).
    OneFlow: One-class flow for anomaly detection based on a minimal volume region. (arXiv:2010.03002v3 [cs.LG] UPDATED)
    (2 min) We propose OneFlow - a flow-based one-class classifier for anomaly (outlier) detection that finds a minimal volume bounding region. Contrary to density-based methods, OneFlow is constructed in such a way that its result typically does not depend on the structure of outliers. This is caused by the fact that during training the gradient of the cost function is propagated only over the points located near to the decision boundary (behavior similar to the support vectors in SVM). The combination of flow models and a Bernstein quantile estimator allows OneFlow to find a parametric form of bounding region, which can be useful in various applications including describing shapes from 3D point clouds. Experiments show that the proposed model outperforms related methods on real-world anomaly detection problems.
    Annotation-efficient deep learning for automatic medical image segmentation. (arXiv:2012.04885v3 [eess.IV] UPDATED)
    (2 min) Automatic medical image segmentation plays a critical role in scientific research and medical care. Existing high-performance deep learning methods typically rely on large training datasets with high-quality manual annotations, which are difficult to obtain in many clinical applications. Here, we introduce Annotation-effIcient Deep lEarning (AIDE), an open-source framework to handle imperfect training datasets. Methodological analyses and empirical evaluations are conducted, and we demonstrate that AIDE surpasses conventional fully-supervised models by presenting better performance on open datasets possessing scarce or noisy annotations. We further test AIDE in a real-life case study for breast tumor segmentation. Three datasets containing 11,852 breast images from three medical centers are employed, and AIDE, utilizing 10% training annotations, consistently produces segmentation maps comparable to those generated by fully-supervised counterparts or provided by independent radiologists. The 10-fold enhanced efficiency in utilizing expert labels has the potential to promote a wide range of biomedical applications.
    A Gradient Flow Framework For Analyzing Network Pruning. (arXiv:2009.11839v4 [cs.LG] UPDATED)
    (2 min) Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several results related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and is therefore appropriate for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, such that increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100. Code available at https://github.com/EkdeepSLubana/flowandprune.
    An Unbiased Estimator of the Full-sky CMB Angular Power Spectrum at Large Scales using Neural Networks. (arXiv:2102.04327v2 [astro-ph.CO] UPDATED)
    (2 min) Accurate estimation of the Cosmic Microwave Background (CMB) angular power spectrum is enticing due to the prospect for precision cosmology it presents. Galactic foreground emissions, however, contaminate the CMB signal and need to be subtracted reliably in order to lessen systematic errors on the CMB temperature estimates. Typically bright foregrounds in a region lead to further uncertainty in temperature estimates in the area even after some foreground removal technique is performed and hence determining the underlying full-sky angular power spectrum poses a challenge. We explore the feasibility of utilizing artificial neural networks to predict the angular power spectrum of the full sky CMB temperature maps from the observed angular power spectrum of the partial sky in which CMB temperatures in some bright foreground regions are masked. We present our analysis at large angular scales with two different masks. We produce unbiased predictions of the full-sky angular power spectrum and recover the underlying theoretical power spectrum using neural networks. Our predictions are also uncorrelated to a large extent. We further show that the multipole-space covariances of the predictions of full-sky spectra made by the ANNs are much smaller than those of the estimates obtained using the pseudo-$C_\ell$ method.
    End-to-End AI-based MRI Reconstruction and Lesion Detection Pipeline for Evaluation of Deep Learning Image Reconstruction. (arXiv:2109.11524v1 [cs.CV])
    (2 min) Deep learning techniques have emerged as a promising approach to highly accelerated MRI. However, recent reconstruction challenges have shown several drawbacks in current deep learning approaches, including the loss of fine image details even using models that perform well in terms of global quality metrics. In this study, we propose an end-to-end deep learning framework for image reconstruction and pathology detection, which enables a clinically aware evaluation of deep learning reconstruction quality. The solution is demonstrated for a use case in detecting meniscal tears on knee MRI studies, ultimately finding a loss of fine image details with common reconstruction methods expressed as a reduced ability to detect important pathology like meniscal tears. Despite the common practice of quantitative reconstruction methodology evaluation with metrics such as SSIM, impaired pathology detection as an automated pathology-based reconstruction evaluation approach suggests existing quantitative methods do not capture clinically important reconstruction outcomes.
    How much "human-like" visual experience do current self-supervised learning algorithms need to achieve human-level object recognition?. (arXiv:2109.11523v1 [cs.CV])
    (2 min) This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like", natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is on the order of a million years of natural visual experience, in other words several orders of magnitude longer than a human lifetime. However, this estimate is quite sensitive to some underlying assumptions, underscoring the need to run carefully controlled human experiments. We discuss the main caveats surrounding our estimate and the implications of this rather surprising result.
    Partially Encrypted Machine Learning using Functional Encryption. (arXiv:1905.10214v5 [cs.LG] UPDATED)
    (2 min) Machine learning on encrypted data has received a lot of attention thanks to recent breakthroughs in homomorphic encryption and secure multi-party computation. It allows outsourcing computation to untrusted servers without sacrificing privacy of sensitive data. We propose a practical framework to perform partially encrypted and privacy-preserving predictions which combines adversarial training and functional encryption. We first present a new functional encryption scheme to efficiently compute quadratic functions so that the data owner controls what can be computed but is not involved in the calculation: it provides a decryption key which allows one to learn a specific function evaluation of some encrypted data. We then show how to use it in machine learning to partially encrypt neural networks with quadratic activation functions at evaluation time, and we provide a thorough analysis of the information leaks based on indistinguishability of data items of the same label. Last, since most encryption schemes cannot deal with the last thresholding operation used for classification, we propose a training method to prevent selected sensitive features from leaking, which adversarially optimizes the network against an adversary trying to identify these features. This is interesting for several existing works using partially encrypted machine learning as it comes with little reduction on the model's accuracy and significantly improves data privacy.
    RKHSMetaMod: An R package to estimate the Hoeffding decomposition of a complex model by solving RKHS ridge group sparse optimization problem. (arXiv:1905.13695v5 [stat.ML] UPDATED)
    (2 min) In this paper, we propose an R package, called RKHSMetaMod, that implements a procedure for estimating a meta-model of a complex model. The meta-model approximates the Hoeffding decomposition of the complex model and allows us to perform sensitivity analysis on it. It belongs to a reproducing kernel Hilbert space that is constructed as a direct sum of Hilbert spaces. The estimator of the meta-model is the solution of a penalized empirical least-squares minimization with the sum of the Hilbert norm and the empirical $L^2$-norm. This procedure, called RKHS ridge group sparse, allows both to select and estimate the terms in the Hoeffding decomposition, and therefore, to select and estimate the Sobol indices that are non-zero. The RKHSMetaMod package provides an interface from R statistical computing environment to the C++ libraries Eigen and GSL. In order to speed up the execution time and optimize the storage memory, except for a function that is written in R, all of the functions of this package are written using the efficient C++ libraries through RcppEigen and RcppGSL packages. These functions are then interfaced in the R environment in order to propose a user-friendly package.
    MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks. (arXiv:2109.11526v1 [cs.CV])
    (2 min) Political activity on social media presents a data-rich window into political behavior, but the vast amount of data means that almost all content analyses of social media require a data labeling step. However, most automated machine classification methods ignore the multimodality of posted content, focusing either on text or images. State-of-the-art vision-and-language models are unusable for most political science research: they require all observations to have both image and text and require computationally expensive pretraining. This paper proposes a novel vision-and-language framework called multimodal representations using modality translation (MARMOT). MARMOT presents two methodological contributions: it can construct representations for observations missing image or text, and it replaces the computationally expensive pretraining with modality translation. MARMOT outperforms an ensemble text-only classifier in 19 of 20 categories in multilabel classifications of tweets reporting election incidents during the 2016 U.S. general election. Moreover, MARMOT shows significant improvements over the results of benchmark multimodal models on the Hateful Memes dataset, improving the best result set by VisualBERT in terms of accuracy from 0.6473 to 0.6760 and area under the receiver operating characteristic curve (AUC) from 0.7141 to 0.7530.
    Outlier-Robust Sparse Estimation via Non-Convex Optimization. (arXiv:2109.11515v1 [cs.LG])
    (2 min) We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yields a near-optimal solution for the underlying robust estimation task. As a corollary, we obtain that any first-order method that efficiently converges to stationarity yields an efficient algorithm for these tasks. The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work.
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v1 [math.OC])
    (2 min) We study nonlinear optimization problems with stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming algorithm, using a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of augmented Lagrangian and performs stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the "liminf" of the KKT residuals converges to zero almost surely. Our algorithm and analysis further develop the prior work \cite{Na2021Adaptive} by allowing nonlinear inequality constraints. We demonstrate the performance of the algorithm on a subset of nonlinear problems collected in the CUTEst test set.
    DeepAID: Interpreting and Improving Deep Learning-based Anomaly Detection in Security Applications. (arXiv:2109.11495v1 [cs.CR])
    (2 min) Unsupervised Deep Learning (DL) techniques have been widely used in various security-related anomaly detection applications, owing to the great promise of being able to detect unforeseen threats and superior performance provided by Deep Neural Networks (DNN). However, the lack of interpretability creates key barriers to the adoption of DL models in practice. Unfortunately, existing interpretation approaches are proposed for supervised learning models and/or non-security domains, which are unadaptable for unsupervised DL models and fail to satisfy special requirements in security domains. In this paper, we propose DeepAID, a general framework aiming to (1) interpret DL-based anomaly detection systems in security domains, and (2) improve the practicality of these systems based on the interpretations. We first propose a novel interpretation method for unsupervised DNNs by formulating and solving well-designed optimization problems with special constraints for security domains. Then, we provide several applications based on our Interpreter as well as a model-based extension Distiller to improve security systems by solving domain-specific problems. We apply DeepAID over three types of security-related anomaly detection systems and extensively evaluate our Interpreter with representative prior works. Experimental results show that DeepAID can provide high-quality interpretations for unsupervised DL models while meeting the special requirements of security domains. We also provide several use cases to show that DeepAID can help security operators to understand model decisions, diagnose system mistakes, give feedback to models, and reduce false positives.
    Exploring Machine Teaching with Children. (arXiv:2109.11434v1 [cs.LG])
    (2 min) Iteratively building and testing machine learning models can help children develop creativity, flexibility, and comfort with machine learning and artificial intelligence. We explore how children use machine teaching interfaces with a team of 14 children (aged 7-13 years) and adult co-designers. Children trained image classifiers and tested each other's models for robustness. Our study illuminates how children reason about ML concepts, offering these insights for designing machine teaching experiences for children: (i) ML metrics (e.g. confidence scores) should be visible for experimentation; (ii) ML activities should enable children to exchange models for promoting reflection and pattern recognition; and (iii) the interface should allow quick data inspection (e.g. images vs. gestures).
    Improving Tuberculosis (TB) Prediction using Synthetically Generated Computed Tomography (CT) Images. (arXiv:2109.11480v1 [eess.IV])
    (2 min) The evaluation of infectious disease processes on radiologic images is an important and challenging task in medical image analysis. Pulmonary infections can often be best imaged and evaluated through computed tomography (CT) scans, which are often not available in low-resource environments and difficult to obtain for critically ill patients. On the other hand, X-ray, a different type of imaging procedure, is inexpensive, often available at the bedside and more widely available, but offers a simpler, two dimensional image. We show that by relying on a model that learns to generate CT images from X-rays synthetically, we can improve the automatic disease classification accuracy and provide clinicians with a different look at the pulmonary disease process. Specifically, we investigate Tuberculosis (TB), a deadly bacterial infectious disease that predominantly affects the lungs, but also other organ systems. We show that relying on synthetically generated CT improves TB identification by 7.50% and distinguishes TB properties up to 12.16% better than the X-ray baseline.
    WRENCH: A Comprehensive Benchmark for Weak Supervision. (arXiv:2109.11377v1 [cs.LG])
    (2 min) Recent \emph{Weak Supervision (WS)} approaches have had widespread success in easing the bottleneck of labeling training data for machine learning by synthesizing labels from multiple potentially noisy supervision sources. However, proper measurement and analysis of these approaches remain a challenge. First, datasets used in existing works are often private and/or custom, limiting standardization. Second, WS datasets with the same name and base data often vary in terms of the labels and weak supervision sources used, a significant "hidden" source of evaluation variance. Finally, WS studies often diverge in terms of the evaluation protocol and ablations used. To address these problems, we introduce a benchmark platform, \benchmark, for a thorough and standardized evaluation of WS approaches. It consists of 22 varied real-world datasets for classification and sequence tagging; a range of real, synthetic, and procedurally-generated weak supervision sources; and a modular, extensible framework for WS evaluation, including implementations for popular WS methods. We use \benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform. The code is available at \url{https://github.com/JieyuZ2/wrench}.
    wsGAT: Weighted and Signed Graph Attention Networks for Link Prediction. (arXiv:2109.11519v1 [cs.SI])
    (2 min) Graph Neural Networks (GNNs) have been widely used to learn representations on graphs and tackle many real-world problems from a wide range of domains. In this paper we propose wsGAT, an extension of the Graph Attention Network (GAT) layers, meant to address the lack of GNNs that can handle graphs with signed and weighted links, which are ubiquitous, for instance, in trust and correlation networks. We first evaluate the performance of our proposal by comparing against GCNII in the weighed link prediction task, and against SGCN in the link sign prediction task. After that, we combine the two tasks and show their performance on predicting the signed weight of links, and their existence. Our results on real-world networks show that models with wsGAT layers outperform the ones with GCNII and SGCN layers, and that there is no loss in performance when signed weights are predicted.
    LSTM Hyper-Parameter Selection for Malware Detection: Interaction Effects and Hierarchical Selection Approach. (arXiv:2109.11500v1 [cs.CR])
    (2 min) Long-Short-Term-Memory (LSTM) networks have shown great promise in artificial intelligence (AI) based language modeling. Recently, LSTM networks have also become popular for designing AI-based Intrusion Detection Systems (IDS). However, its applicability in IDS is studied largely in the default settings as used in language models. Whereas security applications offer distinct conditions and hence warrant careful consideration while applying such recurrent networks. Therefore, we conducted one of the most exhaustive works on LSTM hyper-parameters for IDS and experimented with approx. 150 LSTM configurations to determine its hyper-parameters relative importance, interaction effects, and optimal selection approach for designing an IDS. We conducted multiple analyses of the results of these experiments and empirically controlled for the interaction effects of different hyper-parameters covariate levels. We found that for security applications, especially for designing an IDS, neither similar relative importance as applicable to language models is valid, nor is the standard linear method for hyper-parameter selection ideal. We ascertained that the interaction effect plays a crucial role in determining the relative importance of hyper-parameters. We also discovered that after controlling for the interaction effect, the correct relative importance for LSTMs for an IDS is batch-size, followed by dropout ratio and padding. The findings are significant because when LSTM was first used for language models, the focus had mostly been on increasing the number of layers to enhance performance.
    Robin Hood and Matthew Effects -- Differential Privacy Has Disparate Impact on Synthetic Data. (arXiv:2109.11429v1 [cs.LG])
    (2 min) Generative models trained using Differential Privacy (DP) are increasingly used to produce and share synthetic data in a privacy-friendly manner. In this paper, we set out to analyze the impact of DP on these models vis-a-vis underrepresented classes and subgroups of data. We do so from two angles: 1) the size of classes and subgroups in the synthetic data, and 2) classification accuracy on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our experiments, conducted using three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN), show that DP results in opposite size distributions in the generated synthetic data. More precisely, it affects the gap between the majority and minority classes and subgroups, either reducing it (a "Robin Hood" effect) or increasing it ("Matthew" effect). However, both of these size shifts lead to similar disparate impacts on a classifier's accuracy, affecting disproportionately more the underrepresented subparts of the data. As a result, we call for caution when analyzing or training a model on synthetic data, or risk treating different subpopulations unevenly, which might also lead to unreliable conclusions.
    An Evaluation of Anomaly Detection and Diagnosis in Multivariate Time Series. (arXiv:2109.11428v1 [cs.LG])
    (2 min) Several techniques for multivariate time series anomaly detection have been proposed recently, but a systematic comparison on a common set of datasets and metrics is lacking. This paper presents a systematic and comprehensive evaluation of unsupervised and semi-supervised deep-learning based methods for anomaly detection and diagnosis on multivariate time series data from cyberphysical systems. Unlike previous works, we vary the model and post-processing of model errors, i.e. the scoring functions independently of each other, through a grid of 10 models and 4 scoring functions, comparing these variants to state of the art methods. In time-series anomaly detection, detecting anomalous events is more important than detecting individual anomalous time-points. Through experiments, we find that the existing evaluation metrics either do not take events into account, or cannot distinguish between a good detector and trivial detectors, such as a random or an all-positive detector. We propose a new metric to overcome these drawbacks, namely, the composite F-score ($Fc_1$), for evaluating time-series anomaly detection. Our study highlights that dynamic scoring functions work much better than static ones for multivariate time series anomaly detection, and the choice of scoring functions often matters more than the choice of the underlying model. We also find that a simple, channel-wise model - the Univariate Fully-Connected Auto-Encoder, with the dynamic Gaussian scoring function emerges as a winning candidate for both anomaly detection and diagnosis, beating state of the art algorithms.
    Coded Computation across Shared Heterogeneous Workers with Communication Delay. (arXiv:2109.11246v1 [cs.DC])
    (2 min) Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the performance. Coded computation helps to mitigate the straggler effect, but the amount of redundant load and their assignment to the workers should be carefully optimized. In this work, we consider a multi-master heterogeneous-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation. The goal is to minimize the communication plus computation delay of the slowest task. We propose worker assignment, resource allocation and load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded tasks of either a single master or multiple masters, respectively. Then, the non-convex delay minimization problem is solved by employing the Markov's inequality-based approximation, Karush-Kuhn-Tucker conditions, and successive convex approximation methods. Through extensive simulations, we show that the proposed algorithms can reduce the task completion delay compared to the benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
    Named Entity Recognition and Classification on Historical Documents: A Survey. (arXiv:2109.11406v1 [cs.CL])
    (2 min) After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.
    Learning Dynamics from Noisy Measurements using Deep Learning with a Runge-Kutta Constraint. (arXiv:2109.11446v1 [cs.LG])
    (2 min) Measurement noise is an integral part while collecting data of a physical process. Thus, noise removal is a necessary step to draw conclusions from these data, and it often becomes quite essential to construct dynamical models using these data. We discuss a methodology to learn differential equation(s) using noisy and sparsely sampled measurements. In our methodology, the main innovation can be seen in of integration of deep neural networks with a classical numerical integration method. Precisely, we aim at learning a neural network that implicitly represents the data and an additional neural network that models the vector fields of the dependent variables. We combine these two networks by enforcing the constraint that the data at the next time-steps can be given by following a numerical integration scheme such as the fourth-order Runge-Kutta scheme. The proposed framework to learn a model predicting the vector field is highly effective under noisy measurements. The approach can handle scenarios where dependent variables are not available at the same temporal grid. We demonstrate the effectiveness of the proposed method to learning models using data obtained from various differential equations. The proposed approach provides a promising methodology to learn dynamic models, where the first-principle understanding remains opaque.
    A two-step machine learning approach for crop disease detection: an application of GAN and UAV technology. (arXiv:2109.11066v1 [cs.CV])
    (2 min) Automated plant diagnosis is a technology that promises large increases in cost-efficiency for agriculture. However, multiple problems reduce the effectiveness of drones, including the inverse relationship between resolution and speed and the lack of adequate labeled training data. This paper presents a two-step machine learning approach that analyzes low-fidelity and high-fidelity images in sequence, preserving efficiency as well as accuracy. Two data-generators are also used to minimize class imbalance in the high-fidelity dataset and to produce low-fidelity data that is representative of UAV images. The analysis of applications and methods is conducted on a database of high-fidelity apple tree images which are corrupted with class imbalance. The application begins by generating high-fidelity data using generative networks and then uses this novel data alongside the original high-fidelity data to produce low-fidelity images. A machine-learning identifier identifies plants and labels them as potentially diseased or not. A machine learning classifier is then given the potentially diseased plant images and returns actual diagnoses for these plants. The results show an accuracy of 96.3% for the high-fidelity system and a 75.5% confidence level for our low-fidelity system. Our drone technology shows promising results in accuracy when compared to labor-based methods of diagnosis.
    Stochastic Normalizing Flows for Inverse Problems: a Markov Chains Viewpoint. (arXiv:2109.11375v1 [cs.LG])
    (2 min) To overcome topological constraints and improve the expressiveness of normalizing flow architectures, Wu, K\"ohler and No\'e introduced stochastic normalizing flows which combine deterministic, learnable flow transformations with stochastic sampling methods. In this paper, we consider stochastic normalizing flows from a Markov chain point of view. In particular, we replace transition densities by general Markov kernels and establish proofs via Radon-Nikodym derivatives which allows to incorporate distributions without densities in a sound way. Further, we generalize the results for sampling from posterior distributions as required in inverse problems. The performance of the proposed conditional stochastic normalizing flow is demonstrated by numerical examples.
    Programming and Training Rate-Independent Chemical Reaction Networks. (arXiv:2109.11422v1 [cs.LG])
    (2 min) Embedding computation in biochemical environments incompatible with traditional electronics is expected to have wide-ranging impact in synthetic biology, medicine, nanofabrication and other fields. Natural biochemical systems are typically modeled by chemical reaction networks (CRNs), and CRNs can be used as a specification language for synthetic chemical computation. In this paper, we identify a class of CRNs called non-competitive (NC) whose equilibria are absolutely robust to reaction rates and kinetic rate law, because their behavior is captured solely by their stoichiometric structure. Unlike prior work on rate-independent CRNs, checking non-competition and using it as a design criterion is easy and promises robust output. We also present a technique to program NC-CRNs using well-founded deep learning methods, showing a translation procedure from rectified linear unit (ReLU) neural networks to NC-CRNs. In the case of binary weight ReLU networks, our translation procedure is surprisingly tight in the sense that a single bimolecular reaction corresponds to a single ReLU node and vice versa. This compactness argues that neural networks may be a fitting paradigm for programming rate-independent chemical computation. As proof of principle, we demonstrate our scheme with numerical simulations of CRNs translated from neural networks trained on traditional machine learning datasets (IRIS and MNIST), as well as tasks better aligned with potential biological applications including virus detection and spatial pattern formation.
    Learning the noise fingerprint of quantum devices. (arXiv:2109.11405v1 [quant-ph])
    (2 min) Noise sources unavoidably affect any quantum technological device. Noise's main features are expected to strictly depend on the physical platform on which the quantum device is realized, in the form of a distinguishable fingerprint. Noise sources are also expected to evolve and change over time. Here, we first identify and then characterize experimentally the noise fingerprint of IBM cloud-available quantum computers, by resorting to machine learning techniques designed to classify noise distributions using time-ordered sequences of measured outcome probabilities.
    Multidimensional Scaling: Approximation and Complexity. (arXiv:2109.11505v1 [cs.LG])
    (2 min) Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (non-linear) low-dimensional embeddings of high-dimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into low dimensions. Despite its ubiquity, our theoretical understanding of MDS remains limited as its objective function is highly non-convex. In this paper, we prove that minimizing the Kamada-Kawai objective is NP-hard and give a provable approximation algorithm for optimizing it, which in particular is a PTAS on low-diameter graphs. We supplement this result with experiments suggesting possible connections between our greedy approximation algorithm and gradient-based methods.
    Unbiased Loss Functions for Multilabel Classification with Missing Labels. (arXiv:2109.11282v1 [cs.LG])
    (2 min) This paper considers binary and multilabel classification problems in a setting where labels are missing independently and with a known rate. Missing labels are a ubiquitous phenomenon in extreme multi-label classification (XMC) tasks, such as matching Wikipedia articles to a small subset out of the hundreds of thousands of possible tags, where no human annotator can possibly check the validity of all the negative samples. For this reason, propensity-scored precision -- an unbiased estimate for precision-at-k under a known noise model -- has become one of the standard metrics in XMC. Few methods take this problem into account already during the training phase, and all are limited to loss functions that can be decomposed into a sum of contributions from each individual label. A typical approach to training is to reduce the multilabel problem into a series of binary or multiclass problems, and it has been shown that if the surrogate task should be consistent for optimizing recall, the resulting loss function is not decomposable over labels. Therefore, this paper derives the unique unbiased estimators for the different multilabel reductions, including the non-decomposable ones. These estimators suffer from increased variance and may lead to ill-posed optimization problems, which we address by switching to convex upper-bounds. The theoretical considerations are further supplemented by an experimental study showing that the switch to unbiased estimators significantly alters the bias-variance trade-off and may thus require stronger regularization, which in some cases can negate the benefits of unbiased estimation.
    Fast Density Estimation for Density-based Clustering Methods. (arXiv:2109.11383v1 [cs.LG])
    (2 min) Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms is heavily dominated by finding neighbors and calculating the density of each point which is time-consuming. To address this issue, this paper proposes a density-based clustering framework by using the fast principal component analysis, which can be applied to density based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. By applying this clustering framework to the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on five benchmark datasets demonstrate that the proposed IDBSCAN algorithm improves the computational efficiency significantly.
    A survey of Bayesian Network structure learning. (arXiv:2109.11415v1 [cs.LG])
    (2 min) Bayesian Networks (BNs) have become increasingly popular over the last few decades as a tool for reasoning under uncertainty in fields as diverse as medicine, biology, epidemiology, economics and the social sciences. This is especially true in real-world areas where we seek to answer complex questions based on hypothetical evidence to determine actions for intervention. However, determining the graphical structure of a BN remains a major challenge, especially when modelling a problem under causal assumptions. Solutions to this problem include the automated discovery of BN graphs from data, constructing them based on expert knowledge, or a combination of the two. This paper provides a comprehensive review of combinatoric algorithms proposed for learning BN structure from data, describing 61 algorithms including prototypical, well-established and state-of-the-art approaches. The basic approach of each algorithm is described in consistent terms, and the similarities and differences between them highlighted. Methods of evaluating algorithms and their comparative performance are discussed including the consistency of claims made in the literature. Approaches for dealing with data noise in real-world datasets and incorporating expert knowledge into the learning process are also covered.
    Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora. (arXiv:2109.11086v1 [cs.SD])
    (2 min) In this study, we propose to investigate triplet loss for the purpose of an alternative feature representation for ASR. We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL, for acoustic modeling to represent the acoustic characteristics of each audio. This strategy is then applied to the CHiME-4 corpus and CRSS-UTDallas Fearless Steps Corpus, with emphasis on the 100-hour challenge corpus which consists of 5 selected NASA Apollo-11 channels. An analysis of the extracted embeddings provides the foundation needed to characterize training utterances into distinct groups based on acoustic distinguishing properties. Moreover, we also demonstrate that triplet-loss based embedding performs better than i-Vector in acoustic modeling, confirming that the triplet loss is more effective than a speaker feature. With additional techniques such as pronunciation and silence probability modeling, plus multi-style training, we achieve a +5.42% and +3.18% relative WER improvement for the development and evaluation sets of the Fearless Steps Corpus. To explore generalization, we further test the same technique on the 1 channel track of CHiME-4 and observe a +11.90% relative WER improvement for real test data.
    IReEn: Reverse-Engineering of Black-Box Functions via Iterative Neural Program Synthesis. (arXiv:2006.10720v2 [cs.LG] UPDATED)
    (2 min) In this work, we investigate the problem of revealing the functionality of a black-box agent. Notably, we are interested in the interpretable and formal description of the behavior of such an agent. Ideally, this description would take the form of a program written in a high-level language. This task is also known as reverse engineering and plays a pivotal role in software engineering, computer security, but also most recently in interpretability. In contrast to prior work, we do not rely on privileged information on the black box, but rather investigate the problem under a weaker assumption of having only access to inputs and outputs of the program. We approach this problem by iteratively refining a candidate set using a generative neural program synthesis approach until we arrive at a functionally equivalent program. We assess the performance of our approach on the Karel dataset. Our results show that the proposed approach outperforms the state-of-the-art on this challenge by finding an approximately functional equivalent program in 78% of cases -- even exceeding prior work that had privileged information on the black-box.
    On the equivalence of different adaptive batch size selection strategies for stochastic gradient descent methods. (arXiv:2109.10933v1 [math.OC])
    (2 min) In this study, we demonstrate that the norm test and inner product/orthogonality test presented in \cite{Bol18} are equivalent in terms of the convergence rates associated with Stochastic Gradient Descent (SGD) methods if $\epsilon^2=\theta^2+\nu^2$ with specific choices of $\theta$ and $\nu$. Here, $\epsilon$ controls the relative statistical error of the norm of the gradient while $\theta$ and $\nu$ control the relative statistical error of the gradient in the direction of the gradient and in the direction orthogonal to the gradient, respectively. Furthermore, we demonstrate that the inner product/orthogonality test can be as inexpensive as the norm test in the best case scenario if $\theta$ and $\nu$ are optimally selected, but the inner product/orthogonality test will never be more computationally affordable than the norm test if $\epsilon^2=\theta^2+\nu^2$. Finally, we present two stochastic optimization problems to illustrate our results.
    Energy efficient distributed analytics at the edge of the network for IoT environments. (arXiv:2109.11386v1 [cs.DC])
    (2 min) Due to the pervasive diffusion of personal mobile and IoT devices, many "smart environments" (e.g., smart cities and smart factories) will be, generators of huge amounts of data. Currently, analysis of this data is typically achieved through centralised cloud-based services. However, according to many studies, this approach may present significant issues from the standpoint of data ownership, as well as wireless network capacity. In this paper, we exploit the fog computing paradigm to move computation close to where data is produced. We exploit a well-known distributed machine learning framework (Hypothesis Transfer Learning), and perform data analytics on mobile nodes passing by IoT devices, in addition to fog gateways at the edge of the network infrastructure. We analyse the performance of different configurations of the distributed learning framework, in terms of (i) accuracy obtained in the learning task and (ii) energy spent to send data between the involved nodes. Specifically, we consider reference wireless technologies for communication between the different types of nodes we consider, e.g. LTE, Nb-IoT, 802.15.4, 802.11, etc. Our results show that collecting data through the mobile nodes and executing the distributed analytics using short-range communication technologies, such as 802.15.4 and 802.11, allows to strongly reduce the energy consumption of the system up to $94\%$ with a loss in accuracy w.r.t. a centralised cloud solution up to $2\%$.
    Federated Feature Selection for Cyber-Physical Systems of Systems. (arXiv:2109.11323v1 [cs.LG])
    (2 min) Autonomous systems generate a huge amount of multimodal data that are collected and processed on the Edge, in order to enable AI-based services. The collected datasets are pre-processed in order to extract informative attributes, called features, which are used to feed AI algorithms. Due to the limited computational and communication resources of some CPS, like autonomous vehicles, selecting the subset of relevant features from a dataset is of the utmost importance, in order to improve the result achieved by learning methods and to reduce computation and communication costs. Precisely, feature selection is the candidate approach, which assumes that data contain a certain number of redundant or irrelevant attributes that can be eliminated. The quality of our methods is confirmed by the promising results achieved on two different data sets. In this work, we propose, for the first time, a federated feature selection method suitable for being executed in a distributed manner. Precisely, our results show that a fleet of autonomous vehicles finds a consensus on the optimal set of features that they exploit to reduce data transmission up to 99% with negligible information loss.
    Tactile Grasp Refinement using Deep Reinforcement Learning and Analytic Grasp Stability Metrics. (arXiv:2109.11234v1 [cs.RO])
    (2 min) Reward functions are at the heart of every reinforcement learning (RL) algorithm. In robotic grasping, rewards are often complex and manually engineered functions that do not rely on well-justified physical models from grasp analysis. This work demonstrates that analytic grasp stability metrics constitute powerful optimization objectives for RL algorithms that refine grasps on a three-fingered hand using only tactile and joint position information. We outperform a binary-reward baseline by 42.9% and find that a combination of geometric and force-agnostic grasp stability metrics yields the highest average success rates of 95.4% for cuboids, 93.1% for cylinders, and 62.3% for spheres across wrist position errors between 0 and 7 centimeters and rotational errors between 0 and 14 degrees. In a second experiment, we show that grasp refinement algorithms trained with contact feedback (contact positions, normals, and forces) perform up to 6.6% better than a baseline that receives no tactile information.
    Multi-view Contrastive Self-Supervised Learning of Accounting Data Representations for Downstream Audit Tasks. (arXiv:2109.11201v1 [cs.LG])
    (2 min) International audit standards require the direct assessment of a financial statement's underlying accounting transactions, referred to as journal entries. Recently, driven by the advances in artificial intelligence, deep learning inspired audit techniques have emerged in the field of auditing vast quantities of journal entry data. Nowadays, the majority of such methods rely on a set of specialized models, each trained for a particular audit task. At the same time, when conducting a financial statement audit, audit teams are confronted with (i) challenging time-budget constraints, (ii) extensive documentation obligations, and (iii) strict model interpretability requirements. As a result, auditors prefer to harness only a single preferably `multi-purpose' model throughout an audit engagement. We propose a contrastive self-supervised learning framework designed to learn audit task invariant accounting data representations to meet this requirement. The framework encompasses deliberate interacting data augmentation policies that utilize the attribute characteristics of journal entry data. We evaluate the framework on two real-world datasets of city payments and transfer the learned representations to three downstream audit tasks: anomaly detection, audit sampling, and audit documentation. Our experimental results provide empirical evidence that the proposed framework offers the ability to increase the efficiency of audits by learning rich and interpretable `multi-task' representations.
    Quantum algorithms for group convolution, cross-correlation, and equivariant transformations. (arXiv:2109.11330v1 [quant-ph])
    (2 min) Group convolutions and cross-correlations, which are equivariant to the actions of group elements, are commonly used in mathematics to analyze or take advantage of symmetries inherent in a given problem setting. Here, we provide efficient quantum algorithms for performing linear group convolutions and cross-correlations on data stored as quantum states. Runtimes for our algorithms are logarithmic in the dimension of the group thus offering an exponential speedup compared to classical algorithms when input data is provided as a quantum state and linear operations are well conditioned. Motivated by the rich literature on quantum algorithms for solving algebraic problems, our theoretical framework opens a path for quantizing many algorithms in machine learning and numerical methods that employ group operations.
    Arbitrary-Depth Universal Approximation Theorems for Operator Neural Networks. (arXiv:2109.11354v1 [cs.LG])
    (2 min) The standard Universal Approximation Theorem for operator neural networks (NNs) holds for arbitrary width and bounded depth. Here, we prove that operator NNs of bounded width and arbitrary depth are universal approximators for continuous nonlinear operators. In our main result, we prove that for non-polynomial activation functions that are continuously differentiable at a point with a nonzero derivative, one can construct an operator NN of width five, whose inputs are real numbers with finite decimal representations, that is arbitrarily close to any given continuous nonlinear operator. We derive an analogous result for non-affine polynomial activation functions. We also show that depth has theoretical advantages by constructing operator ReLU NNs of depth $2k^3+8$ and constant width that cannot be well-approximated by any operator ReLU NN of depth $k$, unless its width is exponential in $k$.
    Multi-resolution deep learning pipeline for dense large scale point clouds. (arXiv:2109.11311v1 [cs.CV])
    (2 min) Recent development of 3D sensors allows the acquisition of extremely dense 3D point clouds of large-scale scenes. The main challenge of processing such large point clouds remains in the size of the data, which induce expensive computational and memory cost. In this context, the full resolution cloud is particularly hard to process, and details it brings are rarely exploited. Although fine-grained details are important for detection of small objects, they can alter the local geometry of large structural parts and mislead deep learning networks. In this paper, we introduce a new generic deep learning pipeline to exploit the full precision of large scale point clouds, but only for objects that require details. The core idea of our approach is to split up the process into multiple sub-networks which operate on different resolutions and with each their specific classes to retrieve. Thus, the pipeline allows each class to benefit either from noise and memory cost reduction of a sub-sampling or from fine-grained details.
    Enhancing Navigational Safety in Crowded Environments using Semantic-Deep-Reinforcement-Learning-based Navigation. (arXiv:2109.11288v1 [cs.RO])
    (2 min) Intelligent navigation among social crowds is an essential aspect of mobile robotics for applications such as delivery, health care, or assistance. Deep Reinforcement Learning emerged as an alternative planning method to conservative approaches and promises more efficient and flexible navigation. However, in highly dynamic environments employing different kinds of obstacle classes, safe navigation still presents a grand challenge. In this paper, we propose a semantic Deep-reinforcement-learning-based navigation approach that teaches object-specific safety rules by considering high-level obstacle information. In particular, the agent learns object-specific behavior by contemplating the specific danger zones to enhance safety around vulnerable object classes. We tested the approach against a benchmark obstacle avoidance approach and found an increase in safety. Furthermore, we demonstrate that the agent could learn to navigate more safely by keeping an individual safety distance dependent on the semantic information.
    Active Learning for Argument Strength Estimation. (arXiv:2109.11319v1 [cs.LG])
    (2 min) High-quality arguments are an essential part of decision-making. Automatically predicting the quality of an argument is a complex task that recently got much attention in argument mining. However, the annotation effort for this task is exceptionally high. Therefore, we test uncertainty-based active learning (AL) methods on two popular argument-strength data sets to estimate whether sample-efficient learning can be enabled. Our extensive empirical evaluation shows that uncertainty-based acquisition functions can not surpass the accuracy reached with the random acquisition on these data sets.
    Improved Evolutionary Generative Adversarial Networks. (arXiv:2109.11078v1 [cs.NE])
    (2 min) Evolutionary generative adversarial networks (E-GAN) tries to alleviate mode collapse and gradient vanish that plague generative adversarial networks by introducing evolutionary computation. But because of the lack of a reasonable evaluation mechanism, it did not achieve its design purpose. And it contains only mutation operators in its evolutionary step, but not crossover operator which are equally common with it. In this paper, we firstly point out the shortcomings of the diversity fitness function of E-GAN and propose a new function. Then we propose a universal crossover operator over knowledge distillation, which can be widely applied to evolutionary GANs and complement the missing crossover variation of E-GAN. Incorporating the fitness function and crossover operator we design an evolutionary GAN framework named improved evolutionary generative adversarial networks (IE-GAN) and combine E-GAN to complete an algorithm implementation. Experiments on various datasets demonstrate the effectiveness of IE-GAN and show that our method is competitive in terms of generated samples quality and time efficiency.
    ChannelAugment: Improving generalization of multi-channel ASR by training with input channel randomization. (arXiv:2109.11225v1 [eess.AS])
    (2 min) End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks by joint training of a multi-channel front-end along with the ASR model. The main limitation of such systems is that they are usually trained with data from a fixed array geometry, which can lead to degradation in accuracy when a different array is used in testing. This makes it challenging to deploy these systems in practice, as it is costly to retrain and deploy different models for various array configurations. To address this, we present a simple and effective data augmentation technique, which is based on randomly dropping channels in the multi-channel audio input during training, in order to improve the robustness to various array configurations at test time. We call this technique ChannelAugment, in contrast to SpecAugment (SA) which drops time and/or frequency components of a single channel input audio. We apply ChannelAugment to the Spatial Filtering (SF) and Minimum Variance Distortionless Response (MVDR) neural beamforming approaches. For SF, we observe 10.6% WER improvement across various array configurations employing different numbers of microphones. For MVDR, we achieve a 74% reduction in training time without causing degradation of recognition accuracy.
    A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels. (arXiv:2109.11126v1 [cs.LG])
    (2 min) In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.
    Zero-Shot Information Extraction as a Unified Text-to-Triple Translation. (arXiv:2109.11171v1 [cs.CL])
    (2 min) We cast a suite of information extraction tasks into a text-to-triple translation framework. Instead of solving each task relying on task-specific datasets and models, we formalize the task as a translation between task-specific input text and output triples. By taking the task-specific input, we enable a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task. We further demonstrate that a simple pre-training task of predicting which relational information corresponds to which input text is an effective way to produce task-specific outputs. This enables the zero-shot transfer of our framework to downstream tasks. We study the zero-shot performance of this framework on open information extraction (OIE2016, NYT, WEB, PENN), relation classification (FewRel and TACRED), and factual probe (Google-RE and T-REx). The model transfers non-trivially to most tasks and is often competitive with a fully supervised method without the need for any task-specific training. For instance, we significantly outperform the F1 score of the supervised open information extraction without needing to use its training set.
    PredictionNet: Real-Time Joint Probabilistic Traffic Prediction for Planning, Control, and Simulation. (arXiv:2109.11094v1 [cs.RO])
    (2 min) Predicting the future motion of traffic agents is crucial for safe and efficient autonomous driving. To this end, we present PredictionNet, a deep neural network (DNN) that predicts the motion of all surrounding traffic agents together with the ego-vehicle's motion. All predictions are probabilistic and are represented in a simple top-down rasterization that allows an arbitrary number of agents. Conditioned on a multilayer map with lane information, the network outputs future positions, velocities, and backtrace vectors jointly for all agents including the ego-vehicle in a single pass. Trajectories are then extracted from the output. The network can be used to simulate realistic traffic, and it produces competitive results on popular benchmarks. More importantly, it has been used to successfully control a real-world vehicle for hundreds of kilometers, by combining it with a motion planning/control subsystem. The network runs faster than real-time on an embedded GPU, and the system shows good generalization (across sensory modalities and locations) due to the choice of input representation. Furthermore, we demonstrate that by extending the DNN with reinforcement learning (RL), it can better handle rare or unsafe events like aggressive maneuvers and crashes.
    Predicting the Timing of Camera Movements From the Kinematics of Instruments in Robotic-Assisted Surgery Using Artificial Neural Networks. (arXiv:2109.11192v1 [cs.LG])
    (2 min) Robotic-assisted surgeries benefit both surgeons and patients, however, surgeons frequently need to adjust the endoscopic camera to achieve good viewpoints. Simultaneously controlling the camera and the surgical instruments is impossible, and consequentially, these camera adjustments repeatedly interrupt the surgery. Autonomous camera control could help overcome this challenge, but most existing systems are reactive, e.g., by having the camera follow the surgical instruments. We propose a predictive approach for anticipating when camera movements will occur using artificial neural networks. We used the kinematic data of the surgical instruments, which were recorded during robotic-assisted surgical training on porcine models. We split the data into segments, and labeled each either as a segment that immediately precedes a camera movement, or one that does not. Due to the large class imbalance, we trained an ensemble of networks, each on a balanced sub-set of the training data. We found that the instruments' kinematic data can be used to predict when camera movements will occur, and evaluated the performance on different segment durations and ensemble sizes. We also studied how much in advance an upcoming camera movement can be predicted, and found that predicting a camera movement 0.25, 0.5, and 1 second before they occurred achieved 98%, 94%, and 84% accuracy relative to the prediction of an imminent camera movement. This indicates that camera movement events can be predicted early enough to leave time for computing and executing an autonomous camera movement and suggests that an autonomous camera controller for RAMIS may one day be feasible.
    Secure PAC Bayesian Regression via Real Shamir Secret Sharing. (arXiv:2109.11200v1 [cs.LG])
    (2 min) Common approach of machine learning is to generate a model by using huge amount of training data to predict the test data instances as accurate as possible. Nonetheless, concerns about data privacy are increasingly raised, but not always addressed. We present a secure protocol for obtaining a linear model relying on recently described technique called real number secret sharing. We take as our starting point the PAC Bayesian bounds and deduce a closed form for the model parameters which depends on the data and the prior from the PAC Bayesian bounds. To obtain the model parameters one need to solve a linear system. However, we consider the situation where several parties hold different data instances and they are not willing to give up the privacy of the data. Hence, we suggest to use real number secret sharing and multiparty computation to share the data and solve the linear regression in a secure way without violating the privacy of data. We suggest two methods; an inverse method and a Gaussian elimination method, and compare these methods at the end.
    Learning to Downsample for Segmentation of Ultra-High Resolution Images. (arXiv:2109.11071v1 [cs.CV])
    (2 min) Segmentation of ultra-high resolution images with deep learning is challenging because of their enormous size, often millions or even billions of pixels. Typical solutions drastically downsample the image uniformly to meet memory constraints, implicitly assuming all pixels equally important by sampling at the same density at all spatial locations. However this assumption is not true and compromises the performance of deep learning techniques that have proved powerful on standard-sized images. For example with uniform downsampling, see green boxed region in Fig.1, the rider and bike do not have enough corresponding samples while the trees and buildings are oversampled, and lead to a negative effect on the segmentation prediction from the low-resolution downsampled image. In this work we show that learning the spatially varying downsampling strategy jointly with segmentation offers advantages in segmenting large images with limited computational budget. Fig.1 shows that our method adapts the sampling density over different locations so that more samples are collected from the small important regions and less from the others, which in turn leads to better segmentation accuracy. We show on two public and one local high-resolution datasets that our method consistently learns sampling locations preserving more information and boosting segmentation accuracy over baseline methods.
    Rank Overspecified Robust Matrix Recovery: Subgradient Method and Exact Recovery. (arXiv:2109.11154v1 [math.OC])
    (2 min) We study the robust recovery of a low-rank matrix from sparsely and grossly corrupted Gaussian measurements, with no prior knowledge on the intrinsic rank. We consider the robust matrix factorization approach. We employ a robust $\ell_1$ loss function and deal with the challenge of the unknown rank by using an overspecified factored representation of the matrix variable. We then solve the associated nonconvex nonsmooth problem using a subgradient method with diminishing stepsizes. We show that under a regularity condition on the sensing matrices and corruption, which we call restricted direction preserving property (RDPP), even with rank overspecified, the subgradient method converges to the exact low-rank solution at a sublinear rate. Moreover, our result is more general in the sense that it automatically speeds up to a linear rate once the factor rank matches the unknown rank. On the other hand, we show that the RDPP condition holds under generic settings, such as Gaussian measurements under independent or adversarial sparse corruptions, where the result could be of independent interest. Both the exact recovery and the convergence rate of the proposed subgradient method are numerically verified in the overspecified regime. Moreover, our experiment further shows that our particular design of diminishing stepsize effectively prevents overfitting for robust recovery under overparameterized models, such as robust matrix sensing and learning robust deep image prior. This regularization effect is worth further investigation.
    On Bonus-Based Exploration Methods in the Arcade Learning Environment. (arXiv:2109.11052v1 [cs.LG])
    (2 min) Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler $\epsilon$-greedy scheme. In fact, we find that methods that perform best on that game often underperform $\epsilon$-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
    Predicting Stress in Remote Learning via Advanced Deep Learning Technologies. (arXiv:2109.11076v1 [cs.LG])
    (2 min) COVID-19 has driven most schools to remote learning through online meeting software such as Zoom and Google Meet. Although this trend helps students continue learning without in-person classes, it removes a vital tool that teachers use to teach effectively: visual cues. By not being able to see a student's face clearly, the teacher may not notice when the student needs assistance, or when the student is not paying attention. In order to help remedy the teachers of this challenge, this project proposes a machine learning based approach that provides real-time student mental state monitoring and classifications for the teachers to better conduct remote teaching. Using publicly available electroencephalogram (EEG) data collections, this research explored four different classification techniques: the classic deep neural network, the traditionally popular support vector machine, the latest convolutional neural network, and the XGBoost model, which has gained popularity recently. This study defined three mental classes: an engaged learning mode, a confused learning mode, and a relaxed mode. The experimental results from this project showed that these selected classifiers have varying potentials in classifying EEG signals for mental states. While some of the selected classifiers only yield around 50% accuracy with some delay, the best ones can achieve 80% accurate classification in real-time. This could be very beneficial for teachers in need of help making remote teaching adjustments, and for many other potential applications where in-person interactions are not possible.
    Automated Feature-Topic Pairing: Aligning Semantic and Embedding Spaces in Spatial Representation Learning. (arXiv:2109.11053v1 [cs.LG])
    (2 min) Automated characterization of spatial data is a kind of critical geographical intelligence. As an emerging technique for characterization, Spatial Representation Learning (SRL) uses deep neural networks (DNNs) to learn non-linear embedded features of spatial data for characterization. However, SRL extracts features by internal layers of DNNs, and thus suffers from lacking semantic labels. Texts of spatial entities, on the other hand, provide semantic understanding of latent feature labels, but is insensible to deep SRL models. How can we teach a SRL model to discover appropriate topic labels in texts and pair learned features with the labels? This paper formulates a new problem: feature-topic pairing, and proposes a novel Particle Swarm Optimization (PSO) based deep learning framework. Specifically, we formulate the feature-topic pairing problem into an automated alignment task between 1) a latent embedding feature space and 2) a textual semantic topic space. We decompose the alignment of the two spaces into: 1) point-wise alignment, denoting the correlation between a topic distribution and an embedding vector; 2) pair-wise alignment, denoting the consistency between a feature-feature similarity matrix and a topic-topic similarity matrix. We design a PSO based solver to simultaneously select an optimal set of topics and learn corresponding features based on the selected topics. We develop a closed loop algorithm to iterate between 1) minimizing losses of representation reconstruction and feature-topic alignment and 2) searching the best topics. Finally, we present extensive experiments to demonstrate the enhanced performance of our method.
    A Novel Factor Graph-Based Optimization Technique for Stereo Correspondence Estimation. (arXiv:2109.11077v1 [cs.CV])
    (2 min) Dense disparities among multiple views is essential for estimating the 3D architecture of a scene based on the geometrical relationship among the scene and the views or cameras. Scenes with larger extents of heterogeneous textures, differing scene illumination among the multiple views and with occluding objects affect the accuracy of the estimated disparities. Markov random fields (MRF) based methods for disparity estimation address these limitations using spatial dependencies among the observations and among the disparity estimates. These methods, however, are limited by spatially fixed and smaller neighborhood systems or cliques. In this work, we present a new factor graph-based probabilistic graphical model for disparity estimation that allows a larger and a spatially variable neighborhood structure determined based on the local scene characteristics. We evaluated our method using the Middlebury benchmark stereo datasets and the Middlebury evaluation dataset version 3.0 and compared its performance with recent state-of-the-art disparity estimation algorithms. The new factor graph-based method provided disparity estimates with higher accuracy when compared to the recent non-learning- and learning-based disparity estimation algorithms. In addition to disparity estimation, our factor graph formulation can be useful for obtaining maximum a posteriori solution to optimization problems with complex and variable dependency structures as well as for other dense estimation problems such as optical flow estimation.
    Robust Generalization of Quadratic Neural Networks via Function Identification. (arXiv:2109.10935v1 [cs.LG])
    (2 min) A key challenge facing deep learning is that neural networks are often not robust to shifts in the underlying data distribution. We study this problem from the perspective of the statistical concept of parameter identification. Generalization bounds from learning theory often assume that the test distribution is close to the training distribution. In contrast, if we can identify the "true" parameters, then the model generalizes to arbitrary distribution shifts. However, neural networks are typically overparameterized, making parameter identification impossible. We show that for quadratic neural networks, we can identify the function represented by the model even though we cannot identify its parameters. Thus, we can obtain robust generalization bounds even in the overparameterized setting. We leverage this result to obtain new bounds for contextual bandits and transfer learning with quadratic neural networks. Overall, our results suggest that we can improve robustness of neural networks by designing models that can represent the true data generating process. In practice, the true data generating process is often very complex; thus, we study how our framework might connect to neural module networks, which are designed to break down complex tasks into compositions of simpler ones. We prove robust generalization bounds when individual neural modules are identifiable.
    Exploring Decomposition for Table-based Fact Verification. (arXiv:2109.11020v1 [cs.CL])
    (2 min) Fact verification based on structured data is challenging as it requires models to understand both natural language and symbolic operations performed over tables. Although pre-trained language models have demonstrated a strong capability in verifying simple statements, they struggle with complex statements that involve multiple operations. In this paper, we improve fact verification by decomposing complex statements into simpler subproblems. Leveraging the programs synthesized by a weakly supervised semantic parser, we propose a program-guided approach to constructing a pseudo dataset for decomposition model training. The subproblems, together with their predicted answers, serve as the intermediate evidence to enhance our fact verification model. Experiments show that our proposed approach achieves the new state-of-the-art performance, an 82.7\% accuracy, on the TabFact benchmark.
    On Optimal Robustness to Adversarial Corruption in Online Decision Problems. (arXiv:2109.10963v1 [stat.ML])
    (2 min) This paper considers two fundamental sequential decision-making problems: the problem of prediction with expert advice and the multi-armed bandit problem. We focus on stochastic regimes in which an adversary may corrupt losses, and we investigate what level of robustness can be achieved against adversarial corruptions. The main contribution of this paper is to show that optimal robustness can be expressed by a square-root dependency on the amount of corruption. More precisely, we show that two classes of algorithms, anytime Hedge with decreasing learning rate and algorithms with second-order regret bounds, achieve $O( \frac{\log N}{\Delta} + \sqrt{ \frac{C \log N }{\Delta} } )$-regret, where $N, \Delta$, and $C$ represent the number of experts, the gap parameter, and the corruption level, respectively. We further provide a matching lower bound, which means that this regret bound is tight up to a constant factor. For the multi-armed bandit problem, we also provide a nearly tight lower bound up to a logarithmic factor.
    Multi-Objective Bayesian Optimization over High-Dimensional Search Spaces. (arXiv:2109.10964v1 [cs.LG])
    (2 min) The ability to optimize multiple competing objective functions with high sample efficiency is imperative in many applied problems across science and industry. Multi-objective Bayesian optimization (BO) achieves strong empirical performance on such problems, but even with recent methodological advances, it has been restricted to simple, low-dimensional domains. Most existing BO methods exhibit poor performance on search spaces with more than a few dozen parameters. In this work we propose MORBO, a method for multi-objective Bayesian optimization over high-dimensional search spaces. MORBO performs local Bayesian optimization within multiple trust regions simultaneously, allowing it to explore and identify diverse solutions even when the objective functions are difficult to model globally. We show that MORBO significantly advances the state-of-the-art in sample-efficiency for several high-dimensional synthetic and real-world multi-objective problems, including a vehicle design problem with 222 parameters, demonstrating that MORBO is a practical approach for challenging and important problems that were previously out of reach for BO methods.
    Quantile-based fuzzy C-means clustering of multivariate time series: Robust techniques. (arXiv:2109.11027v1 [stat.ME])
    (2 min) Three robust methods for clustering multivariate time series from the point of view of generating processes are proposed. The procedures are robust versions of a fuzzy C-means model based on: (i) estimates of the quantile cross-spectral density and (ii) the classical principal component analysis. Robustness to the presence of outliers is achieved by using the so-called metric, noise and trimmed approaches. The metric approach incorporates in the objective function a distance measure aimed at neutralizing the effect of the outliers, the noise approach builds an artificial cluster expected to contain the outlying series and the trimmed approach eliminates the most atypical series in the dataset. All the proposed techniques inherit the nice properties of the quantile cross-spectral density, as being able to uncover general types of dependence. Results from a broad simulation study including multivariate linear, nonlinear and GARCH processes indicate that the algorithms are substantially effective in coping with the presence of outlying series (i.e., series exhibiting a dependence structure different from that of the majority), clearly poutperforming alternative procedures. The usefulness of the suggested methods is highlighted by means of two specific applications regarding financial and environmental series.
    Security Analysis of Capsule Network Inference using Horizontal Collaboration. (arXiv:2109.11041v1 [cs.LG])
    (2 min) The traditional convolution neural networks (CNN) have several drawbacks like the Picasso effect and the loss of information by the pooling layer. The Capsule network (CapsNet) was proposed to address these challenges because its architecture can encode and preserve the spatial orientation of input images. Similar to traditional CNNs, CapsNet is also vulnerable to several malicious attacks, as studied by several researchers in the literature. However, most of these studies focus on single-device-based inference, but horizontally collaborative inference in state-of-the-art systems, like intelligent edge services in self-driving cars, voice controllable systems, and drones, nullify most of these analyses. Horizontal collaboration implies partitioning the trained CNN models or CNN tasks to multiple end devices or edge nodes. Therefore, it is imperative to examine the robustness of the CapsNet against malicious attacks when deployed in horizontally collaborative environments. Towards this, we examine the robustness of the CapsNet when subjected to noise-based inference attacks in a horizontal collaborative environment. In this analysis, we perturbed the feature maps of the different layers of four DNN models, i.e., CapsNet, Mini-VGG, LeNet, and an in-house designed CNN (ConvNet) with the same number of parameters as CapsNet, using two types of noised-based attacks, i.e., Gaussian Noise Attack and FGSM noise attack. The experimental results show that similar to the traditional CNNs, depending upon the access of the attacker to the DNN layer, the classification accuracy of the CapsNet drops significantly. For example, when Gaussian Noise Attack classification is performed at the DigitCap layer of the CapsNet, the maximum classification accuracy drop is approximately 97%.
    The CAMELS Multifield Dataset: Learning the Universe's Fundamental Parameters with Artificial Intelligence. (arXiv:2109.10915v1 [cs.LG])
    (2 min) We present the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) Multifield Dataset, CMD, a collection of hundreds of thousands of 2D maps and 3D grids containing many different properties of cosmic gas, dark matter, and stars from 2,000 distinct simulated universes at several cosmic times. The 2D maps and 3D grids represent cosmic regions that span $\sim$100 million light years and have been generated from thousands of state-of-the-art hydrodynamic and gravity-only N-body simulations from the CAMELS project. Designed to train machine learning models, CMD is the largest dataset of its kind containing more than 70 Terabytes of data. In this paper we describe CMD in detail and outline a few of its applications. We focus our attention on one such task, parameter inference, formulating the problems we face as a challenge to the community. We release all data and provide further technical details at https://camels-multifield-dataset.readthedocs.io.
    Mixed-supervised segmentation: Confidence maximization helps knowledge distillation. (arXiv:2109.10902v1 [eess.IV])
    (2 min) Despite achieving promising results in a breadth of medical image segmentation tasks, deep neural networks require large training datasets with pixel-wise annotations. Obtaining these curated datasets is a cumbersome process which limits the application in scenarios where annotated images are scarce. Mixed supervision is an appealing alternative for mitigating this obstacle, where only a small fraction of the data contains complete pixel-wise annotations and other images have a weaker form of supervision. In this work, we propose a dual-branch architecture, where the upper branch (teacher) receives strong annotations, while the bottom one (student) is driven by limited supervision and guided by the upper branch. Combined with a standard cross-entropy loss over the labeled pixels, our novel formulation integrates two important terms: (i) a Shannon entropy loss defined over the less-supervised images, which encourages confident student predictions in the bottom branch; and (ii) a Kullback-Leibler (KL) divergence term, which transfers the knowledge of the strongly supervised branch to the less-supervised branch and guides the entropy (student-confidence) term to avoid trivial solutions. We show that the synergy between the entropy and KL divergence yields substantial improvements in performance. We also discuss an interesting link between Shannon-entropy minimization and standard pseudo-mask generation, and argue that the former should be preferred over the latter for leveraging information from unlabeled pixels. Quantitative and qualitative results on two publicly available datasets demonstrate that our method significantly outperforms other strategies for semantic segmentation within a mixed-supervision framework, as well as recent semi-supervised approaches. Moreover, we show that the branch trained with reduced supervision and guided by the top branch largely outperforms the latter.

2021-09-23

  • cs.CL updates on arXiv.org

    Simulated Annealing for Emotional Dialogue Systems. (arXiv:2109.10715v1 [cs.CL])
    (0 min) Explicitly modeling emotions in dialogue generation has important applications, such as building empathetic personal companions. In this study, we consider the task of expressing a specific emotion for dialogue generation. Previous approaches take the emotion as an input signal, which may be ignored during inference. We instead propose a search-based emotional dialogue system by simulated annealing (SA). Specifically, we first define a scoring function that combines contextual coherence and emotional correctness. Then, SA iteratively edits a general response and searches for a sentence with a higher score, enforcing the presence of the desired emotion. We evaluate our system on the NLPCC2017 dataset. Our proposed method shows 12% improvements in emotion accuracy compared with the previous state-of-the-art method, without hurting the generation quality (measured by BLEU).
    Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach. (arXiv:2102.10242v3 [cs.CL] UPDATED)
    (0 min) Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
    GIPFA: Generating IPA Pronunciation from Audio. (arXiv:2006.07573v2 [cs.CL] UPDATED)
    (0 min) Transcribing spoken audio samples into the International Phonetic Alphabet (IPA) has long been reserved for experts. In this study, we examine the use of an Artificial Neural Network (ANN) model to automatically extract the IPA phonemic pronunciation of a word based on its audio pronunciation, hence its name Generating IPA Pronunciation From Audio (GIPFA). Based on the French Wikimedia dictionary, we trained our model which then correctly predicted 75% of the IPA pronunciations tested. Interestingly, by studying inference errors, the model made it possible to highlight possible errors in the dataset as well as to identify the closest phonemes in French.
    Multi-task Learning with Cross Attention for Keyword Spotting. (arXiv:2107.07634v2 [eess.AS] UPDATED)
    (0 min) Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data. In this approach, an output of an acoustic model is split into two branches for the two tasks, one for phoneme transcription trained with the ASR data and one for keyword classification trained with the KWS data. In this paper, we introduce a cross attention decoder in the multi-task learning framework. Unlike the conventional multi-task learning approach with the simple split of the output layer, the cross attention decoder summarizes information from a phonetic encoder by performing cross attention between the encoder outputs and a trainable query sequence to predict a confidence score for the KWS task. Experimental results on KWS tasks show that the proposed approach achieves a 12% relative reduction in the false reject ratios compared to the conventional multi-task learning with split branches and a bi-directional long short-team memory decoder.
    Deep Sound Change: Deep and Iterative Learning, Convolutional Neural Networks, and Language Change. (arXiv:2011.05463v2 [cs.CL] UPDATED)
    (0 min) This paper proposes a framework for modeling sound change that combines deep learning and iterative learning. Acquisition and transmission of speech is modeled by training generations of Generative Adversarial Networks (GANs) on unannotated raw speech data. The paper argues that several properties of sound change emerge from the proposed architecture. GANs (Goodfellow et al. 2014 arXiv:1406.2661, Donahue et al. 2019 arXiv:1705.07904) are uniquely appropriate for modeling language change because the networks are trained on raw unsupervised acoustic data, contain no language-specific features and, as argued in Begu\v{s} (2020 arXiv:2006.03965), encode phonetic and phonological representations in their latent space and generate linguistically informative innovative data. The first generation of networks is trained on the relevant sequences in human speech from TIMIT. The subsequent generations are not trained on TIMIT, but on generated outputs from the previous generation and thus start learning from each other in an iterative learning task. The initial allophonic distribution is progressively being lost with each generation, likely due to pressures from the global distribution of aspiration in the training data. The networks show signs of a gradual shift in phonetic targets characteristic of a gradual phonetic sound change. At endpoints, the outputs superficially resemble a phonological change -- rule loss.
    An Attention Free Transformer. (arXiv:2105.14103v2 [cs.LG] UPDATED)
    (0 min) We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
    Probing Classifiers: Promises, Shortcomings, and Advances. (arXiv:2102.12452v4 [cs.CL] UPDATED)
    (0 min) Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple -- a classifier is trained to predict some linguistic property from a model's representations -- and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This article critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.
    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. (arXiv:2109.10686v1 [cs.CL])
    (0 min) There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
    Coarse2Fine: Fine-grained Text Classification on Coarsely-grained Annotated Data. (arXiv:2109.10856v1 [cs.CL])
    (0 min) Existing text classification methods mainly focus on a fixed label set, whereas many real-world applications require extending to new fine-grained classes as the number of samples per label increases. To accommodate such requirements, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. Specifically, we first propose a label-conditioned finetuning formulation to attune these generators for our task. Furthermore, we devise a regularization objective based on the coarse-fine label constraints derived from our problem setting, giving us even further improvements over the prior formulation. Our framework uses the fine-tuned generative models to sample pseudo-training data for training the classifier, and bootstraps on real unlabeled data for model refinement. Extensive experiments and case studies on two real-world datasets demonstrate superior performance over SOTA zero-shot classification baselines.
    Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures. (arXiv:2007.08970v3 [cs.CL] UPDATED)
    (0 min) While mainstream machine learning methods are known to have limited ability to compositionally generalize, new architectures and techniques continue to be proposed to address this limitation. We investigate state-of-the-art techniques and architectures in order to assess their effectiveness in improving compositional generalization in semantic parsing tasks based on the SCAN and CFQ datasets. We show that masked language model (MLM) pre-training rivals SCAN-inspired architectures on primitive holdout splits. On a more complex compositional task, we show that pre-training leads to significant improvements in performance vs. comparable non-pre-trained models, whereas architectures proposed to encourage compositional generalization on SCAN or in the area of algorithm learning fail to lead to significant improvements. We establish a new state of the art on the CFQ compositional generalization benchmark using MLM pre-training together with an intermediate representation.
    NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media. (arXiv:2104.05893v2 [cs.CV] UPDATED)
    (0 min) Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes. We are motivated by the threat scenario where an image is used out of context to support a certain narrative. While some prior datasets for detecting image-text inconsistency generate samples via text manipulation, we propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatically retrieving convincing images for a given caption, capturing cases with inconsistent entities or semantic context. Our large-scale automatically generated NewsCLIPpings Dataset: (1) demonstrates that machine-driven image repurposing is now a realistic threat, and (2) provides samples that represent challenging instances of mismatch between text and image in news that are able to mislead humans. We benchmark several state-of-the-art multimodal models on our dataset and analyze their performance across different pretraining domains and visual backbones.
    Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing. (arXiv:2109.10540v1 [cs.CL])
    (0 min) Recent years pretrained language models (PLMs) hit a success on several downstream tasks, showing their power on modeling language. To better understand and leverage what PLMs have learned, several techniques have emerged to explore syntactic structures entailed by PLMs. However, few efforts have been made to explore grounding capabilities of PLMs, which are also essential. In this paper, we highlight the ability of PLMs to discover which token should be grounded to which concept, if combined with our proposed erasing-then-awakening approach. Empirical studies on four datasets demonstrate that our approach can awaken latent grounding which is understandable to human experts, even if it is not exposed to such labels during training. More importantly, our approach shows great potential to benefit downstream semantic parsing models. Taking text-to-SQL as a case study, we successfully couple our approach with two off-the-shelf parsers, obtaining an absolute improvement of up to 9.8%.
    Pix2seq: A Language Modeling Framework for Object Detection. (arXiv:2109.10852v1 [cs.CV])
    (0 min) This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
    Keyword Extraction for Improved Document Retrieval in Conversational Search. (arXiv:2109.05979v2 [cs.CL] UPDATED)
    (0 min) Recent research has shown that mixed-initiative conversational search, based on the interaction between users and computers to clarify and improve a query, provides enormous advantages. Nonetheless, incorporating additional information provided by the user from the conversation poses some challenges. In fact, further interactions could confuse the system as a user might use words irrelevant to the information need but crucial for correct sentence construction in the context of multi-turn conversations. To this aim, in this paper, we have collected two conversational keyword extraction datasets and propose an end-to-end document retrieval pipeline incorporating them. Furthermore, we study the performance of two neural keyword extraction models, namely, BERT and sequence to sequence, in terms of extraction accuracy and human annotation. Finally, we study the effect of keyword extraction on the end-to-end neural IR performance and show that our approach beats state-of-the-art IR models. We make the two datasets publicly available to foster research in this area.
    On the Transferability of Adversarial Attacksagainst Neural Text Classifier. (arXiv:2011.08558v3 [cs.LG] UPDATED)
    (0 min) Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.
    MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization. (arXiv:2109.10650v1 [cs.CL])
    (2 min) One of the most challenging aspects of current single-document news summarization is that the summary often contains 'extrinsic hallucinations', i.e., facts that are not present in the source document, which are often derived via world knowledge. This causes summarization systems to act more like open-ended language models tending to hallucinate facts that are erroneous. In this paper, we mitigate this problem with the help of multiple supplementary resource documents assisting the task. We present a new dataset MiRANews and benchmark existing summarization models. In contrast to multi-document summarization, which addresses multiple events from several source documents, we still aim at generating a summary for a single document. We show via data analysis that it's not only the models which are to blame: more than 27% of facts mentioned in the gold summaries of MiRANews are better grounded on assisting documents than in the main source articles. An error analysis of generated summaries from pretrained models fine-tuned on MiRANews reveals that this has an even bigger effects on models: assisted summarization reduces 55% of hallucinations when compared to single-document summarization models trained on the main article only. Our code and data are available at https://github.com/XinnuoXu/MiRANews.
    HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning. (arXiv:2109.10500v1 [cs.CL])
    (2 min) Taxonomies are valuable resources for many applications, but the limited coverage due to the expensive manual curation process hinders their general applicability. Prior works attempt to automatically expand existing taxonomies to improve their coverage by learning concept embeddings in Euclidean space, while taxonomies, inherently hierarchical, more naturally align with the geometric properties of a hyperbolic space. In this paper, we present HyperExpan, a taxonomy expansion algorithm that seeks to preserve the structure of a taxonomy in a more expressive hyperbolic embedding space and learn to represent concepts and their relations with a Hyperbolic Graph Neural Network (HGNN). Specifically, HyperExpan leverages position embeddings to exploit the structure of the existing taxonomies, and characterizes the concept profile information to support the inference on unseen concepts during training. Experiments show that our proposed HyperExpan outperforms baseline models with representation learning in a Euclidean feature space and achieves state-of-the-art performance on the taxonomy expansion benchmarks.
    Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages. (arXiv:2109.10534v1 [cs.CL])
    (2 min) We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to track performance change as languages are added to a base language in a graded and greedy (in the sense of best boost of performance) manner; which reveals that careful selection of subset of related languages can significantly improve performance than utilizing all related languages. The Indo-Aryan (IA) language family is chosen for the study, the exact languages being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script barrier is crossed by simple rule-based transliteration of the text of all languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning. Textual Entailment, Entity Classification, Section Title Prediction, tasks of IndicGLUE and POS tagging form our test bed. Compared to monolingual fine tuning we get relative performance improvement of up to 150% in the downstream tasks. The surprise take-away is that for any language there is a particular combination of other languages which yields the best performance, and any additional language is in fact detrimental.
    A Simple Approach to Jointly Rank Passages and Select Relevant Sentences in the OBQA Context. (arXiv:2109.10497v1 [cs.CL])
    (2 min) In the open question answering (OBQA) task, how to select the relevant information from a large corpus is a crucial problem for reasoning and inference. Some datasets (e.g, HotpotQA) mainly focus on testing the model's reasoning ability at the sentence level. To overcome this challenge, many existing frameworks use a deep learning model to select relevant passages and then answer each question by matching a sentence in the corresponding passage. However, such frameworks require long inference time and fail to take advantage of the relationship between passages and sentences. In this work, we present a simple yet effective framework to address these problems by jointly ranking passages and selecting sentences. We propose consistency and similarity constraints to promote the correlation and interaction between passage ranking and sentence selection. In our experiments, we demonstrate that our framework can achieve competitive results and outperform the baseline by 28\% in terms of exact matching of relevant sentences on the HotpotQA dataset.
    BFClass: A Backdoor-free Text Classification Framework. (arXiv:2109.10855v1 [cs.CL])
    (2 min) Backdoor attack introduces artificial vulnerabilities into the model by poisoning a subset of the training data via injecting triggers and modifying labels. Various trigger design strategies have been explored to attack text classifiers, however, defending such attacks remains an open problem. In this work, we propose BFClass, a novel efficient backdoor-free training framework for text classification. The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model. To identify triggers, we utilize this discriminator to locate the most suspicious token from each training sample and then distill a concise set by considering their association strengths with particular labels. To recognize the poisoned subset, we examine the training samples with these identified triggers as the most suspicious token, and check if removing the trigger will change the poisoned model's prediction. Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.
    Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation. (arXiv:2109.10859v1 [cs.CL])
    (2 min) Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.
    Towards The Automatic Coding of Medical Transcripts to Improve Patient-Centered Communication. (arXiv:2109.10514v1 [cs.CL])
    (2 min) This paper aims to provide an approach for automatic coding of physician-patient communication transcripts to improve patient-centered communication (PCC). PCC is a central part of high-quality health care. To improve PCC, dialogues between physicians and patients have been recorded and tagged with predefined codes. Trained human coders have manually coded the transcripts. Since it entails huge labor costs and poses possible human errors, automatic coding methods should be considered for efficiency and effectiveness. We adopted three machine learning algorithms (Na\"ive Bayes, Random Forest, and Support Vector Machine) to categorize lines in transcripts into corresponding codes. The result showed that there is evidence to distinguish the codes, and this is considered to be sufficient for training of human annotators.
    Enriching and Controlling Global Semantics for Text Summarization. (arXiv:2109.10616v1 [cs.CL])
    (2 min) Recently, Transformer-based models have been proven effective in the abstractive summarization task by creating fluent and informative summaries. Nevertheless, these models still suffer from the short-range dependency problem, causing them to produce summaries that miss the key points of document. In this paper, we attempt to address this issue by introducing a neural topic model empowered with normalizing flow to capture the global semantics of the document, which are then integrated into the summarization model. In addition, to avoid the overwhelming effect of global semantics on contextualized representation, we introduce a mechanism to control the amount of global semantics supplied to the text generation module. Our method outperforms state-of-the-art summarization models on five common text summarization datasets, namely CNN/DailyMail, XSum, Reddit TIFU, arXiv, and PubMed.
    Grammatical Profiling for Semantic Change Detection. (arXiv:2109.10397v1 [cs.CL])
    (2 min) Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words. We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present an in-depth qualitative and quantitative analysis of the predictions made by our grammatical profiling system, showing that they are plausible and interpretable.
    Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network. (arXiv:2109.10724v1 [cs.SD])
    (2 min) Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that of a method that waits for the future context, it entails a huge amount of processing for sampling from the language model at each time step. In this paper, we propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model, instead of sampling words from the large-scale language model. We perform knowledge distillation from a GPT2-based context prediction network into a simple recurrent model by minimizing a teacher-student loss defined between the context embedding vectors of those models. Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality to that of our previous method, and it can perform incremental synthesis much faster than the average speaking speed of human English speakers, demonstrating the availability of our method to real-time applications.
    The NiuTrans Machine Translation Systems for WMT21. (arXiv:2109.10485v1 [cs.CL])
    (2 min) This paper describes NiuTrans neural machine translation systems of the WMT 2021 news translation tasks. We made submissions to 9 language directions, including English$\leftrightarrow$$\{$Chinese, Japanese, Russian, Icelandic$\}$ and English$\rightarrow$Hausa tasks. Our primary systems are built on several effective variants of Transformer, e.g., Transformer-DLCL, ODE-Transformer. We also utilize back-translation, knowledge distillation, post-ensemble, and iterative fine-tuning techniques to enhance the model performance further.
    Unsupervised Contextualized Document Representation. (arXiv:2109.10509v1 [cs.CL])
    (2 min) Several NLP tasks need the effective representation of text documents. Arora et. al., 2017 demonstrate that simple weighted averaging of word vectors frequently outperforms neural models. SCDV (Mekala et. al., 2017) further extends this from sentences to documents by employing soft and sparse clustering over pre-computed word vectors. However, both techniques ignore the polysemy and contextual character of words. In this paper, we address this issue by proposing SCDV+BERT(ctxd), a simple and effective unsupervised representation that combines contextualized BERT (Devlin et al., 2019) based word embedding for word sense disambiguation with SCDV soft clustering approach. We show that our embeddings outperform original SCDV, pre-train BERT, and several other baselines on many classification datasets. We also demonstrate our embeddings effectiveness on other tasks, such as concept matching and sentence similarity. In addition, we show that SCDV+BERT(ctxd) outperforms fine-tune BERT and different embedding approaches in scenarios with limited data and only few shots examples.
    Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron. (arXiv:2109.10506v1 [cs.CL])
    (2 min) We explore Boccaccio's Decameron to see how digital humanities tools can be used for tasks that have limited data in a language no longer in contemporary use: medieval Italian. We focus our analysis on the question: Do the different storytellers in the text exhibit distinct personalities? To answer this question, we curate and release a dataset based on the authoritative edition of the text. We use supervised classification methods to predict storytellers based on the stories they tell, confirming the difficulty of the task, and demonstrate that topic modeling can extract thematic storyteller "profiles."
    NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering Dataset. (arXiv:2109.10604v1 [cs.CL])
    (2 min) While diverse question answering (QA) datasets have been proposed and contributed significantly to the development of deep learning models for QA tasks, the existing datasets fall short in two aspects. First, we lack QA datasets covering complex questions that involve answers as well as the reasoning processes to get the answers. As a result, the state-of-the-art QA research on numerical reasoning still focuses on simple calculations and does not provide the mathematical expressions or evidences justifying the answers. Second, the QA community has contributed much effort to improving the interpretability of QA models. However, these models fail to explicitly show the reasoning process, such as the evidence order for reasoning and the interactions between different pieces of evidence. To address the above shortcomings, we introduce NOAHQA, a conversational and bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions. With NOAHQA, we develop an interpretable reasoning graph as well as the appropriate evaluation metric to measure the answer quality. We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores, while the human performance is 89.7. We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans, e.g., 28 scores.
    FCM: A Fine-grained Comparison Model forMulti-turn Dialogue Reasoning. (arXiv:2109.10510v1 [cs.CL])
    (2 min) Despite the success of neural dialogue systems in achieving high performance on the leader-board, they cannot meet users' requirements in practice, due to their poor reasoning skills. The underlying reason is that most neural dialogue models only capture the syntactic and semantic information, but fail to model the logical consistency between the dialogue history and the generated response. Recently, a new multi-turn dialogue reasoning task has been proposed, to facilitate dialogue reasoning research. However, this task is challenging, because there are only slight differences between the illogical response and the dialogue history. How to effectively solve this challenge is still worth exploring. This paper proposes a Fine-grained Comparison Model (FCM) to tackle this problem. Inspired by human's behavior in reading comprehension, a comparison mechanism is proposed to focus on the fine-grained differences in the representation of each response candidate. Specifically, each candidate representation is compared with the whole history to obtain a history consistency representation. Furthermore, the consistency signals between each candidate and the speaker's own history are considered to drive a model to prefer a candidate that is logically consistent with the speaker's history logic. Finally, the above consistency representations are employed to output a ranking list of the candidate responses for multi-turn dialogue reasoning. Experimental results on two public dialogue datasets show that our method obtains higher ranking scores than the baseline models.
    Contrastive Learning for Fair Representations. (arXiv:2109.10645v1 [cs.CL])
    (2 min) Trained classification models can unintentionally lead to biased representations and predictions, which can reinforce societal preconceptions and stereotypes. Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise. In this paper, we propose a method for mitigating bias in classifier training by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations, while instances sharing a protected attribute are forced further apart. In such a way our method learns representations which capture the task label in focused regions, while ensuring the protected attribute has diverse spread, and thus has limited impact on prediction and thereby results in fairer models. Extensive experimental results across four tasks in NLP and computer vision show (a) that our proposed method can achieve fairer representations and realises bias reductions compared with competitive baselines; and (b) that it can do so without sacrificing main task performance; (c) that it sets a new state-of-the-art performance in one task despite reducing the bias. Finally, our method is conceptually simple and agnostic to network architectures, and incurs minimal additional compute cost.
    COVR: A test-bed for Visually Grounded Compositional Generalization with real images. (arXiv:2109.10613v1 [cs.CL])
    (2 min) While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost fully automatic procedure for generating question-answer pairs along with a set of context images. COVR focuses on questions that require complex reasoning, including higher-order operations such as quantification and aggregation. Due to the automatic generation process, COVR facilitates the creation of compositional splits, where models at test time need to generalize to new concepts and compositions in a zero- or few-shot setting. We construct compositional splits using COVR and demonstrate a myriad of cases where state-of-the-art pre-trained language-and-vision models struggle to compositionally generalize.
    OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network. (arXiv:1912.13321v4 [cs.CL] UPDATED)
    (2 min) To transcribe spoken language to written medium, most alphabets enable an unambiguous sound-to-letter rule. However, some writing systems have distanced themselves from this simple concept and little work exists in Natural Language Processing (NLP) on measuring such distance. In this study, we use an Artificial Neural Network (ANN) model to evaluate the transparency between written words and their pronunciation, hence its name Orthographic Transparency Estimation with an ANN (OTEANN). Based on datasets derived from Wikimedia dictionaries, we trained and tested this model to score the percentage of correct predictions in phoneme-to-grapheme and grapheme-to-phoneme translation tasks. The scores obtained on 17 orthographies were in line with the estimations of other studies. Interestingly, the model also provided insight into typical mistakes made by learners who only consider the phonemic rule in reading and writing.
    Diarisation using Location tracking with agglomerative clustering. (arXiv:2109.10598v1 [cs.LG])
    (2 min) Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework. Kalman filters, which track the locations of speakers, are used to compute log-likelihood ratios that contribute to the cluster affinity computations for the AHC merging and stopping decisions. Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task, compared to methods that do not use location information or that make stationarity assumptions.
    Evaluating Debiasing Techniques for Intersectional Biases. (arXiv:2109.10441v1 [cs.CL])
    (2 min) Bias is pervasive in NLP models, motivating the development of automatic debiasing techniques. Evaluation of NLP debiasing methods has largely been limited to binary attributes in isolation, e.g., debiasing with respect to binary gender or race, however many corpora involve multiple such attributes, possibly with higher cardinality. In this paper we argue that a truly fair model must consider `gerrymandering' groups which comprise not only single attributes, but also intersectional groups. We evaluate a form of bias-constrained model which is new to NLP, as well an extension of the iterative nullspace projection technique which can handle multiple protected attributes.
    Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing. (arXiv:2109.10847v1 [cs.LG])
    (2 min) Recent progress in the Natural Language Processing domain has given us several State-of-the-Art (SOTA) pretrained models which can be finetuned for specific tasks. These large models with billions of parameters trained on numerous GPUs/TPUs over weeks are leading in the benchmark leaderboards. In this paper, we discuss the need for a benchmark for cost and time effective smaller models trained on a single GPU. This will enable researchers with resource constraints experiment with novel and innovative ideas on tokenization, pretraining tasks, architecture, fine tuning methods etc. We set up Small-Bench NLP, a benchmark for small efficient neural language models trained on a single GPU. Small-Bench NLP benchmark comprises of eight NLP tasks on the publicly available GLUE datasets and a leaderboard to track the progress of the community. Our ELECTRA-DeBERTa (15M parameters) small model architecture achieves an average score of 81.53 which is comparable to that of BERT-Base's 82.20 (110M parameters). Our models, code and leaderboard are available at https://github.com/smallbenchnlp
    "Wait, I'm Still Talking!" Predicting the Dialogue Interaction Behavior Using Imagine-Then-Arbitrate Model. (arXiv:2002.09616v4 [cs.CL] UPDATED)
    (3 min) Producing natural and accurate responses like human beings is the ultimate goal of intelligent dialogue agents. So far, most of the past works concentrate on selecting or generating one pertinent and fluent response according to current query and its context. These models work on a one-to-one environment, making one response to one utterance each round. However, in real human-human conversations, human often sequentially sends several short messages for readability instead of a long message in one turn. Thus messages will not end with an explicit ending signal, which is crucial for agents to decide when to reply. So the first step for an intelligent dialogue agent is not replying but deciding if it should reply at the moment. To address this issue, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to help the agent decide whether to wait or to make a response directly. Our method has two imaginator modules and an arbitrator module. The two imaginators will learn the agent's and user's speaking style respectively, generate possible utterances as the input of the arbitrator, combining with dialogue history. And the arbitrator decides whether to wait or to make a response to the user directly. To verify the performance and effectiveness of our method, we prepared two dialogue datasets and compared our approach with several popular models. Experimental results show that our model performs well on addressing ending prediction issue and outperforms baseline models.
    DialogueBERT: A Self-Supervised Learning based Dialogue Pre-training Encoder. (arXiv:2109.10480v1 [cs.CL])
    (2 min) With the rapid development of artificial intelligence, conversational bots have became prevalent in mainstream E-commerce platforms, which can provide convenient customer service timely. To satisfy the user, the conversational bots need to understand the user's intention, detect the user's emotion, and extract the key entities from the conversational utterances. However, understanding dialogues is regarded as a very challenging task. Different from common language understanding, utterances in dialogues appear alternately from different roles and are usually organized as hierarchical structures. To facilitate the understanding of dialogues, in this paper, we propose a novel contextual dialogue encoder (i.e. DialogueBERT) based on the popular pre-trained language model BERT. Five self-supervised learning pre-training tasks are devised for learning the particularity of dialouge utterances. Four different input embeddings are integrated to catch the relationship between utterances, including turn embedding, role embedding, token embedding and position embedding. DialogueBERT was pre-trained with 70 million dialogues in real scenario, and then fine-tuned in three different downstream dialogue understanding tasks. Experimental results show that DialogueBERT achieves exciting results with 88.63% accuracy for intent recognition, 94.25% accuracy for emotion recognition and 97.04% F1 score for named entity recognition, which outperforms several strong baselines by a large margin.
    Predict-then-Decide: A Predictive Approach for Wait or Answer Task in Dialogue Systems. (arXiv:2005.13119v2 [cs.CL] UPDATED)
    (3 min) Different people have different habits of describing their intents in conversations. Some people tend to deliberate their intents in several successive utterances, i.e., they use several consistent messages for readability instead of a long sentence to express their question. This creates a predicament faced by the application of dialogue systems, especially in real-world industry scenarios, in which the dialogue system is unsure whether it should answer the query of user immediately or wait for further supplementary input. Motivated by such an interesting predicament, we define a novel Wait-or-Answer task for dialogue systems. We shed light on a new research topic about how the dialogue system can be more intelligent to behave in this Wait-or-Answer quandary. Further, we propose a predictive approach named Predict-then-Decide (PTD) to tackle this Wait-or-Answer task. More specifically, we take advantage of a decision model to help the dialogue system decide whether to wait or answer. The decision of decision model is made with the assistance of two ancillary prediction models: a user prediction and an agent prediction. The user prediction model tries to predict what the user would supplement and uses its prediction to persuade the decision model that the user has some information to add, so the dialogue system should wait. The agent prediction model tries to predict the answer of the dialogue system and convince the decision model that it is a superior choice to answer the query of user immediately since the input of user has come to an end. We conduct our experiments on two real-life scenarios and three public datasets. Experimental results on five datasets show our proposed PTD approach significantly outperforms the existing models in solving this Wait-or-Answer problem.
    RETRONLU: Retrieval Augmented Task-Oriented Semantic Parsing. (arXiv:2109.10410v1 [cs.CL])
    (2 min) While large pre-trained language models accumulate a lot of knowledge in their parameters, it has been demonstrated that augmenting it with non-parametric retrieval-based memory has a number of benefits from accuracy improvements to data efficiency for knowledge-focused tasks, such as question answering. In this paper, we are applying retrieval-based modeling ideas to the problem of multi-domain task-oriented semantic parsing for conversational assistants. Our approach, RetroNLU, extends a sequence-to-sequence model architecture with a retrieval component, used to fetch existing similar examples and provide them as an additional input to the model. In particular, we analyze two settings, where we augment an input with (a) retrieved nearest neighbor utterances (utterance-nn), and (b) ground-truth semantic parses of nearest neighbor utterances (semparse-nn). Our technique outperforms the baseline method by 1.5% absolute macro-F1, especially at the low resource setting, matching the baseline model accuracy with only 40% of the data. Furthermore, we analyze the nearest neighbor retrieval component's quality, model sensitivity and break down the performance for semantic parses of different utterance complexity.
    Recursively Summarizing Books with Human Feedback. (arXiv:2109.10862v1 [cs.CL])
    (2 min) A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases ($\sim5\%$ of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.
    Salience-Aware Event Chain Modeling for Narrative Understanding. (arXiv:2109.10475v1 [cs.CL])
    (2 min) Storytelling, whether via fables, news reports, documentaries, or memoirs, can be thought of as the communication of interesting and related events that, taken together, form a concrete process. It is desirable to extract the event chains that represent such processes. However, this extraction remains a challenging problem. We posit that this is due to the nature of the texts from which chains are discovered. Natural language text interleaves a narrative of concrete, salient events with background information, contextualization, opinion, and other elements that are important for a variety of necessary discourse and pragmatics acts but are not part of the principal chain of events being communicated. We introduce methods for extracting this principal chain from natural language text, by filtering away non-salient events and supportive sentences. We demonstrate the effectiveness of our methods at isolating critical event chains by comparing their effect on downstream tasks. We show that by pre-training large language models on our extracted chains, we obtain improvements in two tasks that benefit from a clear understanding of event chains: narrative prediction and event-based temporal question answering. The demonstrated improvements and ablative studies confirm that our extraction method isolates critical event chains.
    Scalable and Efficient MoE Training for Multitask Multilingual Models. (arXiv:2109.10465v1 [cs.CL])
    (2 min) The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.
    Extracting Fine-Grained Knowledge Graphs of Scientific Claims: Dataset and Transformer-Based Results. (arXiv:2109.10453v1 [cs.CL])
    (2 min) Recent transformer-based approaches demonstrate promising results on relational scientific information extraction. Existing datasets focus on high-level description of how research is carried out. Instead we focus on the subtleties of how experimental associations are presented by building SciClaim, a dataset of scientific claims drawn from Social and Behavior Science (SBS), PubMed, and CORD-19 papers. Our novel graph annotation schema incorporates not only coarse-grained entity spans as nodes and relations as edges between them, but also fine-grained attributes that modify entities and their relations, for a total of 12,738 labels in the corpus. By including more label types and more than twice the label density of previous datasets, SciClaim captures causal, comparative, predictive, statistical, and proportional associations over experimental variables along with their qualifications, subtypes, and evidence. We extend work in transformer-based joint entity and relation extraction to effectively infer our schema, showing the promise of fine-grained knowledge graphs in scientific claims and beyond.
    Fairness-aware Class Imbalanced Learning. (arXiv:2109.10444v1 [cs.CL])
    (2 min) Class imbalance is a common challenge in many NLP tasks, and has clear connections to bias, in that bias in training data often leads to higher accuracy for majority groups at the expense of minority groups. However there has traditionally been a disconnect between research on class-imbalanced learning and mitigating bias, and only recently have the two been looked at through a common lens. In this work we evaluate long-tail learning methods for tweet sentiment and occupation classification, and extend a margin-loss based approach with methods to enforce fairness. We empirically show through controlled experiments that the proposed approaches help mitigate both class imbalance and demographic biases.
    What Would it Take to get Biomedical QA Systems into Practice?. (arXiv:2109.10415v1 [cs.CL])
    (2 min) Medical question answering (QA) systems have the potential to answer clinicians uncertainties about treatment and diagnosis on demand, informed by the latest evidence. However, despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments. One likely reason for this is that clinicians may not readily trust QA system outputs, in part because transparency, trustworthiness, and provenance have not been key considerations in the design of such models. In this paper we discuss a set of criteria that, if met, we argue would likely increase the utility of biomedical QA systems, which may in turn lead to adoption of such systems in practice. We assess existing models, tasks, and datasets with respect to these criteria, highlighting shortcomings of previously proposed approaches and pointing toward what might be more usable QA systems.
  • cs.CV updates on arXiv.org

    Rotor Localization and Phase Mapping of Cardiac Excitation Waves using Deep Neural Networks. (arXiv:2109.10472v1 [physics.med-ph])
    (0 min) The analysis of electrical impulse phenomena in cardiac muscle tissue is important for the diagnosis of heart rhythm disorders and other cardiac pathophysiology. Cardiac mapping techniques acquire numerous local temporal measurements and combine them to visualize the spread of electrophysiological wave phenomena across the heart surface. However, low spatial resolutions, sparse measurement locations, noise and other artifacts make it challenging to accurately visualize spatio-temporal activity. For instance, electro-anatomical catheter mapping is severely limited by the sparsity of the measurements and optical mapping is prone to noise and motion artifacts. In the past, several approaches have been proposed to obtain more reliable maps from noisy or sparse mapping data. Here, we demonstrate that deep learning can be used to compute phase maps and detect phase singularities from both noisy and sparse electrical mapping data with high precision and efficiency. The self-supervised deep learning approach is fundamentally different from classical phase mapping techniques. Rather than encoding a phase signal from time-series data, the network instead learns to directly associate short spatio-temporal sequences of electrical data with phase maps and the positions of phase singularities. Using this method, we were able to accurately compute phase maps and locate rotor cores even from extremely sparse and noisy data, generated from both optical mapping experiments and computer simulations. Neural networks are a promising alternative to conventional phase mapping and rotor core localization methods, that could be used in optical mapping studies in basic cardiovascular research as well as in the clinical setting for the analysis of atrial fibrillation.
    Parameter Efficient Multimodal Transformers for Video Representation Learning. (arXiv:2012.04124v2 [cs.CV] UPDATED)
    (0 min) The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.
    TACTIC: Joint Rate-Distortion-Accuracy Optimisation for Low Bitrate Compression. (arXiv:2109.10658v1 [cs.CV])
    (0 min) We present TACTIC: Task-Aware Compression Through Intelligent Coding. Our lossy compression model learns based on the rate-distortion-accuracy trade-off for a specific task. By considering what information is important for the follow-on problem, the system trades off visual fidelity for good task performance at a low bitrate. When compared against JPEG at the same bitrate, our approach is able to improve the accuracy of ImageNet subset classification by 4.5%. We also demonstrate the applicability of our approach to other problems, providing a 3.4% accuracy and 4.9% mean IoU improvements in performance over task-agnostic compression for semantic segmentation.
    ElasticFace: Elastic Margin Loss for Deep Face Recognition. (arXiv:2109.09416v2 [cs.CV] UPDATED)
    (0 min) Learning discriminative face features plays a major role in building high-performing face recognition models. The recent state-of-the-art face recognition solutions proposed to incorporate a fixed penalty margin on commonly used classification loss function, softmax loss, in the normalized hypersphere to increase the discriminative power of face recognition models, by minimizing the intra-class variation and maximizing the inter-class variation. Marginal softmax losses, such as ArcFace and CosFace, assume that the geodesic distance between and within the different identities can be equally learned using a fixed margin. However, such a learning objective is not realistic for real data with inconsistent inter-and intra-class variation, which might limit the discriminative and generalizability of the face recognition model. In this paper, we relax the fixed margin constrain by proposing elastic margin loss (ElasticFace) that allows flexibility in the push for class separability. The main idea is to utilize random margin values drawn from a normal distribution in each training iteration. This aims at giving the margin chances to extract and retract to allow space for flexible class separability learning. We demonstrate the superiority of our elastic margin loss over ArcFace and CosFace losses, using the same geometric transformation, on a large set of mainstream benchmarks. From a wider perspective, our ElasticFace has advanced the state-of-the-art face recognition performance on six out of nine mainstream benchmarks.
    KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation. (arXiv:2109.10504v1 [cs.CV])
    (0 min) Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies. The code is available in supplementary materials.
    Natural Language Video Localization with Learnable Moment Proposals. (arXiv:2109.10678v1 [cs.CV])
    (0 min) Given an untrimmed video and a natural language query, Natural Language Video Localization (NLVL) aims to identify the video moment described by the query. To address this task, existing methods can be roughly grouped into two groups: 1) propose-and-rank models first define a set of hand-designed moment candidates and then find out the best-matching one. 2) proposal-free models directly predict two temporal boundaries of the referential moment from frames. Currently, almost all the propose-and-rank methods have inferior performance than proposal-free counterparts. In this paper, we argue that propose-and-rank approach is underestimated due to the predefined manners: 1) Hand-designed rules are hard to guarantee the complete coverage of targeted segments. 2) Densely sampled candidate moments cause redundant computation and degrade the performance of ranking process. To this end, we propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals. The position and length of these proposals are dynamically adjusted during training process. Moreover, a boundary-aware loss has been proposed to leverage frame-level information and further improve the performance. Extensive ablations on two challenging NLVL benchmarks have demonstrated the effectiveness of LPNet over existing state-of-the-art methods.
    Deep Variational Clustering Framework for Self-labeling of Large-scale Medical Images. (arXiv:2109.10777v1 [cs.CV])
    (0 min) We propose a Deep Variational Clustering (DVC) framework for unsupervised representation learning and clustering of large-scale medical images. DVC simultaneously learns the multivariate Gaussian posterior through the probabilistic convolutional encoder and the likelihood distribution with the probabilistic convolutional decoder; and optimizes cluster labels assignment. Here, the learned multivariate Gaussian posterior captures the latent distribution of a large set of unlabeled images. Then, we perform unsupervised clustering on top of the variational latent space using a clustering loss. In this approach, the probabilistic decoder helps to prevent the distortion of data points in the latent space and to preserve the local structure of data generating distribution. The training process can be considered as a self-training process to refine the latent space and simultaneously optimizing cluster assignments iteratively. We evaluated our proposed framework on three public datasets that represented different medical imaging modalities. Our experimental results show that our proposed framework generalizes better across different datasets. It achieves compelling results on several medical imaging benchmarks. Thus, our approach offers potential advantages over conventional deep unsupervised learning in real-world applications. The source code of the method and all the experiments are available publicly at: https://github.com/csfarzin/DVC
    An Ultra-Fast Method for Simulation of Realistic Ultrasound Images. (arXiv:2109.10353v1 [eess.IV])
    (0 min) Convolutional neural networks (CNNs) have attracted a rapidly growing interest in a variety of different processing tasks in the medical ultrasound community. However, the performance of CNNs is highly reliant on both the amount and fidelity of the training data. Therefore, scarce data is almost always a concern, particularly in the medical field, where clinical data is not easily accessible. The utilization of synthetic data is a popular approach to address this challenge. However, but simulating a large number of images using packages such as Field II is time-consuming, and the distribution of simulated images is far from that of the real images. Herein, we introduce a novel ultra-fast ultrasound image simulation method based on the Fourier transform and evaluate its performance in a lesion segmentation task. We demonstrate that data augmentation using the images generated by the proposed method substantially outperforms Field II in terms of Dice similarity coefficient, while the simulation is almost 36000 times faster (both on CPU).
    HybridSDF: Combining Free Form Shapes and Geometric Primitives for effective Shape Manipulation. (arXiv:2109.10767v1 [cs.CV])
    (0 min) CAD modeling typically involves the use of simple geometric primitives whereas recent advances in deep-learning based 3D surface modeling have opened new shape design avenues. Unfortunately, these advances have not yet been accepted by the CAD community because they cannot be integrated into engineering workflows. To remedy this, we propose a novel approach to effectively combining geometric primitives and free-form surfaces represented by implicit surfaces for accurate modeling that preserves interpretability, enforces consistency, and enables easy manipulation.
    NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media. (arXiv:2104.05893v2 [cs.CV] UPDATED)
    (0 min) Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes. We are motivated by the threat scenario where an image is used out of context to support a certain narrative. While some prior datasets for detecting image-text inconsistency generate samples via text manipulation, we propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatically retrieving convincing images for a given caption, capturing cases with inconsistent entities or semantic context. Our large-scale automatically generated NewsCLIPpings Dataset: (1) demonstrates that machine-driven image repurposing is now a realistic threat, and (2) provides samples that represent challenging instances of mismatch between text and image in news that are able to mislead humans. We benchmark several state-of-the-art multimodal models on our dataset and analyze their performance across different pretraining domains and visual backbones.
    Towards a Real-Time Facial Analysis System. (arXiv:2109.10393v1 [cs.CV])
    (0 min) Facial analysis is an active research area in computer vision, with many practical applications. Most of the existing studies focus on addressing one specific task and maximizing its performance. For a complete facial analysis system, one needs to solve these tasks efficiently to ensure a smooth experience. In this work, we present a system-level design of a real-time facial analysis system. With a collection of deep neural networks for object detection, classification, and regression, the system recognizes age, gender, facial expression, and facial similarity for each person that appears in the camera view. We investigate the parallelization and interplay of individual tasks. Results on common off-the-shelf architecture show that the system's accuracy is comparable to the state-of-the-art methods, and the recognition speed satisfies real-time requirements. Moreover, we propose a multitask network for jointly predicting the first three attributes, i.e., age, gender, and facial expression. Source code and trained models are available at https://github.com/mahehu/TUT-live-age-estimator.
    Joint Optical Neuroimaging Denoising with Semantic Tasks. (arXiv:2109.10499v1 [eess.IV])
    (0 min) Optical neuroimaging is a vital tool for understanding the brain structure and the connection between regions and nuclei. However, the image noise introduced in the sample preparation and the imaging system hinders the extraction of the possible knowlege from the dataset, thus denoising for the optical neuroimaging is usually necessary. The supervised denoisng methods often outperform the unsupervised ones, but the training of the supervised denoising models needs the corresponding clean labels, which is not always avaiable due to the high labeling cost. On the other hand, those semantic labels, such as the located soma positions, the reconstructed neuronal fibers, and the nuclei segmentation result, are generally available and accumulated from everyday neuroscience research. This work connects a supervised denoising and a semantic segmentation model together to form a end-to-end model, which can make use of the semantic labels while still provides a denoised image as an intermediate product. We use both the supervised and the self-supervised models for the denoising and introduce a new cost term for the joint denoising and the segmentation setup. We test the proposed approach on both the synthetic data and the real-world data, including the optical neuroimaing dataset and the electron microscope dataset. The result shows that the joint denoising result outperforms the one using the denoising method alone and the joint model benefits the segmentation and other downstream task as well.
    Automatic Plane Adjustment of Orthopedic Intra-operative Flat Panel Detector CT-Volumes. (arXiv:2109.10731v1 [eess.IV])
    (0 min) Purpose 3D acquisitions are often acquired to assess the result in orthopedic trauma surgery. With a mobile C-Arm system, these acquisitions can be performed intra-operatively. That reduces the number of required revision surgeries. However, due to the operation room setup, the acquisitions typically cannot be performed such that the acquired volumes are aligned to the anatomical regions. Thus, the multiplanar reconstructed (MPR) planes need to be adjusted manually during the review of the volume. In this paper, we present a detailed study of multi-task learning (MTL) regression networks to estimate the parameters of the MPR planes. Approach First, various mathematical descriptions for rotation, including Euler angle, quaternion, and matrix representation, are revised. Then, three different MTL network architectures based on the PoseNet are compared with a single task learning network. Results Using a matrix description rather than the Euler angle description, the accuracy of the regressed normals improves from $7.7^{\circ}$ to $7.3^{\circ}$ in the mean value for single anatomies. The multi-head approach improves the regression of the plane position from $7.4mm$ to $6.1mm$, while the orientation does not benefit from this approach. Conclusions The results show that a multi-head approach can lead to slightly better results than the individual tasks networks. The most important benefit of the MTL approach is that it is a single network for standard plane regression for all body regions with a reduced number of stored parameters.
    Adversarial Attacks on Camera-LiDAR Models for 3D Car Detection. (arXiv:2103.09448v2 [cs.CV] UPDATED)
    (0 min) Most autonomous vehicles (AVs) rely on LiDAR and RGB camera sensors for perception. Using these point cloud and image data, perception models based on deep neural nets (DNNs) have achieved state-of-the-art performance in 3D detection. The vulnerability of DNNs to adversarial attacks has been heavily investigated in the RGB image domain and more recently in the point cloud domain, but rarely in both domains simultaneously. Multi-modal perception systems used in AVs can be divided into two broad types: cascaded models which use each modality independently, and fusion models which learn from different modalities simultaneously. We propose a universal and physically realizable adversarial attack for each type, and study and contrast their respective vulnerabilities to attacks. We place a single adversarial object with specific shape and texture on top of a car with the objective of making this car evade detection. Evaluating on the popular KITTI benchmark, our adversarial object made the host vehicle escape detection by each model type more than 50% of the time. The dense RGB input contributed more to the success of the adversarial attacks on both cascaded and fusion models.
    Few-shot Learning with Global Relatedness Decoupled-Distillation. (arXiv:2107.05583v2 [cs.CV] UPDATED)
    (0 min) Despite the success that metric learning based approaches have achieved in few-shot learning, recent works reveal the ineffectiveness of their episodic training mode. In this paper, we point out two potential reasons for this problem: 1) the random episodic labels can only provide limited supervision information, while the relatedness information between the query and support samples is not fully exploited; 2) the meta-learner is usually constrained by the limited contextual information of the local episode. To overcome these problems, we propose a new Global Relatedness Decoupled-Distillation (GRDD) method using the global category knowledge and the Relatedness Decoupled-Distillation (RDD) strategy. Our GRDD learns new visual concepts quickly by imitating the habit of humans, i.e. learning from the deep knowledge distilled from the teacher. More specifically, we first train a global learner on the entire base subset using category labels as supervision to leverage the global context information of the categories. Then, the well-trained global learner is used to simulate the query-support relatedness in global dependencies. Finally, the distilled global query-support relatedness is explicitly used to train the meta-learner using the RDD strategy, with the goal of making the meta-learner more discriminative. The RDD strategy aims to decouple the dense query-support relatedness into the groups of sparse decoupled relatedness. Moreover, only the relatedness of a single support sample with other query samples is considered in each group. By distilling the sparse decoupled relatedness group by group, sharper relatedness can be effectively distilled to the meta-learner, thereby facilitating the learning of a discriminative meta-learner. We conduct extensive experiments on the miniImagenet and CIFAR-FS datasets, which show the state-of-the-art performance of our GRDD method.
    Differentiable Surface Triangulation. (arXiv:2109.10695v1 [cs.CV])
    (0 min) Triangle meshes remain the most popular data representation for surface geometry. This ubiquitous representation is essentially a hybrid one that decouples continuous vertex locations from the discrete topological triangulation. Unfortunately, the combinatorial nature of the triangulation prevents taking derivatives over the space of possible meshings of any given surface. As a result, to date, mesh processing and optimization techniques have been unable to truly take advantage of modular gradient descent components of modern optimization frameworks. In this work, we present a differentiable surface triangulation that enables optimization for any per-vertex or per-face differentiable objective function over the space of underlying surface triangulations. Our method builds on the result that any 2D triangulation can be achieved by a suitably perturbed weighted Delaunay triangulation. We translate this result into a computational algorithm by proposing a soft relaxation of the classical weighted Delaunay triangulation and optimizing over vertex weights and vertex locations. We extend the algorithm to 3D by decomposing shapes into developable sets and differentiably meshing each set with suitable boundary constraints. We demonstrate the efficacy of our method on various planar and surface meshes on a range of difficult-to-optimize objective functions. Our code can be found online: https://github.com/mrakotosaon/diff-surface-triangulation.
    DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing. (arXiv:2109.10737v1 [cs.CV])
    (0 min) Great diversity and photorealism have been achieved by unconditional GAN frameworks such as StyleGAN and its variations. In the meantime, persistent efforts have been made to enhance the semantic controllability of StyleGANs. For example, a dozen of style manipulation methods have been recently proposed to perform attribute-conditioned style editing. Although some of these methods work well in manipulating the style codes along one attribute, the control accuracy when jointly manipulating multiple attributes tends to be problematic. To address these limitations, we propose a Dynamic Style Manipulation Network (DyStyle) whose structure and parameters vary by input samples, to perform nonlinear and adaptive manipulation of latent codes for flexible and precise attribute control. Additionally, a novel easy-to-hard training procedure is introduced for efficient and stable training of the DyStyle network. Extensive experiments have been conducted on faces and other objects. As a result, our approach demonstrates fine-grained disentangled edits along multiple numeric and binary attributes. Qualitative and quantitative comparisons with existing style manipulation methods verify the superiority of our method in terms of the attribute control accuracy and identity preservation without compromising the photorealism. The advantage of our method is even more significant for joint multi-attribute control. The source codes are made publicly available at \href{https://github.com/phycvgan/DyStyle}{phycvgan/DyStyle}.
    Optimising the selection of samples for robust lidar camera calibration. (arXiv:2103.12287v2 [cs.CV] UPDATED)
    (0 min) We propose a robust calibration pipeline that optimises the selection of calibration samples for the estimation of calibration parameters that fit the entire scene. We minimise user error by automating the data selection process according to a metric, called Variability of Quality (VOQ) that gives a score to each calibration set of samples. We show that this VOQ score is correlated with the estimated calibration parameter's ability to generalise well to the entire scene, thereby overcoming the overfitting problems of existing calibration algorithms. Our approach has the benefits of simplifying the calibration process for practitioners of any calibration expertise level and providing an objective measure of the quality for our calibration pipeline's input and output data. We additionally use a novel method of assessing the accuracy of the calibration parameters. It involves computing reprojection errors for the entire scene to ensure that the parameters are well fitted to all features in the scene. Our proposed calibration pipeline takes 90s, and obtains an average reprojection error of 1-1.2cm, with standard deviation of 0.4-0.5cm over 46 poses evenly distributed in a scene. This process has been validated by experimentation on a high resolution, software definable lidar, Baraja Spectrum-Scan; and a low, fixed resolution lidar, Velodyne VLP-16. We have shown that despite the vast differences in lidar technologies, our proposed approach manages to estimate robust calibration parameters for both. Our code and data set used for this paper are made available as open-source.
    Self-supervised Mean Teacher for Semi-supervised Chest X-ray Classification. (arXiv:2103.03629v2 [cs.CV] UPDATED)
    (0 min) The training of deep learning models generally requires a large amount of annotated data for effective convergence and generalisation. However, obtaining high-quality annotations is a laboursome and expensive process due to the need of expert radiologists for the labelling task. The study of semi-supervised learning in medical image analysis is then of crucial importance given that it is much less expensive to obtain unlabelled images than to acquire images labelled by expert radiologists. Essentially, semi-supervised methods leverage large sets of unlabelled data to enable better training convergence and generalisation than using only the small set of labelled images. In this paper, we propose Self-supervised Mean Teacher for Semi-supervised (S$^2$MTS$^2$) learning that combines self-supervised mean-teacher pre-training with semi-supervised fine-tuning. The main innovation of S$^2$MTS$^2$ is the self-supervised mean-teacher pre-training based on the joint contrastive learning, which uses an infinite number of pairs of positive query and key features to improve the mean-teacher representation. The model is then fine-tuned using the exponential moving average teacher framework trained with semi-supervised learning. We validate S$^2$MTS$^2$ on the multi-label classification problems from Chest X-ray14 and CheXpert, and the multi-class classification from ISIC2018, where we show that it outperforms the previous SOTA semi-supervised learning methods by a large margin.
    PVT: Point-Voxel Transformer for 3D Deep Learning. (arXiv:2108.06076v2 [cs.CV] UPDATED)
    (0 min) In this paper, we present an efficient and high-performance neural architecture, termed Point-Voxel Transformer (PVT)for 3D deep learning, which deeply integrates both 3D voxel-based and point-based self-attention computation to learn more discriminative features from 3D data. Specifically, we conduct multi-head self-attention (MSA) computation in voxels to obtain the efficient learning pattern and the coarse-grained local features while performing self-attention in points to provide finer-grained information about the global context. In addition, to reduce the cost of MSA computation with high efficiency, we design a cyclic shifted boxing scheme by limiting the MSA computation to non-overlapping local box and also preserving cross-box connection. Evaluated on classification benchmark, our method not only achieves state-of-the-art accuracy of 94.0% (no voting) but outperforms previous Transformer-based models with 7x measured speedup on average. On part and semantic segmentation, our model also obtains strong performance(86.5% and 68.2% mIoU, respectively). For 3D object detection task, we replace the primitives in Frustrum PointNet with PVT block and achieve an improvement of 8.6% AP.
    A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric. (arXiv:2101.09028v3 [cs.CV] UPDATED)
    (0 min) Temporal Sentence Grounding in Videos (TSGV), i.e., grounding a natural language sentence which indicates complex human activities in a long and untrimmed video sequence, has received unprecedented attentions over the last few years. Although each newly proposed method plausibly can achieve better performance than previous ones, current TSGV models still tend to capture the moment annotation biases and fail to take full advantage of multi-modal inputs. Even more incredibly, several extremely simple baselines without training can also achieve state-of-the-art performance. In this paper, we take a closer look at the existing evaluation protocols for TSGV, and find that both the prevailing dataset splits and evaluation metrics are the devils to cause unreliable benchmarking. To this end, we propose to re-organize two widely-used TSGV benchmarks (ActivityNet Captions and Charades-STA). Specifically, we deliberately make the ground-truth moment distribution different in the training and test splits, i.e., out-of-distribution (OOD) testing. Meanwhile, we introduce a new evaluation metric dR@n,IoU@m to calibrate the basic IoU scores by penalizing on the bias-influenced moment predictions and alleviate the inflating evaluations caused by the dataset annotation biases such as overlong ground-truth moments. Under our new evaluation protocol, we conduct extensive experiments and ablation studies on eight state-of-the-art TSGV methods. All the results demonstrate that the re-organized dataset splits and new metric can better monitor the progress in TSGV. Our reorganized datsets are available at https://github.com/yytzsy/grounding_changing_distribution.
    AI in Osteoporosis. (arXiv:2109.10478v1 [cs.CV])
    (0 min) In this chapter we explore and evaluate methods for trabecular bone characterization and osteoporosis diagnosis with increased interest in sparse approximations. We first describe texture representation and classification techniques, patch-based methods such as Bag of Keypoints, and more recent deep neural networks. Then we introduce the concept of sparse representations for pattern recognition and we detail integrative sparse analysis methods and classifier decision fusion methods. We report cross-validation results on osteoporosis datasets of bone radiographs and compare the results produced by the different categories of methods. We conclude that advances in the AI and machine learning fields have enabled the development of methods that can be used as diagnostic tools in clinical settings.
    An Attention Free Transformer. (arXiv:2105.14103v2 [cs.LG] UPDATED)
    (0 min) We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
    Robust marginalization of baryonic effects for cosmological inference at the field level. (arXiv:2109.10360v1 [astro-ph.CO])
    (0 min) We train neural networks to perform likelihood-free inference from $(25\,h^{-1}{\rm Mpc})^2$ 2D maps containing the total mass surface density from thousands of hydrodynamic simulations of the CAMELS project. We show that the networks can extract information beyond one-point functions and power spectra from all resolved scales ($\gtrsim 100\,h^{-1}{\rm kpc}$) while performing a robust marginalization over baryonic physics at the field level: the model can infer the value of $\Omega_{\rm m} (\pm 4\%)$ and $\sigma_8 (\pm 2.5\%)$ from simulations completely different to the ones used to train it.
    PSGCNet: A Pyramidal Scale and Global Context Guided Network for Dense Object Counting in Remote Sensing Images. (arXiv:2012.03597v2 [cs.CV] UPDATED)
    (0 min) Object counting, which aims to count the accurate number of object instances in images, has been attracting more and more attention. However, challenges such as large scale variation, complex background interference, and non-uniform density distribution greatly limit the counting accuracy, particularly striking in remote sensing imagery. To mitigate the above issues, this paper proposes a novel framework for dense object counting in remote sensing images, which incorporates a pyramidal scale module (PSM) and a global context module (GCM), dubbed PSGCNet, where PSM is used to adaptively capture multi-scale information and GCM is to guide the model to select suitable scales generated from PSM. Moreover, a reliable supervision manner improved from Bayesian and Counting loss (BCL) is utilized to learn the density probability and then compute the count expectation at each annotation. It can relieve non-uniform density distribution to a certain extent. Extensive experiments on four remote sensing counting datasets demonstrate the effectiveness of the proposed method and the superiority of it compared with state-of-the-arts. Additionally, experiments extended on four commonly used crowd counting datasets further validate the generalization ability of the model. Code is available at https://github.com/gaoguangshuai/PSGCNet.
    Less is More: Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification. (arXiv:2109.10498v1 [cs.CV])
    (0 min) Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance. Recently, learning from synthetic data, which benefits from the popularity of synthetic data engine, has attracted attention from both academia and the public eye. However, existing synthetic datasets are limited in quantity, diversity and realisticity, and cannot be efficiently used for generalizable re-ID problem. To address this challenge, we construct and label a large-scale synthetic person dataset named FineGPR with fine-grained attribute distribution. Moreover, aiming to fully exploit the potential of FineGPR and promote the efficient training from millions of synthetic data, we propose an attribute analysis pipeline AOST to learn attribute distribution in target domain, then apply style transfer network to eliminate the gap between synthetic and real-world data and thus is freely deployed to new scenarios. Experiments conducted on benchmarks demonstrate that FineGPR with AOST outperforms (or is on par with) existing real and synthetic datasets, which suggests its feasibility for re-ID and proves the proverbial less-is-more principle. We hope this fine-grained dataset could advance research towards re-ID in real scenarios.
    The First Vision For Vitals (V4V) Challenge for Non-Contact Video-Based Physiological Estimation. (arXiv:2109.10471v1 [cs.CY])
    (0 min) Telehealth has the potential to offset the high demand for help during public health emergencies, such as the COVID-19 pandemic. Remote Photoplethysmography (rPPG) - the problem of non-invasively estimating blood volume variations in the microvascular tissue from video - would be well suited for these situations. Over the past few years a number of research groups have made rapid advances in remote PPG methods for estimating heart rate from digital video and obtained impressive results. How these various methods compare in naturalistic conditions, where spontaneous behavior, facial expressions, and illumination changes are present, is relatively unknown. To enable comparisons among alternative methods, the 1st Vision for Vitals Challenge (V4V) presented a novel dataset containing high-resolution videos time-locked with varied physiological signals from a diverse population. In this paper, we outline the evaluation protocol, the data used, and the results. V4V is to be held in conjunction with the 2021 International Conference on Computer Vision.
    Uncertainty-Aware Training for Cardiac Resynchronisation Therapy Response Prediction. (arXiv:2109.10641v1 [eess.IV])
    (0 min) Evaluation of predictive deep learning (DL) models beyond conventional performance metrics has become increasingly important for applications in sensitive environments like healthcare. Such models might have the capability to encode and analyse large sets of data but they often lack comprehensive interpretability methods, preventing clinical trust in predictive outcomes. Quantifying uncertainty of a prediction is one way to provide such interpretability and promote trust. However, relatively little attention has been paid to how to include such requirements into the training of the model. In this paper we: (i) quantify the data (aleatoric) and model (epistemic) uncertainty of a DL model for Cardiac Resynchronisation Therapy response prediction from cardiac magnetic resonance images, and (ii) propose and perform a preliminary investigation of an uncertainty-aware loss function that can be used to retrain an existing DL image-based classification model to encourage confidence in correct predictions and reduce confidence in incorrect predictions. Our initial results are promising, showing a significant increase in the (epistemic) confidence of true positive predictions, with some evidence of a reduction in false negative confidence.
    Coast Sargassum Level Estimation from Smartphone Pictures. (arXiv:2109.10390v1 [cs.CV])
    (0 min) Since 2011, significant and atypical arrival of two species of surface dwelling algae, Sargassum natans and Sargassum Fluitans, have been detected in the Mexican Caribbean. This massive accumulation of algae has had a great environmental and economic impact. Therefore, for the government, ecologists, and local businesses, it is important to keep track of the amount of sargassum that arrives on the Caribbean coast. High-resolution satellite imagery is expensive or may be time delayed. Therefore, we propose to estimate the amount of sargassum based on ground-level smartphone photographs. From the computer vision perspective, the problem is quite difficult since no information about the 3D world is provided, in consequence, we have to model it as a classification problem, where a set of five labels define the amount. For this purpose, we have built a dataset with more than one thousand examples from public forums such as Facebook or Instagram and we have tested several state-of-the-art convolutional networks. As a result, the VGG network trained under fine-tuning showed the best performance. Even though the reached accuracy could be improved with more examples, the current prediction distribution is narrow, so the predictions are adequate for keeping a record and taking quick ecological actions.
    Early Lane Change Prediction for Automated Driving Systems Using Multi-Task Attention-based Convolutional Neural Networks. (arXiv:2109.10742v1 [cs.CV])
    (0 min) Lane change (LC) is one of the safety-critical manoeuvres in highway driving according to various road accident records. Thus, reliably predicting such manoeuvre in advance is critical for the safe and comfortable operation of automated driving systems. The majority of previous studies rely on detecting a manoeuvre that has been already started, rather than predicting the manoeuvre in advance. Furthermore, most of the previous works do not estimate the key timings of the manoeuvre (e.g., crossing time), which can actually yield more useful information for the decision making in the ego vehicle. To address these shortcomings, this paper proposes a novel multi-task model to simultaneously estimate the likelihood of LC manoeuvres and the time-to-lane-change (TTLC). In both tasks, an attention-based convolutional neural network (CNN) is used as a shared feature extractor from a bird's eye view representation of the driving environment. The spatial attention used in the CNN model improves the feature extraction process by focusing on the most relevant areas of the surrounding environment. In addition, two novel curriculum learning schemes are employed to train the proposed approach. The extensive evaluation and comparative analysis of the proposed method in existing benchmark datasets show that the proposed method outperforms state-of-the-art LC prediction models, particularly considering long-term prediction performance.
    Finding Facial Forgery Artifacts with Parts-Based Detectors. (arXiv:2109.10688v1 [cs.CV])
    (0 min) Manipulated videos, especially those where the identity of an individual has been modified using deep neural networks, are becoming an increasingly relevant threat in the modern day. In this paper, we seek to develop a generalizable, explainable solution to detecting these manipulated videos. To achieve this, we design a series of forgery detection systems that each focus on one individual part of the face. These parts-based detection systems, which can be combined and used together in a single architecture, meet all of our desired criteria - they generalize effectively between datasets and give us valuable insights into what the network is looking at when making its decision. We thus use these detectors to perform detailed empirical analysis on the FaceForensics++, Celeb-DF, and Facebook Deepfake Detection Challenge datasets, examining not just what the detectors find but also collecting and analyzing useful related statistics on the datasets themselves.
    A Quantitative Comparison of Epistemic Uncertainty Maps Applied to Multi-Class Segmentation. (arXiv:2109.10702v1 [cs.CV])
    (0 min) Uncertainty assessment has gained rapid interest in medical image analysis. A popular technique to compute epistemic uncertainty is the Monte-Carlo (MC) dropout technique. From a network with MC dropout and a single input, multiple outputs can be sampled. Various methods can be used to obtain epistemic uncertainty maps from those multiple outputs. In the case of multi-class segmentation, the number of methods is even larger as epistemic uncertainty can be computed voxelwise per class or voxelwise per image. This paper highlights a systematic approach to define and quantitatively compare those methods in two different contexts: class-specific epistemic uncertainty maps (one value per image, voxel and class) and combined epistemic uncertainty maps (one value per image and voxel). We applied this quantitative analysis to a multi-class segmentation of the carotid artery lumen and vessel wall, on a multi-center, multi-scanner, multi-sequence dataset of (MR) images. We validated our analysis over 144 sets of hyperparameters of a model. Our main analysis considers the relationship between the order of the voxels sorted according to their epistemic uncertainty values and the misclassification of the prediction. Under this consideration, the comparison of combined uncertainty maps reveals that the multi-class entropy and the multi-class mutual information statistically out-perform the other combined uncertainty maps under study. In a class-specific scenario, the one-versus-all entropy statistically out-performs the class-wise entropy, the class-wise variance and the one versus all mutual information. The class-wise entropy statistically out-performs the other class-specific uncertainty maps in terms of calibration. We made a python package available to reproduce our analysis on different data and tasks.
    Incorporating Data Uncertainty in Object Tracking Algorithms. (arXiv:2109.10521v1 [eess.SY])
    (0 min) Methodologies for incorporating the uncertainties characteristic of data-driven object detectors into object tracking algorithms are explored. Object tracking methods rely on measurement error models, typically in the form of measurement noise, false positive rates, and missed detection rates. Each of these quantities, in general, can be dependent on object or measurement location. However, for detections generated from neural-network processed camera inputs, these measurement error statistics are not sufficient to represent the primary source of errors, namely a dissimilarity between run-time sensor input and the training data upon which the detector was trained. To this end, we investigate incorporating data uncertainty into object tracking methods such as to improve the ability to track objects, and particularly those which out-of-distribution w.r.t. training data. The proposed methodologies are validated on an object tracking benchmark as well on experiments with a real autonomous aircraft.
    Batch norm with entropic regularization turns deterministic autoencoders into generative models. (arXiv:2002.10631v2 [cs.LG] UPDATED)
    (0 min) The variational autoencoder is a well defined deep generative model that utilizes an encoder-decoder framework where an encoding neural network outputs a non-deterministic code for reconstructing an input. The encoder achieves this by sampling from a distribution for every input, instead of outputting a deterministic code per input. The great advantage of this process is that it allows the use of the network as a generative model for sampling from the data distribution beyond provided samples for training. We show in this work that utilizing batch normalization as a source for non-determinism suffices to turn deterministic autoencoders into generative models on par with variational ones, so long as we add a suitable entropic regularization to the training objective.
    Effective Model Compression via Stage-wise Pruning. (arXiv:2011.04908v2 [cs.CV] UPDATED)
    (0 min) Automated Machine Learning(Auto-ML) pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous work found that the results of many Auto-ML pruning methods cannot even surpass the results of the uniformly pruning method. In this paper, the ineffectiveness of Auto-ML pruning which is caused by unfull and unfair training of the supernet is shown. A deep supernet suffers from unfull training because it contains too many candidates. To overcome the unfull training, a stage-wise pruning(SWP) method is proposed, which splits a deep supernet into several stage-wise supernets to reduce the candidate number and utilize inplace distillation to supervise the stage training. Besides, A wide supernet is hit by unfair training since the sampling probability of each channel is unequal. Therefore, the fullnet and the tinynet are sampled in each training iteration to ensure each channel can be overtrained. Remarkably, the proxy performance of the subnets trained with SWP is closer to the actual performance than that of most of the previous Auto-ML pruning work. Experiments show that SWP achieves the state-of-the-art on both CIFAR-10 and ImageNet under the mobile setting.
    Animal inspired Application of a Variant of Mel Spectrogram for Seismic Data Processing. (arXiv:2109.10733v1 [cs.CV])
    (0 min) Predicting disaster events from seismic data is of paramount importance and can save thousands of lives, especially in earthquake-prone areas and habitations around volcanic craters. The drastic rise in the number of seismic monitoring stations in recent years has allowed the collection of a huge quantity of data, outpacing the capacity of seismologists. Due to the complex nature of the seismological data, it is often difficult for seismologists to detect subtle patterns with major implications. Machine learning algorithms have been demonstrated to be effective in classification and prediction tasks for seismic data. It has been widely known that some animals can sense disasters like earthquakes from seismic signals well before the disaster strikes. Mel spectrogram has been widely used for speech recognition as it scales the actual frequencies according to human hearing. In this paper, we propose a variant of the Mel spectrogram to scale the raw frequencies of seismic data to the hearing of such animals that can sense disasters from seismic signals. We are using a Computer vision algorithm along with clustering that allows for the classification of unlabelled seismic data.
    SAT: 2D Semantics Assisted Training for 3D Visual Grounding. (arXiv:2105.11450v2 [cs.CV] UPDATED)
    (0 min) 3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region. Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images. These inherent limitations make the 3D visual grounding problem more challenging. In this study, we propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding. The main idea is to learn auxiliary alignments between rich, clean 2D object representations and the corresponding objects or mentioned entities in 3D scenes. SAT takes 2D object semantics, i.e., object label, image feature, and 2D geometric feature, as the extra input in training but does not require such inputs during inference. By effectively utilizing 2D semantics in training, our approach boosts the accuracy on the Nr3D dataset from 37.7% to 49.2%, which significantly surpasses the non-SAT baseline with the identical network architecture and inference input. Our approach outperforms the state of the art by large margins on multiple 3D visual grounding datasets, i.e., +10.4% absolute accuracy on Nr3D, +9.9% on Sr3D, and +5.6% on ScanRef.
    DVC-P: Deep Video Compression with Perceptual Optimizations. (arXiv:2109.10849v1 [eess.IV])
    (0 min) Recent years have witnessed the significant development of learning-based video compression methods, which aim at optimizing objective or perceptual quality and bit rates. In this paper, we introduce deep video compression with perceptual optimizations (DVC-P), which aims at increasing perceptual quality of decoded videos. Our proposed DVC-P is based on Deep Video Compression (DVC) network, but improves it with perceptual optimizations. Specifically, a discriminator network and a mixed loss are employed to help our network trade off among distortion, perception and rate. Furthermore, nearest-neighbor interpolation is used to eliminate checkerboard artifacts which can appear in sequences encoded with DVC frameworks. Thanks to these two improvements, the perceptual quality of decoded sequences is improved. Experimental results demonstrate that, compared with the baseline DVC, our proposed method can generate videos with higher perceptual quality achieving 12.27% reduction in a perceptual BD-rate equivalent, on average.
    Hierarchical Multimodal Transformer to Summarize Videos. (arXiv:2109.10559v1 [cs.CV])
    (0 min) Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, \emph{etc}. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots, and summarize the video by exploiting the scene information formed by shots. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
    EAR-NET: Error Attention Refining Network For Retinal Vessel Segmentation. (arXiv:2107.01351v2 [eess.IV] UPDATED)
    (0 min) The precise detection of blood vessels in retinal images is crucial to the early diagnosis of the retinal vascular diseases, e.g., diabetic, hypertensive and solar retinopathies. Existing works often fail in predicting the abnormal areas, e.g, sudden brighter and darker areas and are inclined to predict a pixel to background due to the significant class imbalance, leading to high accuracy and specificity while low sensitivity. To that end, we propose a novel error attention refining network (ERA-Net) that is capable of learning and predicting the potential false predictions in a two-stage manner for effective retinal vessel segmentation. The proposed ERA-Net in the refine stage drives the model to focus on and refine the segmentation errors produced in the initial training stage. To achieve this, unlike most previous attention approaches that run in an unsupervised manner, we introduce a novel error attention mechanism which considers the differences between the ground truth and the initial segmentation masks as the ground truth to supervise the attention map learning. Experimental results demonstrate that our method achieves state-of-the-art performance on two common retinal blood vessel datasets.
    Pix2seq: A Language Modeling Framework for Object Detection. (arXiv:2109.10852v1 [cs.CV])
    (0 min) This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
    Self-Training Based Unsupervised Cross-Modality Domain Adaptation for Vestibular Schwannoma and Cochlea Segmentation. (arXiv:2109.10674v1 [eess.IV])
    (0 min) With the advances of deep learning, many medical image segmentation studies achieve human-level performance when in fully supervised condition. However, it is extremely expensive to acquire annotation on every data in medical fields, especially on magnetic resonance images (MRI) that comprise many different contrasts. Unsupervised methods can alleviate this problem; however, the performance drop is inevitable compared to fully supervised methods. In this work, we propose a self-training based unsupervised-learning framework that performs automatic segmentation of Vestibular Schwannoma (VS) and cochlea on high-resolution T2 scans. Our method consists of 4 main stages: 1) VS-preserving contrast conversion from contrast-enhanced T1 scan to high-resolution T2 scan, 2) training segmentation on generated T2 scans with annotations on T1 scans, and 3) Inferring pseudo-labels on non-annotated real T2 scans, and 4) boosting the generalizability of VS and cochlea segmentation by training with combined data (i.e., real T2 scans with pseudo-labels and generated T2 scans with true annotations). Our method showed mean Dice score and Average Symmetric Surface Distance (ASSD) of 0.8570 (0.0705) and 0.4970 (0.3391) for VS, 0.8446 (0.0211) and 0.1513 (0.0314) for Cochlea on CrossMoDA2021 challenge validation phase leaderboard, outperforming most other approaches.
    Anticipative Video Transformer. (arXiv:2106.02036v2 [cs.CV] UPDATED)
    (0 min) We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads; and it wins first place in the EpicKitchens-100 CVPR'21 challenge.
    Mask Guided Attention For Fine-Grained Patchy Image Classification. (arXiv:2102.02771v2 [cs.CV] UPDATED)
    (0 min) In this work, we present a novel mask guided attention (MGA) method for fine-grained patchy image classification. The key challenge of fine-grained patchy image classification lies in two folds, ultra-fine-grained inter-category variances among objects and very few data available for training. This motivates us to consider employing more useful supervision signal to train a discriminative model within limited training samples. Specifically, the proposed MGA integrates a pre-trained semantic segmentation model that produces auxiliary supervision signal, i.e., patchy attention mask, enabling a discriminative representation learning. The patchy attention mask drives the classifier to filter out the insignificant parts of images (e.g., common features between different categories), which enhances the robustness of MGA for the fine-grained patchy image classification. We verify the effectiveness of our method on three publicly available patchy image datasets. Experimental results demonstrate that our MGA method achieves superior performance on three datasets compared with the state-of-the-art methods. In addition, our ablation study shows that MGA improves the accuracy by 2.25% and 2% on the SoyCultivarVein and BtfPIS datasets, indicating its practicality towards solving the fine-grained patchy image classification.
    Vehicle Behavior Prediction and Generalization Using Imbalanced Learning Techniques. (arXiv:2109.10656v1 [cs.RO])
    (0 min) The use of learning-based methods for vehicle behavior prediction is a promising research topic. However, many publicly available data sets suffer from class distribution skews which limits learning performance if not addressed. This paper proposes an interaction-aware prediction model consisting of an LSTM autoencoder and SVM classifier. Additionally, an imbalanced learning technique, the multiclass balancing ensemble is proposed. Evaluations show that the method enhances model performance, resulting in improved classification accuracy. Good generalization properties of learned models are important and therefore a generalization study is done where models are evaluated on unseen traffic data with dissimilar traffic behavior stemming from different road configurations. This is realized by using two distinct highway traffic recordings, the publicly available NGSIM US-101 and I80 data sets. Moreover, methods for encoding structural and static features into the learning process for improved generalization are evaluated. The resulting methods show substantial improvements in classification as well as generalization performance.
    Rapid detection and recognition of whole brain activity in a freely behaving Caenorhabditis elegans. (arXiv:2109.10474v1 [q-bio.QM])
    (0 min) Advanced volumetric imaging methods and genetically encoded activity indicators have permitted a comprehensive characterization of whole brain activity at single neuron resolution in \textit{Caenorhabditis elegans}. The constant motion and deformation of the mollusc nervous system, however, impose a great challenge for a consistent identification of densely packed neurons in a behaving animal. Here, we propose a cascade solution for long-term and rapid recognition of head ganglion neurons in a freely moving \textit{C. elegans}. First, potential neuronal regions from a stack of fluorescence images are detected by a deep learning algorithm. Second, 2 dimensional neuronal regions are fused into 3 dimensional neuron entities. Third, by exploiting the neuronal density distribution surrounding a neuron and relative positional information between neurons, a multi-class artificial neural network transforms engineered neuronal feature vectors into digital neuronal identities. Under the constraint of a small number (20-40 volumes) of training samples, our bottom-up approach is able to process each volume - $1024 \times 1024 \times 18$ in voxels - in less than 1 second and achieves an accuracy of $91\%$ in neuronal detection and $74\%$ in neuronal recognition. Our work represents an important development towards a rapid and fully automated algorithm for decoding whole brain activity underlying natural animal behaviors.
    A deep neural network for multi-species fish detection using multiple acoustic cameras. (arXiv:2109.10664v1 [cs.CV])
    (0 min) Underwater acoustic cameras are high potential devices for many applications in ecology, notably for fisheries management and monitoring. However how to extract such data into high value information without a time-consuming entire dataset reading by an operator is still a challenge. Moreover the analysis of acoustic imaging, due to its low signal-to-noise ratio, is a perfect training ground for experimenting with new approaches, especially concerning Deep Learning techniques. We present hereby a novel approach that takes advantage of both CNN (Convolutional Neural Network) and classical CV (Computer Vision) techniques, able to detect a generic class ''fish'' in acoustic video streams. The pipeline pre-treats the acoustic images to extract 2 features, in order to localise the signals and improve the detection performances. To ensure the performances from an ecological point of view, we propose also a two-step validation, one to validate the results of the trainings and one to test the method on a real-world scenario. The YOLOv3-based model was trained with data of fish from multiple species recorded by the two common acoustic cameras, DIDSON and ARIS, including species of high ecological interest, as Atlantic salmon or European eels. The model we developed provides satisfying results detecting almost 80% of fish and minimizing the false positive rate, however the model is much less efficient for eel detections on ARIS videos. The first CNN pipeline for fish monitoring exploiting video data from two models of acoustic cameras satisfies most of the required features. Many challenges are still present, such as the automation of fish species identification through a multiclass model. 1 However the results point a new solution for dealing with complex data, such as sonar data, which can also be reapplied in other cases where the signal-to-noise ratio is a challenge.
    Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images. (arXiv:2109.10778v1 [cs.CV])
    (0 min) Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. However, generating exhaustive and accurate annotations is labor-intensive, challenging, and costly. Drawing only coarse and approximate annotations is a much easier task, less costly, and it alleviates pathologists' workload. In this paper, we study the problem of refining these approximate annotations in digital pathology to obtain more accurate ones. Some previous works have explored obtaining machine learning models from these inaccurate annotations, but few of them tackle the refinement problem where the mislabeled regions should be explicitly identified and corrected, and all of them require a - often very large - number of training samples. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Patches cropped from a WSI with inaccurate labels are processed jointly with a MIL framework, and a deep-attention mechanism is leveraged to discriminate mislabeled instances, mitigating their impact on the predictive model and refining the segmentation. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide. These results demonstrate the LC-MIL is a promising, lightweight tool to provide fine-grained annotations from coarsely annotated pathology sets.
    FaceEraser: Removing Facial Parts for Augmented Reality. (arXiv:2109.10760v1 [cs.CV])
    (0 min) Our task is to remove all facial parts (e.g., eyebrows, eyes, mouth and nose), and then impose visual elements onto the ``blank'' face for augmented reality. Conventional object removal methods rely on image inpainting techniques (e.g., EdgeConnect, HiFill) that are trained in a self-supervised manner with randomly manipulated image pairs. Specifically, given a set of natural images, randomly masked images are used as inputs and the raw images are treated as ground truths. Whereas, this technique does not satisfy the requirements of facial parts removal, as it is hard to obtain ``ground-truth'' images with real ``blank'' faces. To address this issue, we propose a novel data generation technique to produce paired training data that well mimic the ``blank'' faces. In the mean time, we propose a novel network architecture for improved inpainting quality for our task. Finally, we demonstrate various face-oriented augmented reality applications on top of our facial parts removal model. Our method has been integrated into commercial products and its effectiveness has been verified with unconstrained user inputs. The source codes, pre-trained models and training data will be released for research purposes.
    Video-based Person Re-identification without Bells and Whistles. (arXiv:2105.10678v2 [cs.CV] UPDATED)
    (0 min) Video-based person re-identification (Re-ID) aims at matching the video tracklets with cropped video frames for identifying the pedestrians under different cameras. However, there exists severe spatial and temporal misalignment for those cropped tracklets due to the imperfect detection and tracking results generated with obsolete methods. To address this issue, we present a simple re-Detect and Link (DL) module which can effectively reduce those unexpected noise through applying the deep learning-based detection and tracking on the cropped tracklets. Furthermore, we introduce an improved model called Coarse-to-Fine Axial-Attention Network (CF-AAN). Based on the typical Non-local Network, we replace the non-local module with three 1-D position-sensitive axial attentions, in addition to our proposed coarse-to-fine structure. With the developed CF-AAN, compared to the original non-local operation, we can not only significantly reduce the computation cost but also obtain the state-of-the-art performance (91.3% in rank-1 and 86.5% in mAP) on the large-scale MARS dataset. Meanwhile, by simply adopting our DL module for data alignment, to our surprise, several baseline models can achieve better or comparable results with the current state-of-the-arts. Besides, we discover the errors not only for the identity labels of tracklets but also for the evaluation protocol for the test data of MARS. We hope that our work can help the community for the further development of invariant representation without the hassle of the spatial and temporal alignment and dataset noise. The code, corrected labels, evaluation protocol, and the aligned data will be available at https://github.com/jackie840129/CF-AAN.
    TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition. (arXiv:2107.04782v2 [cs.CV] UPDATED)
    (0 min) Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support). The majority of current approaches follow the metric learning paradigm, which learns to compare the similarity between videos. Recently, it has been observed that directly measuring this similarity is not ideal since different action instances may show distinctive temporal distribution, resulting in severe misalignment issues across query and support videos. In this paper, we arrest this problem from two distinct aspects -- action duration misalignment and action evolution misalignment. We address them sequentially through a Two-stage Action Alignment Network (TA2N). The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e.g. background). Next, the second stage coordinates query feature to match the spatial-temporal action evolution of support by performing temporally rearrange and spatially offset prediction. Extensive experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
    DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection. (arXiv:2109.06148v2 [cs.CV] UPDATED)
    (0 min) Object detection is a fundamental task in computer vision. While approaches for axis-aligned bounding box detection have made substantial progress in recent years, they perform poorly on oriented objects which are common in several real-world scenarios such as aerial view imagery and security camera footage. In these cases, a large part of a predicted bounding box will, undesirably, cover non-object related areas. Therefore, oriented object detection has emerged with the aim of generalizing object detection to arbitrary orientations. This enables a tighter fit to oriented objects, leading to a better separation of bounding boxes especially in case of dense object distributions. The vast majority of the work in this area has focused on complex two-stage anchor-based approaches. Anchors act as priors on the bounding box shape and require attentive hyper-parameter fine-tuning on a per-dataset basis, increased model size, and come with computational overhead. In this work, we present DAFNe: A Dense one-stage Anchor-Free deep Network for oriented object detection. As a one-stage model, DAFNe performs predictions on a dense grid over the input image, being architecturally simpler and faster, as well as easier to optimize than its two-stage counterparts. Furthermore, as an anchor-free model, DAFNe reduces the prediction complexity by refraining from employing bounding box anchors. Moreover, we introduce an orientation-aware generalization of the center-ness function for arbitrarily oriented bounding boxes to down-weight low-quality predictions and a center-to-corner bounding box prediction strategy that improves object localization performance. DAFNe improves the prediction accuracy over the previous best one-stage anchor-free model results on DOTA 1.0 by 4.65% mAP, setting the new state-of-the-art results by achieving 76.95% mAP.
    LDC-VAE: A Latent Distribution Consistency Approach to Variational AutoEncoders. (arXiv:2109.10640v1 [cs.LG])
    (0 min) Variational autoencoders (VAEs), as an important aspect of generative models, have received a lot of research interests and reached many successful applications. However, it is always a challenge to achieve the consistency between the learned latent distribution and the prior latent distribution when optimizing the evidence lower bound (ELBO), and finally leads to an unsatisfactory performance in data generation. In this paper, we propose a latent distribution consistency approach to avoid such substantial inconsistency between the posterior and prior latent distributions in ELBO optimizing. We name our method as latent distribution consistency VAE (LDC-VAE). We achieve this purpose by assuming the real posterior distribution in latent space as a Gibbs form, and approximating it by using our encoder. However, there is no analytical solution for such Gibbs posterior in approximation, and traditional approximation ways are time consuming, such as using the iterative sampling-based MCMC. To address this problem, we use the Stein Variational Gradient Descent (SVGD) to approximate the Gibbs posterior. Meanwhile, we use the SVGD to train a sampler net which can obtain efficient samples from the Gibbs posterior. Comparative studies on the popular image generation datasets show that our method has achieved comparable or even better performance than several powerful improvements of VAEs.
    Tighter risk certificates for neural networks. (arXiv:2007.12911v3 [cs.LG] UPDATED)
    (0 min) This paper presents an empirical study regarding training probabilistic neural networks using training objectives derived from PAC-Bayes bounds. In the context of probabilistic neural networks, the output of training is a probability distribution over network weights. We present two training objectives, used here for the first time in connection with training neural networks. These two training objectives are derived from tight PAC-Bayes bounds. We also re-implement a previously used training objective based on a classical PAC-Bayes bound, to compare the properties of the predictors learned using the different training objectives. We compute risk certificates for the learnt predictors, based on part of the data used to learn the predictors. We further experiment with different types of priors on the weights (both data-free and data-dependent priors) and neural network architectures. Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds with much tighter values than previous results in the literature, showing promise not only to guide the learning algorithm through bounding the risk but also for model selection. These observations suggest that the methods studied here might be good candidates for self-certified learning, in the sense of using the whole data set for learning a predictor and certifying its risk on any unseen data (from the same distribution as the training data) potentially without the need for holding out test data.
    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. (arXiv:2109.10686v1 [cs.CL])
    (0 min) There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
    Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data. (arXiv:2108.09020v2 [cs.LG] UPDATED)
    (0 min) Continual learning is the problem of learning and retaining knowledge through time over multiple tasks and environments. Research has primarily focused on the incremental classification setting, where new tasks/classes are added at discrete time intervals. Such an "offline" setting does not evaluate the ability of agents to learn effectively and efficiently, since an agent can perform multiple learning epochs without any time limitation when a task is added. We argue that "online" continual learning, where data is a single continuous stream without task boundaries, enables evaluating both information retention and online learning efficacy. In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online. Trained models are later evaluated on historical data to assess information retention. We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts. Through a large-scale analysis, we identify critical and previously unobserved phenomena of gradient-based optimization in continual learning, and propose effective strategies for improving gradient-based online continual learning with real data. The source code and dataset are available in: https://github.com/IntelLabs/continuallearning.
    Caption Enriched Samples for Improving Hateful Memes Detection. (arXiv:2109.10649v1 [cs.CV])
    (0 min) The recently introduced hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not. Specifically, both unimodal language models and multimodal vision-language models cannot reach the human level of performance. Motivated by the need to model the contrast between the image content and the overlayed text, we suggest applying an off-the-shelf image captioning tool in order to capture the first. We demonstrate that the incorporation of such automatic captions during fine-tuning improves the results for various unimodal and multimodal models. Moreover, in the unimodal case, continuing the pre-training of language models on augmented and original caption pairs, is highly beneficial to the classification accuracy.
    Single Image Dehazing with An Independent Detail-Recovery Network. (arXiv:2109.10492v1 [cs.CV])
    (0 min) Single image dehazing is a prerequisite which affects the performance of many computer vision tasks and has attracted increasing attention in recent years. However, most existing dehazing methods emphasize more on haze removal but less on the detail recovery of the dehazed images. In this paper, we propose a single image dehazing method with an independent Detail Recovery Network (DRN), which considers capturing the details from the input image over a separate network and then integrates them into a coarse dehazed image. The overall network consists of two independent networks, named DRN and the dehazing network respectively. Specifically, the DRN aims to recover the dehazed image details through local and global branches respectively. The local branch can obtain local detail information through the convolution layer and the global branch can capture more global information by the Smooth Dilated Convolution (SDC). The detail feature map is fused into the coarse dehazed image to obtain the dehazed image with rich image details. Besides, we integrate the DRN, the physical-model-based dehazing network and the reconstruction loss into an end-to-end joint learning framework. Extensive experiments on the public image dehazing datasets (RESIDE-Indoor, RESIDE-Outdoor and the TrainA-TestA) illustrate the effectiveness of the modules in the proposed method and show that our method outperforms the state-of-the-art dehazing methods both quantitatively and qualitatively. The code is released in https://github.com/YanLi-LY/Dehazing-DRN.
    Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation. (arXiv:2109.10595v1 [cs.GR])
    (0 min) To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.
    MVM3Det: A Novel Method for Multi-view Monocular 3D Detection. (arXiv:2109.10473v1 [cs.CV])
    (0 min) Monocular 3D object detection encounters occlusion problems in many application scenarios, such as traffic monitoring, pedestrian monitoring, etc., which leads to serious false negative. Multi-view object detection effectively solves this problem by combining data from different perspectives. However, due to label confusion and feature confusion, the orientation estimation of multi-view 3D object detection is intractable, which is important for object tracking and intention prediction. In this paper, we propose a novel multi-view 3D object detection method named MVM3Det which simultaneously estimates the 3D position and orientation of the object according to the multi-view monocular information. The method consists of two parts: 1) Position proposal network, which integrates the features from different perspectives into consistent global features through feature orthogonal transformation to estimate the position. 2) Multi-branch orientation estimation network, which introduces feature perspective pooling to overcome the two confusion problems during the orientation estimation. In addition, we present a first dataset for multi-view 3D object detection named MVM3D. Comparing with State-Of-The-Art (SOTA) methods on our dataset and public dataset WildTrack, our method achieves very competitive results.
    Neural network relief: a pruning algorithm based on neural activity. (arXiv:2109.10795v1 [cs.LG])
    (0 min) Current deep neural networks (DNNs) are overparameterized and use most of their neuronal connections during inference for each task. The human brain, however, developed specialized regions for different tasks and performs inference with a small fraction of its neuronal connections. We propose an iterative pruning strategy introducing a simple importance-score metric that deactivates unimportant connections, tackling overparameterization in DNNs and modulating the firing patterns. The aim is to find the smallest number of connections that is still capable of solving a given task with comparable accuracy, i.e. a simpler subnetwork. We achieve comparable performance for LeNet architectures on MNIST, and significantly higher parameter compression than state-of-the-art algorithms for VGG and ResNet architectures on CIFAR-10/100 and Tiny-ImageNet. Our approach also performs well for the two different optimizers considered -- Adam and SGD. The algorithm is not designed to minimize FLOPs when considering current hardware and software implementations, although it performs reasonably when compared to the state of the art.
    GasHisSDB: A New Gastric Histopathology Image Dataset for Computer Aided Diagnosis of Gastric Cancer. (arXiv:2106.02473v5 [cs.CV] UPDATED)
    (0 min) Gastric cancer has turned out to be the fifth most common cancer globally, and early detection of gastric cancer is essential to save lives. Histopathological examination of gastric cancer is the gold standard for the diagnosis of gastric cancer. However, computer-aided diagnostic techniques are challenging to evaluate due to the scarcity of publicly available gastric histopathology image datasets.In this paper, a noble publicly available Gastric Histopathology Sub-size Image Database (GasHisSDB) is published to identify classifiers' performance. Specifically, two types of data are included: normal and abnormal, with a total of 245,196 tissue case images.This study also performed extensive experiments using traditional machine learning and deep learning methods to prove that the methods of different periods have discrepancies on GasHisSDB. To the best of our knowledge, it is the first publicly available gastric cancer histopathology dataset containing a large number of images for weakly supervised learning. We believe that GasHisSDB can attract researchers to explore new algorithms for the automated diagnosis of gastric cancer, which can help physicians and patients in the clinical setting. GasHisSDB is available at the URL:https://gitee.com/neuhwm/GasHisSDB.git
    Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning. (arXiv:2109.10563v1 [cs.CV])
    (0 min) Due to difficulties in acquiring ground truth depth of equirectangular (360) images, the quality and quantity of equirectangular depth data today is insufficient to represent the various scenes in the world. Therefore, 360 depth estimation studies, which relied solely on supervised learning, are destined to produce unsatisfactory results. Although self-supervised learning methods focusing on equirectangular images (EIs) are introduced, they often have incorrect or non-unique solutions, causing unstable performance. In this paper, we propose 360 monocular depth estimation methods which improve on the areas that limited previous studies. First, we introduce a self-supervised 360 depth learning method that only utilizes gravity-aligned videos, which has the potential to eliminate the needs for depth data during the training procedure. Second, we propose a joint learning scheme realized by combining supervised and self-supervised learning. The weakness of each learning is compensated, thus leading to more accurate depth estimation. Third, we propose a non-local fusion block, which retains global information encoded by vision transformer when reconstructing the depths. With the proposed methods, we successfully apply the transformer to 360 depth estimations, to the best of our knowledge, which has not been tried before. On several benchmarks, our approach achieves significant improvements over previous works and establishes a state of the art.
    Application of Video-to-Video Translation Networks to Computational Fluid Dynamics. (arXiv:2109.10679v1 [physics.flu-dyn])
    (0 min) In recent years, the evolution of artificial intelligence, especially deep learning, has been remarkable, and its application to various fields has been growing rapidly. In this paper, I report the results of the application of generative adversarial networks (GANs), specifically video-to-video translation networks, to computational fluid dynamics (CFD) simulations. The purpose of this research is to reduce the computational cost of CFD simulations with GANs. The architecture of GANs in this research is a combination of the image-to-image translation networks (the so-called "pix2pix") and Long Short-Term Memory (LSTM). It is shown that the results of high-cost and high-accuracy simulations (with high-resolution computational grids) can be estimated from those of low-cost and low-accuracy simulations (with low-resolution grids). In particular, the time evolution of density distributions in the cases of a high-resolution grid is reproduced from that in the cases of a low-resolution grid through GANs, and the density inhomogeneity estimated from the image generated by GANs recovers the ground truth with good accuracy. Qualitative and quantitative comparisons of the results of the proposed method with those of several super-resolution algorithms are also presented.
    Efficient Context-Aware Network for Abdominal Multi-organ Segmentation. (arXiv:2109.10601v1 [eess.IV])
    (0 min) The contextual information, presented in abdominal CT scan, is relative consistent. In order to make full use of the overall 3D context, we develop a whole-volumebased coarse-to-fine framework for efficient and effective abdominal multi-organ segmentation. We propose a new efficientSegNet network, which is composed of encoder, decoder and context block. For the decoder module, anisotropic convolution with a k*k*1 intra-slice convolution and a 1*1*k inter-slice convolution, is designed to reduce the computation burden. For the context block, we propose strip pooling module to capture anisotropic and long-range contextual information, which exists in abdominal scene. Quantitative evaluation on the FLARE2021 validation cases, this method achieves the average dice similarity coefficient (DSC) of 0.895 and average normalized surface distance (NSD) of 0.775. The average running time is 9.8 s per case in inference phase, and maximum used GPU memory is 1017 MB.
    Generating Compositional Color Representations from Text. (arXiv:2109.10477v1 [cs.CV])
    (0 min) We consider the cross-modal task of producing color representations for text phrases. Motivated by the fact that a significant fraction of user queries on an image search engine follow an (attribute, object) structure, we propose a generative adversarial network that generates color profiles for such bigrams. We design our pipeline to learn composition - the ability to combine seen attributes and objects to unseen pairs. We propose a novel dataset curation pipeline from existing public sources. We describe how a set of phrases of interest can be compiled using a graph propagation technique, and then mapped to images. While this dataset is specialized for our investigations on color, the method can be extended to other visual dimensions where composition is of interest. We provide detailed ablation studies that test the behavior of our GAN architecture with loss functions from the contrastive learning literature. We show that the generative model achieves lower Frechet Inception Distance than discriminative ones, and therefore predicts color profiles that better match those from real images. Finally, we demonstrate improved performance in image retrieval and classification, indicating the crucial role that color plays in these downstream tasks.
    A Method For Adding Motion-Blur on Arbitrary Objects By using Auto-Segmentation and Color Compensation Techniques. (arXiv:2109.10524v1 [cs.CV])
    (0 min) When dynamic objects are captured by a camera, motion blur inevitably occurs. Such a blur is sometimes considered as just a noise, however, it sometimes gives an important effect to add dynamism in the scene for photographs or videos. Unlike the similar effects, such as defocus blur, which is now easily controlled even by smartphones, motion blur is still uncontrollable and makes undesired effects on photographs. In this paper, an unified framework to add motion blur on per-object basis is proposed. In the method, multiple frames are captured without motion blur and they are accumulated to create motion blur on target objects. To capture images without motion blur, shutter speed must be short, however, it makes captured images dark, and thus, a sensor gain should be increased to compensate it. Since a sensor gain causes a severe noise on image, we propose a color compensation algorithm based on non-linear filtering technique for solution. Another contribution is that our technique can be used to make HDR images for fast moving objects by using multi-exposure images. In the experiments, effectiveness of the method is confirmed by ablation study using several data sets.
    Evaluating Transformer-based Semantic Segmentation Networks for Pathological Image Segmentation. (arXiv:2108.11993v2 [eess.IV] UPDATED)
    (0 min) Histopathology has played an essential role in cancer diagnosis. With the rapid advances in convolutional neural networks (CNN). Various CNN-based automated pathological image segmentation approaches have been developed in computer-assisted pathological image analysis. In the past few years, Transformer neural networks (Transformer) have shown the unique merit of capturing the global long-distance dependencies across the entire image as a new deep learning paradigm. Such merit is appealing for exploring spatially heterogeneous pathological images. However, there have been very few, if any, studies that have systematically evaluated the current Transformer-based approaches in pathological image segmentation. To assess the performance of Transformer segmentation models on whole slide images (WSI), we quantitatively evaluated six prevalent transformer-based models on tumor segmentation, using the widely used PAIP liver histopathological dataset. For a more comprehensive analysis, we also compare the transformer-based models with six major traditional CNN-based models. The results show that the Transformer-based models exhibit a general superior performance over the CNN-based models. In particular, Segmenter, Swin-Transformer and TransUNet-all transformer-based-came out as the best performers among the twelve evaluated models.
  • cs.IR updates on arXiv.org

    Investigating Entropy for Extractive Document Summarization. (arXiv:2109.10886v1 [cs.IR])
    (0 min) Automatic text summarization aims to cut down readers time and cognitive effort by reducing the content of a text document without compromising on its essence. Ergo, informativeness is the prime attribute of document summary generated by an algorithm, and selecting sentences that capture the essence of a document is the primary goal of extractive document summarization. In this paper, we employ Shannon entropy to capture informativeness of sentences. We employ Non-negative Matrix Factorization (NMF) to reveal probability distributions for computing entropy of terms, topics, and sentences in latent space. We present an information theoretic interpretation of the computed entropy, which is the bedrock of the proposed E-Summ algorithm, an unsupervised method for extractive document summarization. The algorithm systematically applies information theoretic principle for selecting informative sentences from important topics in the document. The proposed algorithm is generic and fast, and hence amenable to use for summarization of documents in real time. Furthermore, it is domain-, collection-independent and agnostic to the language of the document. Benefiting from strictly positive NMF factor matrices, E-Summ algorithm is transparent and explainable too. We use standard ROUGE toolkit for performance evaluation of the proposed method on four well known public data-sets. We also perform quantitative assessment of E-Summ summary quality by computing its semantic similarity w.r.t the original document. Our investigation reveals that though using NMF and information theoretic approach for document summarization promises efficient, explainable, and language independent text summarization, it needs to be bolstered to match the performance of deep neural methods.
    Inference for Hit Enrichment Curves, with Applications to Drug Discovery. (arXiv:1912.09526v2 [stat.AP] UPDATED)
    (0 min) In virtual screening for drug discovery, hit enrichment curves are widely used to assess the performance of ranking algorithms with regard to their ability to identify early enrichment. Unfortunately, researchers almost never consider the uncertainty associated with estimating such curves before declaring differences between performance of competing algorithms. Appropriate inference is complicated by two sources of correlation that are often overlooked: correlation across different testing fractions within a single algorithm, and correlation between competing algorithms. Additionally, researchers are often interested in making comparisons along the entire curve, not only at a few testing fractions. We develop inferential procedures to address both the needs of those interested in a few testing fractions, as well as those interested in the entire curve. For the former, four hypothesis testing and (pointwise) confidence intervals are investigated, and a newly developed EmProc approach is found to be most effective. For inference along entire curves, EmProc-based confidence bands are recommended for simultaneous coverage and minimal width. Our inferential procedures trivially extend to enrichment factors, as well.
    A Survey on Reinforcement Learning for Recommender Systems. (arXiv:2109.10665v1 [cs.IR])
    (0 min) Recommender systems have been widely applied in different real-life scenarios to help us find useful information. Recently, Reinforcement Learning (RL) based recommender systems have become an emerging research topic. It often surpasses traditional recommendation models even most deep learning-based methods, owing to its interactive nature and autonomous learning ability. Nevertheless, there are various challenges of RL when applying in recommender systems. Toward this end, we firstly provide a thorough overview, comparisons, and summarization of RL approaches for five typical recommendation scenarios, following three main categories of RL: value-function, policy search, and Actor-Critic. Then, we systematically analyze the challenges and relevant solutions on the basis of existing literature. Finally, under discussion for open issues of RL and its limitations of recommendation, we highlight some potential research directions in this field.
    Context-aware Tree-based Deep Model for Recommender Systems. (arXiv:2109.10602v1 [cs.IR])
    (0 min) How to predict precise user preference and how to make efficient retrieval from a big corpus are two major challenges of large-scale industrial recommender systems. In tree-based methods, a tree structure T is adopted as index and each item in corpus is attached to a leaf node on T . Then the recommendation problem is converted into a hierarchical retrieval problem solved by a beam search process efficiently. In this paper, we argue that the tree index used to support efficient retrieval in tree-based methods also has rich hierarchical information about the corpus. Furthermore, we propose a novel context-aware tree-based deep model (ConTDM) for recommender systems. In ConTDM, a context-aware user preference prediction model M is designed to utilize both horizontal and vertical contexts on T . Horizontally, a graph convolutional layer is used to enrich the representation of both users and nodes on T with their neighbors. Vertically, a parent fusion layer is designed in M to transmit the user preference representation in higher levels of T to the current level, grasping the essence that tree-based methods are generating the candidate set from coarse to detail during the beam search retrieval. Besides, we argue that the proposed user preference model in ConTDM can be conveniently extended to other tree-based methods for recommender systems. Both experiments on large scale real-world datasets and online A/B test in large scale industrial applications show the significant improvements brought by ConTDM.
    Astronomical Pipeline Provenance: A Use Case Evaluation. (arXiv:2109.10759v1 [astro-ph.IM])
    (0 min) In this decade astronomy is undergoing a paradigm shift to handle data from next generation observatories such as the Square Kilometre Array (SKA) or the Vera C. Rubin Observatory (LSST). Producing real time data streams of up to 10 TB/s and data products of the order of 600 Pbytes/year, the SKA will be the biggest civil data producing machine of the world that demands novel solutions on how these data volumes can be stored and analysed. Through the use of complex, automated pipelines the provenance of this real time data processing is key to establish confidence within the system, its final data products, and ultimately its scientific results. The intention of this paper is to lay the foundation for making an automated provenance generation tool for astronomical/data-processing pipelines. We therefore present a use case analysis, specific to the astronomical needs which addresses the issues of trust and reproducibility as well as other ulterior use cases which are of interest to astronomers. This analysis is subsequently used as the basis to discuss the requirements, challenges, and opportunities involved in designing both the tool and the associated provenance model.
    Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection. (arXiv:2109.10739v1 [cs.IR])
    (0 min) Over the last few years, contextualized pre-trained transformer models such as BERT have provided substantial improvements on information retrieval tasks. Recent approaches based on pre-trained transformer models such as BERT, fine-tune dense low-dimensional contextualized representations of queries and documents in embedding space. While these dense retrievers enjoy substantial retrieval effectiveness improvements compared to sparse retrievers, they are computationally intensive, requiring substantial GPU resources, and dense retrievers are known to be more expensive from both time and resource perspectives. In addition, sparse retrievers have been shown to retrieve complementary information with respect to dense retrievers, leading to proposals for hybrid retrievers. These hybrid retrievers leverage low-cost, exact-matching based sparse retrievers along with dense retrievers to bridge the semantic gaps between query and documents. In this work, we address this trade-off between the cost and utility of sparse vs dense retrievers by proposing a classifier to select a suitable retrieval strategy (i.e., sparse vs. dense vs. hybrid) for individual queries. Leveraging sparse retrievers for queries which can be answered with sparse retrievers decreases the number of calls to GPUs. Consequently, while utility is maintained, query latency decreases. Although we use less computational resources and spend less time, we still achieve improved performance. Our classifier can select between sparse and dense retrieval strategies based on the query alone. We conduct experiments on the MS MARCO passage dataset demonstrating an improved range of efficiency/effectiveness trade-offs between purely sparse, purely dense or hybrid retrieval strategies, allowing an appropriate strategy to be selected based on a target latency and resource budget.
    Keyword Extraction for Improved Document Retrieval in Conversational Search. (arXiv:2109.05979v2 [cs.CL] UPDATED)
    (0 min) Recent research has shown that mixed-initiative conversational search, based on the interaction between users and computers to clarify and improve a query, provides enormous advantages. Nonetheless, incorporating additional information provided by the user from the conversation poses some challenges. In fact, further interactions could confuse the system as a user might use words irrelevant to the information need but crucial for correct sentence construction in the context of multi-turn conversations. To this aim, in this paper, we have collected two conversational keyword extraction datasets and propose an end-to-end document retrieval pipeline incorporating them. Furthermore, we study the performance of two neural keyword extraction models, namely, BERT and sequence to sequence, in terms of extraction accuracy and human annotation. Finally, we study the effect of keyword extraction on the end-to-end neural IR performance and show that our approach beats state-of-the-art IR models. We make the two datasets publicly available to foster research in this area.
    Combining Lexical and Dense Retrieval for Computationally Efficient Multi-hop Question Answering. (arXiv:2106.08433v2 [cs.IR] UPDATED)
    (0 min) In simple open-domain question answering (QA), dense retrieval has become one of the standard approaches for retrieving the relevant passages to infer an answer. Recently, dense retrieval also achieved state-of-the-art results in multi-hop QA, where aggregating information from multiple pieces of information and reasoning over them is required. Despite their success, dense retrieval methods are computationally intensive, requiring multiple GPUs to train. In this work, we introduce a hybrid (lexical and dense) retrieval approach that is highly competitive with the state-of-the-art dense retrieval models, while requiring substantially less computational resources. Additionally, we provide an in-depth evaluation of dense retrieval methods on limited computational resource settings, something that is missing from the current literature.
    Why Don't You Click: Neural Correlates of Non-Click Behaviors in Web Search. (arXiv:2109.10560v1 [cs.IR])
    (0 min) Web search heavily relies on click-through behavior as an essential feedback signal for performance improvement and evaluation. Traditionally, click is usually treated as a positive implicit feedback signal of relevance or usefulness, while non-click (especially non-click after examination) is regarded as a signal of irrelevance or uselessness. However, there are many cases where users do not click on any search results but still satisfy their information need with the contents of the results shown on the Search Engine Result Page (SERP). This raises the problem of measuring result usefulness and modeling user satisfaction in "Zero-click" search scenarios. Previous works have solved this issue by (1) detecting user satisfaction for abandoned SERP with context information and (2) considering result-level click necessity with external assessors' annotations. However, few works have investigated the reason behind non-click behavior and estimated the usefulness of non-click results. A challenge for this research question is how to collect valuable feedback for non-click results. With neuroimaging technologies, we design a lab-based user study and reveal differences in brain signals while examining non-click search results with different usefulness levels. The findings in significant brain regions and electroencephalogram~(EEG) spectrum also suggest that the process of usefulness judgment might involve similar cognitive functions of relevance perception and satisfaction decoding. Inspired by these findings, we conduct supervised learning tasks to estimate the usefulness of non-click results with brain signals and conventional information (i.e., content and context factors). Results show that it is feasible to utilize brain signals to improve usefulness estimation performance and enhancing human-computer interactions in "Zero-click" search scenarios.
    Unsupervised Contextualized Document Representation. (arXiv:2109.10509v1 [cs.CL])
    (0 min) Several NLP tasks need the effective representation of text documents. Arora et. al., 2017 demonstrate that simple weighted averaging of word vectors frequently outperforms neural models. SCDV (Mekala et. al., 2017) further extends this from sentences to documents by employing soft and sparse clustering over pre-computed word vectors. However, both techniques ignore the polysemy and contextual character of words. In this paper, we address this issue by proposing SCDV+BERT(ctxd), a simple and effective unsupervised representation that combines contextualized BERT (Devlin et al., 2019) based word embedding for word sense disambiguation with SCDV soft clustering approach. We show that our embeddings outperform original SCDV, pre-train BERT, and several other baselines on many classification datasets. We also demonstrate our embeddings effectiveness on other tasks, such as concept matching and sentence similarity. In addition, we show that SCDV+BERT(ctxd) outperforms fine-tune BERT and different embedding approaches in scenarios with limited data and only few shots examples.
    Generating Compositional Color Representations from Text. (arXiv:2109.10477v1 [cs.CV])
    (0 min) We consider the cross-modal task of producing color representations for text phrases. Motivated by the fact that a significant fraction of user queries on an image search engine follow an (attribute, object) structure, we propose a generative adversarial network that generates color profiles for such bigrams. We design our pipeline to learn composition - the ability to combine seen attributes and objects to unseen pairs. We propose a novel dataset curation pipeline from existing public sources. We describe how a set of phrases of interest can be compiled using a graph propagation technique, and then mapped to images. While this dataset is specialized for our investigations on color, the method can be extended to other visual dimensions where composition is of interest. We provide detailed ablation studies that test the behavior of our GAN architecture with loss functions from the contrastive learning literature. We show that the generative model achieves lower Frechet Inception Distance than discriminative ones, and therefore predicts color profiles that better match those from real images. Finally, we demonstrate improved performance in image retrieval and classification, indicating the crucial role that color plays in these downstream tasks.
    RETRONLU: Retrieval Augmented Task-Oriented Semantic Parsing. (arXiv:2109.10410v1 [cs.CL])
    (0 min) While large pre-trained language models accumulate a lot of knowledge in their parameters, it has been demonstrated that augmenting it with non-parametric retrieval-based memory has a number of benefits from accuracy improvements to data efficiency for knowledge-focused tasks, such as question answering. In this paper, we are applying retrieval-based modeling ideas to the problem of multi-domain task-oriented semantic parsing for conversational assistants. Our approach, RetroNLU, extends a sequence-to-sequence model architecture with a retrieval component, used to fetch existing similar examples and provide them as an additional input to the model. In particular, we analyze two settings, where we augment an input with (a) retrieved nearest neighbor utterances (utterance-nn), and (b) ground-truth semantic parses of nearest neighbor utterances (semparse-nn). Our technique outperforms the baseline method by 1.5% absolute macro-F1, especially at the low resource setting, matching the baseline model accuracy with only 40% of the data. Furthermore, we analyze the nearest neighbor retrieval component's quality, model sensitivity and break down the performance for semantic parses of different utterance complexity.
  • cs.LG updates on arXiv.org

    Encoders and Ensembles for Task-Free Continual Learning. (arXiv:2105.13327v4 [cs.LG] UPDATED)
    (2 min) We present an architecture that is effective for continual learning in an especially demanding setting, where task boundaries do not exist or are unknown. Our architecture comprises an encoder, pre-trained on a separate dataset, and an ensemble of simple one-layer classifiers. Two main innovations are required to make this combination work. First, the provision of suitably generic pre-trained encoders has been made possible thanks to recent progress in self-supervised training methods. Second, pairing each classifier in the ensemble with a key, where the key-space is identical to the latent space of the encoder, allows them to be used collectively, yet selectively, via k-nearest neighbour lookup. We show that models trained with the encoders-and-ensembles architecture are state-of-the-art for the task-free setting on standard image classification continual learning benchmarks, and improve on prior state-of-the-art by a large margin in the most challenging cases. We also show that the architecture learns well in a fully incremental setting, where one class is learned at a time, and we demonstrate its effectiveness in this setting with up to 100 classes. Finally, we show that the architecture works in a task-free continual learning context where the data distribution changes gradually, and existing approaches requiring knowledge of task boundaries cannot be applied.
    GRAND: Graph Neural Diffusion. (arXiv:2106.10934v2 [cs.LG] UPDATED)
    (2 min) We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE. In our model, the layer structure and topology correspond to the discretisation choices of temporal and spatial operators. Our approach allows a principled development of a broad new class of GNNs that are able to address the common plights of graph learning models such as depth, oversmoothing, and bottlenecks. Key to the success of our models are stability with respect to perturbations in the data and this is addressed for both implicit and explicit discretisation schemes. We develop linear and nonlinear versions of GRAND, which achieve competitive results on many standard graph benchmarks.
    Identifying Potential Exomoon Signals with Convolutional Neural Networks. (arXiv:2109.10503v1 [astro-ph.EP])
    (2 min) Targeted observations of possible exomoon host systems will remain difficult to obtain and time-consuming to analyze in the foreseeable future. As such, time-domain surveys such as Kepler, K2 and TESS will continue to play a critical role as the first step in identifying candidate exomoon systems, which may then be followed-up with premier ground- or space-based telescopes. In this work, we train an ensemble of convolutional neural networks (CNNs) to identify candidate exomoon signals in single-transit events observed by Kepler. Our training set consists of ${\sim}$27,000 examples of synthetic, planet-only and planet+moon single transits, injected into Kepler light curves. We achieve up to 88\% classification accuracy with individual CNN architectures and 97\% precision in identifying the moons in the validation set when the CNN ensemble is in total agreement. We then apply the CNN ensemble to light curves from 1880 Kepler Objects of Interest with periods $>10$ days ($\sim$57,000 individual transits), and further test the accuracy of the CNN classifier by injecting planet transits into each light curve, thus quantifying the extent to which residual stellar activity may result in false positive classifications. We find a small fraction of these transits contain moon-like signals, though we caution against strong inferences of the exomoon occurrence rate from this result. We conclude by discussing some ongoing challenges to utilizing neural networks for the exomoon search.
    Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction. (arXiv:2107.14432v2 [cs.LG] UPDATED)
    (2 min) We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance.
    Minimising quantifier variance under prior probability shift. (arXiv:2107.08209v4 [stat.ML] UPDATED)
    (2 min) For the binary prevalence quantification problem under prior probability shift, we determine the asymptotic variance of the maximum likelihood estimator. We find that it is a function of the Brier score for the regression of the class label on the features under the test data set distribution. This observation suggests that optimising the accuracy of a base classifier, as measured by the Brier score, on the training data set helps to reduce the variance of the related quantifier on the test data set. Therefore, we also point out training criteria for the base classifier that imply optimisation of both of the Brier scores on the training and the test data sets.
    An Attention Free Transformer. (arXiv:2105.14103v2 [cs.LG] UPDATED)
    (2 min) We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
    DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection. (arXiv:2109.06148v2 [cs.CV] UPDATED)
    (3 min) Object detection is a fundamental task in computer vision. While approaches for axis-aligned bounding box detection have made substantial progress in recent years, they perform poorly on oriented objects which are common in several real-world scenarios such as aerial view imagery and security camera footage. In these cases, a large part of a predicted bounding box will, undesirably, cover non-object related areas. Therefore, oriented object detection has emerged with the aim of generalizing object detection to arbitrary orientations. This enables a tighter fit to oriented objects, leading to a better separation of bounding boxes especially in case of dense object distributions. The vast majority of the work in this area has focused on complex two-stage anchor-based approaches. Anchors act as priors on the bounding box shape and require attentive hyper-parameter fine-tuning on a per-dataset basis, increased model size, and come with computational overhead. In this work, we present DAFNe: A Dense one-stage Anchor-Free deep Network for oriented object detection. As a one-stage model, DAFNe performs predictions on a dense grid over the input image, being architecturally simpler and faster, as well as easier to optimize than its two-stage counterparts. Furthermore, as an anchor-free model, DAFNe reduces the prediction complexity by refraining from employing bounding box anchors. Moreover, we introduce an orientation-aware generalization of the center-ness function for arbitrarily oriented bounding boxes to down-weight low-quality predictions and a center-to-corner bounding box prediction strategy that improves object localization performance. DAFNe improves the prediction accuracy over the previous best one-stage anchor-free model results on DOTA 1.0 by 4.65% mAP, setting the new state-of-the-art results by achieving 76.95% mAP.
    Tianshou: a Highly Modularized Deep Reinforcement Learning Library. (arXiv:2107.14171v2 [cs.LG] UPDATED)
    (0 min) We present Tianshou, a highly modularized python library for deep reinforcement learning (DRL) that uses PyTorch as its backend. Tianshou provides a flexible, reliable, yet simple implementation of a modular DRL library, and has supported more than 20 classic algorithms succinctly through a unified interface. To facilitate related research and prove Tianshou's reliability, we have released Tianshou's benchmark on MuJoCo environments, covering eight classic algorithms with the state-of-the-art performance. We open-sourced Tianshou at https://github.com/thu-ml/tianshou/, which has received over 3.7k stars and become one of the most popular PyTorch-based DRL libraries.
    Cross-Subject Statistical Shift Estimation for Generalized Electroencephalography-based Mental Workload Assessment. (arXiv:1906.08823v2 [cs.LG] UPDATED)
    (0 min) Assessment of mental workload in real-world conditions is key to ensure the performance of workers executing tasks that demand sustained attention. Previous literature has employed electroencephalography (EEG) to this end despite having observed that EEG correlates of mental workload vary across subjects and physical strain, thus making it difficult to devise models capable of simultaneously presenting reliable performance across users. Domain adaptation consists of a set of strategies that aim at allowing for improving machine learning systems performance on unseen data at training time. Such methods, however, might rely on assumptions over the considered data distributions, which typically do not hold for applications of EEG data. Motivated by this observation, in this work we propose a strategy to estimate two types of discrepancies between multiple data distributions, namely marginal and conditional shifts, observed on data collected from different subjects. Besides shedding light on the assumptions that hold for a particular dataset, the estimates of statistical shifts obtained with the proposed approach can be used for investigating other aspects of a machine learning pipeline, such as quantitatively assessing the effectiveness of domain adaptation strategies. In particular, we consider EEG data collected from individuals performing mental tasks while running on a treadmill and pedaling on a stationary bike and explore the effects of different normalization strategies commonly used to mitigate cross-subject variability. We show the effects that different normalization schemes have on statistical shifts and their relationship with the accuracy of mental workload prediction as assessed on unseen participants at training time.
    Uncertainty Bounds for Multivariate Machine Learning Predictions on High-Strain Brittle Fracture. (arXiv:2012.15739v2 [cond-mat.mtrl-sci] UPDATED)
    (0 min) Simulation of the crack network evolution on high strain rate impact experiments performed in brittle materials is very compute-intensive. The cost increases even more if multiple simulations are needed to account for the randomness in crack length, location, and orientation, which is inherently found in real-world materials. Constructing a machine learning emulator can make the process faster by orders of magnitude. There has been little work, however, on assessing the error associated with their predictions. Estimating these errors is imperative for meaningful overall uncertainty quantification. In this work, we extend the heteroscedastic uncertainty estimates to bound a multiple output machine learning emulator. We find that the response prediction is accurate within its predicted errors, but with a somewhat conservative estimate of uncertainty.
    Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures. (arXiv:2007.08970v3 [cs.CL] UPDATED)
    (0 min) While mainstream machine learning methods are known to have limited ability to compositionally generalize, new architectures and techniques continue to be proposed to address this limitation. We investigate state-of-the-art techniques and architectures in order to assess their effectiveness in improving compositional generalization in semantic parsing tasks based on the SCAN and CFQ datasets. We show that masked language model (MLM) pre-training rivals SCAN-inspired architectures on primitive holdout splits. On a more complex compositional task, we show that pre-training leads to significant improvements in performance vs. comparable non-pre-trained models, whereas architectures proposed to encourage compositional generalization on SCAN or in the area of algorithm learning fail to lead to significant improvements. We establish a new state of the art on the CFQ compositional generalization benchmark using MLM pre-training together with an intermediate representation.
    Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach. (arXiv:2102.10242v3 [cs.CL] UPDATED)
    (0 min) Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
    Practical distributed quantum information processing with LOCCNet. (arXiv:2101.12190v2 [quant-ph] UPDATED)
    (0 min) Distributed quantum information processing is essential for building quantum networks and enabling more extensive quantum computations. In this regime, several spatially separated parties share a multipartite quantum system, and the most natural set of operations is Local Operations and Classical Communication (LOCC). As a pivotal part in quantum information theory and practice, LOCC has led to many vital protocols such as quantum teleportation. However, designing practical LOCC protocols is challenging due to LOCC's intractable structure and limitations set by near-term quantum devices. Here we introduce LOCCNet, a machine learning framework facilitating protocol design and optimization for distributed quantum information processing tasks. As applications, we explore various quantum information tasks such as entanglement distillation, quantum state discrimination, and quantum channel simulation. We discover protocols with evident improvements, in particular, for entanglement distillation with quantum states of interest in quantum information. Our approach opens up new opportunities for exploring entanglement and its applications with machine learning, which will potentially sharpen our understanding of the power and limitations of LOCC. An implementation of LOCCNet is available in Paddle Quantum, a quantum machine learning Python package based on PaddlePaddle deep learning platform.
    Improved error rates for sparse (group) learning with Lipschitz loss functions. (arXiv:1910.08880v7 [stat.ML] UPDATED)
    (0 min) We study a family of sparse estimators defined as minimizers of some empirical Lipschitz loss function -- which include the hinge loss, the logistic loss and the quantile regression loss -- with a convex, sparse or group-sparse regularization. In particular, we consider the L1 norm on the coefficients, its sorted Slope version, and the Group L1-L2 extension. We propose a new theoretical framework that uses common assumptions in the literature to simultaneously derive new high-dimensional L2 estimation upper bounds for all three regularization schemes. %, and to improve over existing results. For L1 and Slope regularizations, our bounds scale as $(k^*/n) \log(p/k^*)$ -- $n\times p$ is the size of the design matrix and $k^*$ the dimension of the theoretical loss minimizer $\B{\beta}^*$ -- and match the optimal minimax rate achieved for the least-squares case. For Group L1-L2 regularization, our bounds scale as $(s^*/n) \log\left( G / s^* \right) + m^* / n$ -- $G$ is the total number of groups and $m^*$ the number of coefficients in the $s^*$ groups which contain $\B{\beta}^*$ -- and improve over the least-squares case. We show that, when the signal is strongly group-sparse, Group L1-L2 is superior to L1 and Slope. In addition, we adapt our approach to the sub-Gaussian linear regression framework and reach the optimal minimax rate for Lasso, and an improved rate for Group-Lasso. Finally, we release an accelerated proximal algorithm that computes the nine main convex estimators of interest when the number of variables is of the order of $100,000s$.
    SCSS-Net: Solar Corona Structures Segmentation by Deep Learning. (arXiv:2109.10834v1 [astro-ph.SR])
    (0 min) Structures in the solar corona are the main drivers of space weather processes that might directly or indirectly affect the Earth. Thanks to the most recent space-based solar observatories, with capabilities to acquire high-resolution images continuously, the structures in the solar corona can be monitored over the years with a time resolution of minutes. For this purpose, we have developed a method for automatic segmentation of solar corona structures observed in EUV spectrum that is based on a deep learning approach utilizing Convolutional Neural Networks. The available input datasets have been examined together with our own dataset based on the manual annotation of the target structures. Indeed, the input dataset is the main limitation of the developed model's performance. Our \textit{SCSS-Net} model provides results for coronal holes and active regions that could be compared with other generally used methods for automatic segmentation. Even more, it provides a universal procedure to identify structures in the solar corona with the help of the transfer learning technique. The outputs of the model can be then used for further statistical studies of connections between solar activity and the influence of space weather on Earth.
    Scalable and Efficient MoE Training for Multitask Multilingual Models. (arXiv:2109.10465v1 [cs.CL])
    (0 min) The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.
    Batch norm with entropic regularization turns deterministic autoencoders into generative models. (arXiv:2002.10631v2 [cs.LG] UPDATED)
    (0 min) The variational autoencoder is a well defined deep generative model that utilizes an encoder-decoder framework where an encoding neural network outputs a non-deterministic code for reconstructing an input. The encoder achieves this by sampling from a distribution for every input, instead of outputting a deterministic code per input. The great advantage of this process is that it allows the use of the network as a generative model for sampling from the data distribution beyond provided samples for training. We show in this work that utilizing batch normalization as a source for non-determinism suffices to turn deterministic autoencoders into generative models on par with variational ones, so long as we add a suitable entropic regularization to the training objective.
    ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution. (arXiv:2101.07415v3 [cs.LG] UPDATED)
    (0 min) We consider the problem of efficient blackbox optimization over a large hybrid search space, consisting of a mixture of a high dimensional continuous space and a complex combinatorial space. Such examples arise commonly in evolutionary computation, but also more recently, neuroevolution and architecture search for Reinforcement Learning (RL) policies. In this paper, we introduce ES-ENAS, a simple joint optimization procedure by combining Evolutionary Strategies (ES) and combinatorial optimization techniques in a highly scalable and intuitive way, inspired by the \textit{one-shot} or \textit{supernet} paradigm introduced in Efficient Neural Architecture Search (ENAS). Our main insight is noticing that ES is already a highly distributed algorithm involving hundreds of blackbox evaluations which can not only be used for training neural network weights, but also for feedback to a combinatorial optimizer. Through this relatively simple marriage between two different lines of research, we are able to gain the best of both worlds, and empirically demonstrate our approach by optimizing BBOB functions over hybrid spaces as well as combinatorial neural network architectures via edge pruning and quantization on popular RL benchmarks. Due to the modularity of the algorithm, we also are able incorporate a wide variety of popular techniques ranging from use of different continuous and combinatorial optimizers, as well as constrained optimization.
    CC-Cert: A Probabilistic Approach to Certify General Robustness of Neural Networks. (arXiv:2109.10696v1 [cs.LG])
    (0 min) In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks -- small modifications of the input that change the predictions. Besides rigorously studied $\ell_p$-bounded additive perturbations, recently proposed semantic perturbations (e.g. rotation, translation) raise a serious concern on deploying ML systems in real-world. Therefore, it is important to provide provable guarantees for deep learning models against semantically meaningful input transformations. In this paper, we propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds that can be used in general attack settings. We estimate the probability of a model to fail if the attack is sampled from a certain distribution. Our theoretical findings are supported by experimental results on different datasets.
    Diarisation using Location tracking with agglomerative clustering. (arXiv:2109.10598v1 [cs.LG])
    (0 min) Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework. Kalman filters, which track the locations of speakers, are used to compute log-likelihood ratios that contribute to the cluster affinity computations for the AHC merging and stopping decisions. Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task, compared to methods that do not use location information or that make stationarity assumptions.
    On the Treatment of Optimization Problems with L1 Penalty Terms via Multiobjective Continuation. (arXiv:2012.07483v2 [math.OC] UPDATED)
    (0 min) We present a novel algorithm that allows us to gain detailed insight into the effects of sparsity in linear and nonlinear optimization, which is of great importance in many scientific areas such as image and signal processing, medical imaging, compressed sensing, and machine learning (e.g., for the training of neural networks). Sparsity is an important feature to ensure robustness against noisy data, but also to find models that are interpretable and easy to analyze due to the small number of relevant terms. It is common practice to enforce sparsity by adding the $\ell_1$-norm as a weighted penalty term. In order to gain a better understanding and to allow for an informed model selection, we directly solve the corresponding multiobjective optimization problem (MOP) that arises when we minimize the main objective and the $\ell_1$-norm simultaneously. As this MOP is in general non-convex for nonlinear objectives, the weighting method will fail to provide all optimal compromises. To avoid this issue, we present a continuation method which is specifically tailored to MOPs with two objective functions one of which is the $\ell_1$-norm. Our method can be seen as a generalization of well-known homotopy methods for linear regression problems to the nonlinear case. Several numerical examples - including neural network training - demonstrate our theoretical findings and the additional insight that can be gained by this multiobjective approach.
    An artificial neural network approach to bifurcating phenomena in computational fluid dynamics. (arXiv:2109.10765v1 [physics.flu-dyn])
    (0 min) This work deals with the investigation of bifurcating fluid phenomena using a reduced order modelling setting aided by artificial neural networks. We discuss the POD-NN approach dealing with non-smooth solutions set of nonlinear parametrized PDEs. Thus, we study the Navier-Stokes equations describing: (i) the Coanda effect in a channel, and (ii) the lid driven triangular cavity flow, in a physical/geometrical multi-parametrized setting, considering the effects of the domain's configuration on the position of the bifurcation points. Finally, we propose a reduced manifold-based bifurcation diagram for a non-intrusive recovery of the critical points evolution. Exploiting such detection tool, we are able to efficiently obtain information about the pattern flow behaviour, from symmetry breaking profiles to attaching/spreading vortices, even at high Reynolds numbers.
    Unsupervised Contextualized Document Representation. (arXiv:2109.10509v1 [cs.CL])
    (0 min) Several NLP tasks need the effective representation of text documents. Arora et. al., 2017 demonstrate that simple weighted averaging of word vectors frequently outperforms neural models. SCDV (Mekala et. al., 2017) further extends this from sentences to documents by employing soft and sparse clustering over pre-computed word vectors. However, both techniques ignore the polysemy and contextual character of words. In this paper, we address this issue by proposing SCDV+BERT(ctxd), a simple and effective unsupervised representation that combines contextualized BERT (Devlin et al., 2019) based word embedding for word sense disambiguation with SCDV soft clustering approach. We show that our embeddings outperform original SCDV, pre-train BERT, and several other baselines on many classification datasets. We also demonstrate our embeddings effectiveness on other tasks, such as concept matching and sentence similarity. In addition, we show that SCDV+BERT(ctxd) outperforms fine-tune BERT and different embedding approaches in scenarios with limited data and only few shots examples.
    Assisted Learning for Organizations with Limited Data. (arXiv:2109.09307v2 [cs.LG] UPDATED)
    (0 min) We develop an assisted learning framework for assisting organization-level learners to improve their learning performance with limited and imbalanced data. In particular, learners at the organization level usually have sufficient computation resource, but are subject to stringent collaboration policy and information privacy. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In our assisted learning framework, an organizational learner purchases assistance service from a service provider and aims to enhance its model performance within a few assistance rounds. We develop effective stochastic training algorithms for assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to frequently transmit gradients or models, our framework allows the learner to only occasionally share information with the service provider, and still achieve a near-oracle model as if all the data were centralized.
    Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing. (arXiv:2109.10847v1 [cs.LG])
    (0 min) Recent progress in the Natural Language Processing domain has given us several State-of-the-Art (SOTA) pretrained models which can be finetuned for specific tasks. These large models with billions of parameters trained on numerous GPUs/TPUs over weeks are leading in the benchmark leaderboards. In this paper, we discuss the need for a benchmark for cost and time effective smaller models trained on a single GPU. This will enable researchers with resource constraints experiment with novel and innovative ideas on tokenization, pretraining tasks, architecture, fine tuning methods etc. We set up Small-Bench NLP, a benchmark for small efficient neural language models trained on a single GPU. Small-Bench NLP benchmark comprises of eight NLP tasks on the publicly available GLUE datasets and a leaderboard to track the progress of the community. Our ELECTRA-DeBERTa (15M parameters) small model architecture achieves an average score of 81.53 which is comparable to that of BERT-Base's 82.20 (110M parameters). Our models, code and leaderboard are available at https://github.com/smallbenchnlp
    Copyright in Generative Deep Learning. (arXiv:2105.09266v4 [cs.CY] UPDATED)
    (0 min) Machine-generated artworks are now part of the contemporary art scene: they are attracting significant investments and they are presented in exhibitions together with those created by human artists. These artworks are mainly based on generative deep learning techniques, which have seen a formidable development and remarkable refinement in the very recent years. Given the inherent characteristics of these techniques, a series of novel legal problems arise. In this article, we consider a set of key questions in the area of generative deep learning for the arts, including the following: is it possible to use copyrighted works as training set for generative models? How do we legally store their copies in order to perform the training process? Who (if someone) will own the copyright on the generated data? We try to answer these questions considering the law in force in both the United States of America and the European Union, and potential future alternatives. We then extend our analysis to code generation, which is an emerging area of generative deep learning. Finally, we also formulate a set of practical guidelines for artists and developers working on deep learning generated art, as well as some policy suggestions for policymakers.
    Symmetry-Aware Reservoir Computing. (arXiv:2102.00310v4 [cs.NE] UPDATED)
    (0 min) We demonstrate that matching the symmetry properties of a reservoir computer (RC) to the data being processed dramatically increases its processing power. We apply our method to the parity task, a challenging benchmark problem that highlights inversion and permutation symmetries, and to a chaotic system inference task that presents an inversion symmetry rule. For the parity task, our symmetry-aware RC obtains zero error using an exponentially reduced neural network and training data, greatly speeding up the time to result and outperforming hand crafted artificial neural networks. When both symmetries are respected, we find that the network size $N$ necessary to obtain zero error for 50 different RC instances scales linearly with the parity-order $n$. Moreover, some symmetry-aware RC instances perform a zero error classification with only $N=1$ for $n\leq7$. Furthermore, we show that a symmetry-aware RC only needs a training data set with size on the order of $(n+n/2)$ to obtain such performance, an exponential reduction in comparison to a regular RC which requires a training data set with size on the order of $n2^n$ to contain all $2^n$ possible $n-$bit-long sequences. For the inference task, we show that a symmetry-aware RC presents a normalized root-mean-square error three orders-of-magnitude smaller than regular RCs. For both tasks, our RC approach respects the symmetries by adjusting only the input and the output layers, and not by problem-based modifications to the neural network. We anticipate that generalizations of our procedure can be applied in information processing for problems with known symmetries.
    Geo-Context Aware Study of Vision-Based Autonomous Driving Models and Spatial Video Data. (arXiv:2109.10895v1 [cs.HC])
    (0 min) Vision-based deep learning (DL) methods have made great progress in learning autonomous driving models from large-scale crowd-sourced video datasets. They are trained to predict instantaneous driving behaviors from video data captured by on-vehicle cameras. In this paper, we develop a geo-context aware visualization system for the study of Autonomous Driving Model (ADM) predictions together with large-scale ADM video data. The visual study is seamlessly integrated with the geographical environment by combining DL model performance with geospatial visualization techniques. Model performance measures can be studied together with a set of geospatial attributes over map views. Users can also discover and compare prediction behaviors of multiple DL models in both city-wide and street-level analysis, together with road images and video contents. Therefore, the system provides a new visual exploration platform for DL model designers in autonomous driving. Use cases and domain expert evaluation show the utility and effectiveness of the visualization system.
    Towards The Automatic Coding of Medical Transcripts to Improve Patient-Centered Communication. (arXiv:2109.10514v1 [cs.CL])
    (0 min) This paper aims to provide an approach for automatic coding of physician-patient communication transcripts to improve patient-centered communication (PCC). PCC is a central part of high-quality health care. To improve PCC, dialogues between physicians and patients have been recorded and tagged with predefined codes. Trained human coders have manually coded the transcripts. Since it entails huge labor costs and poses possible human errors, automatic coding methods should be considered for efficiency and effectiveness. We adopted three machine learning algorithms (Na\"ive Bayes, Random Forest, and Support Vector Machine) to categorize lines in transcripts into corresponding codes. The result showed that there is evidence to distinguish the codes, and this is considered to be sufficient for training of human annotators.
    Physics-informed Neural Networks-based Model Predictive Control for Multi-link Manipulators. (arXiv:2109.10793v1 [math.OC])
    (0 min) We discuss nonlinear model predictive control (NMPC) for multi-body dynamics via physics-informed machine learning methods. Physics-informed neural networks (PINNs) are a promising tool to approximate (partial) differential equations. PINNs are not suited for control tasks in their original form since they are not designed to handle variable control actions or variable initial values. We thus present the idea of enhancing PINNs by adding control actions and initial conditions as additional network inputs. The high-dimensional input space is subsequently reduced via a sampling strategy and a zero-hold assumption. This strategy enables the controller design based on a PINN as an approximation of the underlying system dynamics. The additional benefit is that the sensitivities are easily computed via automatic differentiation, thus leading to efficient gradient-based algorithms. Finally, we present our results using our PINN-based MPC to solve a tracking problem for a complex mechanical system, a multi-link manipulator.
    Inductive logic programming at 30. (arXiv:2102.10556v2 [cs.AI] UPDATED)
    (0 min) Inductive logic programming (ILP) is a form of logic-based machine learning. The goal is to induce a hypothesis (a logic program) that generalises given training examples. As ILP turns 30, we review the last decade of research. We focus on (i) new meta-level search methods, (ii) techniques for learning recursive programs, (iii) new approaches for predicate invention, and (iv) the use of different technologies. We conclude by discussing current limitations of ILP and directions for future research.
    Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning. (arXiv:2011.03186v3 [cs.LG] UPDATED)
    (0 min) The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget, and empirical studies show the effectiveness of this new idea. The novel components in the proofs include a more refined analysis of the majority voting classifier -- which could be of independent interest -- and an observation that the synthetic "student" learning problem is nearly realizable by construction under the Tsybakov noise condition.
    Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images. (arXiv:2109.10778v1 [cs.CV])
    (0 min) Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. However, generating exhaustive and accurate annotations is labor-intensive, challenging, and costly. Drawing only coarse and approximate annotations is a much easier task, less costly, and it alleviates pathologists' workload. In this paper, we study the problem of refining these approximate annotations in digital pathology to obtain more accurate ones. Some previous works have explored obtaining machine learning models from these inaccurate annotations, but few of them tackle the refinement problem where the mislabeled regions should be explicitly identified and corrected, and all of them require a - often very large - number of training samples. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Patches cropped from a WSI with inaccurate labels are processed jointly with a MIL framework, and a deep-attention mechanism is leveraged to discriminate mislabeled instances, mitigating their impact on the predictive model and refining the segmentation. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide. These results demonstrate the LC-MIL is a promising, lightweight tool to provide fine-grained annotations from coarsely annotated pathology sets.
    SGD May Never Escape Saddle Points. (arXiv:2107.11774v2 [cs.LG] UPDATED)
    (0 min) Stochastic gradient descent (SGD) has been deployed to solve highly non-linear and non-convex machine learning problems such as the training of deep neural networks. However, previous works on SGD often rely on restrictive and unrealistic assumptions about the nature of noise in SGD. In this work, we mathematically construct examples that defy previous understandings of SGD. For example, our constructions show that: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. We also show the relevance of our result to deep learning by presenting a minimal neural network example. Our result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.
    Multi-task Learning with Cross Attention for Keyword Spotting. (arXiv:2107.07634v2 [eess.AS] UPDATED)
    (0 min) Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data. In this approach, an output of an acoustic model is split into two branches for the two tasks, one for phoneme transcription trained with the ASR data and one for keyword classification trained with the KWS data. In this paper, we introduce a cross attention decoder in the multi-task learning framework. Unlike the conventional multi-task learning approach with the simple split of the output layer, the cross attention decoder summarizes information from a phonetic encoder by performing cross attention between the encoder outputs and a trainable query sequence to predict a confidence score for the KWS task. Experimental results on KWS tasks show that the proposed approach achieves a 12% relative reduction in the false reject ratios compared to the conventional multi-task learning with split branches and a bi-directional long short-team memory decoder.
    Recursively Summarizing Books with Human Feedback. (arXiv:2109.10862v1 [cs.CL])
    (0 min) A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases ($\sim5\%$ of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.
    Inference for Hit Enrichment Curves, with Applications to Drug Discovery. (arXiv:1912.09526v2 [stat.AP] UPDATED)
    (0 min) In virtual screening for drug discovery, hit enrichment curves are widely used to assess the performance of ranking algorithms with regard to their ability to identify early enrichment. Unfortunately, researchers almost never consider the uncertainty associated with estimating such curves before declaring differences between performance of competing algorithms. Appropriate inference is complicated by two sources of correlation that are often overlooked: correlation across different testing fractions within a single algorithm, and correlation between competing algorithms. Additionally, researchers are often interested in making comparisons along the entire curve, not only at a few testing fractions. We develop inferential procedures to address both the needs of those interested in a few testing fractions, as well as those interested in the entire curve. For the former, four hypothesis testing and (pointwise) confidence intervals are investigated, and a newly developed EmProc approach is found to be most effective. For inference along entire curves, EmProc-based confidence bands are recommended for simultaneous coverage and minimal width. Our inferential procedures trivially extend to enrichment factors, as well.
    Quantifying Model Predictive Uncertainty with Perturbation Theory. (arXiv:2109.10888v1 [cs.LG])
    (0 min) We propose a framework for predictive uncertainty quantification of a neural network that replaces the conventional Bayesian notion of weight probability density function (PDF) with a physics based potential field representation of the model weights in a Gaussian reproducing kernel Hilbert space (RKHS) embedding. This allows us to use perturbation theory from quantum physics to formulate a moment decomposition problem over the model weight-output relationship. The extracted moments reveal successive degrees of regularization of the weight potential field around the local neighborhood of the model output. Such localized moments represent well the PDF tails and provide significantly greater accuracy of the model's predictive uncertainty than the central moments characterized by Bayesian and ensemble methods or their variants. We show that this consequently leads to a better ability to detect false model predictions of test data that has undergone a covariate shift away from the training PDF learned by the model. We evaluate our approach against baseline uncertainty quantification methods on several benchmark datasets that are corrupted using common distortion techniques. Our approach provides fast model predictive uncertainty estimates with much greater precision and calibration.
    Anticipative Video Transformer. (arXiv:2106.02036v2 [cs.CV] UPDATED)
    (0 min) We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads; and it wins first place in the EpicKitchens-100 CVPR'21 challenge.
    On Resource-Efficient Bayesian Network Classifiers and Deep Neural Networks. (arXiv:2010.11773v2 [cs.LG] UPDATED)
    (0 min) We present two methods to reduce the complexity of Bayesian network (BN) classifiers. First, we introduce quantization-aware training using the straight-through gradient estimator to quantize the parameters of BNs to few bits. Second, we extend a recently proposed differentiable tree-augmented naive Bayes (TAN) structure learning approach by also considering the model size. Both methods are motivated by recent developments in the deep learning community, and they provide effective means to trade off between model size and prediction accuracy, which is demonstrated in extensive experiments. Furthermore, we contrast quantized BN classifiers with quantized deep neural networks (DNNs) for small-scale scenarios which have hardly been investigated in the literature. We show Pareto optimal models with respect to model size, number of operations, and test error and find that both model classes are viable options.
    The First Vision For Vitals (V4V) Challenge for Non-Contact Video-Based Physiological Estimation. (arXiv:2109.10471v1 [cs.CY])
    (0 min) Telehealth has the potential to offset the high demand for help during public health emergencies, such as the COVID-19 pandemic. Remote Photoplethysmography (rPPG) - the problem of non-invasively estimating blood volume variations in the microvascular tissue from video - would be well suited for these situations. Over the past few years a number of research groups have made rapid advances in remote PPG methods for estimating heart rate from digital video and obtained impressive results. How these various methods compare in naturalistic conditions, where spontaneous behavior, facial expressions, and illumination changes are present, is relatively unknown. To enable comparisons among alternative methods, the 1st Vision for Vitals Challenge (V4V) presented a novel dataset containing high-resolution videos time-locked with varied physiological signals from a diverse population. In this paper, we outline the evaluation protocol, the data used, and the results. V4V is to be held in conjunction with the 2021 International Conference on Computer Vision.
    Adaptive Neural Message Passing for Inductive Learning on Hypergraphs. (arXiv:2109.10683v1 [cs.LG])
    (0 min) Graphs are the most ubiquitous data structures for representing relational datasets and performing inferences in them. They model, however, only pairwise relations between nodes and are not designed for encoding the higher-order relations. This drawback is mitigated by hypergraphs, in which an edge can connect an arbitrary number of nodes. Most hypergraph learning approaches convert the hypergraph structure to that of a graph and then deploy existing geometric deep learning methods. This transformation leads to information loss, and sub-optimal exploitation of the hypergraph's expressive power. We present HyperMSG, a novel hypergraph learning framework that uses a modular two-level neural message passing strategy to accurately and efficiently propagate information within each hyperedge and across the hyperedges. HyperMSG adapts to the data and task by learning an attention weight associated with each node's degree centrality. Such a mechanism quantifies both local and global importance of a node, capturing the structural properties of a hypergraph. HyperMSG is inductive, allowing inference on previously unseen nodes. Further, it is robust and outperforms state-of-the-art hypergraph learning methods on a wide range of tasks and datasets. Finally, we demonstrate the effectiveness of HyperMSG in learning multimodal relations through detailed experimentation on a challenging multimedia dataset.
    Introducing Symmetries to Black Box Meta Reinforcement Learning. (arXiv:2109.10781v1 [cs.LG])
    (0 min) Meta reinforcement learning (RL) attempts to discover new RL algorithms automatically from environment interaction. In so-called black-box approaches, the policy and the learning algorithm are jointly represented by a single neural network. These methods are very flexible, but they tend to underperform in terms of generalisation to new, unseen environments. In this paper, we explore the role of symmetries in meta-generalisation. We show that a recent successful meta RL approach that meta-learns an objective for backpropagation-based learning exhibits certain symmetries (specifically the reuse of the learning rule, and invariance to input and output permutations) that are not present in typical black-box meta RL systems. We hypothesise that these symmetries can play an important role in meta-generalisation. Building off recent work in black-box supervised meta learning, we develop a black-box meta RL system that exhibits these same symmetries. We show through careful experimentation that incorporating these symmetries can lead to algorithms with a greater ability to generalise to unseen action & observation spaces, tasks, and environments.
    Unsupervised Movement Detection in Indoor Positioning Systems. (arXiv:2109.10757v1 [cs.LG])
    (0 min) In recent years, the usage of indoor positioning systems for manufacturing processes became increasingly popular. Typically, the production hall is equipped with satellites which receive position data of sensors that can be pinned on components, load carriers or industrial trucks. This enables a company e.g. to reduce search efforts and to optimize individual system processes. In our research context, a sensor only sends position information when it is moved. However, various circumstances frequently affect that data is undesirably sent, e.g. due to disrupting factors nearby. This has a negative impact on the data quality, the energy consumption, and the reliability of the whole system. Motivated by this, we aim to distinguish between actual movements and signals that were undesirably sent which is in particular challenging due to the susceptibility of indoor systems in terms of noise and measuring errors. Therefore, we propose two novel unsupervised classification algorithms suitable for this task. Depending on the question of interest, they rely either on a distance-based or on a time-based criterion, which allows to make use of all essential information. Furthermore, we propose an approach to combine both classifications and to aggregate them on spatial production areas. This enables us to generate a comprehensive map of the underlying production hall with the sole usage of the position data. Aside from the analysis and detection of the underlying movement structure, the user benefits from a better understanding of own system processes and from the detection of problematic system areas which leads to a more efficient usage of positioning systems. Since all our approaches are constructed with unsupervised techniques, they are handily applicable in practice and do not require more information than the output data of the positioning system.
    Vehicle Behavior Prediction and Generalization Using Imbalanced Learning Techniques. (arXiv:2109.10656v1 [cs.RO])
    (0 min) The use of learning-based methods for vehicle behavior prediction is a promising research topic. However, many publicly available data sets suffer from class distribution skews which limits learning performance if not addressed. This paper proposes an interaction-aware prediction model consisting of an LSTM autoencoder and SVM classifier. Additionally, an imbalanced learning technique, the multiclass balancing ensemble is proposed. Evaluations show that the method enhances model performance, resulting in improved classification accuracy. Good generalization properties of learned models are important and therefore a generalization study is done where models are evaluated on unseen traffic data with dissimilar traffic behavior stemming from different road configurations. This is realized by using two distinct highway traffic recordings, the publicly available NGSIM US-101 and I80 data sets. Moreover, methods for encoding structural and static features into the learning process for improved generalization are evaluated. The resulting methods show substantial improvements in classification as well as generalization performance.
    Application of Video-to-Video Translation Networks to Computational Fluid Dynamics. (arXiv:2109.10679v1 [physics.flu-dyn])
    (0 min) In recent years, the evolution of artificial intelligence, especially deep learning, has been remarkable, and its application to various fields has been growing rapidly. In this paper, I report the results of the application of generative adversarial networks (GANs), specifically video-to-video translation networks, to computational fluid dynamics (CFD) simulations. The purpose of this research is to reduce the computational cost of CFD simulations with GANs. The architecture of GANs in this research is a combination of the image-to-image translation networks (the so-called "pix2pix") and Long Short-Term Memory (LSTM). It is shown that the results of high-cost and high-accuracy simulations (with high-resolution computational grids) can be estimated from those of low-cost and low-accuracy simulations (with low-resolution grids). In particular, the time evolution of density distributions in the cases of a high-resolution grid is reproduced from that in the cases of a low-resolution grid through GANs, and the density inhomogeneity estimated from the image generated by GANs recovers the ground truth with good accuracy. Qualitative and quantitative comparisons of the results of the proposed method with those of several super-resolution algorithms are also presented.
    Neural network relief: a pruning algorithm based on neural activity. (arXiv:2109.10795v1 [cs.LG])
    (0 min) Current deep neural networks (DNNs) are overparameterized and use most of their neuronal connections during inference for each task. The human brain, however, developed specialized regions for different tasks and performs inference with a small fraction of its neuronal connections. We propose an iterative pruning strategy introducing a simple importance-score metric that deactivates unimportant connections, tackling overparameterization in DNNs and modulating the firing patterns. The aim is to find the smallest number of connections that is still capable of solving a given task with comparable accuracy, i.e. a simpler subnetwork. We achieve comparable performance for LeNet architectures on MNIST, and significantly higher parameter compression than state-of-the-art algorithms for VGG and ResNet architectures on CIFAR-10/100 and Tiny-ImageNet. Our approach also performs well for the two different optimizers considered -- Adam and SGD. The algorithm is not designed to minimize FLOPs when considering current hardware and software implementations, although it performs reasonably when compared to the state of the art.
    Deep Augmented MUSIC Algorithm for Data-Driven DoA Estimation. (arXiv:2109.10581v1 [eess.SP])
    (0 min) Direction of arrival (DoA) estimation is a crucial task in sensor array signal processing, giving rise to various successful model-based (MB) algorithms as well as recently developed data-driven (DD) methods. This paper introduces a new hybrid MB/DD DoA estimation architecture, based on the classical multiple signal classification (MUSIC) algorithm. Our approach augments crucial aspects of the original MUSIC structure with specifically designed neural architectures, allowing it to overcome certain limitations of the purely MB method, such as its inability to successfully localize coherent sources. The deep augmented MUSIC algorithm is shown to outperform its unaltered version with a superior resolution.
    Investigating and Modeling the Dynamics of Long Ties. (arXiv:2109.10523v1 [cs.SI])
    (0 min) Long ties, the social ties that bridge different communities, are widely believed to play crucial roles in spreading novel information in social networks. However, some existing network theories and prediction models indicate that long ties might dissolve quickly or eventually become redundant, thus putting into question the long-term value of long ties. Our empirical analysis of real-world dynamic networks shows that contrary to such reasoning, long ties are more likely to persist than other social ties, and that many of them constantly function as social bridges without being embedded in local networks. Using a novel cost-benefit analysis model combined with machine learning, we show that long ties are highly beneficial, which instinctively motivates people to expend extra effort to maintain them. This partly explains why long ties are more persistent than what has been suggested by many existing theories and models. Overall, our study suggests the need for social interventions that can promote the formation of long ties, such as mixing people with diverse backgrounds.
    Locality Matters: A Scalable Value Decomposition Approach for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2109.10632v1 [cs.AI])
    (0 min) Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents. As environments grow in size, effective credit assignment becomes increasingly harder and often results in infeasible learning times. Still, in many real-world settings, there exist simplified underlying dynamics that can be leveraged for more scalable solutions. In this work, we exploit such locality structures effectively whilst maintaining global cooperation. We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Centralized Training Decentralized Execution paradigm. Additionally, we provide a direct reward decomposition method for finding these local rewards when only a global signal is provided. We test our method empirically, showing it scales well compared to other methods, significantly improving performance and convergence speed.
    Backdoor Attacks on Federated Learning with Lottery Ticket Hypothesis. (arXiv:2109.10512v1 [cs.LG])
    (0 min) Edge devices in federated learning usually have much more limited computation and communication resources compared to servers in a data center. Recently, advanced model compression methods, like the Lottery Ticket Hypothesis, have already been implemented on federated learning to reduce the model size and communication cost. However, Backdoor Attack can compromise its implementation in the federated learning scenario. The malicious edge device trains the client model with poisoned private data and uploads parameters to the center, embedding a backdoor to the global shared model after unwitting aggregative optimization. During the inference phase, the model with backdoors classifies samples with a certain trigger as one target category, while shows a slight decrease in inference accuracy to clean samples. In this work, we empirically demonstrate that Lottery Ticket models are equally vulnerable to backdoor attacks as the original dense models, and backdoor attacks can influence the structure of extracted tickets. Based on tickets' similarities between each other, we provide a feasible defense for federated learning against backdoor attacks on various datasets.
    Estimation Error Correction in Deep Reinforcement Learning for Deterministic Actor-Critic Methods. (arXiv:2109.10736v1 [cs.LG])
    (0 min) In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies. We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises. To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant. Our Q-value update rule combines the notions behind Clipped Double Q-learning and Maxmin Q-learning by computing the critic objective through the nested combination of maximum and minimum operators to bound the approximate value estimates. We evaluate our modification on the suite of several OpenAI Gym continuous control tasks, improving the state-of-the-art in every environment tested.
    Decentralized Learning of Tree-Structured Gaussian Graphical Models from Noisy Data. (arXiv:2109.10642v1 [cs.LG])
    (0 min) This paper studies the decentralized learning of tree-structured Gaussian graphical models (GGMs) from noisy data. In decentralized learning, data set is distributed across different machines (sensors), and GGMs are widely used to model complex networks such as gene regulatory networks and social networks. The proposed decentralized learning uses the Chow-Liu algorithm for estimating the tree-structured GGM. In previous works, upper bounds on the probability of incorrect tree structure recovery were given mostly without any practical noise for simplification. While this paper investigates the effects of three common types of noisy channels: Gaussian, Erasure, and binary symmetric channel. For Gaussian channel case, to satisfy the failure probability upper bound $\delta > 0$ in recovering a $d$-node tree structure, our proposed theorem requires only $\mathcal{O}(\log(\frac{d}{\delta}))$ samples for the smallest sample size ($n$) comparing to the previous literature \cite{Nikolakakis} with $\mathcal{O}(\log^4(\frac{d}{\delta}))$ samples by using the positive correlation coefficient assumption that is used in some important works in the literature. Moreover, the approximately bounded Gaussian random variable assumption does not appear in \cite{Nikolakakis}. Given some knowledge about the tree structure, the proposed Algorithmic Bound will achieve obviously better performance with small sample size (e.g., $< 2000$) comparing with formulaic bounds. Finally, we validate our theoretical results by performing simulations on synthetic data sets.
    Imitation Learning of Stabilizing Policies for Nonlinear Systems. (arXiv:2109.10854v1 [math.OC])
    (0 min) There has been a recent interest in imitation learning methods that are guaranteed to produce a stabilizing control law with respect to a known system. Work in this area has generally considered linear systems and controllers, for which stabilizing imitation learning takes the form of a biconvex optimization problem. In this paper it is demonstrated that the same methods developed for linear systems and controllers can be readily extended to polynomial systems and controllers using sum of squares techniques. A projected gradient descent algorithm and an alternating direction method of multipliers algorithm are proposed as heuristics for solving the stabilizing imitation learning problem, and their performance is illustrated through numerical experiments.
    Multi-Slice Clustering for 3-order Tensor Data. (arXiv:2109.10803v1 [cs.LG])
    (0 min) Several methods of triclustering of three dimensional data require the specification of the cluster size in each dimension. This introduces a certain degree of arbitrariness. To address this issue, we propose a new method, namely the multi-slice clustering (MSC) for a 3-order tensor data set. We analyse, in each dimension or tensor mode, the spectral decomposition of each tensor slice, i.e. a matrix. Thus, we define a similarity measure between matrix slices up to a threshold (precision) parameter, and from that, identify a cluster. The intersection of all partial clusters provides the desired triclustering. The effectiveness of our algorithm is shown on both synthetic and real-world data sets.
    Deep Variational Clustering Framework for Self-labeling of Large-scale Medical Images. (arXiv:2109.10777v1 [cs.CV])
    (0 min) We propose a Deep Variational Clustering (DVC) framework for unsupervised representation learning and clustering of large-scale medical images. DVC simultaneously learns the multivariate Gaussian posterior through the probabilistic convolutional encoder and the likelihood distribution with the probabilistic convolutional decoder; and optimizes cluster labels assignment. Here, the learned multivariate Gaussian posterior captures the latent distribution of a large set of unlabeled images. Then, we perform unsupervised clustering on top of the variational latent space using a clustering loss. In this approach, the probabilistic decoder helps to prevent the distortion of data points in the latent space and to preserve the local structure of data generating distribution. The training process can be considered as a self-training process to refine the latent space and simultaneously optimizing cluster assignments iteratively. We evaluated our proposed framework on three public datasets that represented different medical imaging modalities. Our experimental results show that our proposed framework generalizes better across different datasets. It achieves compelling results on several medical imaging benchmarks. Thus, our approach offers potential advantages over conventional deep unsupervised learning in real-world applications. The source code of the method and all the experiments are available publicly at: https://github.com/csfarzin/DVC
    Early Lane Change Prediction for Automated Driving Systems Using Multi-Task Attention-based Convolutional Neural Networks. (arXiv:2109.10742v1 [cs.CV])
    (0 min) Lane change (LC) is one of the safety-critical manoeuvres in highway driving according to various road accident records. Thus, reliably predicting such manoeuvre in advance is critical for the safe and comfortable operation of automated driving systems. The majority of previous studies rely on detecting a manoeuvre that has been already started, rather than predicting the manoeuvre in advance. Furthermore, most of the previous works do not estimate the key timings of the manoeuvre (e.g., crossing time), which can actually yield more useful information for the decision making in the ego vehicle. To address these shortcomings, this paper proposes a novel multi-task model to simultaneously estimate the likelihood of LC manoeuvres and the time-to-lane-change (TTLC). In both tasks, an attention-based convolutional neural network (CNN) is used as a shared feature extractor from a bird's eye view representation of the driving environment. The spatial attention used in the CNN model improves the feature extraction process by focusing on the most relevant areas of the surrounding environment. In addition, two novel curriculum learning schemes are employed to train the proposed approach. The extensive evaluation and comparative analysis of the proposed method in existing benchmark datasets show that the proposed method outperforms state-of-the-art LC prediction models, particularly considering long-term prediction performance.
    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. (arXiv:2109.10686v1 [cs.CL])
    (0 min) There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
    Spectral Methods for Data Science: A Statistical Perspective. (arXiv:2012.08496v2 [stat.ML] UPDATED)
    (0 min) Spectral methods have emerged as a simple yet surprisingly effective approach for extracting information from massive, noisy and incomplete data. In a nutshell, spectral methods refer to a collection of algorithms built upon the eigenvalues (resp. singular values) and eigenvectors (resp. singular vectors) of some properly designed matrices constructed from data. A diverse array of applications have been found in machine learning, data science, and signal processing. Due to their simplicity and effectiveness, spectral methods are not only used as a stand-alone estimator, but also frequently employed to initialize other more sophisticated algorithms to improve performance. While the studies of spectral methods can be traced back to classical matrix perturbation theory and methods of moments, the past decade has witnessed tremendous theoretical advances in demystifying their efficacy through the lens of statistical modeling, with the aid of non-asymptotic random matrix theory. This monograph aims to present a systematic, comprehensive, yet accessible introduction to spectral methods from a modern statistical perspective, highlighting their algorithmic implications in diverse large-scale applications. In particular, our exposition gravitates around several central questions that span various applications: how to characterize the sample efficiency of spectral methods in reaching a target level of statistical accuracy, and how to assess their stability in the face of random noise, missing data, and adversarial corruptions? In addition to conventional $\ell_2$ perturbation analysis, we present a systematic $\ell_{\infty}$ and $\ell_{2,\infty}$ perturbation theory for eigenspace and singular subspaces, which has only recently become available owing to a powerful "leave-one-out" analysis framework.
    Natural Typing Recognition vis Surface Electromyography. (arXiv:2109.10743v1 [cs.HC])
    (0 min) By using a computer keyboard as a finger recording device, we construct the largest existing dataset for gesture recognition via surface electromyography (sEMG), and use deep learning to achieve over 90% character-level accuracy on reconstructing typed text entirely from measured muscle potentials. We prioritize the temporal structure of the EMG signal instead of the spatial structure of the electrode layout, using network architectures inspired by those used for real-time spoken language transcription. Our architecture recognizes the rapid movements of natural computer typing, which occur at irregular intervals and often overlap in time. The extensive size of our dataset also allows us to study gesture recognition after synthetically downgrading the spatial or temporal resolution, showing the system capabilities necessary for real-time gesture recognition.
    Cram\'er-Rao bound-informed training of neural networks for quantitative MRI. (arXiv:2109.10535v1 [cs.LG])
    (0 min) Neural networks are increasingly used to estimate parameters in quantitative MRI, in particular in magnetic resonance fingerprinting. Their advantages over the gold standard non-linear least square fitting are their superior speed and their immunity to the non-convexity of many fitting problems. We find, however, that in heterogeneous parameter spaces, i.e. in spaces in which the variance of the estimated parameters varies considerably, good performance is hard to achieve and requires arduous tweaking of the loss function, hyper parameters, and the distribution of the training data in parameter space. Here, we address these issues with a theoretically well-founded loss function: the Cram\'er-Rao bound (CRB) provides a theoretical lower bound for the variance of an unbiased estimator and we propose to normalize the squared error with respective CRB. With this normalization, we balance the contributions of hard-to-estimate and not-so-hard-to-estimate parameters and areas in parameter space, and avoid a dominance of the former in the overall training loss. Further, the CRB-based loss function equals one for a maximally-efficient unbiased estimator, which we consider the ideal estimator. Hence, the proposed CRB-based loss function provides an absolute evaluation metric. We compare a network trained with the CRB-based loss with a network trained with the commonly used means squared error loss and demonstrate the advantages of the former in numerical, phantom, and in vivo experiments.
    Emulating Aerosol Microphysics with a Machine Learning. (arXiv:2109.10593v1 [cs.LG])
    (0 min) Aerosol particles play an important role in the climate system by absorbing and scattering radiation and influencing cloud properties. They are also one of the biggest sources of uncertainty for climate modeling. Many climate models do not include aerosols in sufficient detail. In order to achieve higher accuracy, aerosol microphysical properties and processes have to be accounted for. This is done in the ECHAM-HAM global climate aerosol model using the M7 microphysics model, but increased computational costs make it very expensive to run at higher resolutions or for a longer time. We aim to use machine learning to approximate the microphysics model at sufficient accuracy and reduce the computational cost by being fast at inference time. The original M7 model is used to generate data of input-output pairs to train a neural network on it. By using a special logarithmic transform we are able to learn the variables tendencies achieving an average $R^2$ score of $89\%$. On a GPU we achieve a speed-up of 120 compared to the original model.
    Fully probabilistic design for knowledge fusion between Bayesian filters under uniform disturbances. (arXiv:2109.10596v1 [cs.LG])
    (0 min) This paper considers the problem of Bayesian transfer learning-based knowledge fusion between linear state-space processes driven by uniform state and observation noise processes. The target task conditions on probabilistic state predictor(s) supplied by the source filtering task(s) to improve its own state estimate. A joint model of the target and source(s) is not required and is not elicited. The resulting decision-making problem for choosing the optimal conditional target filtering distribution under incomplete modelling is solved via fully probabilistic design (FPD), i.e. via appropriate minimization of Kullback-Leibler divergence (KLD). The resulting FPD-optimal target learner is robust, in the sense that it can reject poor-quality source knowledge. In addition, the fact that this Bayesian transfer learning (BTL) scheme does not depend on a model of interaction between the source and target tasks ensures robustness to the misspecification of such a model. The latter is a problem that affects conventional transfer learning methods. The properties of the proposed BTL scheme are demonstrated via extensive simulations, and in comparison with two contemporary alternatives.
    Enabling Efficiency-Precision Trade-offs for Label Trees in Extreme Classification. (arXiv:2106.00730v2 [cs.LG] UPDATED)
    (0 min) Extreme multi-label classification (XMC) aims to learn a model that can tag data points with a subset of relevant labels from an extremely large label set. Real world e-commerce applications like personalized recommendations and product advertising can be formulated as XMC problems, where the objective is to predict for a user a small subset of items from a catalog of several million products. For such applications, a common approach is to organize these labels into a tree, enabling training and inference times that are logarithmic in the number of labels. While training a model once a label tree is available is well studied, designing the structure of the tree is a difficult task that is not yet well understood, and can dramatically impact both model latency and statistical performance. Existing approaches to tree construction fall at an extreme point, either optimizing exclusively for statistical performance, or for latency. We propose an efficient information theory inspired algorithm to construct intermediary operating points that trade off between the benefits of both. Our algorithm enables interpolation between these objectives, which was not previously possible. We corroborate our theoretical analysis with numerical results, showing that on the Wiki-500K benchmark dataset our method can reduce a proxy for expected latency by up to 28% while maintaining the same accuracy as Parabel. On several datasets derived from e-commerce customer logs, our modified label tree is able to improve this expected latency metric by up to 20% while maintaining the same accuracy. Finally, we discuss challenges in realizing these latency improvements in deployed models.
    On the Transferability of Adversarial Attacksagainst Neural Text Classifier. (arXiv:2011.08558v3 [cs.LG] UPDATED)
    (0 min) Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.
    Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings. (arXiv:2109.10538v1 [cs.LG])
    (0 min) $t$-SNE is an embedding method that the data science community has widely Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. $t$-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric, therefore two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. An approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding' match. The proposed algorithm has the same complexity than the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets' dynamics.
    Benchmarking Lane-changing Decision-making for Deep Reinforcement Learning. (arXiv:2109.10490v1 [cs.LG])
    (0 min) The development of autonomous driving has attracted extensive attention in recent years, and it is essential to evaluate the performance of autonomous driving. However, testing on the road is expensive and inefficient. Virtual testing is the primary way to validate and verify self-driving cars, and the basis of virtual testing is to build simulation scenarios. In this paper, we propose a training, testing, and evaluation pipeline for the lane-changing task from the perspective of deep reinforcement learning. First, we design lane change scenarios for training and testing, where the test scenarios include stochastic and deterministic parts. Then, we deploy a set of benchmarks consisting of learning and non-learning approaches. We train several state-of-the-art deep reinforcement learning methods in the designed training scenarios and provide the benchmark metrics evaluation results of the trained models in the test scenarios. The designed lane-changing scenarios and benchmarks are both opened to provide a consistent experimental environment for the lane-changing task.
    MEPG: A Minimalist Ensemble Policy Gradient Framework for Deep Reinforcement Learning. (arXiv:2109.10552v1 [cs.LG])
    (0 min) Ensemble reinforcement learning (RL) aims to mitigate instability in Q-learning and to learn a robust policy, which introduces multiple value and policy functions. In this paper, we consider finding a novel but simple ensemble Deep RL algorithm to solve the resource consumption issue. Specifically, we consider integrating multiple models into a single model. To this end, we propose the \underline{M}inimalist \underline{E}nsemble \underline{P}olicy \underline{G}radient framework (MEPG), which introduces minimalist ensemble consistent Bellman update. And we find one value network is sufficient in our framework. Moreover, we theoretically show that the policy evaluation phase in the MEPG is mathematically equivalent to a deep Gaussian Process. To verify the effectiveness of the MEPG framework, we conduct experiments on the gym simulator, which show that the MEPG framework matches or outperforms the state-of-the-art ensemble methods and model-free methods without additional computational resource costs.
    A unified interpretation of the Gaussian mechanism for differential privacy through the sensitivity index. (arXiv:2109.10528v1 [cs.CR])
    (0 min) The Gaussian mechanism (GM) represents a universally employed tool for achieving differential privacy (DP), and a large body of work has been devoted to its analysis. We argue that the three prevailing interpretations of the GM, namely $(\varepsilon, \delta)$-DP, f-DP and R\'enyi DP can be expressed by using a single parameter $\psi$, which we term the sensitivity index. $\psi$ uniquely characterises the GM and its properties by encapsulating its two fundamental quantities: the sensitivity of the query and the magnitude of the noise perturbation. With strong links to the ROC curve and the hypothesis-testing interpretation of DP, $\psi$ offers the practitioner a powerful method for interpreting, comparing and communicating the privacy guarantees of Gaussian mechanisms.
    The Curse Revisited: a Newly Quantified Concept of Meaningful Distances for Learning from High-Dimensional Noisy Data. (arXiv:2109.10569v1 [cs.LG])
    (0 min) Distances between data points are widely used in point cloud representation learning. Yet, it is no secret that under the effect of noise, these distances-and thus the models based upon them-may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we characterize such effects in high-dimensional data using an asymptotic probabilistic expression. Furthermore, while it has been previously argued that neighborhood queries become meaningless and unstable when there is a poor relative discrimination between the furthest and closest point, we conclude that this is not necessarily the case when explicitly separating the ground truth data from the noise. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be true even when we observe this discrimination to be poor. We include thorough empirical verification of our results, as well as experiments that interestingly show our derived phase shift where neighbors become random or not is identical to the phase shift where common dimensionality reduction methods perform poorly or well for finding low-dimensional representations of high-dimensional data with dense noise.
    Solving Large Steiner Tree Problems in Graphs for Cost-Efficient Fiber-To-The-Home Network Expansion. (arXiv:2109.10617v1 [cs.AI])
    (0 min) The expansion of Fiber-To-The-Home (FTTH) networks creates high costs due to expensive excavation procedures. Optimizing the planning process and minimizing the cost of the earth excavation work therefore lead to large savings. Mathematically, the FTTH network problem can be described as a minimum Steiner Tree problem. Even though the Steiner Tree problem has already been investigated intensively in the last decades, it might be further optimized with the help of new computing paradigms and emerging approaches. This work studies upcoming technologies, such as Quantum Annealing, Simulated Annealing and nature-inspired methods like Evolutionary Algorithms or slime-mold-based optimization. Additionally, we investigate partitioning and simplifying methods. Evaluated on several real-life problem instances, we could outperform a traditional, widely-used baseline (NetworkX Approximate Solver) on most of the domains. Prior partitioning of the initial graph and the presented slime-mold-based approach were especially valuable for a cost-efficient approximation. Quantum Annealing seems promising, but was limited by the number of available qubits.
    A Latent Restoring Force Approach to Nonlinear System Identification. (arXiv:2109.10681v1 [stat.ML])
    (0 min) Identification of nonlinear dynamic systems remains a significant challenge across engineering. This work suggests an approach based on Bayesian filtering to extract and identify the contribution of an unknown nonlinear term in the system which can be seen as an alternative viewpoint on restoring force surface type approaches. To achieve this identification, the contribution which is the nonlinear restoring force is modelled, initially, as a Gaussian process in time. That Gaussian process is converted into a state-space model and combined with the linear dynamic component of the system. Then, by inference of the filtering and smoothing distributions, the internal states of the system and the nonlinear restoring force can be extracted. In possession of these states a nonlinear model can be constructed. The approach is demonstrated to be effective in both a simulated case study and on an experimental benchmark dataset.
    An automatic differentiation system for the age of differential privacy. (arXiv:2109.10573v1 [cs.LG])
    (0 min) We introduce Tritium, an automatic differentiation-based sensitivity analysis framework for differentially private (DP) machine learning (ML). Optimal noise calibration in this setting requires efficient Jacobian matrix computations and tight bounds on the L2-sensitivity. Our framework achieves these objectives by relying on a functional analysis-based method for sensitivity tracking, which we briefly outline. This approach interoperates naturally and seamlessly with static graph-based automatic differentiation, which enables order-of-magnitude improvements in compilation times compared to previous work. Moreover, we demonstrate that optimising the sensitivity of the entire computational graph at once yields substantially tighter estimates of the true sensitivity compared to interval bound propagation techniques. Our work naturally befits recent developments in DP such as individual privacy accounting, aiming to offer improved privacy-utility trade-offs, and represents a step towards the integration of accessible machine learning tooling with advanced privacy accounting systems.
    Digital Signal Processing Using Deep Neural Networks. (arXiv:2109.10404v1 [eess.SP])
    (0 min) Currently there is great interest in the utility of deep neural networks (DNNs) for the physical layer of radio frequency (RF) communications. In this manuscript, we describe a custom DNN specially designed to solve problems in the RF domain. Our model leverages the mechanisms of feature extraction and attention through the combination of an autoencoder convolutional network with a transformer network, to accomplish several important communications network and digital signals processing (DSP) tasks. We also present a new open dataset and physical data augmentation model that enables training of DNNs that can perform automatic modulation classification, infer and correct transmission channel effects, and directly demodulate baseband RF signals.
    SoK: Machine Learning Governance. (arXiv:2109.10870v1 [cs.CR])
    (0 min) The application of machine learning (ML) in computer systems introduces not only many benefits but also risks to society. In this paper, we develop the concept of ML governance to balance such benefits and risks, with the aim of achieving responsible applications of ML. Our approach first systematizes research towards ascertaining ownership of data and models, thus fostering a notion of identity specific to ML systems. Building on this foundation, we use identities to hold principals accountable for failures of ML systems through both attribution and auditing. To increase trust in ML systems, we then survey techniques for developing assurance, i.e., confidence that the system meets its security requirements and does not exhibit certain known failures. This leads us to highlight the need for techniques that allow a model owner to manage the life cycle of their system, e.g., to patch or retire their ML system. Put altogether, our systematization of knowledge standardizes the interactions between principals involved in the deployment of ML throughout its life cycle. We highlight opportunities for future work, e.g., to formalize the resulting game between ML principals.
    Robust marginalization of baryonic effects for cosmological inference at the field level. (arXiv:2109.10360v1 [astro-ph.CO])
    (0 min) We train neural networks to perform likelihood-free inference from $(25\,h^{-1}{\rm Mpc})^2$ 2D maps containing the total mass surface density from thousands of hydrodynamic simulations of the CAMELS project. We show that the networks can extract information beyond one-point functions and power spectra from all resolved scales ($\gtrsim 100\,h^{-1}{\rm kpc}$) while performing a robust marginalization over baryonic physics at the field level: the model can infer the value of $\Omega_{\rm m} (\pm 4\%)$ and $\sigma_8 (\pm 2.5\%)$ from simulations completely different to the ones used to train it.
    Fairness without Imputation: A Decision Tree Approach for Fair Prediction with Missing Values. (arXiv:2109.10431v1 [cs.LG])
    (0 min) We investigate the fairness concerns of training a machine learning model using data with missing values. Even though there are a number of fairness intervention methods in the literature, most of them require a complete training set as input. In practice, data can have missing values, and data missing patterns can depend on group attributes (e.g. gender or race). Simply applying off-the-shelf fair learning algorithms to an imputed dataset may lead to an unfair model. In this paper, we first theoretically analyze different sources of discrimination risks when training with an imputed dataset. Then, we propose an integrated approach based on decision trees that does not require a separate process of imputation and learning. Instead, we train a tree with missing incorporated as attribute (MIA), which does not require explicit imputation, and we optimize a fairness-regularized objective function. We demonstrate that our approach outperforms existing fairness intervention methods applied to an imputed dataset, through several experiments on real-world datasets.
    Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent. (arXiv:2006.01738v3 [cs.LG] UPDATED)
    (0 min) We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.
    Classification with Nearest Disjoint Centroids. (arXiv:2109.10436v1 [cs.LG])
    (0 min) In this paper, we develop a new classification method based on nearest centroid, and it is called the nearest disjoint centroid classifier. Our method differs from the nearest centroid classifier in the following two aspects: (1) the centroids are defined based on disjoint subsets of features instead of all the features, and (2) the distance is induced by the dimensionality-normalized norm instead of the Euclidean norm. We provide a few theoretical results regarding our method. In addition, we propose a simple algorithm based on adapted k-means clustering that can find the disjoint subsets of features used in our method, and extend the algorithm to perform feature selection. We evaluate and compare the performance of our method to other closely related classifiers on both simulated data and real-world gene expression datasets. The results demonstrate that our method is able to outperform other competing classifiers by having smaller misclassification rates and/or using fewer features in various settings and situations.
    Learned Benchmarks for Subseasonal Forecasting. (arXiv:2109.10399v1 [physics.ao-ph])
    (0 min) We develop a subseasonal forecasting toolkit of simple learned benchmark models that outperform both operational practice and state-of-the-art machine learning and deep learning methods. Our new models include (a) Climatology++, an adaptive alternative to climatology that, for precipitation, is 9% more accurate and 250% more skillful than the United States operational Climate Forecasting System (CFSv2); (b) CFSv2++, a learned CFSv2 correction that improves temperature and precipitation accuracy by 7-8% and skill by 50-275%; and (c) Persistence++, an augmented persistence model that combines CFSv2 forecasts with lagged measurements to improve temperature and precipitation accuracy by 6-9% and skill by 40-130%. Across the contiguous U.S., our Climatology++, CFSv2++, and Persistence++ toolkit consistently outperforms standard meteorological baselines, state-of-the-art machine and deep learning methods, and the European Centre for Medium-Range Weather Forecasts ensemble. Overall, we find that augmenting traditional forecasting approaches with learned enhancements yields an effective and computationally inexpensive strategy for building the next generation of subseasonal forecasting benchmarks.
    Differentiable Scaffolding Tree for Molecular Optimization. (arXiv:2109.10469v1 [cs.LG])
    (0 min) The structural design of functional molecules, also called molecular optimization, is an essential chemical science and engineering task with important applications, such as drug discovery. Deep generative models and combinatorial optimization methods achieve initial success but still struggle with directly modeling discrete chemical structures and often heavily rely on brute-force enumeration. The challenge comes from the discrete and non-differentiable nature of molecule structures. To address this, we propose differentiable scaffolding tree (DST) that utilizes a learned knowledge network to convert discrete chemical structures to locally differentiable ones. DST enables a gradient-based optimization on a chemical graph structure by back-propagating the derivatives from the target properties through a graph neural network (GNN). Our empirical studies show the gradient-based molecular optimizations are both effective and sample efficient. Furthermore, the learned graph parameters can also provide an explanation that helps domain experts understand the model output.
    A Spectral Approach to Off-Policy Evaluation for POMDPs. (arXiv:2109.10502v1 [cs.LG])
    (0 min) We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes, where the evaluation policy depends only on observable variables but the behavior policy depends on latent states (Tennenholtz et al. (2020a)). Prior work on this problem uses a causal identification strategy based on one-step observable proxies of the hidden state, which relies on the invertibility of certain one-step moment matrices. In this work, we relax this requirement by using spectral methods and extending one-step proxies both into the past and future. We empirically compare our OPE methods to existing ones and demonstrate their improved prediction accuracy and greater generality. Lastly, we derive a separate Importance Sampling (IS) algorithm which relies on rank, distinctness, and positivity conditions, and not on the strict sufficiency conditions of observable trajectories with respect to the reward and hidden-state structure required by Tennenholtz et al. (2020a).
    Semi-parametric Bayesian Additive Regression Trees. (arXiv:2108.07636v2 [stat.ML] UPDATED)
    (0 min) We propose a new semi-parametric model based on Bayesian Additive Regression Trees (BART). In our approach, the response variable is approximated by a linear predictor and a BART model, where the first component is responsible for estimating the main effects and BART accounts for the non-specified interactions and non-linearities. The novelty in our approach lies in the way we change tree generation moves in BART to deal with confounding between the parametric and non-parametric components when they have covariates in common. Through synthetic and real-world examples, we demonstrate that the performance of the new semi-parametric BART is competitive when compared to regression models and other tree-based methods. The implementation of the proposed method is available at https://github.com/ebprado/SP-BART.
    Exploring Adversarial Examples for Efficient Active Learning in Machine Learning Classifiers. (arXiv:2109.10770v1 [cs.LG])
    (0 min) Machine learning researchers have long noticed the phenomenon that the model training process will be more effective and efficient when the training samples are densely sampled around the underlying decision boundary. While this observation has already been widely applied in a range of machine learning security techniques, it lacks theoretical analyses of the correctness of the observation. To address this challenge, we first add particular perturbation to original training examples using adversarial attack methods so that the generated examples could lie approximately on the decision boundary of the ML classifiers. We then investigate the connections between active learning and these particular training examples. Through analyzing various representative classifiers such as k-NN classifiers, kernel methods as well as deep neural networks, we establish a theoretical foundation for the observation. As a result, our theoretical proofs provide support to more efficient active learning methods with the help of adversarial examples, contrary to previous works where adversarial examples are often used as destructive solutions. Experimental results show that the established theoretical foundation will guide better active learning strategies based on adversarial examples.
    Deep Policies for Online Bipartite Matching: A Reinforcement Learning Approach. (arXiv:2109.10380v1 [cs.LG])
    (0 min) From assigning computing tasks to servers and advertisements to users, sequential online matching problems arise in a wide variety of domains. The challenge in online matching lies in making irrevocable assignments while there is uncertainty about future inputs. In the theoretical computer science literature, most policies are myopic or greedy in nature. In real-world applications where the matching process is repeated on a regular basis, the underlying data distribution can be leveraged for better decision-making. We present an end-to-end Reinforcement Learning framework for deriving better matching policies based on trial-and-error on historical data. We devise a set of neural network architectures, design feature representations, and empirically evaluate them across two online matching problems: Edge-Weighted Online Bipartite Matching and Online Submodular Bipartite Matching. We show that most of the learning approaches perform significantly better than classical greedy algorithms on four synthetic and real-world datasets. Our code is publicly available at https://github.com/lyeskhalil/CORL.git.
    ENERO: Efficient Real-Time Routing Optimization. (arXiv:2109.10883v1 [cs.NI])
    (0 min) Wide Area Networks (WAN) are a key infrastructure in today's society. During the last years, WANs have seen a considerable increase in network's traffic as well as in the number of network applications. To enable the deployment of emergent network applications (e.g., Vehicular networks, Internet of Things), existing Traffic Engineering (TE) solutions must be able to achieve high performance real-time network operation. In addition, TE solutions must be able to adapt to dynamic scenarios (e.g., changes in the traffic matrix or topology link failures). However, current TE technologies rely on hand-crafted heuristics or computationally expensive solvers, which are not suitable for highly dynamic TE scenarios. In this paper we propose Enero, an efficient real-time TE engine. Enero is based on a two-stage optimization process. In the first one, it leverages Deep Reinforcement Learning (DRL) to optimize the routing configuration by generating a long-term TE strategy. We integrated a Graph Neural Network (GNN) into the DRL agent to enable efficient TE on dynamic networks. In the second stage, Enero uses a Local Search algorithm to improve DRL's solution without adding computational overhead to the optimization process. Enero offers a lower bound in performance, enabling the network operator to know the worst-case performance of the DRL agent. We believe that the lower bound in performance will lighten the path of deploying DRL-based solutions in real-world network scenarios. The experimental results indicate that Enero is able to operate in real-world dynamic network topologies in 4.5 seconds on average for topologies up to 100 edges.
    Learning by Examples Based on Multi-level Optimization. (arXiv:2109.10824v1 [cs.LG])
    (0 min) Learning by examples, which learns to solve a new problem by looking into how similar problems are solved, is an effective learning method in human learning. When a student learns a new topic, he/she finds out exemplar topics that are similar to this new topic and studies the exemplar topics to deepen the understanding of the new topic. We aim to investigate whether this powerful learning skill can be borrowed from humans to improve machine learning as well. In this work, we propose a novel learning approach called Learning By Examples (LBE). Our approach automatically retrieves a set of training examples that are similar to query examples and predicts labels for query examples by using class labels of the retrieved examples. We propose a three-level optimization framework to formulate LBE which involves three stages of learning: learning a Siamese network to retrieve similar examples; learning a matching network to make predictions on query examples by leveraging class labels of retrieved similar examples; learning the ``ground-truth'' similarities between training examples by minimizing the validation loss. We develop an efficient algorithm to solve the LBE problem and conduct extensive experiments on various benchmarks where the results demonstrate the effectiveness of our method on both supervised and few-shot learning.
    Adversarial Attacks on Camera-LiDAR Models for 3D Car Detection. (arXiv:2103.09448v2 [cs.CV] UPDATED)
    (0 min) Most autonomous vehicles (AVs) rely on LiDAR and RGB camera sensors for perception. Using these point cloud and image data, perception models based on deep neural nets (DNNs) have achieved state-of-the-art performance in 3D detection. The vulnerability of DNNs to adversarial attacks has been heavily investigated in the RGB image domain and more recently in the point cloud domain, but rarely in both domains simultaneously. Multi-modal perception systems used in AVs can be divided into two broad types: cascaded models which use each modality independently, and fusion models which learn from different modalities simultaneously. We propose a universal and physically realizable adversarial attack for each type, and study and contrast their respective vulnerabilities to attacks. We place a single adversarial object with specific shape and texture on top of a car with the objective of making this car evade detection. Evaluating on the popular KITTI benchmark, our adversarial object made the host vehicle escape detection by each model type more than 50% of the time. The dense RGB input contributed more to the success of the adversarial attacks on both cascaded and fusion models.
    Towards a Real-Time Facial Analysis System. (arXiv:2109.10393v1 [cs.CV])
    (0 min) Facial analysis is an active research area in computer vision, with many practical applications. Most of the existing studies focus on addressing one specific task and maximizing its performance. For a complete facial analysis system, one needs to solve these tasks efficiently to ensure a smooth experience. In this work, we present a system-level design of a real-time facial analysis system. With a collection of deep neural networks for object detection, classification, and regression, the system recognizes age, gender, facial expression, and facial similarity for each person that appears in the camera view. We investigate the parallelization and interplay of individual tasks. Results on common off-the-shelf architecture show that the system's accuracy is comparable to the state-of-the-art methods, and the recognition speed satisfies real-time requirements. Moreover, we propose a multitask network for jointly predicting the first three attributes, i.e., age, gender, and facial expression. Source code and trained models are available at https://github.com/mahehu/TUT-live-age-estimator.
    A Workflow for Offline Model-Free Robotic Reinforcement Learning. (arXiv:2109.10813v1 [cs.LG])
    (0 min) Offline reinforcement learning (RL) enables learning control policies by utilizing only prior experience, without any online interaction. This can allow robots to acquire generalizable skills from large and diverse datasets, without any costly or unsafe online data collection. Despite recent algorithmic advances in offline RL, applying these methods to real-world problems has proven challenging. Although offline RL methods can learn from prior data, there is no clear and well-understood process for making various design choices, from model architecture to algorithm hyperparameters, without actually evaluating the learned policies online. In this paper, our aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems. To this end, we devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Our workflow is derived from a conceptual understanding of the behavior of conservative offline RL algorithms and cross-validation in supervised learning. We demonstrate the efficacy of this workflow in producing effective policies without any online tuning, both in several simulated robotic learning scenarios and for three tasks on two distinct real robots, focusing on learning manipulation skills with raw image observations with sparse binary rewards. Explanatory video and additional results can be found at sites.google.com/view/offline-rl-workflow
    RETRONLU: Retrieval Augmented Task-Oriented Semantic Parsing. (arXiv:2109.10410v1 [cs.CL])
    (0 min) While large pre-trained language models accumulate a lot of knowledge in their parameters, it has been demonstrated that augmenting it with non-parametric retrieval-based memory has a number of benefits from accuracy improvements to data efficiency for knowledge-focused tasks, such as question answering. In this paper, we are applying retrieval-based modeling ideas to the problem of multi-domain task-oriented semantic parsing for conversational assistants. Our approach, RetroNLU, extends a sequence-to-sequence model architecture with a retrieval component, used to fetch existing similar examples and provide them as an additional input to the model. In particular, we analyze two settings, where we augment an input with (a) retrieved nearest neighbor utterances (utterance-nn), and (b) ground-truth semantic parses of nearest neighbor utterances (semparse-nn). Our technique outperforms the baseline method by 1.5% absolute macro-F1, especially at the low resource setting, matching the baseline model accuracy with only 40% of the data. Furthermore, we analyze the nearest neighbor retrieval component's quality, model sensitivity and break down the performance for semantic parses of different utterance complexity.
    Pix2seq: A Language Modeling Framework for Object Detection. (arXiv:2109.10852v1 [cs.CV])
    (0 min) This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
    BFClass: A Backdoor-free Text Classification Framework. (arXiv:2109.10855v1 [cs.CL])
    (0 min) Backdoor attack introduces artificial vulnerabilities into the model by poisoning a subset of the training data via injecting triggers and modifying labels. Various trigger design strategies have been explored to attack text classifiers, however, defending such attacks remains an open problem. In this work, we propose BFClass, a novel efficient backdoor-free training framework for text classification. The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model. To identify triggers, we utilize this discriminator to locate the most suspicious token from each training sample and then distill a concise set by considering their association strengths with particular labels. To recognize the poisoned subset, we examine the training samples with these identified triggers as the most suspicious token, and check if removing the trigger will change the poisoned model's prediction. Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.
    Sharp Analysis of Random Fourier Features in Classification. (arXiv:2109.10623v1 [stat.ML])
    (0 min) We study the theoretical properties of random Fourier features classification with Lipschitz continuous loss functions such as support vector machine and logistic regression. Utilizing the regularity condition, we show for the first time that random Fourier features classification can achieve $O(1/\sqrt{n})$ learning rate with only $\Omega(\sqrt{n} \log n)$ features, as opposed to $\Omega(n)$ features suggested by previous results. Our study covers the standard feature sampling method for which we reduce the number of features required, as well as a problem-dependent sampling method which further reduces the number of features while still keeping the optimal generalization property. Moreover, we prove that the random Fourier features classification can obtain a fast $O(1/n)$ learning rate for both sampling schemes under Massart's low noise assumption. Our results demonstrate the potential effectiveness of random Fourier features approximation in reducing the computational complexity (roughly from $O(n^3)$ in time and $O(n^2)$ in space to $O(n^2)$ and $O(n\sqrt{n})$ respectively) without having to trade-off the statistical prediction accuracy. In addition, the achieved trade-off in our analysis is at least the same as the optimal results in the literature under the worst case scenario and significantly improves the optimal results under benign regularity conditions.
    AI in Osteoporosis. (arXiv:2109.10478v1 [cs.CV])
    (0 min) In this chapter we explore and evaluate methods for trabecular bone characterization and osteoporosis diagnosis with increased interest in sparse approximations. We first describe texture representation and classification techniques, patch-based methods such as Bag of Keypoints, and more recent deep neural networks. Then we introduce the concept of sparse representations for pattern recognition and we detail integrative sparse analysis methods and classifier decision fusion methods. We report cross-validation results on osteoporosis datasets of bone radiographs and compare the results produced by the different categories of methods. We conclude that advances in the AI and machine learning fields have enabled the development of methods that can be used as diagnostic tools in clinical settings.
    Uncertainty-Aware Training for Cardiac Resynchronisation Therapy Response Prediction. (arXiv:2109.10641v1 [eess.IV])
    (0 min) Evaluation of predictive deep learning (DL) models beyond conventional performance metrics has become increasingly important for applications in sensitive environments like healthcare. Such models might have the capability to encode and analyse large sets of data but they often lack comprehensive interpretability methods, preventing clinical trust in predictive outcomes. Quantifying uncertainty of a prediction is one way to provide such interpretability and promote trust. However, relatively little attention has been paid to how to include such requirements into the training of the model. In this paper we: (i) quantify the data (aleatoric) and model (epistemic) uncertainty of a DL model for Cardiac Resynchronisation Therapy response prediction from cardiac magnetic resonance images, and (ii) propose and perform a preliminary investigation of an uncertainty-aware loss function that can be used to retrain an existing DL image-based classification model to encourage confidence in correct predictions and reduce confidence in incorrect predictions. Our initial results are promising, showing a significant increase in the (epistemic) confidence of true positive predictions, with some evidence of a reduction in false negative confidence.
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v2 [cs.LG] UPDATED)
    (0 min) Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack strategy provides the attacker with greater control over the target model, and increases the complexity of detecting that the model is being systematically attacked. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task, an exemplary machine learning problem in the audio domain. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and by injecting imperceptible perturbations to the inputs.
    Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron. (arXiv:2109.10506v1 [cs.CL])
    (0 min) We explore Boccaccio's Decameron to see how digital humanities tools can be used for tasks that have limited data in a language no longer in contemporary use: medieval Italian. We focus our analysis on the question: Do the different storytellers in the text exhibit distinct personalities? To answer this question, we curate and release a dataset based on the authoritative edition of the text. We use supervised classification methods to predict storytellers based on the stories they tell, confirming the difficulty of the task, and demonstrate that topic modeling can extract thematic storyteller "profiles."
    Tighter risk certificates for neural networks. (arXiv:2007.12911v3 [cs.LG] UPDATED)
    (0 min) This paper presents an empirical study regarding training probabilistic neural networks using training objectives derived from PAC-Bayes bounds. In the context of probabilistic neural networks, the output of training is a probability distribution over network weights. We present two training objectives, used here for the first time in connection with training neural networks. These two training objectives are derived from tight PAC-Bayes bounds. We also re-implement a previously used training objective based on a classical PAC-Bayes bound, to compare the properties of the predictors learned using the different training objectives. We compute risk certificates for the learnt predictors, based on part of the data used to learn the predictors. We further experiment with different types of priors on the weights (both data-free and data-dependent priors) and neural network architectures. Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds with much tighter values than previous results in the literature, showing promise not only to guide the learning algorithm through bounding the risk but also for model selection. These observations suggest that the methods studied here might be good candidates for self-certified learning, in the sense of using the whole data set for learning a predictor and certifying its risk on any unseen data (from the same distribution as the training data) potentially without the need for holding out test data.
    Selecting Datasets for Evaluating an Enhanced Deep Learning Framework. (arXiv:2109.10442v1 [cs.LG])
    (0 min) A framework was developed to address limitations associated with existing techniques for analysing sequences. This work deals with the steps followed to select suitable datasets characterised by discrete irregular sequential patterns. To identify, select, explore and evaluate which datasets from various sources extracted from more than 400 research articles, an interquartile range method for outlier calculation and a qualitative Billauer's algorithm was adapted to provide periodical peak detection in such datasets. The developed framework was then tested using the most appropriate datasets. The research concluded that the financial market-daily currency exchange domain is the most suitable kind of data set for the evaluation of the designed deep learning framework, as it provides high levels of discrete irregular patterns.
    Entropic Issues in Likelihood-Based OOD Detection. (arXiv:2109.10794v1 [stat.ML])
    (0 min) Deep generative models trained by maximum likelihood remain very popular methods for reasoning about data probabilistically. However, it has been observed that they can assign higher likelihoods to out-of-distribution (OOD) data than in-distribution data, thus calling into question the meaning of these likelihood values. In this work we provide a novel perspective on this phenomenon, decomposing the average likelihood into a KL divergence term and an entropy term. We argue that the latter can explain the curious OOD behaviour mentioned above, suppressing likelihood values on datasets with higher entropy. Although our idea is simple, we have not seen it explored yet in the literature. This analysis provides further explanation for the success of OOD detection methods based on likelihood ratios, as the problematic entropy term cancels out in expectation. Finally, we discuss how this observation relates to recent success in OOD detection with manifold-supported models, for which the above decomposition does not hold.
    LDC-VAE: A Latent Distribution Consistency Approach to Variational AutoEncoders. (arXiv:2109.10640v1 [cs.LG])
    (0 min) Variational autoencoders (VAEs), as an important aspect of generative models, have received a lot of research interests and reached many successful applications. However, it is always a challenge to achieve the consistency between the learned latent distribution and the prior latent distribution when optimizing the evidence lower bound (ELBO), and finally leads to an unsatisfactory performance in data generation. In this paper, we propose a latent distribution consistency approach to avoid such substantial inconsistency between the posterior and prior latent distributions in ELBO optimizing. We name our method as latent distribution consistency VAE (LDC-VAE). We achieve this purpose by assuming the real posterior distribution in latent space as a Gibbs form, and approximating it by using our encoder. However, there is no analytical solution for such Gibbs posterior in approximation, and traditional approximation ways are time consuming, such as using the iterative sampling-based MCMC. To address this problem, we use the Stein Variational Gradient Descent (SVGD) to approximate the Gibbs posterior. Meanwhile, we use the SVGD to train a sampler net which can obtain efficient samples from the Gibbs posterior. Comparative studies on the popular image generation datasets show that our method has achieved comparable or even better performance than several powerful improvements of VAEs.
    Personalized Online Machine Learning. (arXiv:2109.10452v1 [stat.ML])
    (0 min) In this work, we introduce the Personalized Online Super Learner (POSL) -- an online ensembling algorithm for streaming data whose optimization procedure accommodates varying degrees of personalization. Namely, POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized (i.e., optimization with respect to baseline covariate subject ID) to many individuals (i.e., optimization with respect to common baseline covariates). As an online algorithm, POSL learns in real-time. POSL can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed algorithms that are never updated during the procedure, pooled algorithms that learn from many individuals' time-series, and individualized algorithms that learn from within a single time-series. POSL's ensembling of this hybrid of base learning strategies depends on the amount of data collected, the stationarity of the time-series, and the mutual characteristics of a group of time-series. In essence, POSL decides whether to learn across samples, through time, or both, based on the underlying (unknown) structure in the data. For a wide range of simulations that reflect realistic forecasting scenarios, and in a medical data application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for time-series data and adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time-series enter/exit dynamically over chronological time.
    A Hierarchical Network-Oriented Analysis of User Participation in Misinformation Spread on WhatsApp. (arXiv:2109.10462v1 [cs.SI])
    (0 min) WhatsApp emerged as a major communication platform in many countries in the recent years. Despite offering only one-to-one and small group conversations, WhatsApp has been shown to enable the formation of a rich underlying network, crossing the boundaries of existing groups, and with structural properties that favor information dissemination at large. Indeed, WhatsApp has reportedly been used as a forum of misinformation campaigns with significant social, political and economic consequences in several countries. In this article, we aim at complementing recent studies on misinformation spread on WhatsApp, mostly focused on content properties and propagation dynamics, by looking into the network that connects users sharing the same piece of content. Specifically, we present a hierarchical network-oriented characterization of the users engaged in misinformation spread by focusing on three perspectives: individuals, WhatsApp groups and user communities, i.e., groupings of users who, intentionally or not, share the same content disproportionately often. By analyzing sharing and network topological properties, our study offers valuable insights into how WhatsApp users leverage the underlying network connecting different groups to gain large reach in the spread of misinformation on the platform.
    Achieving Counterfactual Fairness for Causal Bandit. (arXiv:2109.10458v1 [cs.LG])
    (0 min) In online recommendation, customers arrive in a sequential and stochastic manner from an underlying distribution and the online decision model recommends a chosen item for each arriving individual based on some strategy. We study how to recommend an item at each step to maximize the expected reward while achieving user-side fairness for customers, i.e., customers who share similar profiles will receive a similar reward regardless of their sensitive attributes and items being recommended. By incorporating causal inference into bandits and adopting soft intervention to model the arm selection strategy, we first propose the d-separation based UCB algorithm (D-UCB) to explore the utilization of the d-separation set in reducing the amount of exploration needed to achieve low cumulative regret. Based on that, we then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness. Both theoretical analysis and empirical evaluation demonstrate effectiveness of our algorithms.
    Self-Supervised Learning to Prove Equivalence Between Programs via Semantics-Preserving Rewrite Rules. (arXiv:2109.10476v1 [cs.LG])
    (0 min) We target the problem of synthesizing proofs of semantic equivalence between two programs made of sequences of statements with complex symbolic expressions. We propose a neural network architecture based on the transformer to generate axiomatic proofs of equivalence between program pairs. We generate expressions which include scalars and vectors and support multi-typed rewrite rules to prove equivalence. For training the system, we develop an original training technique, which we call self-supervised sample selection. This incremental training improves the quality, generalizability and extensibility of the learned model. We study the effectiveness of the system to generate proofs of increasing length, and we demonstrate how transformer models learn to represent complex and verifiable symbolic reasoning. Our system, S4Eq, achieves 97% proof success on 10,000 pairs of programs while ensuring zero false positives by design.
    On Topology Optimization and Routing in Integrated Access and Backhaul Networks: A Genetic Algorithm-based Approach. (arXiv:2102.07252v2 [cs.NI] UPDATED)
    (0 min) In this paper, we study the problem of topology optimization and routing in integrated access and backhaul (IAB) networks, as one of the promising techniques for evolving 5G networks. We study the problem from different perspectives. We develop efficient genetic algorithm-based schemes for both IAB node placement and non-IAB backhaul link distribution, and evaluate the effect of routing on bypassing temporal blockages. Here, concentrating on millimeter wave-based communications, we study the service coverage probability, defined as the probability of the event that the user equipments' (UEs) minimum rate requirements are satisfied. Moreover, we study the effect of different parameters such as the antenna gain, blockage and tree foliage on the system performance. Finally, we summarize the recent Rel-16 as well as the upcoming Rel-17 3GPP discussions on routing in IAB networks, and discuss the main challenges for enabling mesh-based IAB networks. As we show, with a proper network topology, IAB is an attractive approach to enable the network densification required by 5G and beyond.
    DPPIN: A Biological Repository of Dynamic Protein-Protein Interaction Network Data. (arXiv:2107.02168v2 [cs.LG] UPDATED)
    (0 min) Nowadays, many network representation learning algorithms and downstream network mining tasks have already paid attention to dynamic networks or temporal networks, which are more suitable for real-world complex scenarios by modeling evolving patterns and temporal dependencies between node interactions. Moreover, representing and mining temporal networks have a wide range of applications, such as fraud detection, social network analysis, and drug discovery. To contribute to the network representation learning and network mining research community, in this paper, we generate a new biological repository of dynamic protein-protein interaction network data (i.e., DPPIN), which consists of twelve dynamic network datasets describing protein-level interactions of yeast cells at different scales. We first introduce the generation process of DPPIN. To demonstrate the value of our published repository DPPIN, we then list the potential applications that would be benefited. Furthermore, we design dynamic local clustering, dynamic spectral clustering, dynamic subgraph matching, dynamic node classification, and dynamic graph classification experiments, where network datasets of DPPIN could indicate future research opportunities for some tasks by presenting challenges on state-of-the-art baseline algorithms. Finally, we identify future directions for improving the utility of this repository and welcome constructive inputs from the community. All resources of this work are deployed and publicly available at https://github.com/DongqiFu/DPPIN.
    Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data. (arXiv:2108.09020v2 [cs.LG] UPDATED)
    (0 min) Continual learning is the problem of learning and retaining knowledge through time over multiple tasks and environments. Research has primarily focused on the incremental classification setting, where new tasks/classes are added at discrete time intervals. Such an "offline" setting does not evaluate the ability of agents to learn effectively and efficiently, since an agent can perform multiple learning epochs without any time limitation when a task is added. We argue that "online" continual learning, where data is a single continuous stream without task boundaries, enables evaluating both information retention and online learning efficacy. In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online. Trained models are later evaluated on historical data to assess information retention. We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts. Through a large-scale analysis, we identify critical and previously unobserved phenomena of gradient-based optimization in continual learning, and propose effective strategies for improving gradient-based online continual learning with real data. The source code and dataset are available in: https://github.com/IntelLabs/continuallearning.
    Towards Automatic Bias Detection in Knowledge Graphs. (arXiv:2109.10697v1 [cs.AI])
    (0 min) With the recent surge in social applications relying on knowledge graphs, the need for techniques to ensure fairness in KG based methods is becoming increasingly evident. Previous works have demonstrated that KGs are prone to various social biases, and have proposed multiple methods for debiasing them. However, in such studies, the focus has been on debiasing techniques, while the relations to be debiased are specified manually by the user. As manual specification is itself susceptible to human cognitive bias, there is a need for a system capable of quantifying and exposing biases, that can support more informed decisions on what to debias. To address this gap in the literature, we describe a framework for identifying biases present in knowledge graph embeddings, based on numerical bias metrics. We illustrate the framework with three different bias measures on the task of profession prediction, and it can be flexibly extended to further bias definitions and applications. The relations flagged as biased can then be handed to decision makers for judgement upon subsequent debiasing.
    Coarse2Fine: Fine-grained Text Classification on Coarsely-grained Annotated Data. (arXiv:2109.10856v1 [cs.CL])
    (0 min) Existing text classification methods mainly focus on a fixed label set, whereas many real-world applications require extending to new fine-grained classes as the number of samples per label increases. To accommodate such requirements, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. Specifically, we first propose a label-conditioned finetuning formulation to attune these generators for our task. Furthermore, we devise a regularization objective based on the coarse-fine label constraints derived from our problem setting, giving us even further improvements over the prior formulation. Our framework uses the fine-tuned generative models to sample pseudo-training data for training the classifier, and bootstraps on real unlabeled data for model refinement. Extensive experiments and case studies on two real-world datasets demonstrate superior performance over SOTA zero-shot classification baselines.
    High-dimensional Bayesian Optimization for CNN Auto Pruning with Clustering and Rollback. (arXiv:2109.10591v1 [cs.LG])
    (0 min) Pruning has been widely used to slim convolutional neural network (CNN) models to achieve a good trade-off between accuracy and model size so that the pruned models become feasible for power-constrained devices such as mobile phones. This process can be automated to avoid the expensive hand-crafted efforts and to explore a large pruning space automatically so that the high-performance pruning policy can be achieved efficiently. Nowadays, reinforcement learning (RL) and Bayesian optimization (BO)-based auto pruners are widely used due to their solid theoretical foundation, universality, and high compressing quality. However, the RL agent suffers from long training times and high variance of results, while the BO agent is time-consuming for high-dimensional design spaces. In this work, we propose an enhanced BO agent to obtain significant acceleration for auto pruning in high-dimensional design spaces. To achieve this, a novel clustering algorithm is proposed to reduce the dimension of the design space to speedup the searching process. Then, a roll-back algorithm is proposed to recover the high-dimensional design space so that higher pruning accuracy can be obtained. We validate our proposed method on ResNet, MobileNet, and VGG models, and our experiments show that the proposed method significantly improves the accuracy of BO when pruning very deep CNN models. Moreover, our method achieves lower variance and shorter time than the RL-based counterpart.
    A Robust Asymmetric Kernel Function for Bayesian Optimization, with Application to Image Defect Detection in Manufacturing Systems. (arXiv:2109.10898v1 [stat.ML])
    (0 min) Some response surface functions in complex engineering systems are usually highly nonlinear, unformed, and expensive-to-evaluate. To tackle this challenge, Bayesian optimization, which conducts sequential design via a posterior distribution over the objective function, is a critical method used to find the global optimum of black-box functions. Kernel functions play an important role in shaping the posterior distribution of the estimated function. The widely used kernel function, e.g., radial basis function (RBF), is very vulnerable and susceptible to outliers; the existence of outliers is causing its Gaussian process surrogate model to be sporadic. In this paper, we propose a robust kernel function, Asymmetric Elastic Net Radial Basis Function (AEN-RBF). Its validity as a kernel function and computational complexity are evaluated. When compared to the baseline RBF kernel, we prove theoretically that AEN-RBF can realize smaller mean squared prediction error under mild conditions. The proposed AEN-RBF kernel function can also realize faster convergence to the global optimum. We also show that the AEN-RBF kernel function is less sensitive to outliers, and hence improves the robustness of the corresponding Bayesian optimization with Gaussian processes. Through extensive evaluations carried out on synthetic and real-world optimization problems, we show that AEN-RBF outperforms existing benchmark kernel functions.
    Real-Time Privacy-Preserving Data Release for Smart Meters. (arXiv:1906.06427v3 [eess.SP] UPDATED)
    (0 min) Smart Meters (SMs) are a fundamental component of smart grids, but they carry sensitive information about users such as occupancy status of houses and therefore, they have raised serious concerns about leakage of consumers' private information. In particular, we focus on real-time privacy threats, i.e., potential attackers that try to infer sensitive data from SMs reported data in an online fashion. We adopt an information-theoretic privacy measure and show that it effectively limits the performance of any real-time attacker. Using this privacy measure, we propose a general formulation to design a privatization mechanism that can provide a target level of privacy by adding a minimal amount of distortion to the SMs measurements. On the other hand, to cope with different applications, a flexible distortion measure is considered. This formulation leads to a general loss function, which is optimized using a deep learning adversarial framework, where two neural networks $-$ referred to as the releaser and the adversary $-$ are trained with opposite goals. An exhaustive empirical study is then performed to validate the performances of the proposed approach for the occupancy detection privacy problem, assuming the attacker disposes of either limited or full access to the training dataset.
    Blindness of score-based methods to isolated components and mixing proportions. (arXiv:2008.10087v2 [stat.ML] UPDATED)
    (0 min) Statistical tasks such as density estimation and approximate Bayesian inference often involve densities with unknown normalising constants. Score-based methods, including score matching, are popular techniques as they are free of normalising constants. Although these methods enjoy theoretical guarantees, a little-known fact is that they suffer from practical failure modes when the unnormalised distribution of interest has isolated components -- they cannot discover isolated components or identify the correct mixing proportions between components. We demonstrate these findings using simple distributions and present heuristic attempts to address these issues. We hope to bring the attention of theoreticians and practitioners to these issues when developing new algorithms and applications.
    Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles. (arXiv:2109.10432v1 [cs.LG])
    (0 min) Local decision rules are commonly understood to be more explainable, due to the local nature of the patterns involved. With numerical optimization methods such as gradient boosting, ensembles of local decision rules can gain good predictive performance on data involving global structure. Meanwhile, machine learning models are being increasingly used to solve problems in high-stake domains including healthcare and finance. Here, there is an emerging consensus regarding the need for practitioners to understand whether and how those models could perform robustly in the deployment environments, in the presence of distributional shifts. Past research on local decision rules has focused mainly on maximizing discriminant patterns, without due consideration of robustness against distributional shifts. In order to fill this gap, we propose a new method to learn and ensemble local decision rules, that are robust both in the training and deployment environments. Specifically, we propose to leverage causal knowledge by regarding the distributional shifts in subpopulations and deployment environments as the results of interventions on the underlying system. We propose two regularization terms based on causal knowledge to search for optimal and stable rules. Experiments on both synthetic and benchmark datasets show that our method is effective and robust against distributional shifts in multiple environments.
    Privacy-Preserving Machine Learning: Methods, Challenges and Directions. (arXiv:2108.04417v2 [cs.LG] UPDATED)
    (0 min) Machine learning (ML) is increasingly being adopted in a wide variety of application domains. Usually, a well-performing ML model relies on a large volume of training data and high-powered computational resources. Such a need for and the use of huge volumes of data raise serious privacy concerns because of the potential risks of leakage of highly privacy-sensitive information; further, the evolving regulatory environments that increasingly restrict access to and use of privacy-sensitive data add significant challenges to fully benefiting from the power of ML for data-driven applications. A trained ML model may also be vulnerable to adversarial attacks such as membership, attribute, or property inference attacks and model inversion attacks. Hence, well-designed privacy-preserving ML (PPML) solutions are critically needed for many emerging applications. Increasingly, significant research efforts from both academia and industry can be seen in PPML areas that aim toward integrating privacy-preserving techniques into ML pipeline or specific algorithms, or designing various PPML architectures. In particular, existing PPML research cross-cut ML, systems and applications design, as well as security and privacy areas; hence, there is a critical need to understand state-of-the-art research, related challenges and a research roadmap for future research in PPML area. In this paper, we systematically review and summarize existing privacy-preserving approaches and propose a Phase, Guarantee, and Utility (PGU) triad based model to understand and guide the evaluation of various PPML solutions by decomposing their privacy-preserving functionalities. We discuss the unique characteristics and challenges of PPML and outline possible research directions that leverage as well as benefit multiple research communities such as ML, distributed systems, security and privacy.
    Updating Embeddings for Dynamic Knowledge Graphs. (arXiv:2109.10896v1 [cs.LG])
    (0 min) Data in Knowledge Graphs often represents part of the current state of the real world. Thus, to stay up-to-date the graph data needs to be updated frequently. To utilize information from Knowledge Graphs, many state-of-the-art machine learning approaches use embedding techniques. These techniques typically compute an embedding, i.e., vector representations of the nodes as input for the main machine learning algorithm. If a graph update occurs later on -- specifically when nodes are added or removed -- the training has to be done all over again. This is undesirable, because of the time it takes and also because downstream models which were trained with these embeddings have to be retrained if they change significantly. In this paper, we investigate embedding updates that do not require full retraining and evaluate them in combination with various embedding models on real dynamic Knowledge Graphs covering multiple use cases. We study approaches that place newly appearing nodes optimally according to local information, but notice that this does not work well. However, we find that if we continue the training of the old embedding, interleaved with epochs during which we only optimize for the added and removed parts, we obtain good results in terms of typical metrics used in link prediction. This performance is obtained much faster than with a complete retraining and hence makes it possible to maintain embeddings for dynamic Knowledge Graphs.
    Improved Multi-label Classification with Frequent Label-set Mining and Association. (arXiv:2109.10797v1 [cs.LG])
    (0 min) Multi-label (ML) data deals with multiple classes associated with individual samples at the same time. This leads to the co-occurrence of several classes repeatedly, which indicates some existing correlation among them. In this article, the correlation among classes has been explored to improve the classification performance of existing ML classifiers. A novel approach of frequent label-set mining has been proposed to extract these correlated classes from the label-sets of the data. Both co-presence (CP) and co-absence (CA) of classes have been taken into consideration. The rules mined from the ML data has been further used to incorporate class correlation information into existing ML classifiers. The soft scores generated by an ML classifier are modified through a novel approach using the CP-CA rules. A concept of certain and uncertain scores has been defined here, where the proposed method aims to improve the uncertain scores with the help of the certain scores and their corresponding CP-CA rules. This has been experimentally analysed on ten ML datasets for three ML existing classifiers which shows substantial improvement in their overall performance.
    Generating Compositional Color Representations from Text. (arXiv:2109.10477v1 [cs.CV])
    (0 min) We consider the cross-modal task of producing color representations for text phrases. Motivated by the fact that a significant fraction of user queries on an image search engine follow an (attribute, object) structure, we propose a generative adversarial network that generates color profiles for such bigrams. We design our pipeline to learn composition - the ability to combine seen attributes and objects to unseen pairs. We propose a novel dataset curation pipeline from existing public sources. We describe how a set of phrases of interest can be compiled using a graph propagation technique, and then mapped to images. While this dataset is specialized for our investigations on color, the method can be extended to other visual dimensions where composition is of interest. We provide detailed ablation studies that test the behavior of our GAN architecture with loss functions from the contrastive learning literature. We show that the generative model achieves lower Frechet Inception Distance than discriminative ones, and therefore predicts color profiles that better match those from real images. Finally, we demonstrate improved performance in image retrieval and classification, indicating the crucial role that color plays in these downstream tasks.
    Transient Non-Stationarity and Generalisation in Deep Reinforcement Learning. (arXiv:2006.05826v4 [cs.LG] UPDATED)
    (0 min) Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect where these transient non-stationarities can permanently impact the latent representation and adversely affect generalisation performance. Consequently, to improve generalisation of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialised network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalisation benchmarks ProcGen and Multiroom.
    Learning through structure: towards deep neuromorphic knowledge graph embeddings. (arXiv:2109.10376v1 [cs.NE])
    (0 min) Computing latent representations for graph-structured data is an ubiquitous learning task in many industrial and academic applications ranging from molecule synthetization to social network analysis and recommender systems. Knowledge graphs are among the most popular and widely used data representations related to the Semantic Web. Next to structuring factual knowledge in a machine-readable format, knowledge graphs serve as the backbone of many artificial intelligence applications and allow the ingestion of context information into various learning algorithms. Graph neural networks attempt to encode graph structures in low-dimensional vector spaces via a message passing heuristic between neighboring nodes. Over the recent years, a multitude of different graph neural network architectures demonstrated ground-breaking performances in many learning tasks. In this work, we propose a strategy to map deep graph learning architectures for knowledge graph reasoning to neuromorphic architectures. Based on the insight that randomly initialized and untrained (i.e., frozen) graph neural networks are able to preserve local graph structures, we compose a frozen neural network with shallow knowledge graph embedding models. We experimentally show that already on conventional computing hardware, this leads to a significant speedup and memory reduction while maintaining a competitive performance level. Moreover, we extend the frozen architecture to spiking neural networks, introducing a novel, event-based and highly sparse knowledge graph embedding algorithm that is suitable for implementation in neuromorphic hardware.
    Automated classification of plasma regions using 3D particle energy distributions. (arXiv:1908.05715v4 [physics.space-ph] UPDATED)
    (0 min) We investigate the properties of the ion sky maps produced by the Dual Ion Spectrometers (DIS) from the Fast Plasma Investigation (FPI). We have trained a convolutional neural network classifier to predict four regions crossed by the MMS on the dayside magnetosphere: solar wind, ion foreshock, magnetosheath, and magnetopause using solely DIS spectrograms. The accuracy of the classifier is >98%. We use the classifier to detect mixed plasma regions, in particular to find the bow shock regions. A similar approach can be used to identify the magnetopause crossings and reveal regions prone to magnetic reconnection. Data processing through the trained classifier is fast and efficient and thus can be used for classification for the whole MMS database.
    Causal Inference in Non-linear Time-series usingDeep Networks and Knockoff Counterfactuals. (arXiv:2109.10817v1 [cs.LG])
    (0 min) Estimating causal relations is vital in understanding the complex interactions in multivariate time series. Non-linear coupling of variables is one of the major challenges inaccurate estimation of cause-effect relations. In this paper, we propose to use deep autoregressive networks (DeepAR) in tandem with counterfactual analysis to infer nonlinear causal relations in multivariate time series. We extend the concept of Granger causality using probabilistic forecasting with DeepAR. Since deep networks can neither handle missing input nor out-of-distribution intervention, we propose to use the Knockoffs framework (Barberand Cand`es, 2015) for generating intervention variables and consequently counterfactual probabilistic forecasting. Knockoff samples are independent of their output given the observed variables and exchangeable with their counterpart variables without changing the underlying distribution of the data. We test our method on synthetic as well as real-world time series datasets. Overall our method outperforms the widely used vector autoregressive Granger causality and PCMCI in detecting nonlinear causal dependency in multivariate time series.
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v2 [cs.LG] UPDATED)
    (0 min) Though learning has become a core technology of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced solutions. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern learning problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization (ERM), even obtaining a model that satisfies statistical constraints can be challenging, all the more so a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained, finite dimensional, and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap, i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex)~statistical problem, and provide a practical constrained learning algorithm. These results establish a constrained counterpart of classical learning theory and enable the explicit use of constraints in learning. We illustrate this algorithm and theory in rate-constrained learning applications.
    Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation. (arXiv:2108.12211v2 [cs.DC] UPDATED)
    (0 min) Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance. This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
    Synthetic Data -- Anonymisation Groundhog Day. (arXiv:2011.07018v5 [cs.LG] UPDATED)
    (0 min) Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques. Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.

2021-09-22

  • cs.CL updates on arXiv.org

    Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models. (arXiv:2009.13267v4 [cs.CL] UPDATED)
    (2 min) The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016; Norouzi et al., 2016; Shen et al., 2016; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution -- there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energy-based model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT: energy-based re-ranking (EBR). We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences). Our EBR with the joint energy model consistently improves the performance of the Transformer-based NMT: +4 BLEU points on IWSLT'14 German-English, +3.0 BELU points on Sinhala-English, +1.2 BLEU on WMT'16 English-German tasks.
    Contextual Lexicon-Based Approach for Hate Speech and Offensive Language Detection. (arXiv:2104.12265v4 [cs.CL] UPDATED)
    (2 min) This paper provides a new approach for offensive language and hate speech detection on social media. Our approach incorporates an offensive lexicon composed of implicit and explicit offensive and swearing expressions annotated with binary classes: context-dependent and context-independent offensive. Due to the severity of the hate speech and offensive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the proposed method. Nevertheless, our proposal may be applied to any other language or domain. Based on the obtained results, the proposed approach showed high-performance overcoming the current baselines for European and Brazilian Portuguese.
    DiaKG: an Annotated Diabetes Dataset for Medical Knowledge Graph Construction. (arXiv:2105.15033v2 [cs.CL] UPDATED)
    (2 min) Knowledge Graph has been proven effective in modeling structured information and conceptual knowledge, especially in the medical domain. However, the lack of high-quality annotated corpora remains a crucial problem for advancing the research and applications on this task. In order to accelerate the research for domain-specific knowledge graphs in the medical domain, we introduce DiaKG, a high-quality Chinese dataset for Diabetes knowledge graph, which contains 22,050 entities and 6,890 relations in total. We implement recent typical methods for Named Entity Recognition and Relation Extraction as a benchmark to evaluate the proposed dataset thoroughly. Empirical results show that the DiaKG is challenging for most existing methods and further analysis is conducted to discuss future research direction for improvements. We hope the release of this dataset can assist the construction of diabetes knowledge graphs and facilitate AI-based applications.
    Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study. (arXiv:2106.09700v2 [cs.CL] UPDATED)
    (2 min) Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.
    Informed Sampling for Diversity in Concept-to-Text NLG. (arXiv:2004.14364v2 [cs.CL] UPDATED)
    (2 min) Deep-learning models for language generation tasks tend to produce repetitive output. Various methods have been proposed to encourage lexical diversity during decoding, but this often comes at a cost to the perceived fluency and adequacy of the output. In this work, we propose to ameliorate this cost by using an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce. Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output. We focus our experiments on concept-to-text generation where models are sensitive to the inclusion of irrelevant words due to the strict relation between input and output. Our analysis shows that previous methods for diversity underperform in this setting, while human evaluation suggests that our proposed method achieves a high level of diversity with minimal effect to the output's fluency and adequacy.
    Non-autoregressive Mandarin-English Code-switching Speech Recognition. (arXiv:2104.02258v2 [cs.CL] UPDATED)
    (2 min) Mandarin-English code-switching (CS) is frequently used among East and Southeast Asian people. However, the intra-sentence language switching of the two very different languages makes recognizing CS speech challenging. Meanwhile, the recent successful non-autoregressive (NAR) ASR models remove the need for left-to-right beam decoding in autoregressive (AR) models and achieved outstanding performance and fast inference speed, but it has not been applied to Mandarin-English CS speech recognition. This paper takes advantage of the Mask-CTC NAR ASR framework to tackle the CS speech recognition issue. We further propose to change the Mandarin output target of the encoder to Pinyin for faster encoder training and introduce the Pinyin-to-Mandarin decoder to learn contextualized information. Moreover, we use word embedding label smoothing to regularize the decoder with contextualized information and projection matrix regularization to bridge that gap between the encoder and decoder. We evaluate these methods on the SEAME corpus and achieved exciting results.
    A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders. (arXiv:2104.03630v2 [cs.CL] UPDATED)
    (2 min) Powerful sentence encoders trained for multiple languages are on the rise. These systems are capable of embedding a wide range of linguistic properties into vector representations. While explicit probing tasks can be used to verify the presence of specific linguistic properties, it is unclear whether the vector representations can be manipulated to indirectly steer such properties. For efficient learning, we investigate the use of a geometric mapping in embedding space to transform linguistic properties, without any tuning of the pre-trained sentence encoder or decoder. We validate our approach on three linguistic properties using a pre-trained multilingual autoencoder and analyze the results in both monolingual and cross-lingual settings.
    Relation-Guided Pre-Training for Open-Domain Question Answering. (arXiv:2109.10346v1 [cs.CL])
    (2 min) Answering complex open-domain questions requires understanding the latent relations between involving entities. However, we found that the existing QA datasets are extremely imbalanced in some types of relations, which hurts the generalization performance over questions with long-tail relations. To remedy this problem, in this paper, we propose a Relation-Guided Pre-Training (RGPT-QA) framework. We first generate a relational QA dataset covering a wide range of relations from both the Wikidata triplets and Wikipedia hyperlinks. We then pre-train a QA model to infer the latent relations from the question, and then conduct extractive QA to get the target answer entity. We demonstrate that by pretraining with propoed RGPT-QA techique, the popular open-domain QA model, Dense Passage Retriever (DPR), achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions. Particularly, we show that RGPT-QA improves significantly on questions with long-tail relations
    The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation. (arXiv:2101.00421v3 [cs.CL] UPDATED)
    (2 min) Machine translation systems are vulnerable to domain mismatch, especially in a low-resource scenario. Out-of-domain translations are often of poor quality and prone to hallucinations, due to exposure bias and the decoder acting as a language model. We adopt two approaches to alleviate this problem: lexical shortlisting restricted by IBM statistical alignments, and hypothesis re-ranking based on similarity. The methods are computationally cheap, widely known, but not extensively experimented on domain adaptation. We demonstrate success on low-resource out-of-domain test sets, however, the methods are ineffective when there is sufficient data or too great domain mismatch. This is due to both the IBM model losing its advantage over the implicitly learned neural alignment, and issues with subword segmentation of out-of-domain words.
    You should evaluate your language model on marginal likelihood over tokenisations. (arXiv:2109.02550v2 [cs.CL] UPDATED)
    (2 min) Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities, and show that the marginal perplexity can be significantly better than the one best, especially on out-of-domain data. We link this difference in perplexity to the tokeniser uncertainty as measured by tokeniser entropy. We discuss some implications of our results for language model training and evaluation, particularly with regard to tokenisation robustness.
    Condenser: a Pre-training Architecture for Dense Retrieval. (arXiv:2104.08253v2 [cs.CL] UPDATED)
    (2 min) Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text comparison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations. This paper finds a key reason is that standard LMs' internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation. We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation. Our experiments show Condenser improves over standard LM by large margins on various text retrieval and similarity tasks.
    Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents. (arXiv:2109.10341v1 [cs.CL])
    (2 min) Document-level neural machine translation (DocNMT) delivers coherent translations by incorporating cross-sentence context. However, for most language pairs there's a shortage of parallel documents, although parallel sentences are readily available. In this paper, we study whether and how contextual modeling in DocNMT is transferable from sentences to documents in a zero-shot fashion (i.e. no parallel documents for student languages) through multilingual modeling. Using simple concatenation-based DocNMT, we explore the effect of 3 factors on multilingual transfer: the number of document-supervised teacher languages, the data schedule for parallel documents at training, and the data condition of parallel documents (genuine vs. backtranslated). Our experiments on Europarl-7 and IWSLT-10 datasets show the feasibility of multilingual transfer for DocNMT, particularly on document-specific metrics. We observe that more teacher languages and adequate data schedule both contribute to better transfer quality. Surprisingly, the transfer is less sensitive to the data condition and multilingual DocNMT achieves comparable performance with both back-translated and genuine document pairs.
    Open-Retrieval Conversational Machine Reading. (arXiv:2102.08633v2 [cs.CL] UPDATED)
    (2 min) In conversational machine reading, systems need to interpret natural language rules, answer high-level questions such as "May I qualify for VA health care benefits?", and ask follow-up clarification questions whose answer is necessary to answer the original question. However, existing works assume the rule text is provided for each user question, which neglects the essential retrieval step in real scenarios. In this work, we propose and investigate an open-retrieval setting of conversational machine reading. In the open-retrieval setting, the relevant rule texts are unknown so that a system needs to retrieve question-relevant evidence from a collection of rule texts, and answer users' high-level questions according to multiple retrieved rule texts in a conversational manner. We propose MUDERN, a Multi-passage Discourse-aware Entailment Reasoning Network which extracts conditions in the rule texts through discourse segmentation, conducts multi-passage entailment reasoning to answer user questions directly, or asks clarification follow-up questions to inquiry more information. On our created OR-ShARC dataset, MUDERN achieves the state-of-the-art performance, outperforming existing single-passage conversational machine reading models as well as a new multi-passage conversational machine reading baseline by a large margin. In addition, we conduct in-depth analyses to provide new insights into this new setting and our model.
    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. (arXiv:2109.10282v1 [cs.CL])
    (2 min) Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.
    Identifying Offensive Expressions of Opinion in Context. (arXiv:2104.12227v4 [cs.CL] UPDATED)
    (2 min) Classic information extraction techniques consist in building questions and answers about the facts. Indeed, it is still a challenge to subjective information extraction systems to identify opinions and feelings in context. In sentiment-based NLP tasks, there are few resources to information extraction, above all offensive or hateful opinions in context. To fill this important gap, this short paper provides a new cross-lingual and contextual offensive lexicon, which consists of explicit and implicit offensive and swearing expressions of opinion, which were annotated in two different classes: context dependent and context-independent offensive. In addition, we provide markers to identify hate speech. Annotation approach was evaluated at the expression-level and achieves high human inter-annotator agreement. The provided offensive lexicon is available in Portuguese and English languages.
    TranslateLocally: Blazing-fast translation running on the local CPU. (arXiv:2109.10194v1 [cs.CL])
    (2 min) Every day, millions of people sacrifice their privacy and browsing habits in exchange for online machine translation. Companies and governments with confidentiality requirements often ban online translation or pay a premium to disable logging. To bring control back to the end user and demonstrate speed, we developed translateLocally. Running locally on a desktop or laptop CPU, translateLocally delivers cloud-like translation speed and quality even on 10 year old hardware. The open-source software is based on Marian and runs on Linux, Windows, and macOS.
    Not all parameters are born equal: Attention is mostly what you need. (arXiv:2010.11859v2 [cs.CL] UPDATED)
    (2 min) Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and feed forward neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weights do not change over the course of the training. Through this, we show that the attention and FFN are equally important and fulfil the same functionality in a model. We show that the decision about whether a component is frozen or allowed to train is at least as important for the final model performance as its number of parameters. At the same time, the number of parameters alone is not indicative of a component's importance. Finally, while the embedding layer is the least essential for machine translation tasks, it is the most important component for language modelling tasks.
    From English to Signal Temporal Logic. (arXiv:2109.10294v1 [cs.CL])
    (2 min) Formal methods provide very powerful tools and techniques for the design and analysis of complex systems. Their practical application remains however limited, due to the widely accepted belief that formal methods require extensive expertise and a steep learning curve. Writing correct formal specifications in form of logical formulas is still considered to be a difficult and error prone task. In this paper we propose DeepSTL, a tool and technique for the translation of informal requirements, given as free English sentences, into Signal Temporal Logic (STL), a formal specification language for cyber-physical systems, used both by academia and advanced research labs in industry. A major challenge to devise such a translator is the lack of publicly available informal requirements and formal specifications. We propose a two-step workflow to address this challenge. We first design a grammar-based generation technique of synthetic data, where each output is a random STL formula and its associated set of possible English translations. In the second step, we use a state-of-the-art transformer-based neural translation technique, to train an accurate attentional translator of English to STL. The experimental results show high translation quality for patterns of English requirements that have been well trained, making this workflow promising to be extended for processing more complex translation tasks.
    Audiomer: A Convolutional Transformer for Keyword Spotting. (arXiv:2109.10252v1 [cs.LG])
    (2 min) Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or reach competitive performance after feature extraction through Fourier-based methods, incurring a loss-floor. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting with raw audio waveforms, out-performing all previous methods while also being computationally cheaper, much more parameter and data-efficient. Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.
    Knowledge Distillation with Noisy Labels for Natural Language Understanding. (arXiv:2109.10147v1 [cs.CL])
    (2 min) Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem and present two methods to mitigate the impact of label noise. Experiments on the GLUE benchmark show that our methods are effective even under high noise levels. Nevertheless, our results indicate that more research is necessary to cope with label noise under the KD.
    The Trade-offs of Domain Adaptation for Neural Language Models. (arXiv:2109.10274v1 [cs.CL])
    (2 min) In this paper, we connect language model adaptation with concepts of machine learning theory. We consider a training setup with a large out-of-domain set and a small in-domain set. As a first contribution, we derive how the benefit of training a model on either set depends on the size of the sets and the distance between their underlying distribution. As a second contribution, we present how the most popular data selection techniques -- importance sampling, intelligent data selection and influence functions -- can be presented in a common framework which highlights their similarity and also their subtle differences.
    Multi-Task Learning with Sentiment, Emotion, and Target Detection to Recognize Hate Speech and Offensive Language. (arXiv:2109.10255v1 [cs.CL])
    (3 min) The recognition of hate speech and offensive language (HOF) is commonly formulated as a classification task to decide if a text contains HOF. We investigate whether HOF detection can profit by taking into account the relationships between HOF and similar concepts: (a) HOF is related to sentiment analysis because hate speech is typically a negative statement and expresses a negative opinion; (b) it is related to emotion analysis, as expressed hate points to the author experiencing (or pretending to experience) anger while the addressees experience (or are intended to experience) fear. (c) Finally, one constituting element of HOF is the mention of a targeted person or group. On this basis, we hypothesize that HOF detection shows improvements when being modeled jointly with these concepts, in a multi-task learning setup. We base our experiments on existing data sets for each of these concepts (sentiment, emotion, target of HOF) and evaluate our models as a participant (as team IMS-SINAI) in the HASOC FIRE 2021 English Subtask 1A. Based on model-selection experiments in which we consider multiple available resources and submissions to the shared task, we find that the combination of the CrowdFlower emotion corpus, the SemEval 2016 Sentiment Corpus, and the OffensEval 2019 target detection data leads to an F1 =.79 in a multi-head multi-task learning model based on BERT, in comparison to .7895 of plain BERT. On the HASOC 2019 test data, this result is more substantial with an increase by 2pp in F1 and a considerable increase in recall. Across both data sets (2019, 2021), the recall is particularly increased for the class of HOF (6pp for the 2019 data and 3pp for the 2021 data), showing that MTL with emotion, sentiment, and target identification is an appropriate approach for early warning systems that might be deployed in social media platforms.
    RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation. (arXiv:2109.10164v1 [cs.CL])
    (2 min) Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforce that: all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it act as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
    Are Transformers a Modern Version of ELIZA? Observations on French Object Verb Agreement. (arXiv:2109.10133v1 [cs.CL])
    (2 min) Many recent works have demonstrated that unsupervised sentence representations of neural networks encode syntactic information by observing that neural language models are able to predict the agreement between a verb and its subject. We take a critical look at this line of research by showing that it is possible to achieve high accuracy on this agreement task with simple surface heuristics, indicating a possible flaw in our assessment of neural networks' syntactic ability. Our fine-grained analyses of results on the long-range French object-verb agreement show that contrary to LSTMs, Transformers are able to capture a non-trivial amount of grammatical structure.
    Does Vision-and-Language Pretraining Improve Lexical Grounding?. (arXiv:2109.10246v1 [cs.CL])
    (2 min) Linguistic representations derived from text alone have been criticized for their lack of grounding, i.e., connecting words to their meanings in the physical world. Vision-and-Language (VL) models, trained jointly on text and image or video data, have been offered as a response to such criticisms. However, while VL pretraining has shown success on multimodal tasks such as visual question answering, it is not yet known how the internal linguistic representations themselves compare to their text-only counterparts. This paper compares the semantic representations learned via VL vs. text-only pretraining for two recent VL models using a suite of analyses (clustering, probing, and performance on a commonsense question answering task) in a language-only setting. We find that the multimodal models fail to significantly outperform the text-only variants, suggesting that future work is required if multimodal pretraining is to be pursued as a means of improving NLP in general.
    Not All Comments are Equal: Insights into Comment Moderation from a Topic-Aware Model. (arXiv:2109.10033v1 [cs.CL])
    (2 min) Moderation of reader comments is a significant problem for online news platforms. Here, we experiment with models for automatic moderation, using a dataset of comments from a popular Croatian newspaper. Our analysis shows that while comments that violate the moderation rules mostly share common linguistic and thematic features, their content varies across the different sections of the newspaper. We therefore make our models topic-aware, incorporating semantic features from a topic model into the classification decision. Our results show that topic information improves the performance of the model, increases its confidence in correct outputs, and helps us understand the model's outputs.
    ConvFiT: Conversational Fine-Tuning of Pretrained Language Models. (arXiv:2109.10126v1 [cs.CL])
    (2 min) Transformer-based language models (LMs) pretrained on large text collections are proven to store a wealth of semantic knowledge. However, 1) they are not effective as sentence encoders when used off-the-shelf, and 2) thus typically lag behind conversationally pretrained (e.g., via response selection) encoders on conversational tasks such as intent detection (ID). In this work, we propose ConvFiT, a simple and efficient two-stage procedure which turns any pretrained LM into a universal conversational encoder (after Stage 1 ConvFiT-ing) and task-specialised sentence encoder (after Stage 2). We demonstrate that 1) full-blown conversational pretraining is not required, and that LMs can be quickly transformed into effective conversational encoders with much smaller amounts of unannotated data; 2) pretrained LMs can be fine-tuned into task-specialised sentence encoders, optimised for the fine-grained semantics of a particular task. Consequently, such specialised sentence encoders allow for treating ID as a simple semantic similarity task based on interpretable nearest neighbours retrieval. We validate the robustness and versatility of the ConvFiT framework with such similarity-based inference on the standard ID evaluation sets: ConvFiT-ed LMs achieve state-of-the-art ID performance across the board, with particular gains in the most challenging, few-shot setups.
    Representation Learning for Short Text Clustering. (arXiv:2109.09894v1 [cs.CL])
    (2 min) Effective representation learning is critical for short text clustering due to the sparse, high-dimensional and noise attributes of short text corpus. Existing pre-trained models (e.g., Word2vec and BERT) have greatly improved the expressiveness for short text representations with more condensed, low-dimensional and continuous features compared to the traditional Bag-of-Words (BoW) model. However, these models are trained for general purposes and thus are suboptimal for the short text clustering task. In this paper, we propose two methods to exploit the unsupervised autoencoder (AE) framework to further tune the short text representations based on these pre-trained text models for optimal clustering performance. In our first method Structural Text Network Graph Autoencoder (STN-GAE), we exploit the structural text information among the corpus by constructing a text network, and then adopt graph convolutional network as encoder to fuse the structural features with the pre-trained text features for text representation learning. In our second method Soft Cluster Assignment Autoencoder (SCA-AE), we adopt an extra soft cluster assignment constraint on the latent space of autoencoder to encourage the learned text representations to be more clustering-friendly. We tested two methods on seven popular short text datasets, and the experimental results show that when only using the pre-trained model for short text clustering, BERT performs better than BoW and Word2vec. However, as long as we further tune the pre-trained representations, the proposed method like SCA-AE can greatly increase the clustering performance, and the accuracy improvement compared to use BERT alone could reach as much as 14\%.
    NADE: A Benchmark for Robust Adverse Drug Events Extraction in Face of Negations. (arXiv:2109.10080v1 [cs.CL])
    (2 min) Adverse Drug Event (ADE) extraction mod-els can rapidly examine large collections of so-cial media texts, detecting mentions of drug-related adverse reactions and trigger medicalinvestigations. However, despite the recent ad-vances in NLP, it is currently unknown if suchmodels are robust in face ofnegation, which ispervasive across language varieties.In this paper we evaluate three state-of-the-artsystems, showing their fragility against nega-tion, and then we introduce two possible strate-gies to increase the robustness of these mod-els: a pipeline approach, relying on a specificcomponent for negation detection; an augmen-tation of an ADE extraction dataset to artifi-cially create negated samples and further trainthe models.We show that both strategies bring significantincreases in performance, lowering the num-ber of spurious entities predicted by the mod-els. Our dataset and code will be publicly re-leased to encourage research on the topic.
    BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets. (arXiv:2109.10234v1 [cs.CL])
    (2 min) We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialized using the general-domain French language model CamemBERT which follows the base architecture of RoBERTa. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.
    Learning Kernel-Smoothed Machine Translation with Retrieved Examples. (arXiv:2109.09991v1 [cs.CL])
    (2 min) How to effectively adapt neural machine translation (NMT) models according to emerging cases without retraining? Despite the great success of neural machine translation, updating the deployed models online remains a challenge. Existing non-parametric approaches that retrieve similar examples from a database to guide the translation process are promising but are prone to overfit the retrieved examples. However, non-parametric methods are prone to overfit the retrieved examples. In this work, we propose to learn Kernel-Smoothed Translation with Example Retrieval (KSTER), an effective approach to adapt neural machine translation models online. Experiments on domain adaptation and multi-domain machine translation datasets show that even without expensive retraining, KSTER is able to achieve improvement of 1.1 to 1.5 BLEU scores over the best existing online adaptation methods. The code and trained models are released at https://github.com/jiangqn/KSTER.
    How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings. (arXiv:2109.10179v1 [cs.CL])
    (2 min) How do neural networks "perceive" speech sounds from unknown languages? Does the typological similarity between the model's training language (L1) and an unknown language (L2) have an impact on the model representations of L2 speech signals? To answer these questions, we present a novel experimental design based on representational similarity analysis (RSA) to analyze acoustic word embeddings (AWEs) -- vector representations of variable-duration spoken-word segments. First, we train monolingual AWE models on seven Indo-European languages with various degrees of typological similarity. We then employ RSA to quantify the cross-lingual similarity by simulating native and non-native spoken-word processing using AWEs. Our experiments show that typological similarity indeed affects the representational similarity of the models in our study. We further discuss the implications of our work on modeling speech processing and language similarity with neural networks.
    Negation-Instance Based Evaluation of End-to-End Negation Resolution. (arXiv:2109.10013v1 [cs.CL])
    (2 min) In this paper, we revisit the task of negation resolution, which includes the subtasks of cue detection (e.g. "not", "never") and scope resolution. In the context of previous shared tasks, a variety of evaluation metrics have been proposed. Subsequent works usually use different subsets of these, including variations and custom implementations, rendering meaningful comparisons between systems difficult. Examining the problem both from a linguistic perspective and from a downstream viewpoint, we here argue for a negation-instance based approach to evaluating negation resolution. Our proposed metrics correspond to expectations over per-instance scores and hence are intuitively interpretable. To render research comparable and to foster future work, we provide results for a set of current state-of-the-art systems for negation resolution on three English corpora, and make our implementation of the evaluation scripts publicly available.
    One Source, Two Targets: Challenges and Rewards of Dual Decoding. (arXiv:2109.10197v1 [cs.CL])
    (2 min) Machine translation is generally understood as generating one target text from an input source document. In this paper, we consider a stronger requirement: to jointly generate two texts so that each output side effectively depends on the other. As we discuss, such a device serves several practical purposes, from multi-target machine translation to the generation of controlled variations of the target text. We present an analysis of possible implementations of dual decoding, and experiment with four applications. Viewing the problem from multiple angles allows us to better highlight the challenges of dual decoding and to also thoroughly analyze the benefits of generating matched, rather than independent, translations.
    Data Augmentation Methods for Anaphoric Zero Pronouns. (arXiv:2109.09825v1 [cs.CL])
    (2 min) In pro-drop language like Arabic, Chinese, Italian, Japanese, Spanish, and many others, unrealized (null) arguments in certain syntactic positions can refer to a previously introduced entity, and are thus called anaphoric zero pronouns. The existing resources for studying anaphoric zero pronoun interpretation are however still limited. In this paper, we use five data augmentation methods to generate and detect anaphoric zero pronouns automatically. We use the augmented data as additional training materials for two anaphoric zero pronoun systems for Arabic. Our experimental results show that data augmentation improves the performance of the two systems, surpassing the state-of-the-art results.
    StreamSide: A Fully-Customizable Open-Source Toolkit for Efficient Annotation of Meaning Representations. (arXiv:2109.09853v1 [cs.CL])
    (2 min) This demonstration paper presents StreamSide, an open-source toolkit for annotating multiple kinds of meaning representations. StreamSide supports frame-based annotation schemes e.g., Abstract Meaning Representation (AMR) and frameless annotation schemes e.g., Widely Interpretable Semantic Representation (WISeR). Moreover, it supports both sentence-level and document-level annotation by allowing annotators to create multi-rooted graphs for input text. It can open and automatically convert between several types of input formats including plain text, Penman notation, and its own JSON format enabling richer annotation. It features reference frames for AMR predicate argument structures, and also concept-to-text alignment. StreamSide is released under the Apache 2.0 license, and is completely open-source so that it can be customized to annotate enriched meaning representations in different languages (e.g., Uniform Meaning Representations). All StreamSide resources are publicly distributed through our open source project at: https://github.com/emorynlp/StreamSide.
    On the Difficulty of Segmenting Words with Attention. (arXiv:2109.10107v1 [cs.CL])
    (2 min) Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition, attention can be used to locate and segment the words. We show, however, that even on monolingual data this approach is brittle. In our experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task. Models trained to predict words from either phones or speech (i.e., the opposite direction needed to generalize to new data), yield much worse results, suggesting that attention-based segmentation is only useful in limited scenarios.
    Blindness to Modality Helps Entailment Graph Mining. (arXiv:2109.10227v1 [cs.CL])
    (2 min) Understanding linguistic modality is widely seen as important for downstream tasks such as Question Answering and Knowledge Graph Population. Entailment Graph learning might also be expected to benefit from attention to modality. We build Entailment Graphs using a news corpus filtered with a modality parser, and show that stripping modal modifiers from predicates in fact increases performance. This suggests that for some tasks, the pragmatics of modal modification of predicates allows them to contribute as evidence of entailment.
    Improving Span Representation for Domain-adapted Coreference Resolution. (arXiv:2109.09811v1 [cs.LG])
    (2 min) Recent work has shown fine-tuning neural coreference models can produce strong performance when adapting to different domains. However, at the same time, this can require a large amount of annotated target examples. In this work, we focus on supervised domain adaptation for clinical notes, proposing the use of concept knowledge to more efficiently adapt coreference models to a new domain. We develop methods to improve the span representations via (1) a retrofitting loss to incentivize span representations to satisfy a knowledge-based distance function and (2) a scaffolding loss to guide the recovery of knowledge from the span representation. By integrating these losses, our model is able to improve our baseline precision and F-1 score. In particular, we show that incorporating knowledge with end-to-end coreference models results in better performance on the most challenging, domain-specific spans.
    Generalization in Text-based Games via Hierarchical Reinforcement Learning. (arXiv:2109.09968v1 [cs.CL])
    (2 min) Deep reinforcement learning provides a promising approach for text-based games in studying natural language communication between humans and artificial agents. However, the generalization still remains a big challenge as the agents depend critically on the complexity and variety of training tasks. In this paper, we address this problem by introducing a hierarchical framework built upon the knowledge graph-based RL agent. In the high level, a meta-policy is executed to decompose the whole game into a set of subtasks specified by textual goals, and select one of them based on the KG. Then a sub-policy in the low level is executed to conduct goal-conditioned reinforcement learning. We carry out experiments on games with various difficulty levels and show that the proposed method enjoys favorable generalizability.
    SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. (arXiv:2109.10086v1 [cs.IR])
    (2 min) In neural Information Retrieval (IR), ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning \emph{sparse} representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. Introduced recently, the SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. In this paper, we build on SPLADE and propose several significant improvements in terms of effectiveness and/or efficiency. More specifically, we modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. We also report results on the BEIR benchmark. Overall, SPLADE is considerably improved with more than $9$\% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
    Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?. (arXiv:2109.10052v1 [cs.CL])
    (2 min) In this paper, we investigate what types of stereotypical information are captured by pretrained language models. We present the first dataset comprising stereotypical attributes of a range of social groups and propose a method to elicit stereotypes encoded by pretrained language models in an unsupervised fashion. Moreover, we link the emergent stereotypes to their manifestation as basic emotions as a means to study their emotional effects in a more generalized manner. To demonstrate how our methods can be used to analyze emotion and stereotype shifts due to linguistic experience, we use fine-tuning on news sources as a case study. Our experiments expose how attitudes towards different social groups vary across models and how quickly emotions and stereotypes can shift at the fine-tuning stage.
    InvBERT: Text Reconstruction from Contextualized Embeddings used for Derived Text Formats of Literary Works. (arXiv:2109.10104v1 [cs.CL])
    (2 min) Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature. Such automated approaches enable quantitative studies on large corpora which would not be feasible by manual inspection alone. However, due to copyright restrictions, the availability of relevant digitized literary works is limited. Derived Text Formats (DTFs) have been proposed as a solution. Here, textual materials are transformed in such a way that copyright-critical features are removed, but that the use of certain analytical methods remains possible. Contextualized word embeddings produced by transformer-encoders (like BERT) are promising candidates for DTFs because they allow for state-of-the-art performance on various analytical tasks and, at first sight, do not disclose the original text. However, in this paper we demonstrate that under certain conditions the reconstruction of the original copyrighted text becomes feasible and its publication in the form of contextualized word representations is not safe. Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical, since it allows to generate data to train a decoder with a reconstruction accuracy sufficient to violate copyright laws.
    Inspecting the Factuality of Hallucinated Entities in Abstractive Summarization. (arXiv:2109.09784v1 [cs.CL])
    (2 min) State-of-the-art abstractive summarization systems often generate \emph{hallucinations}; i.e., content that is not directly inferable from the source text. Despite being assumed incorrect, many of the hallucinated contents are consistent with world knowledge (factual hallucinations). Including these factual hallucinations into a summary can be beneficial in providing additional background information. In this work, we propose a novel detection approach that separates factual from non-factual hallucinations of entities. Our method is based on an entity's prior and posterior probabilities according to pre-trained and finetuned masked language models, respectively. Empirical results suggest that our method vastly outperforms three strong baselines in both accuracy and F1 scores and has a strong correlation with human judgments on factuality classification tasks. Furthermore, our approach can provide insight into whether a particular hallucination is caused by the summarizer's pre-training or fine-tuning step.
    A Comprehensive Review on Summarizing Financial News Using Deep Learning. (arXiv:2109.10118v1 [cs.CL])
    (2 min) Investors make investment decisions depending on several factors such as fundamental analysis, technical analysis, and quantitative analysis. Another factor on which investors can make investment decisions is through sentiment analysis of news headlines, the sole purpose of this study. Natural Language Processing techniques are typically used to deal with such a large amount of data and get valuable information out of it. NLP algorithms convert raw text into numerical representations that machines can easily understand and interpret. This conversion can be done using various embedding techniques. In this research, embedding techniques used are BoW, TF-IDF, Word2Vec, BERT, GloVe, and FastText, and then fed to deep learning models such as RNN and LSTM. This work aims to evaluate these model's performance to choose the robust model in identifying the significant factors influencing the prediction. During this research, it was expected that Deep Leaming would be applied to get the desired results or achieve better accuracy than the state-of-the-art. The models are compared to check their outputs to know which one has performed better.
    Intensionalizing Abstract Meaning Representations: Non-Veridicality and Scope. (arXiv:2109.09858v1 [cs.CL])
    (2 min) Abstract Meaning Representation (AMR) is a graphical meaning representation language designed to represent propositional information about argument structure. However, at present it is unable to satisfyingly represent non-veridical intensional contexts, often licensing inappropriate inferences. In this paper, we show how to resolve the problem of non-veridicality without appealing to layered graphs through a mapping from AMRs into Simply-Typed Lambda Calculus (STLC). At least for some cases, this requires the introduction of a new role :content which functions as an intensional operator. The translation proposed is inspired by the formal linguistics literature on the event semantics of attitude reports. Next, we address the interaction of quantifier scope and intensional operators in so-called de re/de dicto ambiguities. We adopt a scope node from the literature and provide an explicit multidimensional semantics utilizing Cooper storage which allows us to derive the de re and de dicto scope readings as well as intermediate scope readings which prove difficult for accounts without a scope node.
    Dependency Induction Through the Lens of Visual Perception. (arXiv:2109.09790v1 [cs.CL])
    (2 min) Most previous work on grammar induction focuses on learning phrasal or dependency structure purely from text. However, because the signal provided by text alone is limited, recently introduced visually grounded syntax models make use of multimodal information leading to improved performance in constituency grammar induction. However, as compared to dependency grammars, constituency grammars do not provide a straightforward way to incorporate visual information without enforcing language-specific heuristics. In this paper, we propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based heuristic to jointly learn constituency-structure and dependency-structure grammars. Our experiments find that concreteness is a strong indicator for learning dependency grammars, improving the direct attachment score (DAS) by over 50\% as compared to state-of-the-art models trained on pure text. Next, we propose an extension of our model that leverages both word concreteness and visual semantic role labels in constituency and dependency parsing. Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
    Language Identification with a Reciprocal Rank Classifier. (arXiv:2109.09862v1 [cs.CL])
    (2 min) Language identification is a critical component of language processing pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world settings. We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of copious training data. The key idea for classification is that the reciprocal of the rank in a frequency table makes an effective additive feature score, hence the term Reciprocal Rank Classifier (RRC). The key finding for language classification is that ranked lists of words and frequencies of characters form a sufficient and robust representation of the regularities of key languages and their orthographies. We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set. When trained on Wikipedia but applied to Twitter the macro-averaged F1-score of a conventionally trained SVM classifier drops from 90.9% to 77.7%. By contrast, the macro F1-score of RRC drops only from 93.1% to 90.6%. These classifiers are compared with those from fastText and langid. The RRC performs better than these established systems in most experiments, especially on short Wikipedia texts and Twitter. The RRC classifier can be improved for particular domains and conversational situations by adding words to the ranked lists. Using new terms learned from such conversations, we demonstrate a further 7.9% increase in accuracy of sample message classification, and 1.7% increase for conversation classification. Surprisingly, this made results on Twitter data slightly worse. The RRC classifier is available as an open source Python package (https://github.com/LivePersonInc/lplangid).
    BERT Has Uncommon Sense: Similarity Ranking for Word Sense BERTology. (arXiv:2109.09780v1 [cs.CL])
    (2 min) An important question concerning contextualized word embedding (CWE) models like BERT is how well they can represent different word senses, especially those in the long tail of uncommon senses. Rather than build a WSD system as in previous work, we investigate contextualized embedding neighborhoods directly, formulating a query-by-example nearest neighbor retrieval task and examining ranking performance for words and senses in different frequency bands. In an evaluation on two English sense-annotated corpora, we find that several popular CWE models all outperform a random baseline even for proportionally rare senses, without explicit sense supervision. However, performance varies considerably even among models with similar architectures and pretraining regimes, with especially large differences for rare word senses, revealing that CWE models are not all created equal when it comes to approximating word senses in their native representations.
    Transforming Fake News: Robust Generalisable News Classification Using Transformers. (arXiv:2109.09796v1 [cs.CL])
    (2 min) As online news has become increasingly popular and fake news increasingly prevalent, the ability to audit the veracity of online news content has become more important than ever. Such a task represents a binary classification challenge, for which transformers have achieved state-of-the-art results. Using the publicly available ISOT and Combined Corpus datasets, this study explores transformers' abilities to identify fake news, with particular attention given to investigating generalisation to unseen datasets with varying styles, topics and class distributions. Moreover, we explore the idea that opinion-based news articles cannot be classified as real or fake due to their subjective nature and often sensationalised language, and propose a novel two-step classification pipeline to remove such articles from both model training and the final deployed inference system. Experiments over the ISOT and Combined Corpus datasets show that transformers achieve an increase in F1 scores of up to 4.9% for out of distribution generalisation compared to baseline approaches, with a further increase of 10.1% following the implementation of our two-step classification pipeline. To the best of our knowledge, this study is the first to investigate generalisation of transformers in this context.
    Something Old, Something New: Grammar-based CCG Parsing with Transformer Models. (arXiv:2109.10044v1 [cs.CL])
    (2 min) This report describes the parsing problem for Combinatory Categorial Grammar (CCG), showing how a combination of Transformer-based neural models and a symbolic CCG grammar can lead to substantial gains over existing approaches. The report also documents a 20-year research program, showing how NLP methods have evolved over this time. The staggering accuracy improvements provided by neural models for CCG parsing can be seen as a reflection of the improvements seen in NLP more generally. The report provides a minimal introduction to CCG and CCG parsing, with many pointers to the relevant literature. It then describes the CCG supertagging problem, and some recent work from Tian et al. (2020) which applies Transformer-based models to supertagging with great effect. I use this existing model to develop a CCG multitagger, which can serve as a front-end to an existing CCG parser. Simply using this new multitagger provides substantial gains in parsing accuracy. I then show how a Transformer-based model from the parsing literature can be combined with the grammar-based CCG parser, setting a new state-of-the-art for the CCGbank parsing task of almost 93% F-score for labelled dependencies, with complete sentence accuracies of over 50%.
    DisCoDisCo at the DISRPT2021 Shared Task: A System for Discourse Segmentation, Classification, and Connective Detection. (arXiv:2109.09777v1 [cs.CL])
    (2 min) This paper describes our submission to the DISRPT2021 Shared Task on Discourse Unit Segmentation, Connective Detection, and Relation Classification. Our system, called DisCoDisCo, is a Transformer-based neural classifier which enhances contextualized word embeddings (CWEs) with hand-crafted features, relying on tokenwise sequence tagging for discourse segmentation and connective detection, and a feature-rich, encoder-less sentence pair classifier for relation classification. Our results for the first two tasks outperform SOTA scores from the previous 2019 shared task, and results on relation classification suggest strong performance on the new 2021 benchmark. Ablation tests show that including features beyond CWEs are helpful for both tasks, and a partial evaluation of multiple pre-trained Transformer-based language models indicates that models pre-trained on the Next Sentence Prediction (NSP) task are optimal for relation classification.
  • cs.CV updates on arXiv.org

    Learning to Prompt for Vision-Language Models. (arXiv:2109.01134v2 [cs.CV] UPDATED)
    (3 min) Vision-language pre-training has recently emerged as a promising alternative for representation learning. It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders. Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks since visual concepts can be diametrically generated from natural language, known as prompt. In this paper, we identify that a major challenge of deploying such models in practice is prompt engineering. This is because designing a proper prompt, especially for context words surrounding a class name, requires domain expertise and typically takes a significant amount of time for words tuning since a slight change in wording could have a huge impact on performance. Moreover, different downstream tasks require specific designs, further hampering the efficiency of deployment. To overcome this challenge, we propose a novel approach named context optimization (CoOp). The main idea is to model context in prompts using continuous representations and perform end-to-end learning from data while keeping the pre-trained parameters fixed. In this way, the design of task-relevant prompts can be fully automated. Experiments on 11 datasets show that CoOp effectively turns pre-trained vision-language models into data-efficient visual learners, requiring as few as one or two shots to beat hand-crafted prompts with a decent margin and able to gain significant improvements when using more shots (e.g., at 16 shots the average gain is around 17% with the highest reaching over 50%). CoOp also exhibits strong robustness to distribution shift.
    PDFNet: Pointwise Dense Flow Network for Urban-Scene Segmentation. (arXiv:2109.10083v1 [cs.CV])
    (2 min) In recent years, using a deep convolutional neural network (CNN) as a feature encoder (or backbone) is the most commonly observed architectural pattern in several computer vision methods, and semantic segmentation is no exception. The two major drawbacks of this architectural pattern are: (i) the networks often fail to capture small classes such as wall, fence, pole, traffic light, traffic sign, and bicycle, which are crucial for autonomous vehicles to make accurate decisions. (ii) due to the arbitrarily increasing depth, the networks require massive labeled data and additional regularization techniques to converge and to prevent the risk of over-fitting, respectively. While regularization techniques come at minimal cost, the collection of labeled data is an expensive and laborious process. In this work, we address these two drawbacks by proposing a novel lightweight architecture named point-wise dense flow network (PDFNet). In PDFNet, we employ dense, residual, and multiple shortcut connections to allow a smooth gradient flow to all parts of the network. The extensive experiments on Cityscapes and CamVid benchmarks demonstrate that our method significantly outperforms baselines in capturing small classes and in few-data regimes. Moreover, our method achieves considerable performance in classifying out-of-the training distribution samples, evaluated on Cityscapes to KITTI dataset.
    Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism. (arXiv:2003.03955v3 [cs.CV] UPDATED)
    (0 min) Food retrieval is an important task to perform analysis of food-related information, where we are interested in retrieving relevant information about the queried food item such as ingredients, cooking instructions, etc. In this paper, we investigate cross-modal retrieval between food images and cooking recipes. The goal is to learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. Two major challenges in addressing this problem are 1) large intra-variance and small inter-variance across cross-modal food data; and 2) difficulties in obtaining discriminative recipe representations. To address these two problems, we propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities. Besides, we exploit a self-attention mechanism to improve the embedding of recipes. We evaluate the performance of the proposed method on the large-scale Recipe1M dataset, and show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
    Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR. (arXiv:2109.09628v2 [cs.CV] UPDATED)
    (0 min) Self-supervised monocular depth prediction provides a cost-effective solution to obtain the 3D location of each pixel. However, the existing approaches usually lead to unsatisfactory accuracy, which is critical for autonomous robots. In this paper, we propose a novel two-stage network to advance the self-supervised monocular dense depth learning by leveraging low-cost sparse (e.g. 4-beam) LiDAR. Unlike the existing methods that use sparse LiDAR mainly in a manner of time-consuming iterative post-processing, our model fuses monocular image features and sparse LiDAR features to predict initial depth maps. Then, an efficient feed-forward refine network is further designed to correct the errors in these initial depth maps in pseudo-3D space with real-time performance. Extensive experiments show that our proposed model significantly outperforms all the state-of-the-art self-supervised methods, as well as the sparse-LiDAR-based methods on both self-supervised monocular depth prediction and completion tasks. With the accurate dense depth prediction, our model outperforms the state-of-the-art sparse-LiDAR-based method (Pseudo-LiDAR++) by more than 68% for the downstream task monocular 3D object detection on the KITTI Leaderboard.
    Probabilistic and Geometric Depth: Detecting Objects in Perspective. (arXiv:2107.14160v2 [cs.CV] UPDATED)
    (2 min) 3D object detection is an important capability needed in various practical applications such as driver assistance systems. Monocular 3D detection, as a representative general setting among image-based approaches, provides a more economical solution than conventional settings relying on LiDARs. It has drawn increasing attention recently but still yields unsatisfactory results. This paper first presents a systematic study on this problem. We observe that the current monocular 3D detection can be simplified as an instance depth estimation problem: The inaccurate instance depth blocks all the other 3D attribute predictions from improving the overall detection performance. However, recent methods directly estimate the depth based on isolated instances or pixels while ignoring the geometric relations across different objects. These geometric relations can be valuable constraints as the key information about depth is not directly manifest in the monocular image. Therefore, we construct geometric relation graphs across predicted objects and use the graph to facilitate depth estimation. As the preliminary depth estimation of each instance is usually inaccurate in this ill-posed setting, we incorporate a probabilistic representation to capture the uncertainty. It provides an important indicator to identify confident predictions and further guide the depth propagation. Despite the simplicity of the basic idea, our method obtains significant improvements on KITTI and nuScenes benchmarks, achieving 1st place out of all monocular vision-only methods while still maintaining real-time efficiency. Code and models will be released at https://github.com/open-mmlab/mmdetection3d.
    Multifield Cosmology with Artificial Intelligence. (arXiv:2109.09747v1 [astro-ph.CO])
    (0 min) Astrophysical processes such as feedback from supernovae and active galactic nuclei modify the properties and spatial distribution of dark matter, gas, and galaxies in a poorly understood way. This uncertainty is one of the main theoretical obstacles to extract information from cosmological surveys. We use 2,000 state-of-the-art hydrodynamic simulations from the CAMELS project spanning a wide variety of cosmological and astrophysical models and generate hundreds of thousands of 2-dimensional maps for 13 different fields: from dark matter to gas and stellar properties. We use these maps to train convolutional neural networks to extract the maximum amount of cosmological information while marginalizing over astrophysical effects at the field level. Although our maps only cover a small area of $(25~h^{-1}{\rm Mpc})^2$, and the different fields are contaminated by astrophysical effects in very different ways, our networks can infer the values of $\Omega_{\rm m}$ and $\sigma_8$ with a few percent level precision for most of the fields. We find that the marginalization performed by the network retains a wealth of cosmological information compared to a model trained on maps from gravity-only N-body simulations that are not contaminated by astrophysical effects. Finally, we train our networks on multifields -- 2D maps that contain several fields as different colors or channels -- and find that not only they can infer the value of all parameters with higher accuracy than networks trained on individual fields, but they can constrain the value of $\Omega_{\rm m}$ with higher accuracy than the maps from the N-body simulations.
    StereOBJ-1M: Large-scale Stereo Image Dataset for 6D Object Pose Estimation. (arXiv:2109.10115v1 [cs.CV])
    (0 min) We present a large-scale stereo RGB image object pose estimation dataset named the $\textbf{StereOBJ-1M}$ dataset. The dataset is designed to address challenging cases such as object transparency, translucency, and specular reflection, in addition to the common challenges of occlusion, symmetry, and variations in illumination and environments. In order to collect data of sufficient scale for modern deep learning models, we propose a novel method for efficiently annotating pose data in a multi-view fashion that allows data capturing in complex and flexible environments. Fully annotated with 6D object poses, our dataset contains over 396K frames and over 1.5M annotations of 18 objects recorded in 183 scenes constructed in 11 different environments. The 18 objects include 8 symmetric objects, 7 transparent objects, and 8 reflective objects. We benchmark two state-of-the-art pose estimation frameworks on StereOBJ-1M as baselines for future work. We also propose a novel object-level pose optimization method for computing 6D pose from keypoint predictions in multiple images.
    Feature Correlation Aggregation: on the Path to Better Graph Neural Networks. (arXiv:2109.09300v1 [cs.LG] CROSS LISTED)
    (0 min) Prior to the introduction of Graph Neural Networks (GNNs), modeling and analyzing irregular data, particularly graphs, was thought to be the Achilles' heel of deep learning. The core concept of GNNs is to find a representation by recursively aggregating the representations of a central node and those of its neighbors. The core concept of GNNs is to find a representation by recursively aggregating the representations of a central node and those of its neighbor, and its success has been demonstrated by many GNNs' designs. However, most of them only focus on using the first-order information between a node and its neighbors. In this paper, we introduce a central node permutation variant function through a frustratingly simple and innocent-looking modification to the core operation of a GNN, namely the Feature cOrrelation aGgregation (FOG) module which learns the second-order information from feature correlation between a node and its neighbors in the pipeline. By adding FOG into existing variants of GNNs, we empirically verify this second-order information complements the features generated by original GNNs across a broad set of benchmarks. A tangible boost in performance of the model is observed where the model surpasses previous state-of-the-art results by a significant margin while employing fewer parameters. (e.g., 33.116% improvement on a real-world molecular dataset using graph convolutional networks).
    Modeling Annotation Uncertainty with Gaussian Heatmaps in Landmark Localization. (arXiv:2109.09533v2 [cs.CV] UPDATED)
    (2 min) In landmark localization, due to ambiguities in defining their exact position, landmark annotations may suffer from large observer variabilities, which result in uncertain annotations. To model the annotation ambiguities of the training dataset, we propose to learn anisotropic Gaussian parameters modeling the shape of the target heatmap during optimization. Furthermore, our method models the prediction uncertainty of individual samples by fitting anisotropic Gaussian functions to the predicted heatmaps during inference. Besides state-of-the-art results, our experiments on datasets of hand radiographs and lateral cephalograms also show that Gaussian functions are correlated with both localization accuracy and observer variability. As a final experiment, we show the importance of integrating the uncertainty into decision making by measuring the influence of the predicted location uncertainty on the classification of anatomical abnormalities in lateral cephalograms.
    CondNet: Conditional Classifier for Scene Segmentation. (arXiv:2109.10322v1 [cs.CV])
    (2 min) The fully convolutional network (FCN) has achieved tremendous success in dense visual recognition tasks, such as scene segmentation. The last layer of FCN is typically a global classifier (1x1 convolution) to recognize each pixel to a semantic label. We empirically show that this global classifier, ignoring the intra-class distinction, may lead to sub-optimal results. In this work, we present a conditional classifier to replace the traditional global classifier, where the kernels of the classifier are generated dynamically conditioned on the input. The main advantages of the new classifier consist of: (i) it attends on the intra-class distinction, leading to stronger dense recognition capability; (ii) the conditional classifier is simple and flexible to be integrated into almost arbitrary FCN architectures to improve the prediction. Extensive experiments demonstrate that the proposed classifier performs favourably against the traditional classifier on the FCN architecture. The framework equipped with the conditional classifier (called CondNet) achieves new state-of-the-art performances on two datasets. The code and models are available at https://git.io/CondNet.
    Deep Distributionally Robust Learning for Calibrated Uncertainties under Domain Shift. (arXiv:2010.05784v2 [cs.LG] UPDATED)
    (2 min) We propose a deep distributionally robust learning framework for calibrated uncertainties under domain shifts. We consider cases where the source (training) distribution differs significantly from the target (test) distribution. In addition to the standard class predictor, our framework contains a binary domain classifier which estimates the density ratio between the source and target domains. We incorporate both with neural networks and train them end-to-end. The framework is demonstrated to generate calibrated uncertainties that benefit many downstream tasks, including unsupervised domain adaptation (UDA) and semi-supervised learning (SSL) where methods such as self-training and FixMatch use uncertainties to select confident pseudo-labels. Our experiments show that the introduction of DRL to these methods leads to significant improvements in cross-domain performance. We also demonstrate that the produced density ratio estimates show agreement with the human selection frequencies, suggesting a match with the human perceived uncertainties. The source code of this work will be made publicly available.
    Quadruple Augmented Pyramid Network for Multi-class COVID-19 Segmentation via CT. (arXiv:2103.05546v2 [eess.IV] UPDATED)
    (2 min) COVID-19, a new strain of coronavirus disease, has been one of the most serious and infectious disease in the world. Chest CT is essential in prognostication, diagnosing this disease, and assessing the complication. In this paper, a multi-class COVID-19 CT segmentation is proposed aiming at helping radiologists estimate the extent of effected lung volume. We utilized four augmented pyramid networks on an encoder-decoder segmentation framework. Quadruple Augmented Pyramid Network (QAP-Net) not only enable CNN capture features from variation size of CT images, but also act as spatial interconnections and down-sampling to transfer sufficient feature information for semantic segmentation. Experimental results achieve competitive performance in segmentation with the Dice of 0.8163, which outperforms other state-of-the-art methods, demonstrating the proposed framework can segments of consolidation as well as glass, ground area via COVID-19 chest CT efficiently and accurately.
    Self-supervised Representation Learning for Reliable Robotic Monitoring of Fruit Anomalies. (arXiv:2109.10135v1 [cs.RO])
    (2 min) Data augmentation can be a simple yet powerful tool for autonomous robots to fully utilise available data for self-supervised identification of atypical scenes or objects. State-of-the-art augmentation methods arbitrarily embed structural peculiarity in focal objects on typical images so that classifying these artefacts can provide guidance for learning representations for the detection of anomalous visual inputs. In this paper, however, we argue that learning such structure-sensitive representations can be a suboptimal approach to some classes of anomaly (e.g., unhealthy fruits) which are better recognised by a different type of visual element such as "colour". We thus propose Channel Randomisation as a novel data augmentation method for restricting neural network models to learn encoding of "colour irregularity" whilst predicting channel-randomised images to ultimately build reliable fruit-monitoring robots identifying atypical fruit qualities. Our experiments show that (1) the colour-based alternative can better learn representations for consistently accurate identification of fruit anomalies in various fruit species, and (2) validation accuracy can be monitored for early stopping of training due to positive correlation between the colour-learning task and fruit anomaly detection. Moreover, the proposed approach is evaluated on a new anomaly dataset Riseholme-2021, consisting of 3:5K strawberry images collected from a mobile robot, which we share with the community to encourage active agri-robotics research.
    Homography augumented momentum constrastive learning for SAR image retrieval. (arXiv:2109.10329v1 [cs.CV])
    (0 min) Deep learning-based image retrieval has been emphasized in computer vision. Representation embedding extracted by deep neural networks (DNNs) not only aims at containing semantic information of the image, but also can manage large-scale image retrieval tasks. In this work, we propose a deep learning-based image retrieval approach using homography transformation augmented contrastive learning to perform large-scale synthetic aperture radar (SAR) image search tasks. Moreover, we propose a training method for the DNNs induced by contrastive learning that does not require any labeling procedure. This may enable tractability of large-scale datasets with relative ease. Finally, we verify the performance of the proposed method by conducting experiments on the polarimetric SAR image datasets.
    Trust Your Robots! Predictive Uncertainty Estimation of Neural Networks with Sparse Gaussian Processes. (arXiv:2109.09690v2 [cs.RO] UPDATED)
    (0 min) This paper presents a probabilistic framework to obtain both reliable and fast uncertainty estimates for predictions with Deep Neural Networks (DNNs). Our main contribution is a practical and principled combination of DNNs with sparse Gaussian Processes (GPs). We prove theoretically that DNNs can be seen as a special case of sparse GPs, namely mixtures of GP experts (MoE-GP), and we devise a learning algorithm that brings the derived theory into practice. In experiments from two different robotic tasks -- inverse dynamics of a manipulator and object detection on a micro-aerial vehicle (MAV) -- we show the effectiveness of our approach in terms of predictive uncertainty, improved scalability, and run-time efficiency on a Jetson TX2. We thus argue that our approach can pave the way towards reliable and fast robot learning systems with uncertainty awareness.
    Towards Precise Pruning Points Detection using Semantic-Instance-Aware Plant Models for Grapevine Winter Pruning Automation. (arXiv:2109.07247v1 [cs.RO] CROSS LISTED)
    (0 min) Grapevine winter pruning is a complex task, that requires skilled workers to execute it correctly. The complexity makes it time consuming. It is an operation that requires about 80-120 hours per hectare annually, making an automated robotic system that helps in speeding up the process a crucial tool in large-size vineyards. We will describe (a) a novel expert annotated dataset for grapevine segmentation, (b) a state of the art neural network implementation and (c) generation of pruning points following agronomic rules, leveraging the simplified structure of the plant. With this approach, we are able to generate a set of pruning points on the canes, paving the way towards a correct automation of grapevine winter pruning.
    DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers. (arXiv:2109.10060v1 [cs.CV])
    (0 min) Dynamic networks have shown their promising capability in reducing theoretical computation complexity by adapting their architectures to the input during inference. However, their practical runtime usually lags behind the theoretical acceleration due to inefficient sparsity. Here, we explore a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels, while keeping parameters stored statically and contiguously in hardware to prevent the extra burden of sparse computation. Based on this scheme, we present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers, respectively. To ensure sub-network generality and routing fairness, we propose a disentangled two-stage optimization scheme with training techniques such as in-place bootstrapping (IB), multi-view consistency (MvCo) and sandwich gate sparsification (SGS) to train supernet and gate separately. Extensive experiments on 4 datasets and 3 different network architectures demonstrate our method consistently outperforms state-of-the-art static and dynamic model compression methods by a large margin (up to 6.6%). Typically, DS-Net++ achieves 2-4x computation reduction and 1.62x real-world acceleration over MobileNet, ResNet-50 and Vision Transformer, with minimal accuracy drops (0.1-0.3%) on ImageNet. Code release: https://github.com/changlin31/DS-Net
    From Culture to Clothing: Discovering the World Events Behind A Century of Fashion Images. (arXiv:2102.01690v2 [cs.CV] UPDATED)
    (0 min) Fashion is intertwined with external cultural factors, but identifying these links remains a manual process limited to only the most salient phenomena. We propose a data-driven approach to identify specific cultural factors affecting the clothes people wear. Using large-scale datasets of news articles and vintage photos spanning a century, we present a multi-modal statistical model to detect influence relationships between happenings in the world and people's choice of clothing. Furthermore, on two image datasets we apply our model to improve the concrete vision tasks of visual style forecasting and photo timestamping. Our work is a first step towards a computational, scalable, and easily refreshable approach to link culture to clothing.
    Comparison of single and multitask learning for predicting cognitive decline based on MRI data. (arXiv:2109.10266v1 [cs.LG])
    (0 min) The Alzheimer's Disease Assessment Scale-Cognitive subscale (ADAS-Cog) is a neuropsychological tool that has been designed to assess the severity of cognitive symptoms of dementia. Personalized prediction of the changes in ADAS-Cog scores could help in timing therapeutic interventions in dementia and at-risk populations. In the present work, we compared single and multitask learning approaches to predict the changes in ADAS-Cog scores based on T1-weighted anatomical magnetic resonance imaging (MRI). In contrast to most machine learning-based prediction methods ADAS-Cog changes, we stratified the subjects based on their baseline diagnoses and evaluated the prediction performances in each group. Our experiments indicated a positive relationship between the predicted and observed ADAS-Cog score changes in each diagnostic group, suggesting that T1-weighted MRI has a predictive value for evaluating cognitive decline in the entire AD continuum. We further studied whether correction of the differences in the magnetic field strength of MRI would improve the ADAS-Cog score prediction. The partial least square-based domain adaptation slightly improved the prediction performance, but the improvement was marginal. In summary, this study demonstrated that ADAS-Cog change could be, to some extent, predicted based on anatomical MRI. Based on this study, the recommended method for learning the predictive models is a single-task regularized linear regression due to its simplicity and good performance. It appears important to combine the training data across all subject groups for the most effective predictive models.
    Data InStance Prior (DISP) in Generative Adversarial Networks. (arXiv:2012.04256v2 [cs.CV] UPDATED)
    (0 min) Recent advances in generative adversarial networks (GANs) have shown remarkable progress in generating high-quality images. However, this gain in performance depends on the availability of a large amount of training data. In limited data regimes, training typically diverges, and therefore the generated samples are of low quality and lack diversity. Previous works have addressed training in low data setting by leveraging transfer learning and data augmentation techniques. We propose a novel transfer learning method for GANs in the limited data domain by leveraging informative data prior derived from self-supervised/supervised pre-trained networks trained on a diverse source domain. We perform experiments on several standard vision datasets using various GAN architectures (BigGAN, SNGAN, StyleGAN2) to demonstrate that the proposed method effectively transfers knowledge to domains with few target images, outperforming existing state-of-the-art techniques in terms of image quality and diversity. We also show the utility of data instance prior in large-scale unconditional image generation.
    CI-dataset and DetDSCI methodology for detecting too small and too large critical infrastructures in satellite images: Airports and electrical substations as case study. (arXiv:2105.11844v2 [cs.CV] UPDATED)
    (0 min) The detection of critical infrastructures in large territories represented by aerial and satellite images is of high importance in several fields such as in security, anomaly detection, land use planning and land use change detection. However, the detection of such infrastructures is complex as they have highly variable shapes and sizes, i.e., some infrastructures, such as electrical substations, are too small while others, such as airports, are too large. Besides, airports can have a surface area either small or too large with completely different shapes, which makes its correct detection challenging. As far as we know, these limitations have not been tackled yet in previous works. This paper presents (1) a smart Critical Infrastructure dataset, named CI-dataset, organised into two scales, small and large scales critical infrastructures and (2) a two-level resolution-independent critical infrastructure detection (DetDSCI) methodology that first determines the spatial resolution of the input image using a classification model, then analyses the image using the appropriate detector for that spatial resolution. The present study targets two representative classes, airports and electrical substations. Our experiments show that DetDSCI methodology achieves up to 37,53% F1 improvement with respect to Faster R-CNN, one of the most influential detection models.
    Unsupervised Cycle-consistent Generative Adversarial Networks for Pan-sharpening. (arXiv:2109.09395v2 [cs.CV] UPDATED)
    (0 min) Deep learning based pan-sharpening has received significant research interest in recent years. Most of existing methods fall into the supervised learning framework in which they down-sample the multi-spectral (MS) and panchromatic (PAN) images and regard the original MS images as ground truths to form training samples. Although impressive performance could be achieved, they have difficulties generalizing to the original full-scale images due to the scale gap, which makes them lack of practicability. In this paper, we propose an unsupervised generative adversarial framework that learns from the full-scale images without the ground truths to alleviate this problem. We extract the modality-specific features from the PAN and MS images with a two-stream generator, perform fusion in the feature domain, and then reconstruct the pan-sharpened images. Furthermore, we introduce a novel hybrid loss based on the cycle-consistency and adversarial scheme to improve the performance. Comparison experiments with the state-of-the-art methods are conducted on GaoFen-2 and WorldView-3 satellites. Results demonstrate that the proposed method can greatly improve the pan-sharpening performance on the full-scale images, which clearly show its practical value. Codes and datasets will be made publicly available.
    Learning Interpretable Concept Groups in CNNs. (arXiv:2109.10078v1 [cs.CV])
    (0 min) We propose a novel training methodology -- Concept Group Learning (CGL) -- that encourages training of interpretable CNN filters by partitioning filters in each layer into concept groups, each of which is trained to learn a single visual concept. We achieve this through a novel regularization strategy that forces filters in the same group to be active in similar image regions for a given layer. We additionally use a regularizer to encourage a sparse weighting of the concept groups in each layer so that a few concept groups can have greater importance than others. We quantitatively evaluate CGL's model interpretability using standard interpretability evaluation techniques and find that our method increases interpretability scores in most cases. Qualitatively we compare the image regions that are most active under filters learned using CGL versus filters learned without CGL and find that CGL activation regions more strongly concentrate around semantically relevant features.
    Multi-Source Video Domain Adaptation with Temporal Attentive Moment Alignment. (arXiv:2109.09964v1 [cs.CV])
    (0 min) Multi-Source Domain Adaptation (MSDA) is a more practical domain adaptation scenario in real-world scenarios. It relaxes the assumption in conventional Unsupervised Domain Adaptation (UDA) that source data are sampled from a single domain and match a uniform data distribution. MSDA is more difficult due to the existence of different domain shifts between distinct domain pairs. When considering videos, the negative transfer would be provoked by spatial-temporal features and can be formulated into a more challenging Multi-Source Video Domain Adaptation (MSVDA) problem. In this paper, we address the MSVDA problem by proposing a novel Temporal Attentive Moment Alignment Network (TAMAN) which aims for effective feature transfer by dynamically aligning both spatial and temporal feature moments. TAMAN further constructs robust global temporal features by attending to dominant domain-invariant local temporal features with high local classification confidence and low disparity between global and local feature discrepancies. To facilitate future research on the MSVDA problem, we introduce comprehensive benchmarks, covering extensive MSVDA scenarios. Empirical results demonstrate a superior performance of the proposed TAMAN across multiple MSVDA benchmarks.
    Adversarial Example Detection for DNN Models: A Review and Experimental Comparison. (arXiv:2105.00203v2 [cs.CV] UPDATED)
    (0 min) Deep learning (DL) has shown great success in many human-related tasks, which has led to its adoption in many computer vision based applications, such as security surveillance systems, autonomous vehicles and healthcare. Such safety-critical applications have to draw their path to success deployment once they have the capability to overcome safety-critical challenges. Among these challenges are the defense against or/and the detection of the adversarial examples (AEs). Adversaries can carefully craft small, often imperceptible, noise called perturbations to be added to the clean image to generate the AE. The aim of AE is to fool the DL model which makes it a potential risk for DL applications. Many test-time evasion attacks and countermeasures,i.e., defense or detection methods, are proposed in the literature. Moreover, few reviews and surveys were published and theoretically showed the taxonomy of the threats and the countermeasure methods with little focus in AE detection methods. In this paper, we focus on image classification tasks and attempt to provide a survey for detection methods of test-time evasion attacks on neural network classifiers. A detailed discussion for such methods is provided with experimental results for eight state-of-the-art detectors under different scenarios on four datasets. We also provide potential challenges and future perspectives for this research direction.
    Survey: Transformer based Video-Language Pre-training. (arXiv:2109.09920v1 [cs.CV])
    (2 min) Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing. This survey aims to give a comprehensive overview on transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer tructure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.
    Tensor Pooling Driven Instance Segmentation Framework for Baggage Threat Recognition. (arXiv:2108.09603v2 [cs.CV] UPDATED)
    (2 min) Automated systems designed for screening contraband items from the X-ray imagery are still facing difficulties with high clutter, concealment, and extreme occlusion. In this paper, we addressed this challenge using a novel multi-scale contour instance segmentation framework that effectively identifies the cluttered contraband data within the baggage X-ray scans. Unlike standard models that employ region-based or keypoint-based techniques to generate multiple boxes around objects, we propose to derive proposals according to the hierarchy of the regions defined by the contours. The proposed framework is rigorously validated on three public datasets, dubbed GDXray, SIXray, and OPIXray, where it outperforms the state-of-the-art methods by achieving the mean average precision score of 0.9779, 0.9614, and 0.8396, respectively. Furthermore, to the best of our knowledge, this is the first contour instance segmentation framework that leverages multi-scale information to recognize cluttered and concealed contraband data from the colored and grayscale security X-ray imagery.
    Real-Time Trash Detection for Modern Societies using CCTV to Identifying Trash by utilizing Deep Convolutional Neural Network. (arXiv:2109.09611v2 [cs.CV] UPDATED)
    (2 min) To protect the environment from trash pollution, especially in societies, and to take strict action against the red-handed people who throws the trash. As modern societies are developing and these societies need a modern solution to make the environment clean. Artificial intelligence (AI) evolution, especially in Deep Learning, gives an excellent opportunity to develop real-time trash detection using CCTV cameras. The inclusion of this project is real-time trash detection using a deep model of Convolutional Neural Network (CNN). It is used to obtain eight classes mask, tissue papers, shoppers, boxes, automobile parts, pampers, bottles, and juices boxes. After detecting the trash, the camera records the video of that person for ten seconds who throw trash in society. The challenging part of this paper is preparing a complex custom dataset that took too much time. The dataset consists of more than 2100 images. The CNN model was created, labeled, and trained. The detection time accuracy and average mean precision (mAP) benchmark both models' performance. In experimental phase the mAP performance and accuracy of the improved CNN model was superior in all aspects. The model is used on a CCTV camera to detect trash in real-time.
    Oriented Object Detection in Aerial Images Based on Area Ratio of Parallelogram. (arXiv:2109.10187v1 [cs.CV])
    (2 min) Rotated object detection is a challenging task in aerial images as the object in aerial images are displayed in arbitrary directions and usually densely packed. Although considerable progress has been made, there are still challenges that existing regression-based rotation detectors suffer the problem of discontinuous boundaries, which is directly caused by angular periodicity or corner ordering. In this paper, we propose a simple effective framework to address the above challenges. Instead of directly regressing the five parameters (coordinates of the central point, width, height, and rotation angle) or the four vertices, we use the area ratio of parallelogram (ARP) to accurately describe a multi-oriented object. Specifically, we regress coordinates of center point, height and width of minimum circumscribed rectangle of oriented object and three area ratios {\lambda}_1, {\lambda}_2 and {\lambda}_3. This may facilitate the offset learning and avoid the issue of angular periodicity or label points sequence for oriented objects. To further remedy the confusion issue nearly horizontal objects, we employ the area ratio between the object and its horizontal bounding box (minimum circumscribed rectangle) to guide the selection of horizontal or oriented detection for each object. We also propose a rotation efficient IoU loss (R-EIoU) to connect the horizontal bounding box with the three area ratios and improve the accurate for the rotating bounding box. Experimental results on three remote sensing datasets including HRSC2016, DOTA and UCAS-AOD and scene text including ICDAR2015 show that our method achieves superior detection performance compared with many state-of-the-art approaches. The code and model will be coming with paper published.
    SemCal: Semantic LiDAR-Camera Calibration using Neural MutualInformation Estimator. (arXiv:2109.10270v1 [cs.CV])
    (2 min) This paper proposes SemCal: an automatic, targetless, extrinsic calibration algorithm for a LiDAR and camera system using semantic information. We leverage a neural information estimator to estimate the mutual information (MI) of semantic information extracted from each sensor measurement, facilitating semantic-level data association. By using a matrix exponential formulation of the $se(3)$ transformation and a kernel-based sampling method to sample from camera measurement based on LiDAR projected points, we can formulate the LiDAR-Camera calibration problem as a novel differentiable objective function that supports gradient-based optimization methods. We also introduce a semantic-based initial calibration method using 2D MI-based image registration and Perspective-n-Point (PnP) solver. To evaluate performance, we demonstrate the robustness of our method and quantitatively analyze the accuracy using a synthetic dataset. We also evaluate our algorithm qualitatively on an urban dataset (KITTI360) and an off-road dataset (RELLIS-3D) benchmark datasets using both hand-annotated ground truth labels as well as labels predicted by the state-of-the-art deep learning models, showing improvement over recent comparable calibration approaches.
    Cross-Descriptor Visual Localization and Mapping. (arXiv:2012.01377v2 [cs.CV] UPDATED)
    (2 min) Visual localization and mapping is the key technology underlying the majority of mixed reality and robotics systems. Most state-of-the-art approaches rely on local features to establish correspondences between images. In this paper, we present three novel scenarios for localization and mapping which require the continuous update of feature representations and the ability to match across different feature types. While localization and mapping is a fundamental computer vision problem, the traditional setup supposes the same local features are used throughout the evolution of a map. Thus, whenever the underlying features are changed, the whole process is repeated from scratch. However, this is typically impossible in practice, because raw images are often not stored and re-building the maps could lead to loss of the attached digital content. To overcome the limitations of current approaches, we present the first principled solution to cross-descriptor localization and mapping. Our data-driven approach is agnostic to the feature descriptor type, has low computational requirements, and scales linearly with the number of description algorithms. Extensive experiments demonstrate the effectiveness of our approach on state-of-the-art benchmarks for a variety of handcrafted and learned features.
    Comparing object recognition in humans and deep convolutional neural networks -- An eye tracking study. (arXiv:2108.00107v2 [cs.CV] UPDATED)
    (2 min) Deep convolutional neural networks (DCNNs) and the ventral visual pathway share vast architectural and functional similarities in visual challenges such as object recognition. Recent insights have demonstrated that both hierarchical cascades can be compared in terms of both exerted behavior and underlying activation. However, these approaches ignore key differences in spatial priorities of information processing. In this proof-of-concept study, we demonstrate a comparison of human observers (N = 45) and three feedforward DCNNs through eye tracking and saliency maps. The results reveal fundamentally different resolutions in both visualization methods that need to be considered for an insightful comparison. Moreover, we provide evidence that a DCNN with biologically plausible receptive field sizes called vNet reveals higher agreement with human viewing behavior as contrasted with a standard ResNet architecture. We find that image-specific factors such as category, animacy, arousal, and valence have a direct link to the agreement of spatial object recognition priorities in humans and DCNNs, while other measures such as difficulty and general image properties do not. With this approach, we try to open up new perspectives at the intersection of biological and computer vision research.
    Weak Adaptation Learning -- Addressing Cross-domain Data Insufficiency with Weak Annotator. (arXiv:2102.07358v3 [cs.LG] UPDATED)
    (2 min) Data quantity and quality are crucial factors for data-driven learning methods. In some target problem domains, there are not many data samples available, which could significantly hinder the learning process. While data from similar domains may be leveraged to help through domain adaptation, obtaining high-quality labeled data for those source domains themselves could be difficult or costly. To address such challenges on data insufficiency for classification problem in a target domain, we propose a weak adaptation learning (WAL) approach that leverages unlabeled data from a similar source domain, a low-cost weak annotator that produces labels based on task-specific heuristics, labeling rules, or other methods (albeit with inaccuracy), and a small amount of labeled data in the target domain. Our approach first conducts a theoretical analysis on the error bound of the trained classifier with respect to the data quantity and the performance of the weak annotator, and then introduces a multi-stage weak adaptation learning method to learn an accurate classifier by lowering the error bound. Our experiments demonstrate the effectiveness of our approach in learning an accurate classifier with limited labeled data in the target domain and unlabeled data in the source domain.
    LOTR: Face Landmark Localization Using Localization Transformer. (arXiv:2109.10057v1 [cs.CV])
    (2 min) This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches.
    3D Point Cloud Completion with Geometric-Aware Adversarial Augmentation. (arXiv:2109.10161v1 [cs.CV])
    (2 min) With the popularity of 3D sensors in self-driving and other robotics applications, extensive research has focused on designing novel neural network architectures for accurate 3D point cloud completion. However, unlike in point cloud classification and reconstruction, the role of adversarial samples in3D point cloud completion has seldom been explored. In this work, we show that training with adversarial samples can improve the performance of neural networks on 3D point cloud completion tasks. We propose a novel approach to generate adversarial samples that benefit both the performance of clean and adversarial samples. In contrast to the PGD-k attack, our method generates adversarial samples that keep the geometric features in clean samples and contain few outliers. In particular, we use principal directions to constrain the adversarial perturbations for each input point. The gradient components in the mean direction of principal directions are taken as adversarial perturbations. In addition, we also investigate the effect of using the minimum curvature direction. Besides, we adopt attack strength accumulation and auxiliary Batch Normalization layers method to speed up the training process and alleviate the distribution mismatch between clean and adversarial samples. Experimental results show that training with the adversarial samples crafted by our method effectively enhances the performance of PCN on the ShapeNet dataset.
    Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images. (arXiv:2004.14487v3 [cs.CV] UPDATED)
    (2 min) The connection between visual input and tactile sensing is critical for object manipulation tasks such as grasping and pushing. In this work, we introduce the challenging task of estimating a set of tactile physical properties from visual information. We aim to build a model that learns the complex mapping between visual information and tactile physical properties. We construct a first of its kind image-tactile dataset with over 400 multiview image sequences and the corresponding tactile properties. A total of fifteen tactile physical properties across categories including friction, compliance, adhesion, texture, and thermal conductance are measured and then estimated by our models. We develop a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile joint classification loss. Additionally, we develop a neural architecture search framework capable of selecting optimal combinations of viewing angles for estimating a given physical property.
    Towards Discovery and Attribution of Open-world GAN Generated Images. (arXiv:2105.04580v2 [cs.CV] UPDATED)
    (2 min) With the recent progress in Generative Adversarial Networks (GANs), it is imperative for media and visual forensics to develop detectors which can identify and attribute images to the model generating them. Existing works have shown to attribute images to their corresponding GAN sources with high accuracy. However, these works are limited to a closed set scenario, failing to generalize to GANs unseen during train time and are therefore, not scalable with a steady influx of new GANs. We present an iterative algorithm for discovering images generated from previously unseen GANs by exploiting the fact that all GANs leave distinct fingerprints on their generated images. Our algorithm consists of multiple components including network training, out-of-distribution detection, clustering, merge and refine steps. Through extensive experiments, we show that our algorithm discovers unseen GANs with high accuracy and also generalizes to GANs trained on unseen real datasets. We additionally apply our algorithm to attribution and discovery of GANs in an online fashion as well as to the more standard task of real/fake detection. Our experiments demonstrate the effectiveness of our approach to discover new GANs and can be used in an open-world setup.
    Scale-aware direct monocular odometry. (arXiv:2109.10077v1 [cs.RO])
    (2 min) We present a framework for direct monocular odometry based on depth prediction from a deep neural network. In contrast with existing methods where depth information is only partially exploited, we formulate a novel depth prediction residual which allows us to incorporate multi-view depth information. In addition, we propose to use a truncated robust cost function which prevents considering inconsistent depth estimations. The photometric and depth-prediction measurements are integrated in a tightly-coupled optimization leading to a scale-aware monocular system which does not accumulate scale drift. We demonstrate the validity of our proposal evaluating it on the KITTI odometry dataset and comparing it with state-of-the-art monocular and stereo SLAM systems. Experiments show that our proposal largely outperforms classic monocular SLAM, being 5 to 9 times more precise, with an accuracy which is closer to that of stereo systems.
    Skeleton-Graph: Long-Term 3D Motion Prediction From 2D Observations Using Deep Spatio-Temporal Graph CNNs. (arXiv:2109.10257v1 [cs.CV])
    (2 min) Several applications such as autonomous driving, augmented reality and virtual reality requires a precise prediction of the 3D human pose. Recently, a new problem was introduced in the field to predict the 3D human poses from an observed 2D poses. We propose Skeleton-Graph, a deep spatio-temporal graph CNN model that predicts the future 3D skeleton poses in a single pass from the 2D ones. Unlike prior works, Skeleton-Graph focuses on modeling the interaction between the skeleton joints by exploiting their spatial configuration. This is being achieved by formulating the problem as a graph structure while learning a suitable graph adjacency kernel. By the design, Skeleton-Graph predicts the future 3D poses without divergence on the long-term unlike prior works. We also introduce a new metric that measures the divergence of predictions on the long-term. Our results show an FDE improvement of at least 27% and an ADE of 4% on both the GTA-IM and PROX datasets respectively in comparison with prior works. Also, we are 88% and 93% less divergence on the long-term motion prediction in comparison with prior works on both GTA-IM and PROX datasets. https://github.com/abduallahmohamed/Skeleton-Graph.git
    VPN: Video Provenance Network for Robust Content Attribution. (arXiv:2109.10038v1 [cs.CV])
    (2 min) We present VPN - a content attribution method for recovering provenance information from videos shared online. Platforms, and users, often transform video into different quality, codecs, sizes, shapes, etc. or slightly edit its content such as adding text or emoji, as they are redistributed online. We learn a robust search embedding for matching such video, invariant to these transformations, using full-length or truncated video queries. Once matched against a trusted database of video clips, associated information on the provenance of the clip is presented to the user. We use an inverted index to match temporal chunks of video using late-fusion to combine both visual and audio features. In both cases, features are extracted via a deep neural network trained using contrastive learning on a dataset of original and augmented video clips. We demonstrate high accuracy recall over a corpus of 100,000 videos.
    Deep Edge-Aware Interactive Colorization against Color-Bleeding Effects. (arXiv:2107.01619v2 [cs.CV] UPDATED)
    (2 min) Deep neural networks for automatic image colorization often suffer from the color-bleeding artifact, a problematic color spreading near the boundaries between adjacent objects. Such color-bleeding artifacts debase the reality of generated outputs, limiting the applicability of colorization models in practice. Although previous approaches have attempted to address this problem in an automatic manner, they tend to work only in limited cases where a high contrast of gray-scale values are given in an input image. Alternatively, leveraging user interactions would be a promising approach for solving this color-breeding artifacts. In this paper, we propose a novel edge-enhancing network for the regions of interest via simple user scribbles indicating where to enhance. In addition, our method requires a minimal amount of effort from users for their satisfactory enhancement. Experimental results demonstrate that our interactive edge-enhancing approach effectively improves the color-bleeding artifacts compared to the existing baselines across various datasets.
    Self-Supervised Action-Space Prediction for Automated Driving. (arXiv:2109.10024v1 [cs.RO])
    (2 min) Making informed driving decisions requires reliable prediction of other vehicles' trajectories. In this paper, we present a novel learned multi-modal trajectory prediction architecture for automated driving. It achieves kinematically feasible predictions by casting the learning problem into the space of accelerations and steering angles -- by performing action-space prediction, we can leverage valuable model knowledge. Additionally, the dimensionality of the action manifold is lower than that of the state manifold, whose intrinsically correlated states are more difficult to capture in a learned manner. For the purpose of action-space prediction, we present the simple Feed-Forward Action-Space Prediction (FFW-ASP) architecture. Then, we build on this notion and introduce the novel Self-Supervised Action-Space Prediction (SSP-ASP) architecture that outputs future environment context features in addition to trajectories. A key element in the self-supervised architecture is that, based on an observed action history and past context features, future context features are predicted prior to future trajectories. The proposed methods are evaluated on real-world datasets containing urban intersections and roundabouts, and show accurate predictions, outperforming state-of-the-art for kinematically feasible predictions in several prediction metrics.
    Data-driven controllers and the need for perception systems in underwater manipulation. (arXiv:2109.10327v1 [cs.RO])
    (2 min) The underwater environment poses a complex problem for developing autonomous capabilities for Underwater Vehicle Manipulator Systems (UVMSs). The modeling of UVMSs is a complicated and costly process due to the highly nonlinear dynamics and the presence of unknown hydrodynamical effects. This is aggravated in tasks where the manipulation of objects is necessary, as this may not only introduce external disturbances that can lead to a fast degradation of the control system performance, but also requires the coordinating with a vision system for the correct grasping and operation of the object. In this article, we introduce a control strategy for UVMSs working with unknown payloads. The proposed control strategy is based on a data-driven optimal controller. We present a number of experimental results showing the benefits of the proposed strategy. Furthermore, we include a discussion regarding the visual perception requirements for the UVMS in order to achieve full autonomy in underwater manipulation tasks of unknown payloads.
    Survey on Semantic Stereo Matching / Semantic Depth Estimation. (arXiv:2109.10123v1 [cs.CV])
    (2 min) Stereo matching is one of the widely used techniques for inferring depth from stereo images owing to its robustness and speed. It has become one of the major topics of research since it finds its applications in autonomous driving, robotic navigation, 3D reconstruction, and many other fields. Finding pixel correspondences in non-textured, occluded and reflective areas is the major challenge in stereo matching. Recent developments have shown that semantic cues from image segmentation can be used to improve the results of stereo matching. Many deep neural network architectures have been proposed to leverage the advantages of semantic segmentation in stereo matching. This paper aims to give a comparison among the state of art networks both in terms of accuracy and in terms of speed which are of higher importance in real-time applications.
    MESSFN : a Multi-level and Enhanced Spectral-Spatial Fusion Network for Pan-sharpening. (arXiv:2109.09937v1 [cs.CV])
    (2 min) Dominant pan-sharpening frameworks simply concatenate the MS stream and the PAN stream once at a specific level. This way of fusion neglects the multi-level spectral-spatial correlation between the two streams, which is vital to improving the fusion performance. In consideration of this, we propose a Multi-level and Enhanced Spectral-Spatial Fusion Network (MESSFN) with the following innovations: First, to fully exploit and strengthen the above correlation, a Hierarchical Multi-level Fusion Architecture (HMFA) is carefully designed. A novel Spectral-Spatial (SS) stream is established to hierarchically derive and fuse the multi-level prior spectral and spatial expertise from the MS stream and the PAN stream. This helps the SS stream master a joint spectral-spatial representation in the hierarchical network for better modeling the fusion relationship. Second, to provide superior expertise, consequently, based on the intrinsic characteristics of the MS image and the PAN image, two feature extraction blocks are specially developed. In the MS stream, a Residual Spectral Attention Block (RSAB) is proposed to mine the potential spectral correlations between different spectra of the MS image through adjacent cross-spectrum interaction. While in the PAN stream, a Residual Multi-scale Spatial Attention Block (RMSAB) is proposed to capture multi-scale information and reconstruct precise high-frequency details from the PAN image through an improved spatial attention-based inception structure. The spectral and spatial feature representations are enhanced. Extensive experiments on two datasets demonstrate that the proposed network is competitive with or better than state-of-the-art methods. Our code can be found in github.
    KDFNet: Learning Keypoint Distance Field for 6D Object Pose Estimation. (arXiv:2109.10127v1 [cs.CV])
    (2 min) We present KDFNet, a novel method for 6D object pose estimation from RGB images. To handle occlusion, many recent works have proposed to localize 2D keypoints through pixel-wise voting and solve a Perspective-n-Point (PnP) problem for pose estimation, which achieves leading performance. However, such voting process is direction-based and cannot handle long and thin objects where the direction intersections cannot be robustly found. To address this problem, we propose a novel continuous representation called Keypoint Distance Field (KDF) for projected 2D keypoint locations. Formulated as a 2D array, each element of the KDF stores the 2D Euclidean distance between the corresponding image pixel and a specified projected 2D keypoint. We use a fully convolutional neural network to regress the KDF for each keypoint. Using this KDF encoding of projected object keypoint locations, we propose to use a distance-based voting scheme to localize the keypoints by calculating circle intersections in a RANSAC fashion. We validate the design choices of our framework by extensive ablation experiments. Our proposed method achieves state-of-the-art performance on Occlusion LINEMOD dataset with an average ADD(-S) accuracy of 50.3% and TOD dataset mug subset with an average ADD accuracy of 75.72%. Extensive experiments and visualizations demonstrate that the proposed method is able to robustly estimate the 6D pose in challenging scenarios including occlusion.
    FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging. (arXiv:2109.09658v2 [cs.CV] UPDATED)
    (2 min) The recent advancements in artificial intelligence (AI) combined with the extensive amount of data generated by today's clinical systems, has led to the development of imaging AI solutions across the whole value chain of medical imaging, including image reconstruction, medical image segmentation, image-based diagnosis and treatment planning. Notwithstanding the successes and future potential of AI in medical imaging, many stakeholders are concerned of the potential risks and ethical implications of imaging AI solutions, which are perceived as complex, opaque, and difficult to comprehend, utilise, and trust in critical clinical applications. Despite these concerns and risks, there are currently no concrete guidelines and best practices for guiding future AI developments in medical imaging towards increased trust, safety and adoption. To bridge this gap, this paper introduces a careful selection of guiding principles drawn from the accumulated experiences, consensus, and best practices from five large European projects on AI in Health Imaging. These guiding principles are named FUTURE-AI and its building blocks consist of (i) Fairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness and (vi) Explainability. In a step-by-step approach, these guidelines are further translated into a framework of concrete recommendations for specifying, developing, evaluating, and deploying technically, clinically and ethically trustworthy AI solutions into clinical practice.
    Single Person Pose Estimation: A Survey. (arXiv:2109.10056v1 [cs.CV])
    (2 min) Human pose estimation in unconstrained images and videos is a fundamental computer vision task. To illustrate the evolutionary path in technique, in this survey we summarize representative human pose methods in a structured taxonomy, with a particular focus on deep learning models and single-person image setting. Specifically, we examine and survey all the components of a typical human pose estimation pipeline, including data augmentation, model architecture and backbone, supervision representation, post-processing, standard datasets, evaluation metrics. To envisage the future directions, we finally discuss the key unsolved problems and potential trends for human pose estimation.
    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. (arXiv:2109.10282v1 [cs.CL])
    (2 min) Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.
    Unsupervised Abstract Reasoning for Raven's Problem Matrices. (arXiv:2109.10011v1 [cs.CV])
    (2 min) Raven's Progressive Matrices (RPM) is highly correlated with human intelligence, and it has been widely used to measure the abstract reasoning ability of humans. In this paper, to study the abstract reasoning capability of deep neural networks, we propose the first unsupervised learning method for solving RPM problems. Since the ground truth labels are not allowed, we design a pseudo target based on the prior constraints of the RPM formulation to approximate the ground truth label, which effectively converts the unsupervised learning strategy into a supervised one. However, the correct answer is wrongly labelled by the pseudo target, and thus the noisy contrast will lead to inaccurate model training. To alleviate this issue, we propose to improve the model performance with negative answers. Moreover, we develop a decentralization method to adapt the feature representation to different RPM problems. Extensive experiments on three datasets demonstrate that our method even outperforms some of the supervised approaches. Our code is available at https://github.com/visiontao/ncd.
    ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators. (arXiv:2108.09432v2 [cs.CV] UPDATED)
    (2 min) This paper introduces an unsupervised loss for training parametric deformation shape generators. The key idea is to enforce the preservation of local rigidity among the generated shapes. Our approach builds on an approximation of the as-rigid-as possible (or ARAP) deformation energy. We show how to develop the unsupervised loss via a spectral decomposition of the Hessian of the ARAP energy. Our loss nicely decouples pose and shape variations through a robust norm. The loss admits simple closed-form expressions. It is easy to train and can be plugged into any standard generation models, e.g., variational auto-encoder (VAE) and auto-decoder (AD). Experimental results show that our approach outperforms existing shape generation approaches considerably on public benchmark datasets of various shape categories such as human, animal and bone.
    Encoding Clinical Priori in 3D Convolutional Neural Networks for Prostate Cancer Detection in bpMRI. (arXiv:2011.00263v4 [eess.IV] UPDATED)
    (2 min) We hypothesize that anatomical priors can be viable mediums to infuse domain-specific clinical knowledge into state-of-the-art convolutional neural networks (CNN) based on the U-Net architecture. We introduce a probabilistic population prior which captures the spatial prevalence and zonal distinction of clinically significant prostate cancer (csPCa), in order to improve its computer-aided detection (CAD) in bi-parametric MR imaging (bpMRI). To evaluate performance, we train 3D adaptations of the U-Net, U-SEResNet, UNet++ and Attention U-Net using 800 institutional training-validation scans, paired with radiologically-estimated annotations and our computed prior. For 200 independent testing bpMRI scans with histologically-confirmed delineations of csPCa, our proposed method of encoding clinical priori demonstrates a strong ability to improve patient-based diagnosis (upto 8.70% increase in AUROC) and lesion-level detection (average increase of 1.08 pAUC between 0.1-10 false positives per patient) across all four architectures.
    Automated segmentation and extraction of posterior eye segment using OCT scans. (arXiv:2109.10000v1 [eess.IV])
    (2 min) This paper proposes an automated method for the segmentation and extraction of the posterior segment of the human eye, including the vitreous, retina, choroid, and sclera compartments, using multi-vendor optical coherence tomography (OCT) scans. The proposed method works in two phases. First extracts the retinal pigment epithelium (RPE) layer by applying the adaptive thresholding technique to identify the retina-choroid junction. Then, it exploits the structure tensor guided approach to extract the inner limiting membrane (ILM) and the choroidal stroma (CS) layers, locating the vitreous-retina and choroid-sclera junctions in the candidate OCT scan. Furthermore, these three junction boundaries are utilized to conduct posterior eye compartmentalization effectively for both healthy and disease eye OCT scans. The proposed framework is evaluated over 1000 OCT scans, where it obtained the mean intersection over union (IoU) and mean Dice similarity coefficient (DSC) scores of 0.874 and 0.930, respectively.
    OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association. (arXiv:2103.02440v2 [cs.CV] UPDATED)
    (2 min) Many image-based perception tasks can be formulated as detecting, associating and tracking semantic keypoints, e.g., human body pose estimation and tracking. In this work, we present a general framework that jointly detects and forms spatio-temporal keypoint associations in a single stage, making this the first real-time pose detection and tracking algorithm. We present a generic neural network architecture that uses Composite Fields to detect and construct a spatio-temporal pose which is a single, connected graph whose nodes are the semantic keypoints (e.g., a person's body joints) in multiple frames. For the temporal associations, we introduce the Temporal Composite Association Field (TCAF) which requires an extended network architecture and training method beyond previous Composite Fields. Our experiments show competitive accuracy while being an order of magnitude faster on multiple publicly available datasets such as COCO, CrowdPose and the PoseTrack 2017 and 2018 datasets. We also show that our method generalizes to any class of semantic keypoints such as car and animal parts to provide a holistic perception framework that is well suited for urban mobility such as self-driving cars and delivery robots.
    Does Vision-and-Language Pretraining Improve Lexical Grounding?. (arXiv:2109.10246v1 [cs.CL])
    (2 min) Linguistic representations derived from text alone have been criticized for their lack of grounding, i.e., connecting words to their meanings in the physical world. Vision-and-Language (VL) models, trained jointly on text and image or video data, have been offered as a response to such criticisms. However, while VL pretraining has shown success on multimodal tasks such as visual question answering, it is not yet known how the internal linguistic representations themselves compare to their text-only counterparts. This paper compares the semantic representations learned via VL vs. text-only pretraining for two recent VL models using a suite of analyses (clustering, probing, and performance on a commonsense question answering task) in a language-only setting. We find that the multimodal models fail to significantly outperform the text-only variants, suggesting that future work is required if multimodal pretraining is to be pursued as a means of improving NLP in general.
    Bayesian Confidence Calibration for Epistemic Uncertainty Modelling. (arXiv:2109.10092v1 [cs.CV])
    (2 min) Modern neural networks have found to be miscalibrated in terms of confidence calibration, i.e., their predicted confidence scores do not reflect the observed accuracy or precision. Recent work has introduced methods for post-hoc confidence calibration for classification as well as for object detection to address this issue. Especially in safety critical applications, it is crucial to obtain a reliable self-assessment of a model. But what if the calibration method itself is uncertain, e.g., due to an insufficient knowledge base? We introduce Bayesian confidence calibration - a framework to obtain calibrated confidence estimates in conjunction with an uncertainty of the calibration method. Commonly, Bayesian neural networks (BNN) are used to indicate a network's uncertainty about a certain prediction. BNNs are interpreted as neural networks that use distributions instead of weights for inference. We transfer this idea of using distributions to confidence calibration. For this purpose, we use stochastic variational inference to build a calibration mapping that outputs a probability distribution rather than a single calibrated estimate. Using this approach, we achieve state-of-the-art calibration performance for object detection calibration. Finally, we show that this additional type of uncertainty can be used as a sufficient criterion for covariate shift detection. All code is open source and available at https://github.com/EFS-OpenSource/calibration-framework.
    An Optimal Control Framework for Joint-channel Parallel MRI Reconstruction without Coil Sensitivities. (arXiv:2109.09738v1 [eess.IV])
    (2 min) Goal: This work aims at developing a novel calibration-free fast parallel MRI (pMRI) reconstruction method incorporate with discrete-time optimal control framework. The reconstruction model is designed to learn a regularization that combines channels and extracts features by leveraging the information sharing among channels of multi-coil images. We propose to recover both magnitude and phase information by taking advantage of structured multiplayer convolutional networks in image and Fourier spaces. Methods: We develop a novel variational model with a learnable objective function that integrates an adaptive multi-coil image combination operator and effective image regularization in the image and Fourier spaces. We cast the reconstruction network as a structured discrete-time optimal control system, resulting in an optimal control formulation of parameter training where the parameters of the objective function play the role of control variables. We demonstrate that the Lagrangian method for solving the control problem is equivalent to back-propagation, ensuring the local convergence of the training algorithm. Results: We conduct a large number of numerical experiments of the proposed method with comparisons to several state-of-the-art pMRI reconstruction networks on real pMRI datasets. The numerical results demonstrate the promising performance of the proposed method evidently. Conclusion: The proposed method provides a general deep network design and training framework for efficient joint-channel pMRI reconstruction. Significance: By learning multi-coil image combination operator and performing regularizations in both image domain and k-space domain, the proposed method achieves a highly efficient image reconstruction network for pMRI.
    Learning PAC-Bayes Priors for Probabilistic Neural Networks. (arXiv:2109.10304v1 [cs.LG])
    (2 min) Recent works have investigated deep learning models trained by optimising PAC-Bayes bounds, with priors that are learnt on subsets of the data. This combination has been shown to lead not only to accurate classifiers, but also to remarkably tight risk certificates, bearing promise towards self-certified learning (i.e. use all the data to learn a predictor and certify its quality). In this work, we empirically investigate the role of the prior. We experiment on 6 datasets with different strategies and amounts of data to learn data-dependent PAC-Bayes priors, and we compare them in terms of their effect on test performance of the learnt predictors and tightness of their risk certificate. We ask what is the optimal amount of data which should be allocated for building the prior and show that the optimum may be dataset dependent. We demonstrate that using a small percentage of the prior-building data for validation of the prior leads to promising results. We include a comparison of underparameterised and overparameterised models, along with an empirical study of different training objectives and regularisation strategies to learn the prior distribution.
    Integrated Construction of Multimodal Atlases with Structural Connectomes in the Space of Riemannian Metrics. (arXiv:2109.09808v1 [cs.CV])
    (2 min) The structural network of the brain, or structural connectome, can be represented by fiber bundles generated by a variety of tractography methods. While such methods give qualitative insights into brain structure, there is controversy over whether they can provide quantitative information, especially at the population level. In order to enable population-level statistical analysis of the structural connectome, we propose representing a connectome as a Riemannian metric, which is a point on an infinite-dimensional manifold. We equip this manifold with the Ebin metric, a natural metric structure for this space, to get a Riemannian manifold along with its associated geometric properties. We then use this Riemannian framework to apply object-oriented statistical analysis to define an atlas as the Fr\'echet mean of a population of Riemannian metrics. This formulation ties into the existing framework for diffeomorphic construction of image atlases, allowing us to construct a multimodal atlas by simultaneously integrating complementary white matter structure details from DWMRI and cortical details from T1-weighted MRI. We illustrate our framework with 2D data examples of connectome registration and atlas formation. Finally, we build an example 3D multimodal atlas using T1 images and connectomes derived from diffusion tensors estimated from a subset of subjects from the Human Connectome Project.
    Physics-based Human Motion Estimation and Synthesis from Videos. (arXiv:2109.09913v1 [cs.CV])
    (2 min) Human motion synthesis is an important problem with applications in graphics, gaming and simulation environments for robotics. Existing methods require accurate motion capture data for training, which is costly to obtain. Instead, we propose a framework for training generative models of physically plausible human motion directly from monocular RGB videos, which are much more widely available. At the core of our method is a novel optimization formulation that corrects imperfect image-based pose estimations by enforcing physics constraints and reasons about contacts in a differentiable way. This optimization yields corrected 3D poses and motions, as well as their corresponding contact forces. Results show that our physically-corrected motions significantly outperform prior work on pose estimation. We can then use these to train a generative model to synthesize future motion. We demonstrate both qualitatively and quantitatively significantly improved motion estimation, synthesis quality and physical plausibility achieved by our method on the large scale Human3.6m dataset \cite{h36m_pami} as compared to prior kinematic and physics-based methods. By enabling learning of motion synthesis from video, our method paves the way for large-scale, realistic and diverse motion synthesis.
    Multi-Domain Few-Shot Learning and Dataset for Agricultural Applications. (arXiv:2109.09952v1 [cs.CV])
    (2 min) Automatic classification of pests and plants (both healthy and diseased) is of paramount importance in agriculture to improve yield. Conventional deep learning models based on convolutional neural networks require thousands of labeled examples per category. In this work we propose a method to learn from a few samples to automatically classify different pests, plants, and their diseases, using Few-Shot Learning (FSL). We learn a feature extractor to generate embeddings and then update the embeddings using Transformers. Using Mahalanobis distance, a class-covariance-based metric, we then calculate the similarity of the transformed embeddings with the embedding of the image to be classified. Using our proposed architecture, we conduct extensive experiments on multiple datasets showing the effectiveness of our proposed model. We conduct 42 experiments in total to comprehensively analyze the model and it achieves up to 14% and 24% performance gains on few-shot image classification benchmarks on two datasets. We also compile a new FSL dataset containing images of healthy and diseased plants taken in real-world settings. Using our proposed architecture which has been shown to outperform several existing FSL architectures in agriculture, we provide strong baselines on our newly proposed dataset.
    IgNet. A Super-precise Convolutional Neural Network. (arXiv:2109.09939v1 [cs.CV])
    (2 min) Convolutional neural networks (CNN) are known to be an effective means to detect and analyze images. Their power is essentially based on the ability to extract out images common features. There exist, however, images involving unique, irregular features or details. Such is a collection of unusual children drawings reflecting the kids imagination and individuality. These drawings were analyzed by means of a CNN constructed by means of Keras-TensorFlow. The same problem - on a significantly higher level - was solved with newly developed family of networks called IgNet that is described in this paper. It proved able to learn by 100 % all the categorical characteristics of the drawings. In the case of a regression task (learning the young artists ages) IgNet performed with an error of no more than 0.4 %. The principles are discussed of IgNet design that made it possible to reach such substantial results with rather simple network topology.
    MetaMedSeg: Volumetric Meta-learning for Few-Shot Organ Segmentation. (arXiv:2109.09734v1 [eess.IV])
    (2 min) The lack of sufficient annotated image data is a common issue in medical image segmentation. For some organs and densities, the annotation may be scarce, leading to poor model training convergence, while other organs have plenty of annotated data. In this work, we present MetaMedSeg, a gradient-based meta-learning algorithm that redefines the meta-learning task for the volumetric medical data with the goal to capture the variety between the slices. We also explore different weighting schemes for gradients aggregation, arguing that different tasks might have different complexity, and hence, contribute differently to the initialization. We propose an importance-aware weighting scheme to train our model. In the experiments, we present an evaluation of the medical decathlon dataset by extracting 2D slices from CT and MRI volumes of different organs and performing semantic segmentation. The results show that our proposed volumetric task definition leads to up to 30% improvement in terms of IoU compared to related baselines. The proposed update rule is also shown to improve the performance for complex scenarios where the data distribution of the target organ is very different from the source organs.
    AutoPhoto: Aesthetic Photo Capture using Reinforcement Learning. (arXiv:2109.09923v1 [cs.CV])
    (2 min) The process of capturing a well-composed photo is difficult and it takes years of experience to master. We propose a novel pipeline for an autonomous agent to automatically capture an aesthetic photograph by navigating within a local region in a scene. Instead of classical optimization over heuristics such as the rule-of-thirds, we adopt a data-driven aesthetics estimator to assess photo quality. A reinforcement learning framework is used to optimize the model with respect to the learned aesthetics metric. We train our model in simulation with indoor scenes, and we demonstrate that our system can capture aesthetic photos in both simulation and real world environments on a ground robot. To our knowledge, this is the first system that can automatically explore an environment to capture an aesthetic photo with respect to a learned aesthetic estimator.
    Skin Deep Unlearning: Artefact and Instrument Debiasing in the Context of Melanoma Classification. (arXiv:2109.09818v1 [cs.CV])
    (2 min) Convolutional Neural Networks have demonstrated dermatologist-level performance in the classification of melanoma and other skin lesions, but prediction irregularities due to biases seen within the training data are an issue that should be addressed before widespread deployment is possible. In this work, we robustly remove bias and spurious variation from an automated melanoma classification pipeline using two leading bias unlearning techniques. We show that the biases introduced by surgical markings and rulers presented in previous studies can be reasonably mitigated using these bias removal methods. We also demonstrate the generalisation benefits of unlearning spurious variation relating to the imaging instrument used to capture lesion images. Contributions of this work include the application of different debiasing techniques for artefact bias removal and the concept of instrument bias unlearning for domain generalisation in melanoma detection. Our experimental results provide evidence that the effects of each of the aforementioned biases are notably reduced, with different debiasing techniques excelling at different tasks.
    Object Detection in Thermal Spectrum for Advanced Driver-Assistance Systems (ADAS). (arXiv:2109.09854v1 [cs.CV])
    (2 min) Object detection in thermal infrared spectrum provides more reliable data source in low-lighting conditions and different weather conditions, as it is useful both in-cabin and outside for pedestrian, animal, and vehicular detection as well as for detecting street-signs & lighting poles. This paper is about exploring and adapting state-of-the-art object detection and classifier framework on thermal vision with seven distinct classes for advanced driver-assistance systems (ADAS). The trained network variants on public datasets are validated on test data with three different test approaches which include test-time with no augmentation, test-time augmentation, and test-time with model ensembling. Additionally, the efficacy of trained networks is tested on locally gathered novel test-data captured with an uncooled LWIR prototype thermal camera in challenging weather and environmental scenarios. The performance analysis of trained models is investigated by computing precision, recall, and mean average precision scores (mAP). Furthermore, the trained model architecture is optimized using TensorRT inference accelerator and deployed on resource-constrained edge hardware Nvidia Jetson Nano to explicitly reduce the inference time on GPU as well as edge devices for further real-time onboard installations.
    Enforcing Mutual Consistency of Hard Regions for Semi-supervised Medical Image Segmentation. (arXiv:2109.09960v1 [cs.CV])
    (2 min) In this paper, we proposed a novel mutual consistency network (MC-Net+) to effectively exploit the unlabeled hard regions for semi-supervised medical image segmentation. The MC-Net+ model is motivated by the observation that deep models trained with limited annotations are prone to output highly uncertain and easily mis-classified predictions in the ambiguous regions (e.g. adhesive edges or thin branches) for the image segmentation task. Leveraging these region-level challenging samples can make the semi-supervised segmentation model training more effective. Therefore, our proposed MC-Net+ model consists of two new designs. First, the model contains one shared encoder and multiple sightly different decoders (i.e. using different up-sampling strategies). The statistical discrepancy of multiple decoders' outputs is computed to denote the model's uncertainty, which indicates the unlabeled hard regions. Second, a new mutual consistency constraint is enforced between one decoder's probability output and other decoders' soft pseudo labels. In this way, we minimize the model's uncertainty during training and force the model to generate invariant and low-entropy results in such challenging areas of unlabeled data, in order to learn a generalized feature representation. We compared the segmentation results of the MC-Net+ with five state-of-the-art semi-supervised approaches on three public medical datasets. Extension experiments with two common semi-supervised settings demonstrate the superior performance of our model over other existing methods, which sets a new state of the art for semi-supervised medical image segmentation.
    AirDOS: Dynamic SLAM benefits from Articulated Objects. (arXiv:2109.09903v1 [cs.RO])
    (2 min) Dynamic Object-aware SLAM (DOS) exploits object-level information to enable robust motion estimation in dynamic environments. It has attracted increasing attention with the recent success of learning-based models. Existing methods mainly focus on identifying and excluding dynamic objects from the optimization. In this paper, we show that feature-based visual SLAM systems can also benefit from the presence of dynamic articulated objects by taking advantage of two observations: (1) The 3D structure of an articulated object remains consistent over time; (2) The points on the same object follow the same motion. In particular, we present AirDOS, a dynamic object-aware system that introduces rigidity and motion constraints to model articulated objects. By jointly optimizing the camera pose, object motion, and the object 3D structure, we can rectify the camera pose estimation, preventing tracking loss, and generate 4D spatio-temporal maps for both dynamic objects and static scenes. Experiments show that our algorithm improves the robustness of visual SLAM algorithms in challenging crowded urban environments. To the best of our knowledge, AirDOS is the first dynamic object-aware SLAM system demonstrating that camera pose estimation can be improved by incorporating dynamic articulated objects.
    Well Googled is Half Done: Multimodal Forecasting of New Fashion Product Sales with Image-based Google Trends. (arXiv:2109.09824v1 [cs.CV])
    (2 min) This paper investigates the effectiveness of systematically probing Google Trendsagainst textual translations of visual aspects as exogenous knowledge to predict the sales of brand-new fashion items, where past sales data is not available, but only an image and few metadata are available. In particular, we propose GTM-Transformer, standing for Google Trends Multimodal Transformer, whose encoder works on the representation of the exogenous time series, while the decoder forecasts the sales using the Google Trends encoding, and the available visual and metadata information. Our model works in a non-autoregressive manner, avoiding the compounding effect of the first-step errors. As a second contribution, we present the VISUELLE dataset, which is the first publicly available dataset for the task of new fashion product sales forecasting, containing the sales of 5577 new products sold between 2016-2019, derived from genuine historical data ofNunalie, an Italian fast-fashion company. Our dataset is equipped with images of products, metadata, related sales, and associated Google Trends. We use VISUELLE to compare our approach against state-of-the-art alternatives and numerous baselines, showing that GTM-Transformer is the most accurate in terms of both percentage and absolute error. It is worth noting that the addition of exogenous knowledge boosts the forecasting accuracy by 1.5% WAPE wise, showing the importance of exploiting Google Trends. The code and dataset are both available at https://github.com/HumaticsLAB/GTM-Transformer.
    On the Importance of Distractors for Few-Shot Classification. (arXiv:2109.09883v1 [cs.CV])
    (2 min) Few-shot classification aims at classifying categories of a novel task by learning from just a few (typically, 1 to 5) labelled examples. An effective approach to few-shot classification involves a prior model trained on a large-sample base domain, which is then finetuned over the novel few-shot task to yield generalizable representations. However, task-specific finetuning is prone to overfitting due to the lack of enough training examples. To alleviate this issue, we propose a new finetuning approach based on contrastive learning that reuses unlabelled examples from the base domain in the form of distractors. Unlike the nature of unlabelled data used in prior works, distractors belong to classes that do not overlap with the novel categories. We demonstrate for the first time that inclusion of such distractors can significantly boost few-shot generalization. Our technical novelty includes a stochastic pairing of examples sharing the same category in the few-shot task and a weighting term that controls the relative influence of task-specific negatives and distractors. An important aspect of our finetuning objective is that it is agnostic to distractor labels and hence applicable to various base domain settings. Compared to state-of-the-art approaches, our method shows accuracy gains of up to $12\%$ in cross-domain and up to $5\%$ in unsupervised prior-learning settings.
    Balanced-MixUp for Highly Imbalanced Medical Image Classification. (arXiv:2109.09850v1 [cs.CV])
    (2 min) Highly imbalanced datasets are ubiquitous in medical image classification problems. In such problems, it is often the case that rare classes associated to less prevalent diseases are severely under-represented in labeled databases, typically resulting in poor performance of machine learning algorithms due to overfitting in the learning process. In this paper, we propose a novel mechanism for sampling training data based on the popular MixUp regularization technique, which we refer to as Balanced-MixUp. In short, Balanced-MixUp simultaneously performs regular (i.e., instance-based) and balanced (i.e., class-based) sampling of the training data. The resulting two sets of samples are then mixed-up to create a more balanced training distribution from which a neural network can effectively learn without incurring in heavily under-fitting the minority classes. We experiment with a highly imbalanced dataset of retinal images (55K samples, 5 classes) and a long-tail dataset of gastro-intestinal video frames (10K images, 23 classes), using two CNNs of varying representation capabilities. Experimental results demonstrate that applying Balanced-MixUp outperforms other conventional sampling schemes and loss functions specifically designed to deal with imbalanced data. Code is released at https://github.com/agaldran/balanced_mixup .
    Viewpoint Invariant Dense Matching for Visual Geolocalization. (arXiv:2109.09827v1 [cs.CV])
    (2 min) In this paper we propose a novel method for image matching based on dense local features and tailored for visual geolocalization. Dense local features matching is robust against changes in illumination and occlusions, but not against viewpoint shifts which are a fundamental aspect of geolocalization. Our method, called GeoWarp, directly embeds invariance to viewpoint shifts in the process of extracting dense features. This is achieved via a trainable module which learns from the data an invariance that is meaningful for the task of recognizing places. We also devise a new self-supervised loss and two new weakly supervised losses to train this module using only unlabeled data and weak labels. GeoWarp is implemented efficiently as a re-ranking method that can be easily embedded into pre-existing visual geolocalization pipelines. Experimental validation on standard geolocalization benchmarks demonstrates that GeoWarp boosts the accuracy of state-of-the-art retrieval architectures. The code and trained models are available at https://github.com/gmberton/geo_warp
    Augmenting Depth Estimation with Geospatial Context. (arXiv:2109.09879v1 [cs.CV])
    (2 min) Modern cameras are equipped with a wide array of sensors that enable recording the geospatial context of an image. Taking advantage of this, we explore depth estimation under the assumption that the camera is geocalibrated, a problem we refer to as geo-enabled depth estimation. Our key insight is that if capture location is known, the corresponding overhead viewpoint offers a valuable resource for understanding the scale of the scene. We propose an end-to-end architecture for depth estimation that uses geospatial context to infer a synthetic ground-level depth map from a co-located overhead image, then fuses it inside of an encoder/decoder style segmentation network. To support evaluation of our methods, we extend a recently released dataset with overhead imagery and corresponding height maps. Results demonstrate that integrating geospatial context significantly reduces error compared to baselines, both at close ranges and when evaluating at much larger distances than existing benchmarks consider.
    Source-Free Domain Adaptive Fundus Image Segmentation with Denoised Pseudo-Labeling. (arXiv:2109.09735v1 [eess.IV])
    (2 min) Domain adaptation typically requires to access source domain data to utilize their distribution information for domain alignment with the target data. However, in many real-world scenarios, the source data may not be accessible during the model adaptation in the target domain due to privacy issue. This paper studies the practical yet challenging source-free unsupervised domain adaptation problem, in which only an existing source model and the unlabeled target data are available for model adaptation. We present a novel denoised pseudo-labeling method for this problem, which effectively makes use of the source model and unlabeled target data to promote model self-adaptation from pseudo labels. Importantly, considering that the pseudo labels generated from source model are inevitably noisy due to domain shift, we further introduce two complementary pixel-level and class-level denoising schemes with uncertainty estimation and prototype estimation to reduce noisy pseudo labels and select reliable ones to enhance the pseudo-labeling efficacy. Experimental results on cross-domain fundus image segmentation show that without using any source images or altering source training, our approach achieves comparable or even higher performance than state-of-the-art source-dependent unsupervised domain adaptation methods.
    Robust Estimation of Reflection Symmetry in Noisy and Partial 3D Point Clouds. (arXiv:2109.09927v1 [cs.CV])
    (2 min) Detecting the reflection symmetry plane of an object represented by a 3D point cloud is a fundamental problem in 3D computer vision and geometry processing due to its various applications such as compression, object detection, robotic grasping, 3D surface reconstruction, etc. There exist several efficient approaches for solving this problem for clean 3D point clouds. However, this problem becomes difficult to solve in the presence of outliers and missing parts due to occlusions while scanning the objects through 3D scanners. The existing methods try to overcome these challenges mostly by voting-based techniques but fail in challenging settings. In this work, we propose a statistical estimator for the plane of reflection symmetry that is robust to outliers and missing parts. We pose the problem of finding the optimal estimator as an optimization problem on a 2-sphere that quickly converges to the global solution. We further propose a 3D point descriptor that is invariant to 3D reflection symmetry using the spectral properties of the geodesic distance matrix constructed from the neighbors of a point. This helps us in decoupling the chicken-and-egg problem of finding optimal symmetry plane and correspondences between the reflective symmetric points. We show that the proposed approach achieves the state-of-the-art performance on the benchmarks dataset.
    Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation. (arXiv:2109.09881v1 [cs.CV])
    (2 min) Surface normal estimation from a single image is an important task in 3D scene understanding. In this paper, we address two limitations shared by the existing methods: the inability to estimate the aleatoric uncertainty and lack of detail in the prediction. The proposed network estimates the per-pixel surface normal probability distribution. We introduce a new parameterization for the distribution, such that its negative log-likelihood is the angular loss with learned attenuation. The expected value of the angular error is then used as a measure of the aleatoric uncertainty. We also present a novel decoder framework where pixel-wise multi-layer perceptrons are trained on a subset of pixels sampled based on the estimated uncertainty. The proposed uncertainty-guided sampling prevents the bias in training towards large planar surfaces and improves the quality of prediction, especially near object boundaries and on small structures. Experimental results show that the proposed method outperforms the state-of-the-art in ScanNet and NYUv2, and that the estimated uncertainty correlates well with the prediction error. Code is available at https://github.com/baegwangbin/surface_normal_uncertainty.
    Unsupervised Domain Adaptation with Semantic Consistency across Heterogeneous Modalities for MRI Prostate Lesion Segmentation. (arXiv:2109.09736v1 [eess.IV])
    (2 min) Any novel medical imaging modality that differs from previous protocols e.g. in the number of imaging channels, introduces a new domain that is heterogeneous from previous ones. This common medical imaging scenario is rarely considered in the domain adaptation literature, which handles shifts across domains of the same dimensionality. In our work we rely on stochastic generative modeling to translate across two heterogeneous domains at pixel space and introduce two new loss functions that promote semantic consistency. Firstly, we introduce a semantic cycle-consistency loss in the source domain to ensure that the translation preserves the semantics. Secondly, we introduce a pseudo-labelling loss, where we translate target data to source, label them by a source-domain network, and use the generated pseudo-labels to supervise the target-domain network. Our results show that this allows us to extract systematically better representations for the target domain. In particular, we address the challenge of enhancing performance on VERDICT-MRI, an advanced diffusion-weighted imaging technique, by exploiting labeled mp-MRI data. When compared to several unsupervised domain adaptation approaches, our approach yields substantial improvements, that consistently carry over to the semi-supervised and supervised learning settings.
  • cs.IR updates on arXiv.org

    WorldKG: A World-Scale Geographic Knowledge Graph. (arXiv:2109.10036v1 [cs.IR])
    (2 min) OpenStreetMap is a rich source of openly available geographic information. However, the representation of geographic entities, e.g., buildings, mountains, and cities, within OpenStreetMap is highly heterogeneous, diverse, and incomplete. As a result, this rich data source is hardly usable for real-world applications. This paper presents WorldKG -- a new geographic knowledge graph aiming to provide a comprehensive semantic representation of geographic entities in OpenStreetMap. We describe the WorldKG knowledge graph, including its ontology that builds the semantic dataset backbone, the extraction procedure of the ontology and geographic entities from OpenStreetMap, and the methods to enhance entity annotation. We perform statistical and qualitative dataset assessment, demonstrating the large scale and high precision of the semantic geographic information in WorldKG.
    SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. (arXiv:2109.10086v1 [cs.IR])
    (2 min) In neural Information Retrieval (IR), ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning \emph{sparse} representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. Introduced recently, the SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. In this paper, we build on SPLADE and propose several significant improvements in terms of effectiveness and/or efficiency. More specifically, we modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. We also report results on the BEIR benchmark. Overall, SPLADE is considerably improved with more than $9$\% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
    Towards a computational definition of the tresillo rhythm and its tracing in popular music. (arXiv:2109.10256v1 [cs.IR])
    (2 min) This paper discusses the use and popularity of a rhythm, which henceforth is referred to as "Tresillo rhythm". We first define and formalizes the Tresillo rhythm. Given a mathematical representation of the rhythm, it is then traced in the US Billboard Top 20 Charts of the last 20 years. To detect and determine the use of this rhythm in a song, we compute the similarity of a song with this rhythm. The calculated similarity, then indicates how similar the rhythm of a pop song is compared to the prior defined Tresillo rhythm. To assert and cross-validate the computer rhythm similarity, two different formalizations of the Tresillo rhythm have been compiled and several different approaches to calculated rhythm similarities have been tested and compared. This similarity measure is then used to do an empirical study on the usage of the Tresillo rhythm in the US Billboard Top 20 Charts of the past 20 years (1999-2019). Finally, we argue about some of the possible reasons for the observed trend.
    When expertise gone missing: Uncovering the loss of prolific contributors in Wikipedia. (arXiv:2109.09979v1 [cs.SI])
    (2 min) Success of planetary-scale online collaborative platforms such as Wikipedia is hinged on active and continued participation of its voluntary contributors. The phenomenal success of Wikipedia as a valued multilingual source of information is a testament to the possibilities of collective intelligence. Specifically, the sustained and prudent contributions by the experienced prolific editors play a crucial role to operate the platform smoothly for decades. However, it has been brought to light that growth of Wikipedia is stagnating in terms of the number of editors that faces steady decline over time. This decreasing productivity and ever increasing attrition rate in both newcomer and experienced editors is a major concern for not only the future of this platform but also for several industry-scale information retrieval systems such as Siri, Alexa which depend on Wikipedia as knowledge store. In this paper, we have studied the ongoing crisis in which experienced and prolific editors withdraw. We performed extensive analysis of the editor activities and their language usage to identify features that can forecast prolific Wikipedians, who are at risk of ceasing voluntary services. To the best of our knowledge, this is the first work which proposes a scalable prediction pipeline, towards detecting the prolific Wikipedians, who might be at a risk of retiring from the platform and, thereby, can potentially enable moderators to launch appropriate incentive mechanisms to retain such `would-be missing' valued Wikipedians.
    The Vision and the Perspective of Digital Tourism. (arXiv:2109.09795v1 [cs.DL])
    (2 min) The dynamics of the modern information society changes the usual areas of human activity, generates various innovations based on the widespread use of Information and Communication Technologies (ICTs). Virtually, every activity today is technology related. In these conditions, scientific activity is also changing. Digitalization processes act as integrative to various scientific directions, which form the base for interdisciplinary scientific research. The study of their formation is an important scientific task aimed at predicting the development of both science and society as a whole. In this study, based on the integrated use of ICTs, we consider methods of the terminology base analysis in various interdisciplinary research directions on the instance of tourism in the digital age. The development of scientific interest in the area of digital tourism in Russian and global scientific discourses is also compared. The purpose of this paper is to proof the relevance of scientific study in the field of tourism digitalization, to identify the generic directions and trends of digital tourism, and to specify technologies for the implementation of digital tourism using the case study of St. Petersburg.
    Condenser: a Pre-training Architecture for Dense Retrieval. (arXiv:2104.08253v2 [cs.CL] UPDATED)
    (2 min) Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text comparison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations. This paper finds a key reason is that standard LMs' internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation. We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation. Our experiments show Condenser improves over standard LM by large margins on various text retrieval and similarity tasks.
    Homography augumented momentum constrastive learning for SAR image retrieval. (arXiv:2109.10329v1 [cs.CV])
    (2 min) Deep learning-based image retrieval has been emphasized in computer vision. Representation embedding extracted by deep neural networks (DNNs) not only aims at containing semantic information of the image, but also can manage large-scale image retrieval tasks. In this work, we propose a deep learning-based image retrieval approach using homography transformation augmented contrastive learning to perform large-scale synthetic aperture radar (SAR) image search tasks. Moreover, we propose a training method for the DNNs induced by contrastive learning that does not require any labeling procedure. This may enable tractability of large-scale datasets with relative ease. Finally, we verify the performance of the proposed method by conducting experiments on the polarimetric SAR image datasets.
    Deviation-Based Learning. (arXiv:2109.09816v1 [econ.TH])
    (2 min) We propose deviation-based learning, a new approach to training recommender systems. In the beginning, the recommender and rational users have different pieces of knowledge, and the recommender needs to learn the users' knowledge to make better recommendations. The recommender learns users' knowledge by observing whether each user followed or deviated from her recommendations. We show that learning frequently stalls if the recommender always recommends a choice: users tend to follow the recommendation blindly, and their choices do not reflect their knowledge. Social welfare and the learning rate are improved drastically if the recommender abstains from recommending a choice when she predicts that multiple arms will produce a similar payoff.
    Merchant Category Identification Using Credit Card Transactions. (arXiv:2011.02602v1 [cs.LG] CROSS LISTED)
    (2 min) Digital payment volume has proliferated in recent years with the rapid growth of small businesses and online shops. When processing these digital transactions, recognizing each merchant's real identity (i.e., business type) is vital to ensure the integrity of payment processing systems. Conventionally, this problem is formulated as a time series classification problem solely using the merchant transaction history. However, with the large scale of the data, and changing behaviors of merchants and consumers over time, it is extremely challenging to achieve satisfying performance from off-the-shelf classification methods. In this work, we approach this problem from a multi-modal learning perspective, where we use not only the merchant time series data but also the information of merchant-merchant relationship (i.e., affinity) to verify the self-reported business type (i.e., merchant category) of a given merchant. Specifically, we design two individual encoders, where one is responsible for encoding temporal information and the other is responsible for affinity information, and a mechanism to fuse the outputs of the two encoders to accomplish the identification task. Our experiments on real-world credit card transaction data between 71,668 merchants and 433,772,755 customers have demonstrated the effectiveness and efficiency of the proposed model.
  • cs.LG updates on arXiv.org

    LOTR: Face Landmark Localization Using Localization Transformer. (arXiv:2109.10057v1 [cs.CV])
    (2 min) This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches.
    FakeWake: Understanding and Mitigating Fake Wake-up Words of Voice Assistants. (arXiv:2109.09958v1 [cs.LG])
    (2 min) In the area of Internet of Things (IoT) voice assistants have become an important interface to operate smart speakers, smartphones, and even automobiles. To save power and protect user privacy, voice assistants send commands to the cloud only if a small set of pre-registered wake-up words are detected. However, voice assistants are shown to be vulnerable to the FakeWake phenomena, whereby they are inadvertently triggered by innocent-sounding fuzzy words. In this paper, we present a systematic investigation of the FakeWake phenomena from three aspects. To start with, we design the first fuzzy word generator to automatically and efficiently produce fuzzy words instead of searching through a swarm of audio materials. We manage to generate 965 fuzzy words covering 8 most popular English and Chinese smart speakers. To explain the causes underlying the FakeWake phenomena, we construct an interpretable tree-based decision model, which reveals phonetic features that contribute to false acceptance of fuzzy words by wake-up word detectors. Finally, we propose remedies to mitigate the effect of FakeWake. The results show that the strengthened models are not only resilient to fuzzy words but also achieve better overall performance on original training datasets.
    Learning low-degree functions from a logarithmic number of random queries. (arXiv:2109.10162v1 [cs.LG])
    (2 min) We prove that for any integer $n\in\mathbb{N}$, $d\in\{1,\ldots,n\}$ and any $\varepsilon,\delta\in(0,1)$, a bounded function $f:\{-1,1\}^n\to[-1,1]$ of degree at most $d$ can be learned with probability at least $1-\delta$ and $L_2$-error $\varepsilon$ using $\log(\tfrac{n}{\delta})\,\varepsilon^{-d-1} C^{d^{3/2}\sqrt{\log d}}$ random queries for a universal finite constant $C>1$.
    ApproxIFER: A Model-Agnostic Approach to Resilient and Robust Prediction Serving Systems. (arXiv:2109.09868v1 [cs.LG])
    (2 min) Due to the surge of cloud-assisted AI services, the problem of designing resilient prediction serving systems that can effectively cope with stragglers/failures and minimize response delays has attracted much interest. The common approach for tackling this problem is replication which assigns the same prediction task to multiple workers. This approach, however, is very inefficient and incurs significant resource overheads. Hence, a learning-based approach known as parity model (ParM) has been recently proposed which learns models that can generate parities for a group of predictions in order to reconstruct the predictions of the slow/failed workers. While this learning-based approach is more resource-efficient than replication, it is tailored to the specific model hosted by the cloud and is particularly suitable for a small number of queries (typically less than four) and tolerating very few (mostly one) number of stragglers. Moreover, ParM does not handle Byzantine adversarial workers. We propose a different approach, named Approximate Coded Inference (ApproxIFER), that does not require training of any parity models, hence it is agnostic to the model hosted by the cloud and can be readily applied to different data domains and model architectures. Compared with earlier works, ApproxIFER can handle a general number of stragglers and scales significantly better with the number of queries. Furthermore, ApproxIFER is robust against Byzantine workers. Our extensive experiments on a large number of datasets and model architectures also show significant accuracy improvement by up to 58% over the parity model approaches.
    Vaccine allocation policy optimization and budget sharing mechanism using Thompson sampling. (arXiv:2109.10004v1 [math.OC])
    (2 min) The optimal allocation of vaccines to population subgroups over time is a challenging health care management problem. In the context of a pandemic, the interaction between vaccination policies adopted by multiple agents and the cooperation (or lack thereof) creates a complex environment that affects the global transmission dynamics of the disease. In this study, we take the perspective of decision-making agents that aim to minimize the size of their susceptible populations and must allocate vaccine under limited supply. We assume that vaccine efficiency rates are unknown to agents and we propose an optimization policy based on Thompson sampling to learn mean vaccine efficiency rates over time. Furthermore, we develop a budget-balanced resource sharing mechanism to promote cooperation among agents. We apply the proposed framework to the COVID-19 pandemic. We use a raster model of the world where agents represent the main countries worldwide and interact in a global mobility network to generate multiple problem instances. Our numerical results show that the proposed vaccine allocation policy achieves a larger reduction in the number of susceptible individuals, infections and deaths globally compared to a population-based policy. In addition, we show that, under a fixed global vaccine allocation budget, most countries can reduce their national number of infections and deaths by sharing their budget with countries with which they have a relatively high mobility exchange. The proposed framework can be used to improve policy-making in health care management by national and global health authorities.
    Neural networks with trainable matrix activation functions. (arXiv:2109.09948v1 [cs.LG])
    (2 min) The training process of neural networks usually optimize weights and bias parameters of linear transformations, while nonlinear activation functions are pre-specified and fixed. This work develops a systematic approach to constructing matrix activation functions whose entries are generalized from ReLU. The activation is based on matrix-vector multiplications using only scalar multiplications and comparisons. The proposed activation functions depend on parameters that are trained along with the weights and bias vectors. Neural networks based on this approach are simple and efficient and are shown to be robust in numerical experiments.
    Stabilizing Elastic Weight Consolidation method in practical ML tasks and using weight importances for neural network pruning. (arXiv:2109.10021v1 [cs.LG])
    (2 min) This paper is devoted to the features of the practical application of Elastic Weight Consolidation method. Here we will more rigorously compare the known methodologies for calculating the importance of weights when applied to networks with fully connected and convolutional layers. We will also point out the problems that arise when applying the Elastic Weight Consolidation method in multilayer neural networks with convolutional layers and self-attention layers, and propose method to overcome these problems. In addition, we will notice an interesting fact about the use of various types of weight importance in the neural network pruning task.
    Multiblock-Networks: A Neural Network Analog to Component Based Methods for Multi-Source Data. (arXiv:2109.10279v1 [cs.LG])
    (2 min) Training predictive models on datasets from multiple sources is a common, yet challenging setup in applied machine learning. Even though model interpretation has attracted more attention in recent years, many modeling approaches still focus mainly on performance. To further improve the interpretability of machine learning models, we suggest the adoption of concepts and tools from the well-established framework of component based multiblock analysis, also known as chemometrics. Nevertheless, artificial neural networks provide greater flexibility in model architecture and thus, often deliver superior predictive performance. In this study, we propose a setup to transfer the concepts of component based statistical models, including multiblock variants of principal component regression and partial least squares regression, to neural network architectures. Thereby, we combine the flexibility of neural networks with the concepts for interpreting block relevance in multiblock methods. In two use cases we demonstrate how the concept can be implemented in practice, and compare it to both common feed-forward neural networks without blocks, as well as statistical component based multiblock methods. Our results underline that multiblock networks allow for basic model interpretation while matching the performance of ordinary feed-forward neural networks.
    Deep Distributionally Robust Learning for Calibrated Uncertainties under Domain Shift. (arXiv:2010.05784v2 [cs.LG] UPDATED)
    (2 min) We propose a deep distributionally robust learning framework for calibrated uncertainties under domain shifts. We consider cases where the source (training) distribution differs significantly from the target (test) distribution. In addition to the standard class predictor, our framework contains a binary domain classifier which estimates the density ratio between the source and target domains. We incorporate both with neural networks and train them end-to-end. The framework is demonstrated to generate calibrated uncertainties that benefit many downstream tasks, including unsupervised domain adaptation (UDA) and semi-supervised learning (SSL) where methods such as self-training and FixMatch use uncertainties to select confident pseudo-labels. Our experiments show that the introduction of DRL to these methods leads to significant improvements in cross-domain performance. We also demonstrate that the produced density ratio estimates show agreement with the human selection frequencies, suggesting a match with the human perceived uncertainties. The source code of this work will be made publicly available.
    Towards a Fairness-Aware Scoring System for Algorithmic Decision-Making. (arXiv:2109.10053v1 [cs.LG])
    (2 min) Scoring systems, as simple classification models, have significant advantages in interpretability and transparency when making predictions. It facilitates humans' decision-making by allowing them to make a quick prediction by hand through adding and subtracting a few point scores and thus has been widely used in various fields such as medical diagnosis of Intensive Care Units. However, the (un)fairness issues in these models have long been criticized, and the use of biased data in the construction of score systems heightens this concern. In this paper, we proposed a general framework to create data-driven fairness-aware scoring systems. Our approach is first to develop a social welfare function that incorporates both efficiency and equity. Then, we translate the social welfare maximization problem in economics into the empirical risk minimization task in the machine learning community to derive a fairness-aware scoring system with the help of mixed integer programming. We show that the proposed framework provides practitioners or policymakers great flexibility to select their desired fairness requirements and also allows them to customize their own requirements by imposing various operational constraints. Experimental evidence on several real data sets verifies that the proposed scoring system can achieve the optimal welfare of stakeholders and balance the interpretability, fairness, and efficiency issues.
    Prediction of severe thunderstorm events with ensemble deep learning and radar data. (arXiv:2109.09791v1 [cs.LG])
    (2 min) The problem of nowcasting extreme weather events can be addressed by applying either numerical methods for the solution of dynamic model equations or data-driven artificial intelligence algorithms. Within this latter framework, the present paper illustrates how a deep learning method, exploiting videos of radar reflectivity frames as input, can be used to realize a warning machine able to sound timely alarms of possible severe thunderstorm events. From a technical viewpoint, the computational core of this approach is the use of a value-weighted skill score for both transforming the probabilistic outcomes of the deep neural network into binary classification and assessing the forecasting performances. The warning machine has been validated against weather radar data recorded in the Liguria region, in Italy,
    IG-RL: Inductive Graph Reinforcement Learning for Massive-Scale Traffic Signal Control. (arXiv:2003.05738v6 [cs.LG] UPDATED)
    (0 min) Scaling adaptive traffic-signal control involves dealing with combinatorial state and action spaces. Multi-agent reinforcement learning attempts to address this challenge by distributing control to specialized agents. However, specialization hinders generalization and transferability, and the computational graphs underlying neural-networks architectures -- dominating in the multi-agent setting -- do not offer the flexibility to handle an arbitrary number of entities which changes both between road networks, and over time as vehicles traverse the network. We introduce Inductive Graph Reinforcement Learning (IG-RL) based on graph-convolutional networks which adapts to the structure of any road network, to learn detailed representations of traffic-controllers and their surroundings. Our decentralized approach enables learning of a transferable-adaptive-traffic-signal-control policy. After being trained on an arbitrary set of road networks, our model can generalize to new road networks, traffic distributions, and traffic regimes, with no additional training and a constant number of parameters, enabling greater scalability compared to prior methods. Furthermore, our approach can exploit the granularity of available data by capturing the (dynamic) demand at both the lane and the vehicle levels. The proposed method is tested on both road networks and traffic settings never experienced during training. We compare IG-RL to multi-agent reinforcement learning and domain-specific baselines. In both synthetic road networks and in a larger experiment involving the control of the 3,971 traffic signals of Manhattan, we show that different instantiations of IG-RL outperform baselines.
    Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR. (arXiv:2109.09628v2 [cs.CV] UPDATED)
    (0 min) Self-supervised monocular depth prediction provides a cost-effective solution to obtain the 3D location of each pixel. However, the existing approaches usually lead to unsatisfactory accuracy, which is critical for autonomous robots. In this paper, we propose a novel two-stage network to advance the self-supervised monocular dense depth learning by leveraging low-cost sparse (e.g. 4-beam) LiDAR. Unlike the existing methods that use sparse LiDAR mainly in a manner of time-consuming iterative post-processing, our model fuses monocular image features and sparse LiDAR features to predict initial depth maps. Then, an efficient feed-forward refine network is further designed to correct the errors in these initial depth maps in pseudo-3D space with real-time performance. Extensive experiments show that our proposed model significantly outperforms all the state-of-the-art self-supervised methods, as well as the sparse-LiDAR-based methods on both self-supervised monocular depth prediction and completion tasks. With the accurate dense depth prediction, our model outperforms the state-of-the-art sparse-LiDAR-based method (Pseudo-LiDAR++) by more than 68% for the downstream task monocular 3D object detection on the KITTI Leaderboard.
    Homography augumented momentum constrastive learning for SAR image retrieval. (arXiv:2109.10329v1 [cs.CV])
    (0 min) Deep learning-based image retrieval has been emphasized in computer vision. Representation embedding extracted by deep neural networks (DNNs) not only aims at containing semantic information of the image, but also can manage large-scale image retrieval tasks. In this work, we propose a deep learning-based image retrieval approach using homography transformation augmented contrastive learning to perform large-scale synthetic aperture radar (SAR) image search tasks. Moreover, we propose a training method for the DNNs induced by contrastive learning that does not require any labeling procedure. This may enable tractability of large-scale datasets with relative ease. Finally, we verify the performance of the proposed method by conducting experiments on the polarimetric SAR image datasets.
    Audiomer: A Convolutional Transformer for Keyword Spotting. (arXiv:2109.10252v1 [cs.LG])
    (0 min) Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or reach competitive performance after feature extraction through Fourier-based methods, incurring a loss-floor. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting with raw audio waveforms, out-performing all previous methods while also being computationally cheaper, much more parameter and data-efficient. Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.
    Deconvolutional Density Network: Modeling Free-Form Conditional Distributions. (arXiv:2105.14367v2 [cs.LG] UPDATED)
    (0 min) Conditional density estimation (CDE) is the task of estimating the probability of an event conditioned on some inputs. A neural network (NN) can be used to compute the output distribution for continuous-domain, but it is difficult to explicitly approximate a free-form one without knowing the information of its general form a priori. In order to fit an arbitrary conditional distribution, discretizing the continuous domain into bins is an effective strategy, as long as we have sufficiently narrow bins and very large data. However, collecting enough data is often hard to reach and falls far short of that ideal in many circumstances, especially in multivariate CDE for the curse of dimensionality. In this paper, we demonstrate the benefits of modeling free-form conditional distributions using a deconvolution-based neural net framework, coping with data deficiency problems in discretization. It has the advantage of being flexible but also takes advantage of the hierarchical smoothness offered by the deconvolution layers. We compare our method to a number of other density-estimation approaches and show that our Deconvolutional Density Network (DDN) outperforms the competing methods on many univariate and multivariate tasks.
    On the Estimation of Information Measures of Continuous Distributions. (arXiv:2002.02851v2 [cs.IT] UPDATED)
    (0 min) The estimation of information measures of continuous distributions based on samples is a fundamental problem in statistics and machine learning. In this paper, we analyze estimates of differential entropy in $K$-dimensional Euclidean space, computed from a finite number of samples, when the probability density function belongs to a predetermined convex family $\mathcal{P}$. First, estimating differential entropy to any accuracy is shown to be infeasible if the differential entropy of densities in $\mathcal{P}$ is unbounded, clearly showing the necessity of additional assumptions. Subsequently, we investigate sufficient conditions that enable confidence bounds for the estimation of differential entropy. In particular, we provide confidence bounds for simple histogram based estimation of differential entropy from a fixed number of samples, assuming that the probability density function is Lipschitz continuous with known Lipschitz constant and known, bounded support. Our focus is on differential entropy, but we provide examples that show that similar results hold for mutual information and relative entropy as well.
    Shape Inference and Grammar Induction for Example-based Procedural Generation. (arXiv:2109.10217v1 [cs.AI])
    (0 min) Designers increasingly rely on procedural generation for automatic generation of content in various industries. These techniques require extensive knowledge of the desired content, and about how to actually implement such procedural methods. Algorithms for learning interpretable generative models from example content could alleviate both difficulties. We propose SIGI, a novel method for inferring shapes and inducing a shape grammar from grid-based 3D building examples. This interpretable grammar is well-suited for co-creative design. Applied to Minecraft buildings, we show how the shape grammar can be used to automatically generate new buildings in a similar style.
    Merchant Category Identification Using Credit Card Transactions. (arXiv:2011.02602v1 [cs.LG] CROSS LISTED)
    (0 min) Digital payment volume has proliferated in recent years with the rapid growth of small businesses and online shops. When processing these digital transactions, recognizing each merchant's real identity (i.e., business type) is vital to ensure the integrity of payment processing systems. Conventionally, this problem is formulated as a time series classification problem solely using the merchant transaction history. However, with the large scale of the data, and changing behaviors of merchants and consumers over time, it is extremely challenging to achieve satisfying performance from off-the-shelf classification methods. In this work, we approach this problem from a multi-modal learning perspective, where we use not only the merchant time series data but also the information of merchant-merchant relationship (i.e., affinity) to verify the self-reported business type (i.e., merchant category) of a given merchant. Specifically, we design two individual encoders, where one is responsible for encoding temporal information and the other is responsible for affinity information, and a mechanism to fuse the outputs of the two encoders to accomplish the identification task. Our experiments on real-world credit card transaction data between 71,668 merchants and 433,772,755 customers have demonstrated the effectiveness and efficiency of the proposed model.
    FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging. (arXiv:2109.09658v2 [cs.CV] UPDATED)
    (0 min) The recent advancements in artificial intelligence (AI) combined with the extensive amount of data generated by today's clinical systems, has led to the development of imaging AI solutions across the whole value chain of medical imaging, including image reconstruction, medical image segmentation, image-based diagnosis and treatment planning. Notwithstanding the successes and future potential of AI in medical imaging, many stakeholders are concerned of the potential risks and ethical implications of imaging AI solutions, which are perceived as complex, opaque, and difficult to comprehend, utilise, and trust in critical clinical applications. Despite these concerns and risks, there are currently no concrete guidelines and best practices for guiding future AI developments in medical imaging towards increased trust, safety and adoption. To bridge this gap, this paper introduces a careful selection of guiding principles drawn from the accumulated experiences, consensus, and best practices from five large European projects on AI in Health Imaging. These guiding principles are named FUTURE-AI and its building blocks consist of (i) Fairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness and (vi) Explainability. In a step-by-step approach, these guidelines are further translated into a framework of concrete recommendations for specifying, developing, evaluating, and deploying technically, clinically and ethically trustworthy AI solutions into clinical practice.
    CI-dataset and DetDSCI methodology for detecting too small and too large critical infrastructures in satellite images: Airports and electrical substations as case study. (arXiv:2105.11844v2 [cs.CV] UPDATED)
    (0 min) The detection of critical infrastructures in large territories represented by aerial and satellite images is of high importance in several fields such as in security, anomaly detection, land use planning and land use change detection. However, the detection of such infrastructures is complex as they have highly variable shapes and sizes, i.e., some infrastructures, such as electrical substations, are too small while others, such as airports, are too large. Besides, airports can have a surface area either small or too large with completely different shapes, which makes its correct detection challenging. As far as we know, these limitations have not been tackled yet in previous works. This paper presents (1) a smart Critical Infrastructure dataset, named CI-dataset, organised into two scales, small and large scales critical infrastructures and (2) a two-level resolution-independent critical infrastructure detection (DetDSCI) methodology that first determines the spatial resolution of the input image using a classification model, then analyses the image using the appropriate detector for that spatial resolution. The present study targets two representative classes, airports and electrical substations. Our experiments show that DetDSCI methodology achieves up to 37,53% F1 improvement with respect to Faster R-CNN, one of the most influential detection models.
    Adaptive Reliability Analysis for Multi-fidelity Models using a Collective Learning Strategy. (arXiv:2109.10219v1 [cs.LG])
    (0 min) In many fields of science and engineering, models with different fidelities are available. Physical experiments or detailed simulations that accurately capture the behavior of the system are regarded as high-fidelity models with low model uncertainty, however, they are expensive to run. On the other hand, simplified physical experiments or numerical models are seen as low-fidelity models that are cheaper to evaluate. Although low-fidelity models are often not suitable for direct use in reliability analysis due to their low accuracy, they can offer information about the trend of the high-fidelity model thus providing the opportunity to explore the design space at a low cost. This study presents a new approach called adaptive multi-fidelity Gaussian process for reliability analysis (AMGPRA). Contrary to selecting training points and information sources in two separate stages as done in state-of-the-art mfEGRA method, the proposed approach finds the optimal training point and information source simultaneously using the novel collective learning function (CLF). CLF is able to assess the global impact of a candidate training point from an information source and it accommodates any learning function that satisfies a certain profile. In this context, CLF provides a new direction for quantifying the impact of new training points and can be easily extended with new learning functions to adapt to different reliability problems. The performance of the proposed method is demonstrated by three mathematical examples and one engineering problem concerning the wind reliability of transmission towers. It is shown that the proposed method achieves similar or higher accuracy with reduced computational costs compared to state-of-the-art single and multi-fidelity methods. A key application of AMGPRA is high-fidelity fragility modeling using complex and costly physics-based computational models.
    An Optimal Control Framework for Joint-channel Parallel MRI Reconstruction without Coil Sensitivities. (arXiv:2109.09738v1 [eess.IV])
    (0 min) Goal: This work aims at developing a novel calibration-free fast parallel MRI (pMRI) reconstruction method incorporate with discrete-time optimal control framework. The reconstruction model is designed to learn a regularization that combines channels and extracts features by leveraging the information sharing among channels of multi-coil images. We propose to recover both magnitude and phase information by taking advantage of structured multiplayer convolutional networks in image and Fourier spaces. Methods: We develop a novel variational model with a learnable objective function that integrates an adaptive multi-coil image combination operator and effective image regularization in the image and Fourier spaces. We cast the reconstruction network as a structured discrete-time optimal control system, resulting in an optimal control formulation of parameter training where the parameters of the objective function play the role of control variables. We demonstrate that the Lagrangian method for solving the control problem is equivalent to back-propagation, ensuring the local convergence of the training algorithm. Results: We conduct a large number of numerical experiments of the proposed method with comparisons to several state-of-the-art pMRI reconstruction networks on real pMRI datasets. The numerical results demonstrate the promising performance of the proposed method evidently. Conclusion: The proposed method provides a general deep network design and training framework for efficient joint-channel pMRI reconstruction. Significance: By learning multi-coil image combination operator and performing regularizations in both image domain and k-space domain, the proposed method achieves a highly efficient image reconstruction network for pMRI.
    Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment. (arXiv:2109.09774v1 [cs.DL])
    (0 min) In this paper we revisit the 2014 NeurIPS experiment that examined inconsistency in conference peer review. We determine that 50\% of the variation in reviewer quality scores was subjective in origin. Further, with seven years passing since the experiment we find that for \emph{accepted} papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count. We trace the fate of rejected papers, recovering where these papers were eventually published. For these papers we find a correlation between quality scores and impact. We conclude that the reviewing process for the 2014 conference was good for identifying poor papers, but poor for identifying good papers. We give some suggestions for improving the reviewing process but also warn against removing the subjective element. Finally, we suggest that the real conclusion of the experiment is that the community should place less onus on the notion of `top-tier conference publications' when assessing the quality of individual researchers. For NeurIPS 2021, the PCs are repeating the experiment, as well as conducting new ones.
    Fast TreeSHAP: Accelerating SHAP Value Computation for Trees. (arXiv:2109.09847v1 [cs.LG])
    (0 min) SHAP (SHapley Additive exPlanation) values are one of the leading tools for interpreting machine learning models, with strong theoretical guarantees (consistency, local accuracy) and a wide availability of implementations and use cases. Even though computing SHAP values takes exponential time in general, TreeSHAP takes polynomial time on tree-based models. While the speedup is significant, TreeSHAP can still dominate the computation time of industry-level machine learning solutions on datasets with millions or more entries, causing delays in post-hoc model diagnosis and interpretation service. In this paper we present two new algorithms, Fast TreeSHAP v1 and v2, designed to improve the computational efficiency of TreeSHAP for large datasets. We empirically find that Fast TreeSHAP v1 is 1.5x faster than TreeSHAP while keeping the memory cost unchanged. Similarly, Fast TreeSHAP v2 is 2.5x faster than TreeSHAP, at the cost of a slightly higher memory usage, thanks to the pre-computation of expensive TreeSHAP steps. We also show that Fast TreeSHAP v2 is well-suited for multi-time model interpretations, resulting in as high as 3x faster explanation of newly incoming samples.
    KDFNet: Learning Keypoint Distance Field for 6D Object Pose Estimation. (arXiv:2109.10127v1 [cs.CV])
    (0 min) We present KDFNet, a novel method for 6D object pose estimation from RGB images. To handle occlusion, many recent works have proposed to localize 2D keypoints through pixel-wise voting and solve a Perspective-n-Point (PnP) problem for pose estimation, which achieves leading performance. However, such voting process is direction-based and cannot handle long and thin objects where the direction intersections cannot be robustly found. To address this problem, we propose a novel continuous representation called Keypoint Distance Field (KDF) for projected 2D keypoint locations. Formulated as a 2D array, each element of the KDF stores the 2D Euclidean distance between the corresponding image pixel and a specified projected 2D keypoint. We use a fully convolutional neural network to regress the KDF for each keypoint. Using this KDF encoding of projected object keypoint locations, we propose to use a distance-based voting scheme to localize the keypoints by calculating circle intersections in a RANSAC fashion. We validate the design choices of our framework by extensive ablation experiments. Our proposed method achieves state-of-the-art performance on Occlusion LINEMOD dataset with an average ADD(-S) accuracy of 50.3% and TOD dataset mug subset with an average ADD accuracy of 75.72%. Extensive experiments and visualizations demonstrate that the proposed method is able to robustly estimate the 6D pose in challenging scenarios including occlusion.
    Neural Distance Embeddings for Biological Sequences. (arXiv:2109.09740v1 [q-bio.QM])
    (0 min) The development of data-dependent heuristics and representations for biological sequences that reflect their evolutionary distance is critical for large-scale biological research. However, popular machine learning approaches, based on continuous Euclidean spaces, have struggled with the discrete combinatorial formulation of the edit distance that models evolution and the hierarchical relationship that characterises real-world datasets. We present Neural Distance Embeddings (NeuroSEED), a general framework to embed sequences in geometric vector spaces, and illustrate the effectiveness of the hyperbolic space that captures the hierarchical structure and provides an average 22% reduction in embedding RMSE against the best competing geometry. The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised NeuroSEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets. As an example for hierarchical clustering, the proposed pretrained and from-scratch methods match the quality of competing baselines with 30x and 15x runtime reduction, respectively.
    DeepTimeAnomalyViz: A Tool for Visualizing and Post-processing Deep Learning Anomaly Detection Results for Industrial Time-Series. (arXiv:2109.10082v1 [cs.LG])
    (0 min) Industrial processes are monitored by a large number of various sensors that produce time-series data. Deep Learning offers a possibility to create anomaly detection methods that can aid in preventing malfunctions and increasing efficiency. But creating such a solution can be a complicated task, with factors such as inference speed, amount of available data, number of sensors, and many more, influencing the feasibility of such implementation. We introduce the DeTAVIZ interface, which is a web browser based visualization tool for quick exploration and assessment of feasibility of DL based anomaly detection in a given problem. Provided with a pool of pretrained models and simulation results, DeTAVIZ allows the user to easily and quickly iterate through multiple post processing options and compare different models, and allows for manual optimisation towards a chosen metric.
    Multifield Cosmology with Artificial Intelligence. (arXiv:2109.09747v1 [astro-ph.CO])
    (0 min) Astrophysical processes such as feedback from supernovae and active galactic nuclei modify the properties and spatial distribution of dark matter, gas, and galaxies in a poorly understood way. This uncertainty is one of the main theoretical obstacles to extract information from cosmological surveys. We use 2,000 state-of-the-art hydrodynamic simulations from the CAMELS project spanning a wide variety of cosmological and astrophysical models and generate hundreds of thousands of 2-dimensional maps for 13 different fields: from dark matter to gas and stellar properties. We use these maps to train convolutional neural networks to extract the maximum amount of cosmological information while marginalizing over astrophysical effects at the field level. Although our maps only cover a small area of $(25~h^{-1}{\rm Mpc})^2$, and the different fields are contaminated by astrophysical effects in very different ways, our networks can infer the values of $\Omega_{\rm m}$ and $\sigma_8$ with a few percent level precision for most of the fields. We find that the marginalization performed by the network retains a wealth of cosmological information compared to a model trained on maps from gravity-only N-body simulations that are not contaminated by astrophysical effects. Finally, we train our networks on multifields -- 2D maps that contain several fields as different colors or channels -- and find that not only they can infer the value of all parameters with higher accuracy than networks trained on individual fields, but they can constrain the value of $\Omega_{\rm m}$ with higher accuracy than the maps from the N-body simulations.
    A Simple Unified Framework for Anomaly Detection in Deep Reinforcement Learning. (arXiv:2109.09889v1 [cs.LG])
    (0 min) Abnormal states in deep reinforcement learning~(RL) are states that are beyond the scope of an RL policy. Such states may make the RL system unsafe and impede its deployment in real scenarios. In this paper, we propose a simple yet effective anomaly detection framework for deep RL algorithms that simultaneously considers random, adversarial and out-of-distribution~(OOD) state outliers. In particular, we attain the class-conditional distributions for each action class under the Gaussian assumption, and rely on these distributions to discriminate between inliers and outliers based on Mahalanobis Distance~(MD) and Robust Mahalanobis Distance. We conduct extensive experiments on Atari games that verify the effectiveness of our detection strategies. To the best of our knowledge, we present the first in-detail study of statistical and adversarial anomaly detection in deep RL algorithms. This simple unified anomaly detection paves the way towards deploying safe RL systems in real-world applications.
    Multitask learning over graphs: An Approach for Distributed, Streaming Machine Learning. (arXiv:2001.02112v2 [eess.SP] UPDATED)
    (0 min) The problem of learning simultaneously several related tasks has received considerable attention in several domains, especially in machine learning with the so-called multitask learning problem or learning to learn problem [1], [2]. Multitask learning is an approach to inductive transfer learning (using what is learned for one problem to assist in another problem) and helps improve generalization performance relative to learning each task separately by using the domain information contained in the training signals of related tasks as an inductive bias. Several strategies have been derived within this community under the assumption that all data are available beforehand at a fusion center. However, recent years have witnessed an increasing ability to collect data in a distributed and streaming manner. This requires the design of new strategies for learning jointly multiple tasks from streaming data over distributed (or networked) systems. This article provides an overview of multitask strategies for learning and adaptation over networks. The working hypothesis for these strategies is that agents are allowed to cooperate with each other in order to learn distinct, though related tasks. The article shows how cooperation steers the network limiting point and how different cooperation rules allow to promote different task relatedness models. It also explains how and when cooperation over multitask networks outperforms non-cooperative strategies.
    Improving Span Representation for Domain-adapted Coreference Resolution. (arXiv:2109.09811v1 [cs.LG])
    (0 min) Recent work has shown fine-tuning neural coreference models can produce strong performance when adapting to different domains. However, at the same time, this can require a large amount of annotated target examples. In this work, we focus on supervised domain adaptation for clinical notes, proposing the use of concept knowledge to more efficiently adapt coreference models to a new domain. We develop methods to improve the span representations via (1) a retrofitting loss to incentivize span representations to satisfy a knowledge-based distance function and (2) a scaffolding loss to guide the recovery of knowledge from the span representation. By integrating these losses, our model is able to improve our baseline precision and F-1 score. In particular, we show that incorporating knowledge with end-to-end coreference models results in better performance on the most challenging, domain-specific spans.
    IgNet. A Super-precise Convolutional Neural Network. (arXiv:2109.09939v1 [cs.CV])
    (0 min) Convolutional neural networks (CNN) are known to be an effective means to detect and analyze images. Their power is essentially based on the ability to extract out images common features. There exist, however, images involving unique, irregular features or details. Such is a collection of unusual children drawings reflecting the kids imagination and individuality. These drawings were analyzed by means of a CNN constructed by means of Keras-TensorFlow. The same problem - on a significantly higher level - was solved with newly developed family of networks called IgNet that is described in this paper. It proved able to learn by 100 % all the categorical characteristics of the drawings. In the case of a regression task (learning the young artists ages) IgNet performed with an error of no more than 0.4 %. The principles are discussed of IgNet design that made it possible to reach such substantial results with rather simple network topology.
    Measuring Ethics in AI with AI: A Methodology and Dataset Construction. (arXiv:2107.11913v2 [cs.AI] UPDATED)
    (0 min) Recently, the use of sound measures and metrics in Artificial Intelligence has become the subject of interest of academia, government, and industry. Efforts towards measuring different phenomena have gained traction in the AI community, as illustrated by the publication of several influential field reports and policy documents. These metrics are designed to help decision takers to inform themselves about the fast-moving and impacting influences of key advances in Artificial Intelligence in general and Machine Learning in particular. In this paper we propose to use such newfound capabilities of AI technologies to augment our AI measuring capabilities. We do so by training a model to classify publications related to ethical issues and concerns. In our methodology we use an expert, manually curated dataset as the training set and then evaluate a large set of research papers. Finally, we highlight the implications of AI metrics, in particular their contribution towards developing trustful and fair AI-based tools and technologies. Keywords: AI Ethics; AI Fairness; AI Measurement. Ethics in Computer Science.
    LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Sparse Reward Iterative Tasks. (arXiv:2107.04775v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) has shown impressive success in exploring high-dimensional environments to learn complex tasks, but can often exhibit unsafe behaviors and require extensive environment interaction when exploration is unconstrained. A promising strategy for learning in dynamically uncertain environments is requiring that the agent can robustly return to learned safe sets, where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with visual observations is exceedingly challenging. We present a novel continuous representation for safe sets by framing it as a binary classification problem in a learned latent space, which flexibly scales to image observations. We then present a new algorithm, Latent Space Safe Sets (LS3), which uses this representation for long-horizon tasks with sparse rewards. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use prior task successes to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints. See https://tinyurl.com/latent-ss for code and supplementary material.
    Online Multi-horizon Transaction Metric Estimation with Multi-modal Learning in Payment Networks. (arXiv:2109.10020v1 [cs.LG])
    (0 min) Predicting metrics associated with entities' transnational behavior within payment processing networks is essential for system monitoring. Multivariate time series, aggregated from the past transaction history, can provide valuable insights for such prediction. The general multivariate time series prediction problem has been well studied and applied across several domains, including manufacturing, medical, and entomology. However, new domain-related challenges associated with the data such as concept drift and multi-modality have surfaced in addition to the real-time requirements of handling the payment transaction data at scale. In this work, we study the problem of multivariate time series prediction for estimating transaction metrics associated with entities in the payment transaction database. We propose a model with five unique components to estimate the transaction metrics from multi-modality data. Four of these components capture interaction, temporal, scale, and shape perspectives, and the fifth component fuses these perspectives together. We also propose a hybrid offline/online training scheme to address concept drift in the data and fulfill the real-time requirements. Combining the estimation model with a graphical user interface, the prototype transaction metric estimation system has demonstrated its potential benefit as a tool for improving a payment processing company's system monitoring capability.
    Learning with Density Matrices and Random Features. (arXiv:2102.04394v3 [cs.LG] UPDATED)
    (2 min) A density matrix describes the statistical state of a quantum system. It is a powerful formalism to represent both the quantum and classical uncertainty of quantum systems and to express different statistical operations such as measurement, system combination and expectations as linear algebra operations. This paper explores how density matrices can be used as a building block to build machine learning models exploiting their ability to straightforwardly combine linear algebra and probability. One of the main results of the paper is to show that density matrices coupled with random Fourier features could approximate arbitrary probability distributions over $\mathbb{R}^n$. Based on this finding the paper builds different models for density estimation, classification and regression. These models are differentiable, so it is possible to integrate them with other differentiable components, such as deep learning architectures and to learn their parameters using gradient-based optimization. In addition, the paper presents optimization-less training strategies based on estimation and model averaging. The models are evaluated in benchmark tasks and the results are reported and discussed.
    Mine Your Own vieW: Self-Supervised Learning Through Across-Sample Prediction. (arXiv:2102.10106v2 [cs.LG] UPDATED)
    (2 min) State-of-the-art methods for self-supervised learning (SSL) build representations by maximizing the similarity between different transformed "views" of a sample. Without sufficient diversity in the transformations used to create views, however, it can be difficult to overcome nuisance variables in the data and build rich representations. This motivates the use of the dataset itself to find similar, yet distinct, samples to serve as views for one another. In this paper, we introduce Mine Your Own vieW (MYOW), a new approach for self-supervised learning that looks within the dataset to define diverse targets for prediction. The idea behind our approach is to actively mine views, finding samples that are neighbors in the representation space of the network, and then predict, from one sample's latent representation, the representation of a nearby sample. After showing the promise of MYOW on benchmarks used in computer vision, we highlight the power of this idea in a novel application in neuroscience where SSL has yet to be applied. When tested on multi-unit neural recordings, we find that MYOW outperforms other self-supervised approaches in all examples (in some cases by more than 10%), and often surpasses the supervised baseline. With MYOW, we show that it is possible to harness the diversity of the data to build rich views and leverage self-supervision in new domains where augmentations are limited or unknown.
    Search For Deep Graph Neural Networks. (arXiv:2109.10047v1 [cs.LG])
    (0 min) Current GNN-oriented NAS methods focus on the search for different layer aggregate components with shallow and simple architectures, which are limited by the 'over-smooth' problem. To further explore the benefits from structural diversity and depth of GNN architectures, we propose a GNN generation pipeline with a novel two-stage search space, which aims at automatically generating high-performance while transferable deep GNN models in a block-wise manner. Meanwhile, to alleviate the 'over-smooth' problem, we incorporate multiple flexible residual connection in our search space and apply identity mapping in the basic GNN layers. For the search algorithm, we use deep-q-learning with epsilon-greedy exploration strategy and reward reshaping. Extensive experiments on real-world datasets show that our generated GNN models outperforms existing manually designed and NAS-based ones.
    Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study. (arXiv:2106.09700v2 [cs.CL] UPDATED)
    (2 min) Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.
    Feature Correlation Aggregation: on the Path to Better Graph Neural Networks. (arXiv:2109.09300v1 [cs.LG] CROSS LISTED)
    (2 min) Prior to the introduction of Graph Neural Networks (GNNs), modeling and analyzing irregular data, particularly graphs, was thought to be the Achilles' heel of deep learning. The core concept of GNNs is to find a representation by recursively aggregating the representations of a central node and those of its neighbors. The core concept of GNNs is to find a representation by recursively aggregating the representations of a central node and those of its neighbor, and its success has been demonstrated by many GNNs' designs. However, most of them only focus on using the first-order information between a node and its neighbors. In this paper, we introduce a central node permutation variant function through a frustratingly simple and innocent-looking modification to the core operation of a GNN, namely the Feature cOrrelation aGgregation (FOG) module which learns the second-order information from feature correlation between a node and its neighbors in the pipeline. By adding FOG into existing variants of GNNs, we empirically verify this second-order information complements the features generated by original GNNs across a broad set of benchmarks. A tangible boost in performance of the model is observed where the model surpasses previous state-of-the-art results by a significant margin while employing fewer parameters. (e.g., 33.116% improvement on a real-world molecular dataset using graph convolutional networks).
    Learning PAC-Bayes Priors for Probabilistic Neural Networks. (arXiv:2109.10304v1 [cs.LG])
    (0 min) Recent works have investigated deep learning models trained by optimising PAC-Bayes bounds, with priors that are learnt on subsets of the data. This combination has been shown to lead not only to accurate classifiers, but also to remarkably tight risk certificates, bearing promise towards self-certified learning (i.e. use all the data to learn a predictor and certify its quality). In this work, we empirically investigate the role of the prior. We experiment on 6 datasets with different strategies and amounts of data to learn data-dependent PAC-Bayes priors, and we compare them in terms of their effect on test performance of the learnt predictors and tightness of their risk certificate. We ask what is the optimal amount of data which should be allocated for building the prior and show that the optimum may be dataset dependent. We demonstrate that using a small percentage of the prior-building data for validation of the prior leads to promising results. We include a comparison of underparameterised and overparameterised models, along with an empirical study of different training objectives and regularisation strategies to learn the prior distribution.
    Online Deterministic Annealing for Classification and Clustering. (arXiv:2102.05836v3 [cs.LG] UPDATED)
    (2 min) Inherent in virtually every iterative machine learning algorithm is the problem of hyper-parameter tuning, which includes three major design parameters: (a) the complexity of the model, e.g., the number of neurons in a neural network, (b) the initial conditions, which heavily affect the behavior of the algorithm, and (c) the dissimilarity measure used to quantify its performance. We introduce an online prototype-based learning algorithm that can be viewed as a progressively growing competitive-learning neural network architecture for classification and clustering. The learning rule of the proposed approach is formulated as an online gradient-free stochastic approximation algorithm that solves a sequence of appropriately defined optimization problems, simulating an annealing process. The annealing nature of the algorithm contributes to avoiding poor local minima, offers robustness with respect to the initial conditions, and provides a means to progressively increase the complexity of the learning model, through an intuitive bifurcation phenomenon. The proposed approach is interpretable, requires minimal hyper-parameter tuning, and allows online control over the performance-complexity trade-off. Finally, we show that Bregman divergences appear naturally as a family of dissimilarity measures that play a central role in both the performance and the computational complexity of the learning algorithm. Experimental results illustrate the properties and evaluate the performance of the proposed learning algorithm.
    Recurrent Localization Networks applied to the Lippmann-Schwinger Equation. (arXiv:2102.00063v2 [physics.comp-ph] UPDATED)
    (2 min) The bulk of computational approaches for modeling physical systems in materials science derive from either analytical (i.e. physics based) or data-driven (i.e. machine-learning based) origins. In order to combine the strengths of these two approaches, we advance a novel machine learning approach for solving equations of the generalized Lippmann-Schwinger (L-S) type. In this paradigm, a given problem is converted into an equivalent L-S equation and solved as an optimization problem, where the optimization procedure is calibrated to the problem at hand. As part of a learning-based loop unrolling, we use a recurrent convolutional neural network to iteratively solve the governing equations for a field of interest. This architecture leverages the generalizability and computational efficiency of machine learning approaches, but also permits a physics-based interpretation. We demonstrate our learning approach on the two-phase elastic localization problem, where it achieves excellent accuracy on the predictions of the local (i.e., voxel-level) elastic strains. Since numerous governing equations can be converted into an equivalent L-S form, the proposed architecture has potential applications across a range of multiscale materials phenomena.
    Uncertainty Toolbox: an Open-Source Library for Assessing, Visualizing, and Improving Uncertainty Quantification. (arXiv:2109.10254v1 [cs.LG])
    (2 min) With increasing deployment of machine learning systems in various real-world tasks, there is a greater need for accurate quantification of predictive uncertainty. While the common goal in uncertainty quantification (UQ) in machine learning is to approximate the true distribution of the target data, many works in UQ tend to be disjoint in the evaluation metrics utilized, and disparate implementations for each metric lead to numerical results that are not directly comparable across different works. To address this, we introduce Uncertainty Toolbox, an open-source python library that helps to assess, visualize, and improve UQ. Uncertainty Toolbox additionally provides pedagogical resources, such as a glossary of key terms and an organized collection of key paper references. We hope that this toolbox is useful for accelerating and uniting research efforts in uncertainty in machine learning.
    Transferability of Graph Neural Networks: an Extended Graphon Approach. (arXiv:2109.10096v1 [cs.LG])
    (0 min) We study spectral graph convolutional neural networks (GCNNs), where filters are defined as continuous functions of the graph shift operator (GSO) through functional calculus. A spectral GCNN is not tailored to one specific graph and can be transferred between different graphs. It is hence important to study the GCNN transferability: the capacity of the network to have approximately the same repercussion on different graphs that represent the same phenomenon. Transferability ensures that GCNNs trained on certain graphs generalize if the graphs in the test set represent the same phenomena as the graphs in the training set. In this paper, we consider a model of transferability based on graphon analysis. Graphons are limit objects of graphs, and, in the graph paradigm, two graphs represent the same phenomenon if both approximate the same graphon. Our main contributions can be summarized as follows: 1) we prove that any fixed GCNN with continuous filters is transferable under graphs that approximate the same graphon, 2) we prove transferability for graphs that approximate unbounded graphon shift operators, which are defined in this paper, and, 3) we obtain non-asymptotic approximation results, proving linear stability of GCNNs. This extends current state-of-the-art results which show asymptotic transferability for polynomial filters under graphs that approximate bounded graphons.
    Whittle index based Q-learning for restless bandits with average reward. (arXiv:2004.14427v3 [cs.LG] UPDATED)
    (2 min) A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.
    Quantifying and Mitigating Privacy Risks of Contrastive Learning. (arXiv:2102.04140v2 [cs.LG] UPDATED)
    (0 min) Data is the key factor to drive the development of machine learning (ML) during the past decade. However, high-quality data, in particular labeled data, is often hard and expensive to collect. To leverage large-scale unlabeled data, self-supervised learning, represented by contrastive learning, is introduced. The objective of contrastive learning is to map different views derived from a training sample (e.g., through data augmentation) closer in their representation space, while different views derived from different samples more distant. In this way, a contrastive model learns to generate informative representations for data samples, which are then used to perform downstream ML tasks. Recent research has shown that machine learning models are vulnerable to various privacy attacks. However, most of the current efforts concentrate on models trained with supervised learning. Meanwhile, data samples' informative representations learned with contrastive learning may cause severe privacy risks as well. In this paper, we perform the first privacy analysis of contrastive learning through the lens of membership inference and attribute inference. Our experimental results show that contrastive models trained on image datasets are less vulnerable to membership inference attacks but more vulnerable to attribute inference attacks compared to supervised models. The former is due to the fact that contrastive models are less prone to overfitting, while the latter is caused by contrastive models' capability of representing data samples expressively. To remedy this situation, we propose the first privacy-preserving contrastive learning mechanism, Talos, relying on adversarial training. Empirical results show that Talos can successfully mitigate attribute inference risks for contrastive models while maintaining their membership privacy and model utility.
    NADE: A Benchmark for Robust Adverse Drug Events Extraction in Face of Negations. (arXiv:2109.10080v1 [cs.CL])
    (0 min) Adverse Drug Event (ADE) extraction mod-els can rapidly examine large collections of so-cial media texts, detecting mentions of drug-related adverse reactions and trigger medicalinvestigations. However, despite the recent ad-vances in NLP, it is currently unknown if suchmodels are robust in face ofnegation, which ispervasive across language varieties.In this paper we evaluate three state-of-the-artsystems, showing their fragility against nega-tion, and then we introduce two possible strate-gies to increase the robustness of these mod-els: a pipeline approach, relying on a specificcomponent for negation detection; an augmen-tation of an ADE extraction dataset to artifi-cially create negated samples and further trainthe models.We show that both strategies bring significantincreases in performance, lowering the num-ber of spurious entities predicted by the mod-els. Our dataset and code will be publicly re-leased to encourage research on the topic.
    iRNN: Integer-only Recurrent Neural Network. (arXiv:2109.09828v1 [cs.LG])
    (0 min) Recurrent neural networks (RNN) are used in many real-world text and speech applications. They include complex modules such as recurrence, exponential-based activation, gate interaction, unfoldable normalization, bi-directional dependence, and attention. The interaction between these elements prevents running them on integer-only operations without a significant performance drop. Deploying RNNs that include layer normalization and attention on integer-only arithmetic is still an open problem. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear approximation of activations, to serve a wide range of RNNs on various applications. The proposed method is proven to work on RNN-based language models and automatic speech recognition. Our iRNN maintains similar performance as its full-precision counterpart, their deployment on smartphones improves the runtime performance by $2\times$, and reduces the model size by $4\times$.
    Short-term traffic prediction using physics-aware neural networks. (arXiv:2109.10253v1 [cs.LG])
    (2 min) In this work, we propose an algorithm performing short-term predictions of the flux of vehicles on a stretch of road, using past measurements of the flux. This algorithm is based on a physics-aware recurrent neural network. A discretization of a macroscopic traffic flow model (using the so-called Traffic Reaction Model) is embedded in the architecture of the network and yields flux predictions based on estimated and predicted space-time dependent traffic parameters. These parameters are themselves obtained using a succession of LSTM ans simple recurrent neural networks. Besides, on top of the predictions, the algorithm yields a smoothing of its inputs which is also physically-constrained by the macroscopic traffic flow model. The algorithm is tested on raw flux measurements obtained from loop detectors.
    Competing Bandits: The Perils of Exploration Under Competition. (arXiv:2007.10144v5 [cs.GT] UPDATED)
    (3 min) Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a "greedy" bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some "free" users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the "competition vs. innovation" relationship, and elucidate the first-mover advantage in the digital economy.
    Open-Retrieval Conversational Machine Reading. (arXiv:2102.08633v2 [cs.CL] UPDATED)
    (2 min) In conversational machine reading, systems need to interpret natural language rules, answer high-level questions such as "May I qualify for VA health care benefits?", and ask follow-up clarification questions whose answer is necessary to answer the original question. However, existing works assume the rule text is provided for each user question, which neglects the essential retrieval step in real scenarios. In this work, we propose and investigate an open-retrieval setting of conversational machine reading. In the open-retrieval setting, the relevant rule texts are unknown so that a system needs to retrieve question-relevant evidence from a collection of rule texts, and answer users' high-level questions according to multiple retrieved rule texts in a conversational manner. We propose MUDERN, a Multi-passage Discourse-aware Entailment Reasoning Network which extracts conditions in the rule texts through discourse segmentation, conducts multi-passage entailment reasoning to answer user questions directly, or asks clarification follow-up questions to inquiry more information. On our created OR-ShARC dataset, MUDERN achieves the state-of-the-art performance, outperforming existing single-passage conversational machine reading models as well as a new multi-passage conversational machine reading baseline by a large margin. In addition, we conduct in-depth analyses to provide new insights into this new setting and our model.
    A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders. (arXiv:2104.03630v2 [cs.CL] UPDATED)
    (2 min) Powerful sentence encoders trained for multiple languages are on the rise. These systems are capable of embedding a wide range of linguistic properties into vector representations. While explicit probing tasks can be used to verify the presence of specific linguistic properties, it is unclear whether the vector representations can be manipulated to indirectly steer such properties. For efficient learning, we investigate the use of a geometric mapping in embedding space to transform linguistic properties, without any tuning of the pre-trained sentence encoder or decoder. We validate our approach on three linguistic properties using a pre-trained multilingual autoencoder and analyze the results in both monolingual and cross-lingual settings.
    mGNN: Generalizing the Graph Neural Networks to the Multilayer Case. (arXiv:2109.10119v1 [cs.LG])
    (2 min) Networks are a powerful tool to model complex systems, and the definition of many Graph Neural Networks (GNN), Deep Learning algorithms that can handle networks, has opened a new way to approach many real-world problems that would be hardly or even untractable. In this paper, we propose mGNN, a framework meant to generalize GNNs to the case of multi-layer networks, i.e., networks that can model multiple kinds of interactions and relations between nodes. Our approach is general (i.e., not task specific) and has the advantage of extending any type of GNN without any computational overhead. We test the framework into three different tasks (node and network classification, link prediction) to validate it.
    Self-Supervised Action-Space Prediction for Automated Driving. (arXiv:2109.10024v1 [cs.RO])
    (2 min) Making informed driving decisions requires reliable prediction of other vehicles' trajectories. In this paper, we present a novel learned multi-modal trajectory prediction architecture for automated driving. It achieves kinematically feasible predictions by casting the learning problem into the space of accelerations and steering angles -- by performing action-space prediction, we can leverage valuable model knowledge. Additionally, the dimensionality of the action manifold is lower than that of the state manifold, whose intrinsically correlated states are more difficult to capture in a learned manner. For the purpose of action-space prediction, we present the simple Feed-Forward Action-Space Prediction (FFW-ASP) architecture. Then, we build on this notion and introduce the novel Self-Supervised Action-Space Prediction (SSP-ASP) architecture that outputs future environment context features in addition to trajectories. A key element in the self-supervised architecture is that, based on an observed action history and past context features, future context features are predicted prior to future trajectories. The proposed methods are evaluated on real-world datasets containing urban intersections and roundabouts, and show accurate predictions, outperforming state-of-the-art for kinematically feasible predictions in several prediction metrics.
    Long-Term Exploration in Persistent MDPs. (arXiv:2109.10173v1 [cs.LG])
    (2 min) Exploration is an essential part of reinforcement learning, which restricts the quality of learned policy. Hard-exploration environments are defined by huge state space and sparse rewards. In such conditions, an exhaustive exploration of the environment is often impossible, and the successful training of an agent requires a lot of interaction steps. In this paper, we propose an exploration method called Rollback-Explore (RbExplore), which utilizes the concept of the persistent Markov decision process, in which agents during training can roll back to visited states. We test our algorithm in the hard-exploration Prince of Persia game, without rewards and domain knowledge. At all used levels of the game, our agent outperforms or shows comparable results with state-of-the-art curiosity methods with knowledge-based intrinsic motivation: ICM and RND. An implementation of RbExplore can be found at https://github.com/cds-mipt/RbExplore.
    A Novel Structured Natural Gradient Descent for Deep Learning. (arXiv:2109.10100v1 [cs.LG])
    (2 min) Natural gradient descent (NGD) provided deep insights and powerful tools to deep neural networks. However the computation of Fisher information matrix becomes more and more difficult as the network structure turns large and complex. This paper proposes a new optimization method whose main idea is to accurately replace the natural gradient optimization by reconstructing the network. More specifically, we reconstruct the structure of the deep neural network, and optimize the new network using traditional gradient descent (GD). The reconstructed network achieves the effect of the optimization way with natural gradient descent. Experimental results show that our optimization method can accelerate the convergence of deep network models and achieve better performance than GD while sharing its computational simplicity.
    Data Augmentation Methods for Anaphoric Zero Pronouns. (arXiv:2109.09825v1 [cs.CL])
    (2 min) In pro-drop language like Arabic, Chinese, Italian, Japanese, Spanish, and many others, unrealized (null) arguments in certain syntactic positions can refer to a previously introduced entity, and are thus called anaphoric zero pronouns. The existing resources for studying anaphoric zero pronoun interpretation are however still limited. In this paper, we use five data augmentation methods to generate and detect anaphoric zero pronouns automatically. We use the augmented data as additional training materials for two anaphoric zero pronoun systems for Arabic. Our experimental results show that data augmentation improves the performance of the two systems, surpassing the state-of-the-art results.
    Graph Neural Networks for Graph Drawing. (arXiv:2109.10061v1 [cs.LG])
    (2 min) Graph Drawing techniques have been developed in the last few years with the purpose of producing aesthetically pleasing node-link layouts. Recently, the employment of differentiable loss functions has paved the road to the massive usage of Gradient Descent and related optimization algorithms. In this paper, we propose a novel framework for the development of Graph Neural Drawers (GND), machines that rely on neural computation for constructing efficient and complex maps. GND are Graph Neural Networks (GNNs) whose learning process can be driven by any provided loss function, such as the ones commonly employed in Graph Drawing. Moreover, we prove that this mechanism can be guided by loss functions computed by means of Feedforward Neural Networks, on the basis of supervision hints that express beauty properties, like the minimization of crossing edges. In this context, we show that GNNs can nicely be enriched by positional features to deal also with unlabelled vertexes. We provide a proof-of-concept by constructing a loss function for the edge-crossing and provide quantitative and qualitative comparisons among different GNN models working under the proposed framework.
    Signal Classification using Smooth Coefficients of Multiple wavelets. (arXiv:2109.09988v1 [stat.ML])
    (2 min) Classification of time series signals has become an important construct and has many practical applications. With existing classifiers we may be able to accurately classify signals, however that accuracy may decline if using a reduced number of attributes. Transforming the data then undertaking reduction in dimensionality may improve the quality of the data analysis, decrease time required for classification and simplify models. We propose an approach, which chooses suitable wavelets to transform the data, then combines the output from these transforms to construct a dataset to then apply ensemble classifiers to. We demonstrate this on different data sets, across different classifiers and use differing evaluation methods. Our experimental results demonstrate the effectiveness of the proposed technique, compared to the approaches that use either raw signal data or a single wavelet transform.
    Survey on Semantic Stereo Matching / Semantic Depth Estimation. (arXiv:2109.10123v1 [cs.CV])
    (2 min) Stereo matching is one of the widely used techniques for inferring depth from stereo images owing to its robustness and speed. It has become one of the major topics of research since it finds its applications in autonomous driving, robotic navigation, 3D reconstruction, and many other fields. Finding pixel correspondences in non-textured, occluded and reflective areas is the major challenge in stereo matching. Recent developments have shown that semantic cues from image segmentation can be used to improve the results of stereo matching. Many deep neural network architectures have been proposed to leverage the advantages of semantic segmentation in stereo matching. This paper aims to give a comparison among the state of art networks both in terms of accuracy and in terms of speed which are of higher importance in real-time applications.
    A deep generative model for probabilistic energy forecasting in power systems: normalizing flows. (arXiv:2106.09370v4 [cs.LG] UPDATED)
    (3 min) Greater direct electrification of end-use sectors with a higher share of renewables is one of the pillars to power a carbon-neutral society by 2050. However, in contrast to conventional power plants, renewable energy is subject to uncertainty raising challenges for their interaction with power systems. Scenario-based probabilistic forecasting models have become a vital tool to equip decision-makers. This paper presents to the power systems forecasting practitioners a recent deep learning technique, the normalizing flows, to produce accurate scenario-based probabilistic forecasts that are crucial to face the new challenges in power systems applications. The strength of this technique is to directly learn the stochastic multivariate distribution of the underlying process by maximizing the likelihood. Through comprehensive empirical evaluations using the open data of the Global Energy Forecasting Competition 2014, we demonstrate that this methodology is competitive with other state-of-the-art deep learning generative models: generative adversarial networks and variational autoencoders. The models producing weather-based wind, solar power, and load scenarios are properly compared in terms of forecast value by considering the case study of an energy retailer and quality using several complementary metrics. The numerical experiments are simple and easily reproducible. Thus, we hope it will encourage other forecasting practitioners to test and use normalizing flows in power system applications such as bidding on electricity markets, scheduling power systems with high renewable energy sources penetration, energy management of virtual power plan or microgrids, and unit commitment.
    Learning to Forecast Dynamical Systems from Streaming Data. (arXiv:2109.09703v2 [math.DS] UPDATED)
    (2 min) Kernel analog forecasting (KAF) is a powerful methodology for data-driven, non-parametric forecasting of dynamically generated time series data. This approach has a rigorous foundation in Koopman operator theory and it produces good forecasts in practice, but it suffers from the heavy computational costs common to kernel methods. This paper proposes a streaming algorithm for KAF that only requires a single pass over the training data. This algorithm dramatically reduces the costs of training and prediction without sacrificing forecasting skill. Computational experiments demonstrate that the streaming KAF method can successfully forecast several classes of dynamical systems (periodic, quasi-periodic, and chaotic) in both data-scarce and data-rich regimes. The overall methodology may have wider interest as a new template for streaming kernel regression.
    Communication-efficient Distributed Cooperative Learning with Compressed Beliefs. (arXiv:2102.07767v2 [cs.LG] UPDATED)
    (2 min) We study the problem of distributed cooperative learning, where a group of agents seeks to agree on a set of hypotheses that best describes a sequence of private observations. In the scenario where the set of hypotheses is large, we propose a belief update rule where agents share compressed (either sparse or quantized) beliefs with an arbitrary positive compression rate. Our algorithm leverages a unified communication rule that enables agents to access wide-ranging compression operators as black-box modules. We prove the almost sure asymptotic exponential convergence of beliefs around the set of optimal hypotheses. Additionally, we show a non-asymptotic, explicit, and linear concentration rate in probability of the beliefs on the optimal hypothesis set. We provide numerical experiments to illustrate the communication benefits of our method. The simulation results show that the number of transmitted bits can be reduced to 5-10% of the non-compressed method in the studied scenarios.
    Multi-future Merchant Transaction Prediction. (arXiv:2007.05303v1 [cs.LG] CROSS LISTED)
    (2 min) The multivariate time series generated from merchant transaction history can provide critical insights for payment processing companies. The capability of predicting merchants' future is crucial for fraud detection and recommendation systems. Conventionally, this problem is formulated to predict one multivariate time series under the multi-horizon setting. However, real-world applications often require more than one future trend prediction considering the uncertainties, where more than one multivariate time series needs to be predicted. This problem is called multi-future prediction. In this work, we combine the two research directions and propose to study this new problem: multi-future, multi-horizon and multivariate time series prediction. This problem is crucial as it has broad use cases in the financial industry to reduce the risk while improving user experience by providing alternative futures. This problem is also challenging as now we not only need to capture the patterns and insights from the past but also train a model that has a strong inference capability to project multiple possible outcomes. To solve this problem, we propose a new model using convolutional neural networks and a simple yet effective encoder-decoder structure to learn the time series pattern from multiple perspectives. We use experiments on real-world merchant transaction data to demonstrate the effectiveness of our proposed model. We also provide extensive discussions on different model design choices in our experimental section.
    Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents. (arXiv:2109.10341v1 [cs.CL])
    (2 min) Document-level neural machine translation (DocNMT) delivers coherent translations by incorporating cross-sentence context. However, for most language pairs there's a shortage of parallel documents, although parallel sentences are readily available. In this paper, we study whether and how contextual modeling in DocNMT is transferable from sentences to documents in a zero-shot fashion (i.e. no parallel documents for student languages) through multilingual modeling. Using simple concatenation-based DocNMT, we explore the effect of 3 factors on multilingual transfer: the number of document-supervised teacher languages, the data schedule for parallel documents at training, and the data condition of parallel documents (genuine vs. backtranslated). Our experiments on Europarl-7 and IWSLT-10 datasets show the feasibility of multilingual transfer for DocNMT, particularly on document-specific metrics. We observe that more teacher languages and adequate data schedule both contribute to better transfer quality. Surprisingly, the transfer is less sensitive to the data condition and multilingual DocNMT achieves comparable performance with both back-translated and genuine document pairs.
    Understanding and Scheduling Weight Decay. (arXiv:2011.11152v4 [cs.LG] UPDATED)
    (2 min) Weight decay is a popular and even necessary regularization technique for training deep neural networks that generalize well. Previous work usually interpreted weight decay as a Gaussian prior from the Bayesian perspective. However, weight decay sometimes shows mysterious behaviors beyond the conventional understanding. For example, the optimal weight decay value tends to be zero given long enough training time. Moreover, existing work typically failed to recognize the importance of scheduling weight decay during training. Our work aims at theoretically understanding novel behaviors of weight decay and designing schedulers for weight decay in deep learning. This paper mainly has three contributions. First, we propose a novel theoretical interpretation of weight decay from the perspective of learning dynamics. Second, we propose a novel weight-decay linear scaling rule for large-batch training that proportionally increases weight decay rather than the learning rate as the batch size increases. Third, we provide an effective learning-rate-aware scheduler for weight decay, called the Stable Weight Decay (SWD) method, which, to the best of our knowledge, is the first practical design for weight decay scheduling. In our various experiments, the SWD method often makes improvements over $L_{2}$ Regularization and Decoupled Weight Decay.
    Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks. (arXiv:2109.10312v1 [cs.RO])
    (2 min) In this paper, we study the problem of learning a repertoire of low-level skills from raw images that can be sequenced to complete long-horizon visuomotor tasks. Reinforcement learning (RL) is a promising approach for acquiring short-horizon skills autonomously. However, the focus of RL algorithms has largely been on the success of those individual skills, more so than learning and grounding a large repertoire of skills that can be sequenced to complete extended multi-stage tasks. The latter demands robustness and persistence, as errors in skills can compound over time, and may require the robot to have a number of primitive skills in its repertoire, rather than just one. To this end, we introduce EMBR, a model-based RL method for learning primitive skills that are suitable for completing long-horizon visuomotor tasks. EMBR learns and plans using a learned model, critic, and success classifier, where the success classifier serves both as a reward function for RL and as a grounding mechanism to continuously detect if the robot should retry a skill when unsuccessful or under perturbations. Further, the learned model is task-agnostic and trained using data from all skills, enabling the robot to efficiently learn a number of distinct primitives. These visuomotor primitive skills and their associated pre- and post-conditions can then be directly combined with off-the-shelf symbolic planners to complete long-horizon tasks. On a Franka Emika robot arm, we find that EMBR enables the robot to complete three long-horizon visuomotor tasks at 85% success rate, such as organizing an office desk, a file cabinet, and drawers, which require sequencing up to 12 skills, involve 14 unique learned primitives, and demand generalization to novel objects.
    Power up! Robust Graph Convolutional Network via Graph Powering. (arXiv:1905.10029v2 [cs.LG] UPDATED)
    (2 min) Graph convolutional networks (GCNs) are powerful tools for graph-structured data. However, they have been recently shown to be vulnerable to topological attacks. To enhance adversarial robustness, we go beyond spectral graph theory to robust graph theory. By challenging the classical graph Laplacian, we propose a new convolution operator that is provably robust in the spectral domain and is incorporated in the GCN architecture to improve expressivity and interpretability. By extending the original graph to a sequence of graphs, we also propose a robust training paradigm that encourages transferability across graphs that span a range of spatial and spectral characteristics. The proposed approaches are demonstrated in extensive experiments to simultaneously improve performance in both benign and adversarial situations.
    SFFDD: Deep Neural Network with Enriched Features for Failure Prediction with Its Application to Computer Disk Driver. (arXiv:2109.09856v1 [cs.LG])
    (2 min) A classification technique incorporating a novel feature derivation method is proposed for predicting failure of a system or device with multivariate time series sensor data. We treat the multivariate time series sensor data as images for both visualization and computation. Failure follows various patterns which are closely related to the root causes. Different predefined transformations are applied on the original sensors data to better characterize the failure patterns. In addition to feature derivation, ensemble method is used to further improve the performance. In addition, a general algorithm architecture of deep neural network is proposed to handle multiple types of data with less manual feature engineering. We apply the proposed method on the early predict failure of computer disk drive in order to improve storage systems availability and avoid data loss. The classification accuracy is largely improved with the enriched features, named smart features.
    Self-supervised Representation Learning for Reliable Robotic Monitoring of Fruit Anomalies. (arXiv:2109.10135v1 [cs.RO])
    (2 min) Data augmentation can be a simple yet powerful tool for autonomous robots to fully utilise available data for self-supervised identification of atypical scenes or objects. State-of-the-art augmentation methods arbitrarily embed structural peculiarity in focal objects on typical images so that classifying these artefacts can provide guidance for learning representations for the detection of anomalous visual inputs. In this paper, however, we argue that learning such structure-sensitive representations can be a suboptimal approach to some classes of anomaly (e.g., unhealthy fruits) which are better recognised by a different type of visual element such as "colour". We thus propose Channel Randomisation as a novel data augmentation method for restricting neural network models to learn encoding of "colour irregularity" whilst predicting channel-randomised images to ultimately build reliable fruit-monitoring robots identifying atypical fruit qualities. Our experiments show that (1) the colour-based alternative can better learn representations for consistently accurate identification of fruit anomalies in various fruit species, and (2) validation accuracy can be monitored for early stopping of training due to positive correlation between the colour-learning task and fruit anomaly detection. Moreover, the proposed approach is evaluated on a new anomaly dataset Riseholme-2021, consisting of 3:5K strawberry images collected from a mobile robot, which we share with the community to encourage active agri-robotics research.
    Clinical Validation of Single-Chamber Model-Based Algorithms Used to Estimate Respiratory Compliance. (arXiv:2109.10224v1 [q-bio.QM])
    (2 min) Non-invasive estimation of respiratory physiology using computational algorithms promises to be a valuable technique for future clinicians to detect detrimental changes in patient pathophysiology. However, few clinical algorithms used to non-invasively analyze lung physiology have undergone rigorous validation in a clinical setting, and are often validated either using mechanical devices, or with small clinical validation datasets using 2-8 patients. This work aims to improve this situation by first, establishing an open, and clinically validated dataset comprising data from both mechanical lungs and nearly 40,000 breaths from 18 intubated patients. Next, we use this data to evaluate 15 different algorithms that use the "single chamber" model of estimating respiratory compliance. We evaluate these algorithms under varying clinical scenarios patients typically experience during hospitalization. In particular, we explore algorithm performance under four different types of patient ventilator asynchrony. We also analyze algorithms under varying ventilation modes to benchmark algorithm performance and to determine if ventilation mode has any impact on the algorithm. Our approach yields several advances by 1) showing which specific algorithms work best clinically under varying mode and asynchrony scenarios, 2) developing a simple mathematical method to reduce variance in algorithmic results, and 3) presenting additional insights about single-chamber model algorithms. We hope that our paper, approach, dataset, and software framework can thus be used by future researchers to improve their work and allow future integration of "single chamber" algorithms into clinical practice.
    Unsupervised Abstract Reasoning for Raven's Problem Matrices. (arXiv:2109.10011v1 [cs.CV])
    (2 min) Raven's Progressive Matrices (RPM) is highly correlated with human intelligence, and it has been widely used to measure the abstract reasoning ability of humans. In this paper, to study the abstract reasoning capability of deep neural networks, we propose the first unsupervised learning method for solving RPM problems. Since the ground truth labels are not allowed, we design a pseudo target based on the prior constraints of the RPM formulation to approximate the ground truth label, which effectively converts the unsupervised learning strategy into a supervised one. However, the correct answer is wrongly labelled by the pseudo target, and thus the noisy contrast will lead to inaccurate model training. To alleviate this issue, we propose to improve the model performance with negative answers. Moreover, we develop a decentralization method to adapt the feature representation to different RPM problems. Extensive experiments on three datasets demonstrate that our method even outperforms some of the supervised approaches. Our code is available at https://github.com/visiontao/ncd.
    Modeling Annotation Uncertainty with Gaussian Heatmaps in Landmark Localization. (arXiv:2109.09533v2 [cs.CV] UPDATED)
    (2 min) In landmark localization, due to ambiguities in defining their exact position, landmark annotations may suffer from large observer variabilities, which result in uncertain annotations. To model the annotation ambiguities of the training dataset, we propose to learn anisotropic Gaussian parameters modeling the shape of the target heatmap during optimization. Furthermore, our method models the prediction uncertainty of individual samples by fitting anisotropic Gaussian functions to the predicted heatmaps during inference. Besides state-of-the-art results, our experiments on datasets of hand radiographs and lateral cephalograms also show that Gaussian functions are correlated with both localization accuracy and observer variability. As a final experiment, we show the importance of integrating the uncertainty into decision making by measuring the influence of the predicted location uncertainty on the classification of anatomical abnormalities in lateral cephalograms.
    Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics. (arXiv:2109.09833v1 [cs.LG])
    (2 min) In this paper, we characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers. Specifically, we firstly show that the stochastic gradient noise possesses finite variance, and therefore the classical Central Limit Theorem (CLT) applies; this indicates that the gradient noise is asymptotically Gaussian. Such an asymptotic result validates the wide-accepted assumption of Gaussian noise. We clarify that the recently observed phenomenon of heavy tails within gradient noise may not be intrinsic properties, but the consequence of insufficient mini-batch size; the gradient noise, which is a sum of limited i.i.d. random variables, has not reached the asymptotic regime of CLT, thus deviates from Gaussian. We quantitatively measure the goodness of Gaussian approximation of the noise, which supports our conclusion. Secondly, we analyze the noise-induced dynamics of stochastic gradient descent using the Langevin equation, granting for momentum hyperparameter in the optimizer with a physical interpretation. We then proceed to demonstrate the existence of the steady-state distribution of stochastic gradient descent and approximate the distribution at a small learning rate.
    Non-parametric Kernel-Based Estimation of Probability Distributions for Precipitation Modeling. (arXiv:2109.09961v1 [stat.AP])
    (2 min) The probability distribution of precipitation amount strongly depends on geography, climate zone, and time scale considered. Closed-form parametric probability distributions are not sufficiently flexible to provide accurate and universal models for precipitation amount over different time scales. In this paper we derive non-parametric estimates of the cumulative distribution function (CDF) of precipitation amount for wet time intervals. The CDF estimates are obtained by integrating the kernel density estimator leading to semi-explicit CDF expressions for different kernel functions. We investigate kernel-based CDF estimation with an adaptive plug-in bandwidth (KCDE), using both synthetic data sets and reanalysis precipitation data from the island of Crete (Greece). We show that KCDE provides better estimates of the probability distribution than the standard empirical (staircase) estimate and kernel-based estimates that use the normal reference bandwidth. We also demonstrate that KCDE enables the simulation of non-parametric precipitation amount distributions by means of the inverse transform sampling method.
    Weak Signals in the Mobility Landscape: Car Sharing in Ten European Cities. (arXiv:2109.09832v1 [cs.LG])
    (2 min) Car sharing is one the pillars of a smart transportation infrastructure, as it is expected to reduce traffic congestion, parking demands and pollution in our cities. From the point of view of demand modelling, car sharing is a weak signal in the city landscape: only a small percentage of the population uses it, and thus it is difficult to study reliably with traditional techniques such as households travel diaries. In this work, we depart from these traditional approaches and we leverage web-based, digital records about vehicle availability in 10 European cities for one of the major active car sharing operators. We discuss which sociodemographic and urban activity indicators are associated with variations in car sharing demand, which forecasting approach (among the most popular in the related literature) is better suited to predict pickup and drop-off events, and how the spatio-temporal information about vehicle availability can be used to infer how different zones in a city are used by customers. We conclude the paper by presenting a direct application of the analysis of the dataset, aimed at identifying where to locate maintenance facilities within the car sharing operation area.
    Fast nonlinear risk assessment for autonomous vehicles using learned conditional probabilistic models of agent futures. (arXiv:2109.09975v1 [cs.LG])
    (2 min) This paper presents fast non-sampling based methods to assess the risk for trajectories of autonomous vehicles when probabilistic predictions of other agents' futures are generated by deep neural networks (DNNs). The presented methods address a wide range of representations for uncertain predictions including both Gaussian and non-Gaussian mixture models to predict both agent positions and control inputs conditioned on the scene contexts. We show that the problem of risk assessment when Gaussian mixture models (GMMs) of agent positions are learned can be solved rapidly to arbitrary levels of accuracy with existing numerical methods. To address the problem of risk assessment for non-Gaussian mixture models of agent position, we propose finding upper bounds on risk using nonlinear Chebyshev's Inequality and sums-of-squares (SOS) programming; they are both of interest as the former is much faster while the latter can be arbitrarily tight. These approaches only require higher order statistical moments of agent positions to determine upper bounds on risk. To perform risk assessment when models are learned for agent control inputs as opposed to positions, we propagate the moments of uncertain control inputs through the nonlinear motion dynamics to obtain the exact moments of uncertain position over the planning horizon. To this end, we construct deterministic linear dynamical systems that govern the exact time evolution of the moments of uncertain position in the presence of uncertain control inputs. The presented methods are demonstrated on realistic predictions from DNNs trained on the Argoverse and CARLA datasets and are shown to be effective for rapidly assessing the probability of low probability events.
    Learning Adaptive Control for SE(3) Hamiltonian Dynamics. (arXiv:2109.09974v1 [cs.RO])
    (2 min) Fast adaptive control is a critical component for reliable robot autonomy in rapidly changing operational conditions. While a robot dynamics model may be obtained from first principles or learned from data, updating its parameters is often too slow for online adaptation to environment changes. This motivates the use of machine learning techniques to learn disturbance descriptors from trajectory data offline as well as the design of adaptive control to estimate and compensate the disturbances online. This paper develops adaptive geometric control for rigid-body systems, such as ground, aerial, and underwater vehicles, that satisfy Hamilton's equations of motion over the SE(3) manifold. Our design consists of an offline system identification stage, followed by an online adaptive control stage. In the first stage, we learn a Hamiltonian model of the system dynamics using a neural ordinary differential equation (ODE) network trained from state-control trajectory data with different disturbance realizations. The disturbances are modeled as a linear combination of nonlinear descriptors. In the second stage, we design a trajectory tracking controller with disturbance compensation from an energy-based perspective. An adaptive control law is employed to adjust the disturbance model online proportional to the geometric tracking errors on the SE(3) manifold. We verify our adaptive geometric controller for trajectory tracking on a fully-actuated pendulum and an under-actuated quadrotor.
    Learning Interpretable Concept Groups in CNNs. (arXiv:2109.10078v1 [cs.CV])
    (2 min) We propose a novel training methodology -- Concept Group Learning (CGL) -- that encourages training of interpretable CNN filters by partitioning filters in each layer into concept groups, each of which is trained to learn a single visual concept. We achieve this through a novel regularization strategy that forces filters in the same group to be active in similar image regions for a given layer. We additionally use a regularizer to encourage a sparse weighting of the concept groups in each layer so that a few concept groups can have greater importance than others. We quantitatively evaluate CGL's model interpretability using standard interpretability evaluation techniques and find that our method increases interpretability scores in most cases. Qualitatively we compare the image regions that are most active under filters learned using CGL versus filters learned without CGL and find that CGL activation regions more strongly concentrate around semantically relevant features.
    Unsupervised Domain Adaptation with Semantic Consistency across Heterogeneous Modalities for MRI Prostate Lesion Segmentation. (arXiv:2109.09736v1 [eess.IV])
    (2 min) Any novel medical imaging modality that differs from previous protocols e.g. in the number of imaging channels, introduces a new domain that is heterogeneous from previous ones. This common medical imaging scenario is rarely considered in the domain adaptation literature, which handles shifts across domains of the same dimensionality. In our work we rely on stochastic generative modeling to translate across two heterogeneous domains at pixel space and introduce two new loss functions that promote semantic consistency. Firstly, we introduce a semantic cycle-consistency loss in the source domain to ensure that the translation preserves the semantics. Secondly, we introduce a pseudo-labelling loss, where we translate target data to source, label them by a source-domain network, and use the generated pseudo-labels to supervise the target-domain network. Our results show that this allows us to extract systematically better representations for the target domain. In particular, we address the challenge of enhancing performance on VERDICT-MRI, an advanced diffusion-weighted imaging technique, by exploiting labeled mp-MRI data. When compared to several unsupervised domain adaptation approaches, our approach yields substantial improvements, that consistently carry over to the semi-supervised and supervised learning settings.
    Scenario generation for market risk models using generative neural networks. (arXiv:2109.10072v1 [cs.LG])
    (2 min) In this research, we show how to expand existing approaches of generative adversarial networks (GANs) being used as economic scenario generators (ESG) to a whole internal model - with enough risk factors to model the full band-width of investments for an insurance company and for a one year horizon as required in Solvency 2. For validation of this approach as well as for optimisation of the GAN architecture, we develop new performance measures and provide a consistent, data-driven framework. Finally, we demonstrate that the results of a GAN-based ESG are similar to regulatory approved internal models in Europe. Therefore, GAN-based models can be seen as an assumption-free data-driven alternative way of market risk modelling.
    Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification. (arXiv:2109.09917v1 [cs.LG])
    (2 min) This work presents a new meta-heuristic approach to select the structure of polynomial NARX models for regression and classification problems. The method takes into account the complexity of the model and the contribution of each term to build parsimonious models by proposing a new cost function formulation. The robustness of the new algorithm is tested on several simulated and experimental system with different nonlinear characteristics. The obtained results show that the proposed algorithm is capable of identifying the correct model, for cases where the proper model structure is known, and determine parsimonious models for experimental data even for those systems for which traditional and contemporary methods habitually fails. The new algorithm is validated over classical methods such as the FROLS and recent randomized approaches.
    Well Googled is Half Done: Multimodal Forecasting of New Fashion Product Sales with Image-based Google Trends. (arXiv:2109.09824v1 [cs.CV])
    (2 min) This paper investigates the effectiveness of systematically probing Google Trendsagainst textual translations of visual aspects as exogenous knowledge to predict the sales of brand-new fashion items, where past sales data is not available, but only an image and few metadata are available. In particular, we propose GTM-Transformer, standing for Google Trends Multimodal Transformer, whose encoder works on the representation of the exogenous time series, while the decoder forecasts the sales using the Google Trends encoding, and the available visual and metadata information. Our model works in a non-autoregressive manner, avoiding the compounding effect of the first-step errors. As a second contribution, we present the VISUELLE dataset, which is the first publicly available dataset for the task of new fashion product sales forecasting, containing the sales of 5577 new products sold between 2016-2019, derived from genuine historical data ofNunalie, an Italian fast-fashion company. Our dataset is equipped with images of products, metadata, related sales, and associated Google Trends. We use VISUELLE to compare our approach against state-of-the-art alternatives and numerous baselines, showing that GTM-Transformer is the most accurate in terms of both percentage and absolute error. It is worth noting that the addition of exogenous knowledge boosts the forecasting accuracy by 1.5% WAPE wise, showing the importance of exploiting Google Trends. The code and dataset are both available at https://github.com/HumaticsLAB/GTM-Transformer.
    Chemical-Reaction-Aware Molecule Representation Learning. (arXiv:2109.09888v1 [cs.LG])
    (2 min) Molecule representation learning (MRL) methods aim to embed molecules into a real vector space. However, existing SMILES-based (Simplified Molecular-Input Line-Entry System) or GNN-based (Graph Neural Networks) MRL methods either take SMILES strings as input that have difficulty in encoding molecule structure information, or over-emphasize the importance of GNN architectures but neglect their generalization ability. Here we propose using chemical reactions to assist learning molecule representation. The key idea of our approach is to preserve the equivalence of molecules with respect to chemical reactions in the embedding space, i.e., forcing the sum of reactant embeddings and the sum of product embeddings to be equal for each chemical equation. This constraint is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings. Moreover, our model can use any GNN as the molecule encoder and is thus agnostic to GNN architectures. Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks, e.g., 17.4% absolute Hit@1 gain in chemical reaction prediction, 2.3% absolute AUC gain in molecule property prediction, and 18.5% relative RMSE gain in graph-edit-distance prediction, respectively, over the best baseline method. The code is available at https://github.com/hwwang55/MolR.
    MetaMedSeg: Volumetric Meta-learning for Few-Shot Organ Segmentation. (arXiv:2109.09734v1 [eess.IV])
    (2 min) The lack of sufficient annotated image data is a common issue in medical image segmentation. For some organs and densities, the annotation may be scarce, leading to poor model training convergence, while other organs have plenty of annotated data. In this work, we present MetaMedSeg, a gradient-based meta-learning algorithm that redefines the meta-learning task for the volumetric medical data with the goal to capture the variety between the slices. We also explore different weighting schemes for gradients aggregation, arguing that different tasks might have different complexity, and hence, contribute differently to the initialization. We propose an importance-aware weighting scheme to train our model. In the experiments, we present an evaluation of the medical decathlon dataset by extracting 2D slices from CT and MRI volumes of different organs and performing semantic segmentation. The results show that our proposed volumetric task definition leads to up to 30% improvement in terms of IoU compared to related baselines. The proposed update rule is also shown to improve the performance for complex scenarios where the data distribution of the target organ is very different from the source organs.
    Towards Energy-Efficient and Secure Edge AI: A Cross-Layer Framework. (arXiv:2109.09829v1 [cs.CR])
    (2 min) The security and privacy concerns along with the amount of data that is required to be processed on regular basis has pushed processing to the edge of the computing systems. Deploying advanced Neural Networks (NN), such as deep neural networks (DNNs) and spiking neural networks (SNNs), that offer state-of-the-art results on resource-constrained edge devices is challenging due to the stringent memory and power/energy constraints. Moreover, these systems are required to maintain correct functionality under diverse security and reliability threats. This paper first discusses existing approaches to address energy efficiency, reliability, and security issues at different system layers, i.e., hardware (HW) and software (SW). Afterward, we discuss how to further improve the performance (latency) and the energy efficiency of Edge AI systems through HW/SW-level optimizations, such as pruning, quantization, and approximation. To address reliability threats (like permanent and transient faults), we highlight cost-effective mitigation techniques, like fault-aware training and mapping. Moreover, we briefly discuss effective detection and protection techniques to address security threats (like model and data corruption). Towards the end, we discuss how these techniques can be combined in an integrated cross-layer framework for realizing robust and energy-efficient Edge AI systems.
    Context-Specific Representation Abstraction for Deep Option Learning. (arXiv:2109.09876v1 [cs.LG])
    (2 min) Hierarchical reinforcement learning has focused on discovering temporally extended actions, such as options, that can provide benefits in problems requiring extensive exploration. One promising approach that learns these options end-to-end is the option-critic (OC) framework. We examine and show in this paper that OC does not decompose a problem into simpler sub-problems, but instead increases the size of the search over policy space with each option considering the entire state space during learning. This issue can result in practical limitations of this method, including sample inefficient learning. To address this problem, we introduce Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space. Specifically, our method learns a factored belief state representation that enables each option to learn a policy over only a subsection of the state space. We test our method against hierarchical, non-hierarchical, and modular recurrent neural network baselines, demonstrating significant sample efficiency improvements in challenging partially observable environments.
    Identifying biases in legal data: An algorithmic fairness perspective. (arXiv:2109.09946v1 [cs.CY])
    (2 min) The need to address representation biases and sentencing disparities in legal case data has long been recognized. Here, we study the problem of identifying and measuring biases in large-scale legal case data from an algorithmic fairness perspective. Our approach utilizes two regression models: A baseline that represents the decisions of a "typical" judge as given by the data and a "fair" judge that applies one of three fairness concepts. Comparing the decisions of the "typical" judge and the "fair" judge allows for quantifying biases across demographic groups, as we demonstrate in four case studies on criminal data from Cook County (Illinois).
    Learning offline: memory replay in biological and artificial reinforcement learning. (arXiv:2109.10034v1 [cs.LG])
    (2 min) Learning to act in an environment to maximise rewards is among the brain's key functions. This process has often been conceptualised within the framework of reinforcement learning, which has also gained prominence in machine learning and artificial intelligence (AI) as a way to optimise decision-making. A common aspect of both biological and machine reinforcement learning is the reactivation of previously experienced episodes, referred to as replay. Replay is important for memory consolidation in biological neural networks, and is key to stabilising learning in deep neural networks. Here, we review recent developments concerning the functional roles of replay in the fields of neuroscience and AI. Complementary progress suggests how replay might support learning processes, including generalisation and continual learning, affording opportunities to transfer knowledge across the two fields to advance the understanding of biological and artificial learning and memory.
    StereOBJ-1M: Large-scale Stereo Image Dataset for 6D Object Pose Estimation. (arXiv:2109.10115v1 [cs.CV])
    (2 min) We present a large-scale stereo RGB image object pose estimation dataset named the $\textbf{StereOBJ-1M}$ dataset. The dataset is designed to address challenging cases such as object transparency, translucency, and specular reflection, in addition to the common challenges of occlusion, symmetry, and variations in illumination and environments. In order to collect data of sufficient scale for modern deep learning models, we propose a novel method for efficiently annotating pose data in a multi-view fashion that allows data capturing in complex and flexible environments. Fully annotated with 6D object poses, our dataset contains over 396K frames and over 1.5M annotations of 18 objects recorded in 183 scenes constructed in 11 different environments. The 18 objects include 8 symmetric objects, 7 transparent objects, and 8 reflective objects. We benchmark two state-of-the-art pose estimation frameworks on StereOBJ-1M as baselines for future work. We also propose a novel object-level pose optimization method for computing 6D pose from keypoint predictions in multiple images.
    Graph Random Neural Network for Semi-Supervised Learning on Graphs. (arXiv:2005.11079v4 [cs.LG] UPDATED)
    (2 min) We study the problem of semi-supervised learning on graphs, for which graph neural networks (GNNs) have been extensively explored. However, most existing GNNs inherently suffer from the limitations of over-smoothing, non-robustness, and weak-generalization when labeled nodes are scarce. In this paper, we propose a simple yet effective framework -- GRAPH RANDOM NEURAL NETWORKS (GRAND) -- to address these issues. In GRAND, we first design a random propagation strategy to perform graph data augmentation. Then we leverage consistency regularization to optimize the prediction consistency of unlabeled nodes across different data augmentations. Extensive experiments on graph benchmark datasets suggest that GRAND significantly outperforms state-of-the-art GNN baselines on semi-supervised node classification. Finally, we show that GRAND mitigates the issues of over-smoothing and non-robustness, exhibiting better generalization behavior than existing GNNs. The source code of GRAND is publicly available at https://github.com/Grand20/grand.
    Robustness Analysis of Deep Learning Frameworks on Mobile Platforms. (arXiv:2109.09869v1 [cs.LG])
    (2 min) With the recent increase in the computational power of modern mobile devices, machine learning-based heavy tasks such as face detection and speech recognition are now integral parts of such devices. This requires frameworks to execute machine learning models (e.g., Deep Neural Networks) on mobile devices. Although there exist studies on the accuracy and performance of these frameworks, the quality of on-device deep learning frameworks, in terms of their robustness, has not been systematically studied yet. In this paper, we empirically compare two on-device deep learning frameworks with three adversarial attacks on three different model architectures. We also use both the quantized and unquantized variants for each architecture. The results show that, in general, neither of the deep learning frameworks is better than the other in terms of robustness, and there is not a significant difference between the PC and mobile frameworks either. However, in cases like Boundary attack, mobile version is more robust than PC. In addition, quantization improves robustness in all cases when moving from PC to mobile.
    Modelling Adversarial Noise for Adversarial Defense. (arXiv:2109.09901v1 [cs.LG])
    (2 min) Deep neural networks have been demonstrated to be vulnerable to adversarial noise, promoting the development of defenses against adversarial attacks. Traditionally, adversarial defenses typically focus on directly exploiting adversarial examples to remove adversarial noise or train an adversarially robust target model. Motivated by that the relationship between adversarial data and natural data can help infer clean data from adversarial data to obtain the final correct prediction, in this paper, we study to model adversarial noise to learn the transition relationship in the label space for using adversarial labels to improve adversarial accuracy. Specifically, we introduce a transition matrix to relate adversarial labels and true labels. By exploiting the transition matrix, we can directly infer clean labels from adversarial labels. Then, we propose to employ a deep neural network (i.e., transition network) to model the instance-dependent transition matrix from adversarial noise. In addition, we conduct joint adversarial training on the target model and the transition network to achieve optimal performance. Empirical evaluations on benchmark datasets demonstrate that our method could significantly improve adversarial accuracy in comparison to state-of-the-art methods.
    Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models. (arXiv:2009.13267v4 [cs.CL] UPDATED)
    (2 min) The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016; Norouzi et al., 2016; Shen et al., 2016; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution -- there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energy-based model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT: energy-based re-ranking (EBR). We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences). Our EBR with the joint energy model consistently improves the performance of the Transformer-based NMT: +4 BLEU points on IWSLT'14 German-English, +3.0 BELU points on Sinhala-English, +1.2 BLEU on WMT'16 English-German tasks.
    Consistency of spectral clustering for directed network community detection. (arXiv:2109.10319v1 [stat.ML])
    (2 min) Directed networks appear in various areas, such as biology, sociology, physiology and computer science. However, at present, most network analysis ignores the direction. In this paper, we construct a spectral clustering method based on the singular decomposition of the adjacency matrix to detect community in directed stochastic block model (DiSBM). By considering a sparsity parameter, under some mild conditions, we show the proposed approach can consistently recover hidden row and column communities for different scaling of degrees. By considering the degree heterogeneity of both row and column nodes, we further establish a theoretical framework for directed degree corrected stochastic block model (DiDCSBM). We show that the spectral clustering method stably yields consistent community detection for row clusters and column clusters under mild constraints on the degree heterogeneity. Our theoretical results under DiSBM and DiDCSBM provide some innovations on some special directed networks, such as directed network with balanced clusters, directed network with nodes enjoying similar degrees, and the directed Erd\"os-R\'enyi graph. Furthermore, our theoretical results under DiDCSBM are consistent with those under DiSBM when DiDCSBM degenerates to DiSBM.
    AutoGCL: Automated Graph Contrastive Learning via Learnable View Generators. (arXiv:2109.10259v1 [cs.LG])
    (2 min) Contrastive learning has been widely applied to graph representation learning, where the view generators play a vital role in generating effective contrastive samples. Most of the existing contrastive learning methods employ pre-defined view generation methods, e.g., node drop or edge perturbation, which usually cannot adapt to input data or preserve the original semantic structures well. To address this issue, we propose a novel framework named Automated Graph Contrastive Learning (AutoGCL) in this paper. Specifically, AutoGCL employs a set of learnable graph view generators orchestrated by an auto augmentation strategy, where every graph view generator learns a probability distribution of graphs conditioned by the input. While the graph view generators in AutoGCL preserve the most representative structures of the original graph in generation of every contrastive sample, the auto augmentation learns policies to introduce adequate augmentation variances in the whole contrastive learning procedure. Furthermore, AutoGCL adopts a joint training strategy to train the learnable view generators, the graph encoder, and the classifier in an end-to-end manner, resulting in topological heterogeneity yet semantic similarity in the generation of contrastive samples. Extensive experiments on semi-supervised learning, unsupervised learning, and transfer learning demonstrate the superiority of our AutoGCL framework over the state-of-the-arts in graph contrastive learning. In addition, the visualization results further confirm that the learnable view generators can deliver more compact and semantically meaningful contrastive samples compared against the existing view generation methods.
    Informed Sampling for Diversity in Concept-to-Text NLG. (arXiv:2004.14364v2 [cs.CL] UPDATED)
    (2 min) Deep-learning models for language generation tasks tend to produce repetitive output. Various methods have been proposed to encourage lexical diversity during decoding, but this often comes at a cost to the perceived fluency and adequacy of the output. In this work, we propose to ameliorate this cost by using an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce. Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output. We focus our experiments on concept-to-text generation where models are sensitive to the inclusion of irrelevant words due to the strict relation between input and output. Our analysis shows that previous methods for diversity underperform in this setting, while human evaluation suggests that our proposed method achieves a high level of diversity with minimal effect to the output's fluency and adequacy.
    Generalized Optimization: A First Step Towards Category Theoretic Learning Theory. (arXiv:2109.10262v1 [math.OC])
    (2 min) The Cartesian reverse derivative is a categorical generalization of reverse-mode automatic differentiation. We use this operator to generalize several optimization algorithms, including a straightforward generalization of gradient descent and a novel generalization of Newton's method. We then explore which properties of these algorithms are preserved in this generalized setting. First, we show that the transformation invariances of these algorithms are preserved: while generalized Newton's method is invariant to all invertible linear transformations, generalized gradient descent is invariant only to orthogonal linear transformations. Next, we show that we can express the change in loss of generalized gradient descent with an inner product-like expression, thereby generalizing the non-increasing and convergence properties of the gradient descent optimization flow. Finally, we include several numerical experiments to illustrate the ideas in the paper and demonstrate how we can use them to optimize polynomial functions over an ordered ring.
    Ensembles of Localised Models for Time Series Forecasting. (arXiv:2012.15059v2 [cs.LG] UPDATED)
    (2 min) With large quantities of data typically available nowadays, forecasting models that are trained across sets of time series, known as Global Forecasting Models (GFM), are regularly outperforming traditional univariate forecasting models that work on isolated series. As GFMs usually share the same set of parameters across all time series, they often have the problem of not being localised enough to a particular series, especially in situations where datasets are heterogeneous. We study how ensembling techniques can be used with generic GFMs and univariate models to solve this issue. Our work systematises and compares relevant current approaches, namely clustering series and training separate submodels per cluster, the so-called ensemble of specialists approach, and building heterogeneous ensembles of global and local models. We fill some gaps in the existing GFM localisation approaches, in particular by incorporating varied clustering techniques such as feature-based clustering, distance-based clustering and random clustering, and generalise them to use different underlying GFM model types. We then propose a new methodology of clustered ensembles where we train multiple GFMs on different clusters of series, obtained by changing the number of clusters and cluster seeds. Using Feed-forward Neural Networks, Recurrent Neural Networks, and Pooled Regression models as the underlying GFMs, in our evaluation on eight publicly available datasets, the proposed models are able to achieve significantly higher accuracy than baseline GFM models and univariate forecasting methods.
    Weak Adaptation Learning -- Addressing Cross-domain Data Insufficiency with Weak Annotator. (arXiv:2102.07358v3 [cs.LG] UPDATED)
    (2 min) Data quantity and quality are crucial factors for data-driven learning methods. In some target problem domains, there are not many data samples available, which could significantly hinder the learning process. While data from similar domains may be leveraged to help through domain adaptation, obtaining high-quality labeled data for those source domains themselves could be difficult or costly. To address such challenges on data insufficiency for classification problem in a target domain, we propose a weak adaptation learning (WAL) approach that leverages unlabeled data from a similar source domain, a low-cost weak annotator that produces labels based on task-specific heuristics, labeling rules, or other methods (albeit with inaccuracy), and a small amount of labeled data in the target domain. Our approach first conducts a theoretical analysis on the error bound of the trained classifier with respect to the data quantity and the performance of the weak annotator, and then introduces a multi-stage weak adaptation learning method to learn an accurate classifier by lowering the error bound. Our experiments demonstrate the effectiveness of our approach in learning an accurate classifier with limited labeled data in the target domain and unlabeled data in the source domain.
    Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits. (arXiv:2109.09855v1 [cs.LG])
    (2 min) We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA)^2B-UCB algorithm. As compared with the existing algorithms, R(MA)^2B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA)^2B-UCB outperforms the existing algorithms in both regret and run time.
    Molecular Energy Learning Using Alternative Blackbox Matrix-Matrix Multiplication Algorithm for Exact Gaussian Process. (arXiv:2109.09817v1 [physics.chem-ph])
    (2 min) We present an application of the blackbox matrix-matrix multiplication (BBMM) algorithm to scale up the Gaussian Process (GP) training of molecular energies in the molecular-orbital based machine learning (MOB-ML) framework. An alternative implementation of BBMM (AltBBMM) is also proposed to train more efficiently (over four-fold speedup) with the same accuracy and transferability as the original BBMM implementation. The training of MOB-ML was limited to 220 molecules, and BBMM and AltBBMM scale the training of MOB-ML up by over 30 times to 6500 molecules (more than a million pair energies). The accuracy and transferability of both algorithms are examined on the benchmark datasets of organic molecules with 7 and 13 heavy atoms. These lower-scaling implementations of the GP preserve the state-of-the-art learning efficiency in the low-data regime while extending it to the large-data regime with better accuracy than other available machine learning works on molecular energies.
    Demonstration-Efficient Guided Policy Search via Imitation of Robust Tube MPC. (arXiv:2109.09910v1 [cs.RO])
    (2 min) We propose a demonstration-efficient strategy to compress a computationally expensive Model Predictive Controller (MPC) into a more computationally efficient representation based on a deep neural network and Imitation Learning (IL). By generating a Robust Tube variant (RTMPC) of the MPC and leveraging properties from the tube, we introduce a data augmentation method that enables high demonstration-efficiency, being capable to compensate the distribution shifts typically encountered in IL. Our approach opens the possibility of zero-shot transfer from a single demonstration collected in a nominal domain, such as a simulation or a robot in a lab/controlled environment, to a domain with bounded model errors/perturbations. Numerical and experimental evaluations performed on a trajectory tracking MPC for a quadrotor show that our method outperforms strategies commonly employed in IL, such as DAgger and Domain Randomization, in terms of demonstration-efficiency and robustness to perturbations unseen during training.
    SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization. (arXiv:2109.09831v1 [cs.LG])
    (2 min) Algorithm parameters, in particular hyperparameters of machine learning algorithms, can substantially impact their performance. To support users in determining well-performing hyperparameter configurations for their algorithms, datasets and applications at hand, SMAC3 offers a robust and flexible framework for Bayesian Optimization, which can improve performance within a few evaluations. It offers several facades and pre-sets for typical use cases, such as optimizing hyperparameters, solving low dimensional continuous (artificial) global optimization problems and configuring algorithms to perform well across multiple problem instances. The SMAC3 package is available under a permissive BSD-license at https://github.com/automl/SMAC3.
    Description of Corner Cases in Automated Driving: Goals and Challenges. (arXiv:2109.09607v2 [cs.LG] UPDATED)
    (2 min) Scaling the distribution of automated vehicles requires handling various unexpected and possibly dangerous situations, termed corner cases (CC). Since many modules of automated driving systems are based on machine learning (ML), CC are an essential part of the data for their development. However, there is only a limited amount of CC data in large-scale data collections, which makes them challenging in the context of ML. With a better understanding of CC, offline applications, e.g., dataset analysis, and online methods, e.g., improved performance of automated driving systems, can be improved. While there are knowledge-based descriptions and taxonomies for CC, there is little research on machine-interpretable descriptions. In this extended abstract, we will give a brief overview of the challenges and goals of such a description.
    Metamorphic Relation Prioritization for Effective Regression Testing. (arXiv:2109.09798v1 [cs.SE])
    (2 min) Metamorphic testing (MT) is widely used for testing programs that face the oracle problem. It uses a set of metamorphic relations (MRs), which are relations among multiple inputs and their corresponding outputs to determine whether the program under test is faulty. Typically, MRs vary in their ability to detect faults in the program under test, and some MRs tend to detect the same set of faults. In this paper, we propose approaches to prioritize MRs to improve the efficiency and effectiveness of MT for regression testing. We present two MR prioritization approaches: (1) fault-based and (2) coverage-based. To evaluate these MR prioritization approaches, we conduct experiments on three complex open-source software systems. Our results show that the MR prioritization approaches developed by us significantly outperform the current practice of executing the source and follow-up test cases of the MRs in an ad-hoc manner in terms of fault detection effectiveness. Further, fault-based MR prioritization leads to reducing the number of source and follow-up test cases that needs to be executed as well as reducing the average time taken to detect a fault, which would result in saving time and cost during the testing process.
    MBB: Model-Based Baseline for Efficient Reinforcement Learning. (arXiv:2011.02073v3 [cs.LG] UPDATED)
    (2 min) Model-free reinforcement learning (RL) is capable of learning control policies for high-dimensional, complex robotic tasks, but tends to be data-inefficient. Model-based RL tends to be more data-efficient but often suffers from learning a high-dimensional model that is good enough for policy improvement. This limits its use to learning simple models for restrictive domains. Optimal control generates solutions without collecting any data, assuming an accurate model of the system and environment is known, which is often true in many control theory applications. However, optimal control cannot be scaled to problems with a high-dimensional state space. In this paper, we propose a novel approach to alleviate data inefficiency of model-free RL in high-dimensional problems by warm-starting the learning process using a lower-dimensional model-based solution. Particularly, we initialize a baseline function for the high-dimensional RL problem via supervision from a lower-dimensional value function, which can be obtained by solving a lower-dimensional problem with a known, approximate model using "classical" techniques such as value iteration or optimal control. Therefore, our approach implicitly exploits the model priors from simplified problem space to facilitate the policy learning in high-dimensional RL tasks. We demonstrate our approach on two representative robotic learning tasks and observe significant improvement in policy performance and learning efficiency. We also evaluate our method empirically with a third task.
    Trust Your Robots! Predictive Uncertainty Estimation of Neural Networks with Sparse Gaussian Processes. (arXiv:2109.09690v2 [cs.RO] UPDATED)
    (2 min) This paper presents a probabilistic framework to obtain both reliable and fast uncertainty estimates for predictions with Deep Neural Networks (DNNs). Our main contribution is a practical and principled combination of DNNs with sparse Gaussian Processes (GPs). We prove theoretically that DNNs can be seen as a special case of sparse GPs, namely mixtures of GP experts (MoE-GP), and we devise a learning algorithm that brings the derived theory into practice. In experiments from two different robotic tasks -- inverse dynamics of a manipulator and object detection on a micro-aerial vehicle (MAV) -- we show the effectiveness of our approach in terms of predictive uncertainty, improved scalability, and run-time efficiency on a Jetson TX2. We thus argue that our approach can pave the way towards reliable and fast robot learning systems with uncertainty awareness.
    Dynamic Sampling and Selective Masking for Communication-Efficient Federated Learning. (arXiv:2003.09603v2 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) is a novel machine learning setting that enables on-device intelligence via decentralized training and federated optimization. Deep neural networks' rapid development facilitates the learning techniques for modeling complex problems and emerges into federated deep learning under the federated setting. However, the tremendous amount of model parameters burdens the communication network with a high load of transportation. This paper introduces two approaches for improving communication efficiency by dynamic sampling and top-$k$ selective masking. The former controls the fraction of selected client models dynamically, while the latter selects parameters with top-$k$ largest values of difference for federated updating. Experiments on convolutional image classification and recurrent language modeling are conducted on three public datasets to show our proposed methods' effectiveness.
    Comparison of single and multitask learning for predicting cognitive decline based on MRI data. (arXiv:2109.10266v1 [cs.LG])
    (2 min) The Alzheimer's Disease Assessment Scale-Cognitive subscale (ADAS-Cog) is a neuropsychological tool that has been designed to assess the severity of cognitive symptoms of dementia. Personalized prediction of the changes in ADAS-Cog scores could help in timing therapeutic interventions in dementia and at-risk populations. In the present work, we compared single and multitask learning approaches to predict the changes in ADAS-Cog scores based on T1-weighted anatomical magnetic resonance imaging (MRI). In contrast to most machine learning-based prediction methods ADAS-Cog changes, we stratified the subjects based on their baseline diagnoses and evaluated the prediction performances in each group. Our experiments indicated a positive relationship between the predicted and observed ADAS-Cog score changes in each diagnostic group, suggesting that T1-weighted MRI has a predictive value for evaluating cognitive decline in the entire AD continuum. We further studied whether correction of the differences in the magnetic field strength of MRI would improve the ADAS-Cog score prediction. The partial least square-based domain adaptation slightly improved the prediction performance, but the improvement was marginal. In summary, this study demonstrated that ADAS-Cog change could be, to some extent, predicted based on anatomical MRI. Based on this study, the recommended method for learning the predictive models is a single-task regularized linear regression due to its simplicity and good performance. It appears important to combine the training data across all subject groups for the most effective predictive models.
    3KG: Contrastive Learning of 12-Lead Electrocardiograms using Physiologically-Inspired Augmentations. (arXiv:2106.04452v2 [physics.med-ph] UPDATED)
    (2 min) We propose 3KG, a physiologically-inspired contrastive learning approach that generates views using 3D augmentations of the 12-lead electrocardiogram. We evaluate representation quality by fine-tuning a linear layer for the downstream task of 23-class diagnosis on the PhysioNet 2020 challenge training data and find that 3KG achieves a $9.1\%$ increase in mean AUC over the best self-supervised baseline when trained on $1\%$ of labeled data. Our empirical analysis shows that combining spatial and temporal augmentations produces the strongest representations. In addition, we investigate the effect of this physiologically-inspired pretraining on downstream performance on different disease subgroups and find that 3KG makes the greatest gains for conduction and rhythm abnormalities. Our method allows for flexibility in incorporating other self-supervised strategies and highlights the potential for similar modality-specific augmentations for other biomedical signals.
    Deep Directed Information-Based Learning for Privacy-Preserving Smart Meter Data Release. (arXiv:2011.11421v2 [cs.LG] UPDATED)
    (2 min) The explosion of data collection has raised serious privacy concerns in users due to the possibility that sharing data may also reveal sensitive information. The main goal of a privacy-preserving mechanism is to prevent a malicious third party from inferring sensitive information while keeping the shared data useful. In this paper, we study this problem in the context of time series data and smart meters (SMs) power consumption measurements in particular. Although Mutual Information (MI) between private and released variables has been used as a common information-theoretic privacy measure, it fails to capture the causal time dependencies present in the power consumption time series data. To overcome this limitation, we introduce the Directed Information (DI) as a more meaningful measure of privacy in the considered setting and propose a novel loss function. The optimization is then performed using an adversarial framework where two Recurrent Neural Networks (RNNs), referred to as the releaser and the adversary, are trained with opposite goals. Our empirical studies on real-world data sets from SMs measurements in the worst-case scenario where an attacker has access to all the training data set used by the releaser, validate the proposed method and show the existing trade-offs between privacy and utility.
    Towards Discovery and Attribution of Open-world GAN Generated Images. (arXiv:2105.04580v2 [cs.CV] UPDATED)
    (2 min) With the recent progress in Generative Adversarial Networks (GANs), it is imperative for media and visual forensics to develop detectors which can identify and attribute images to the model generating them. Existing works have shown to attribute images to their corresponding GAN sources with high accuracy. However, these works are limited to a closed set scenario, failing to generalize to GANs unseen during train time and are therefore, not scalable with a steady influx of new GANs. We present an iterative algorithm for discovering images generated from previously unseen GANs by exploiting the fact that all GANs leave distinct fingerprints on their generated images. Our algorithm consists of multiple components including network training, out-of-distribution detection, clustering, merge and refine steps. Through extensive experiments, we show that our algorithm discovers unseen GANs with high accuracy and also generalizes to GANs trained on unseen real datasets. We additionally apply our algorithm to attribution and discovery of GANs in an online fashion as well as to the more standard task of real/fake detection. Our experiments demonstrate the effectiveness of our approach to discover new GANs and can be used in an open-world setup.
    Relation-Guided Pre-Training for Open-Domain Question Answering. (arXiv:2109.10346v1 [cs.CL])
    (2 min) Answering complex open-domain questions requires understanding the latent relations between involving entities. However, we found that the existing QA datasets are extremely imbalanced in some types of relations, which hurts the generalization performance over questions with long-tail relations. To remedy this problem, in this paper, we propose a Relation-Guided Pre-Training (RGPT-QA) framework. We first generate a relational QA dataset covering a wide range of relations from both the Wikidata triplets and Wikipedia hyperlinks. We then pre-train a QA model to infer the latent relations from the question, and then conduct extractive QA to get the target answer entity. We demonstrate that by pretraining with propoed RGPT-QA techique, the popular open-domain QA model, Dense Passage Retriever (DPR), achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions. Particularly, we show that RGPT-QA improves significantly on questions with long-tail relations
    Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images. (arXiv:2004.14487v3 [cs.CV] UPDATED)
    (2 min) The connection between visual input and tactile sensing is critical for object manipulation tasks such as grasping and pushing. In this work, we introduce the challenging task of estimating a set of tactile physical properties from visual information. We aim to build a model that learns the complex mapping between visual information and tactile physical properties. We construct a first of its kind image-tactile dataset with over 400 multiview image sequences and the corresponding tactile properties. A total of fifteen tactile physical properties across categories including friction, compliance, adhesion, texture, and thermal conductance are measured and then estimated by our models. We develop a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile joint classification loss. Additionally, we develop a neural architecture search framework capable of selecting optimal combinations of viewing angles for estimating a given physical property.
    Reproducibility Study: Comparing Rewinding and Fine-tuning in Neural Network Pruning. (arXiv:2109.09670v2 [cs.LG] UPDATED)
    (2 min) Scope of reproducibility: We are reproducing Comparing Rewinding and Fine-tuning in Neural Networks from arXiv:2003.02389. In this work the authors compare three different approaches to retraining neural networks after pruning: 1) fine-tuning, 2) rewinding weights as in arXiv:1803.03635 and 3) a new, original method involving learning rate rewinding, building upon Lottery Ticket Hypothesis. We reproduce the results of all three approaches, but we focus on verifying their approach, learning rate rewinding, since it is newly proposed and is described as a universal alternative to other methods. We used CIFAR10 for most reproductions along with additional experiments on the larger CIFAR100, which extends the results originally provided by the authors. We have also extended the list of tested network architectures to include Wide ResNets. The new experiments led us to discover the limitations of learning rate rewinding which can worsen pruning results on large architectures. Results: We were able to reproduce the exact results reported by the authors in all originally reported scenarios. However, extended results on larger Wide Residual Networks have demonstrated the limitations of the newly proposed learning rate rewinding -- we observed a previously unreported accuracy degradation for low sparsity ranges. Nevertheless, the general conclusion of the paper still holds and was indeed reproduced.
    CondNet: Conditional Classifier for Scene Segmentation. (arXiv:2109.10322v1 [cs.CV])
    (2 min) The fully convolutional network (FCN) has achieved tremendous success in dense visual recognition tasks, such as scene segmentation. The last layer of FCN is typically a global classifier (1x1 convolution) to recognize each pixel to a semantic label. We empirically show that this global classifier, ignoring the intra-class distinction, may lead to sub-optimal results. In this work, we present a conditional classifier to replace the traditional global classifier, where the kernels of the classifier are generated dynamically conditioned on the input. The main advantages of the new classifier consist of: (i) it attends on the intra-class distinction, leading to stronger dense recognition capability; (ii) the conditional classifier is simple and flexible to be integrated into almost arbitrary FCN architectures to improve the prediction. Extensive experiments demonstrate that the proposed classifier performs favourably against the traditional classifier on the FCN architecture. The framework equipped with the conditional classifier (called CondNet) achieves new state-of-the-art performances on two datasets. The code and models are available at https://git.io/CondNet.
    Neural forecasting at scale. (arXiv:2109.09705v2 [cs.LG] UPDATED)
    (2 min) We study the problem of efficiently scaling ensemble-based deep neural networks for time series (TS) forecasting on a large set of time series. Current state-of-the-art deep ensemble models have high memory and computational requirements, hampering their use to forecast millions of TS in practical scenarios. We propose N-BEATS(P), a global multivariate variant of the N-BEATS model designed to allow simultaneous training of multiple univariate TS forecasting models. Our model addresses the practical limitations of related models, reducing the training time by half and memory requirement by a factor of 5, while keeping the same level of accuracy. We have performed multiple experiments detailing the various ways to train our model and have obtained results that demonstrate its capacity to support zero-shot TS forecasting, i.e., to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, which provides an efficient and reliable solution to forecast at scale even in difficult forecasting conditions.
    Morphence: Moving Target Defense Against Adversarial Examples. (arXiv:2108.13952v3 [cs.LG] UPDATED)
    (2 min) Robustness to adversarial examples of machine learning models remains an open topic of research. Attacks often succeed by repeatedly probing a fixed target model with adversarial examples purposely crafted to fool it. In this paper, we introduce Morphence, an approach that shifts the defense landscape by making a model a moving target against adversarial examples. By regularly moving the decision function of a model, Morphence makes it significantly challenging for repeated or correlated attacks to succeed. Morphence deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is seamlessly replaced by a new model pool generated in advance. We evaluate Morphence on two benchmark image classification datasets (MNIST and CIFAR10) against five reference attacks (2 white-box and 3 black-box). In all cases, Morphence consistently outperforms the thus-far effective defense, adversarial training, even in the face of strong white-box attacks, while preserving accuracy on clean data.
    Learning to Prompt for Vision-Language Models. (arXiv:2109.01134v2 [cs.CV] UPDATED)
    (3 min) Vision-language pre-training has recently emerged as a promising alternative for representation learning. It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders. Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks since visual concepts can be diametrically generated from natural language, known as prompt. In this paper, we identify that a major challenge of deploying such models in practice is prompt engineering. This is because designing a proper prompt, especially for context words surrounding a class name, requires domain expertise and typically takes a significant amount of time for words tuning since a slight change in wording could have a huge impact on performance. Moreover, different downstream tasks require specific designs, further hampering the efficiency of deployment. To overcome this challenge, we propose a novel approach named context optimization (CoOp). The main idea is to model context in prompts using continuous representations and perform end-to-end learning from data while keeping the pre-trained parameters fixed. In this way, the design of task-relevant prompts can be fully automated. Experiments on 11 datasets show that CoOp effectively turns pre-trained vision-language models into data-efficient visual learners, requiring as few as one or two shots to beat hand-crafted prompts with a decent margin and able to gain significant improvements when using more shots (e.g., at 16 shots the average gain is around 17% with the highest reaching over 50%). CoOp also exhibits strong robustness to distribution shift.
    Introduction to Neural Network Verification. (arXiv:2109.10317v1 [cs.LG])
    (2 min) Deep learning has transformed the way we think of software and what it can do. But deep neural networks are fragile and their behaviors are often surprising. In many settings, we need to provide formal guarantees on the safety, security, correctness, or robustness of neural networks. This book covers foundational ideas from formal verification and their adaptation to reasoning about neural networks and deep learning.
    Discovery of temporal structure intricacy in arterial blood pressure waveforms representing acuity of liver transplant and forecasting short term surgical outcome via unsupervised manifold learning. (arXiv:2109.10258v1 [q-bio.QM])
    (3 min) Background: Arterial blood pressure (ABP) waveform evolves across each consecutive pulse during the liver transplant surgery. We hypothesized that the quantification of the waveform evolution reflects 1) the acuity of the recipient undergoing liver transplant and 2) the intraoperative dynamics that forecasts short-term surgical outcomes. Methods: In this prospective observational single cohort study on living donor liver transplant surgery, we extracted the waveform morphological evolution from the ABP data with the unsupervised manifold learning waveform analysis. Two quantitative indices, trend movement and fluctuation movement, were developed to represent the slow-varying and fast-varying dynamics respectively. We investigated the associations with the liver disease acuity represented with the Model for End-Stage Liver Disease (MELD) score and the primary outcomes, the early allograft failure (EAF), as well as the recently developed EAF scores, including the Liver Graft Assessment Following Transplantation (L-GrAFT) score, the Early Allograft Failure Simplified Estimation (EASE) score, and the Model for Early Allograft Function (MEAF) score. Results: Sixty recipients were enrolled. The presurgical trend movement was correlated with the MELD scores. It decreased in the anhepatic phase. The neohepatic trend movement correlated with the L-GrAFT scores, the EASE score, and the MEAF score. Regarding the constituent of the EAF scores, the trend movement most correlated with the postoperative day 7 bilirubin. Conclusions: The ABP waveform evolution intricacy in the presurgical phase reflects recipients' acuity condition while that in the neohepatic phase reveal the short-term surgical outcome calculated from laboratory data in postoperative day 7-10. The waveform evolution reflects the intraoperative contribution to the early outcome.
    Assured Neural Network Architectures for Control and Identification of Nonlinear Systems. (arXiv:2109.10298v1 [cs.LG])
    (2 min) In this paper, we consider the problem of automatically designing a Rectified Linear Unit (ReLU) Neural Network (NN) architecture (number of layers and number of neurons per layer) with the assurance that it is sufficiently parametrized to control a nonlinear system; i.e. control the system to satisfy a given formal specification. This is unlike current techniques, which provide no assurances on the resultant architecture. Moreover, our approach requires only limited knowledge of the underlying nonlinear system and specification. We assume only that the specification can be satisfied by a Lipschitz-continuous controller with a known bound on its Lipschitz constant; the specific controller need not be known. From this assumption, we bound the number of affine functions needed to construct a Continuous Piecewise Affine (CPWA) function that can approximate any Lipschitz-continuous controller that satisfies the specification. Then we connect this CPWA to a NN architecture using the authors' recent results on the Two-Level Lattice (TLL) NN architecture; the TLL architecture was shown to be parameterized by the number of affine functions present in the CPWA function it realizes.
    Utterance Clustering Using Stereo Audio Channels. (arXiv:2009.05076v2 [eess.AS] UPDATED)
    (2 min) Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then extracted embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter sharing Gaussian mixture model was conducted to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multi-person discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono audio signals in more complicated conditions.

2021-09-21

  • cs.CL updates on arXiv.org

    CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization. (arXiv:2109.09209v1 [cs.CL])
    (2 min) We study generating abstractive summaries that are faithful and factually consistent with the given articles. A novel contrastive learning formulation is presented, which leverages both reference summaries, as positive training data, and automatically generated erroneous summaries, as negative training data, to train summarization systems that are better at distinguishing between them. We further design four types of strategies for creating negative samples, to resemble errors made commonly by two state-of-the-art models, BART and PEGASUS, found in our new human annotations of summary errors. Experiments on XSum and CNN/Daily Mail show that our contrastive learning framework is robust across datasets and models. It consistently produces more factual summaries than strong comparisons with post error correction, entailment-based reranking, and unlikelihood training, according to QA-based factuality evaluation. Human judges echo the observation and find that our model summaries correct more errors.
    Joint Passage Ranking for Diverse Multi-Answer Retrieval. (arXiv:2104.08445v2 [cs.CL] UPDATED)
    (2 min) We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question. This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer. In this paper, we introduce JPR, the first joint passage retrieval model for multi-answer retrieval. JPR makes use of an autoregressive reranker that selects a sequence of passages, each conditioned on previously selected passages. JPR is trained to select passages that cover new answers at each timestep and uses a tree-decoding algorithm to enable flexibility in the degree of diversity. Compared to prior approaches, JPR achieves significantly better answer coverage on three multi-answer datasets. When combined with downstream question answering, the improved retrieval enables larger answer generation models since they need to consider fewer passages, establishing a new state-of-the-art.
    A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio. (arXiv:2107.02852v2 [eess.AS] UPDATED)
    (2 min) Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the VoxCeleb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 8.9--29.9% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.
    Finetuning Pretrained Transformers into RNNs. (arXiv:2103.13076v2 [cs.CL] UPDATED)
    (2 min) Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process has lower training cost relative to training these recurrent variants from scratch. As many models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.
    MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models. (arXiv:2109.09237v1 [cs.CL])
    (2 min) Recent work indicated that pretrained language models (PLMs) such as BERT and RoBERTa can be transformed into effective sentence and word encoders even via simple self-supervised techniques. Inspired by this line of work, in this paper we propose a fully unsupervised approach to improving word-in-context (WiC) representations in PLMs, achieved via a simple and efficient WiC-targeted fine-tuning procedure: MirrorWiC. The proposed method leverages only raw texts sampled from Wikipedia, assuming no sense-annotated data, and learns context-aware word representations within a standard contrastive learning setup. We experiment with a series of standard and comprehensive WiC benchmarks across multiple languages. Our proposed fully unsupervised MirrorWiC models obtain substantial gains over off-the-shelf PLMs across all monolingual, multilingual and cross-lingual setups. Moreover, on some standard WiC benchmarks, MirrorWiC is even on-par with supervised models fine-tuned with in-task data and sense labels.
    Unified and Multilingual Author Profiling for Detecting Haters. (arXiv:2109.09233v1 [cs.CL])
    (2 min) This paper presents a unified user profiling framework to identify hate speech spreaders by processing their tweets regardless of the language. The framework encodes the tweets with sentence transformers and applies an attention mechanism to select important tweets for learning user profiles. Furthermore, the attention layer helps to explain why a user is a hate speech spreader by producing attention weights at both token and post level. Our proposed model outperformed the state-of-the-art multilingual transformer models.
    Preventing Author Profiling through Zero-Shot Multilingual Back-Translation. (arXiv:2109.09133v1 [cs.CL])
    (2 min) Documents as short as a single sentence may inadvertently reveal sensitive information about their authors, including e.g. their gender or ethnicity. Style transfer is an effective way of transforming texts in order to remove any information that enables author profiling. However, for a number of current state-of-the-art approaches the improved privacy is accompanied by an undesirable drop in the down-stream utility of the transformed data. In this paper, we propose a simple, zero-shot way to effectively lower the risk of author profiling through multilingual back-translation using off-the-shelf translation models. We compare our models with five representative text style transfer models on three datasets across different domains. Results from both an automatic and a human evaluation show that our approach achieves the best overall performance while requiring no training data. We are able to lower the adversarial prediction of gender and race by up to $22\%$ while retaining $95\%$ of the original utility on downstream tasks.
    Towards Zero-Label Language Learning. (arXiv:2109.09193v1 [cs.CL])
    (2 min) This paper explores zero-label learning in Natural Language Processing (NLP), whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. At the core of our framework is a novel approach for better leveraging the powerful pretrained language models. Specifically, inspired by the recent success of few-shot inference on GPT-3, we present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations. Our method enables zero-label learning as we train task-specific models solely on the synthetic data, yet we achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, our approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
    FST Morphological Analyser and Generator for Mapud\"ungun. (arXiv:2109.09176v1 [cs.CL])
    (2 min) Following the Mapuche grammar by Smeets, this article describes the main morphophonological aspects of Mapud\"ungun, explaining what triggers them and the contexts where they arise. We present a computational approach producing a finite state morphological analyser (and generator) capable of classifying and appropriately tagging all the components (roots and suffixes) that interact in a Mapuche word form. The bulk of the article focuses on presenting details about the morphology of Mapud\"ungun verb and its formalisation using FOMA. A system evaluation process and its results are also present in this article.
    An Alignment-Agnostic Model for Chinese Text Error Correction. (arXiv:2104.07190v2 [cs.CL] UPDATED)
    (2 min) This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters, which is common for Chinese native speakers. Most existing models based on detect-correct framework can correct mistaken characters errors, but they cannot deal with missing or redundant characters. The reason is that lengths of sentences before and after correction are not the same, leading to the inconsistence between model inputs and outputs. Although the Seq2Seq-based or sequence tagging methods provide solutions to the problem and achieved relatively good results on English context, but they do not perform well in Chinese context according to our experimental results. In our work, we propose a novel detect-correct framework which is alignment-agnostic, meaning that it can handle both text aligned and non-aligned occasions, and it can also serve as a cold start model when there are no annotated data provided. Experimental results on three datasets demonstrate that our method is effective and achieves the best performance among existing published models.
    Do Long-Range Language Models Actually Use Long-Range Context?. (arXiv:2109.09115v1 [cs.CL])
    (2 min) Language models are generally trained on short, truncated input sequences, which limits their ability to use discourse-level information present in long-range context to improve their predictions. Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models, which can process much longer sequences than models of the past. However, the ways in which such models take advantage of the long-range context remain unclear. In this paper, we perform a fine-grained analysis of two long-range Transformer language models (including the \emph{Routing Transformer}, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens. Our results reveal that providing long-range context (i.e., beyond the previous 2K tokens) to these models only improves their predictions on a small set of tokens (e.g., those that can be copied from the distant context) and does not help at all for sentence-level prediction tasks. Finally, we discover that PG-19 contains a variety of different document types and domains, and that long-range context helps most for literary novels (as opposed to textbooks or magazines).
    One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages. (arXiv:2102.02585v3 [cs.CL] UPDATED)
    (2 min) Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored, and non-English fastText models were trained using subword sizes optimized for English and German word analogy tasks. In our work, we find the optimal subword sizes on the English, German, Czech, Italian, Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We then propose a simple n-gram coverage model and we show that it predicts better-than-default subword sizes on the Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We show that the optimization of fastText's subword sizes matters and results in a 14% improvement on the Czech word analogy task. We also show that expensive parameter optimization can be replaced by a simple n-gram coverage model that consistently improves the accuracy of fastText models on the word analogy tasks by up to 3% compared to the default subword sizes, and that it is within 1% accuracy of the optimal subword sizes.
    Conditional probing: measuring usable information beyond a baseline. (arXiv:2109.09234v1 [cs.CL])
    (2 min) Probing experiments investigate the extent to which neural representations make properties -- like part-of-speech -- predictable. One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. Instead of using baselines as a point of comparison, we're interested in measuring information that is contained in the representation but not in the baseline. For example, current methods can detect when a representation is more useful than the word identity (a baseline) for predicting part-of-speech; however, they cannot detect when the representation is predictive of just the aspects of part-of-speech not explainable by the word identity. In this work, we extend a theory of usable information called $\mathcal{V}$-information and propose conditional probing, which explicitly conditions on the information in the baseline. In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.
    UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims. (arXiv:2109.09232v1 [cs.CL])
    (2 min) Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is check-worthy.In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages.
    Measuring the Volatility of the Political agenda in Public Opinion and News Media. (arXiv:1808.09037v2 [cs.CY] UPDATED)
    (2 min) Recent election surprises, regime changes, and political shocks indicate that political agendas have become more fast-moving and volatile. The ability to measure the complex dynamics of agenda change and capture the nature and extent of volatility in political systems is therefore more crucial than ever before. This study proposes a definition and operationalization of volatility that combines insights from political science, communications, information theory, and computational techniques. The proposed measures of fractionalization and agenda change encompass the shifting salience of issues in the agenda as a whole and allow the study of agendas across different domains. We evaluate these metrics and compare them to other measures such as issue-level survival rates and the Pedersen Index, which uses public-opinion poll data to measure public agendas, as well as traditional media content to measure media agendas in the UK and Germany. We show how these measures complement existing approaches and could be employed in future agenda-setting research.
    Investigating Crowdsourcing Protocols for Evaluatingthe Factual Consistency of Summaries. (arXiv:2109.09195v1 [cs.CL])
    (2 min) Current pre-trained models applied to summarization are prone to factual inconsistencies which either misrepresent the source text or introduce extraneous information. Thus, comparing the factual consistency of summaries is necessary as we develop improved models. However, the optimal human evaluation setup for factual consistency has not been standardized. To address this issue, we crowdsourced evaluations for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols, on 100 articles from each of the CNN-Daily Mail and XSum datasets over four state-of-the-art models, to determine the most reliable evaluation framework. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design. Our crowdsourcing templates and summary evaluations will be publicly available to facilitate future research on factual consistency in summarization.
    MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers. (arXiv:2109.00799v2 [cs.CL] UPDATED)
    (2 min) Developing automatic Math Word Problem (MWP) solvers has been an interest of NLP researchers since the 1960s. Over the last few years, there are a growing number of datasets and deep learning-based methods proposed for effectively solving MWPs. However, most existing methods are benchmarked soly on one or two datasets, varying in different configurations, which leads to a lack of unified, standardized, fair, and comprehensive comparison between methods. This paper presents MWPToolkit, the first open-source framework for solving MWPs. In MWPToolkit, we decompose the procedure of existing MWP solvers into multiple core components and decouple their models into highly reusable modules. We also provide a hyper-parameter search function to boost the performance. In total, we implement and compare 17 MWP solvers on 4 widely-used single equation generation benchmarks and 2 multiple equations generation benchmarks. These features enable our MWPToolkit to be suitable for researchers to reproduce advanced baseline models and develop new MWP solvers quickly. Code and documents are available at https://github.com/LYH-YF/MWPToolkit.
    Language Models are Few-Shot Butlers. (arXiv:2104.07972v2 [cs.CL] UPDATED)
    (2 min) Pretrained language models demonstrate strong performance in most NLP tasks when fine-tuned on small task-specific datasets. Hence, these autoregressive models constitute ideal agents to operate in text-based environments where language understanding and generative capabilities are essential. Nonetheless, collecting expert demonstrations in such environments is a time-consuming endeavour. We introduce a two-stage procedure to learn from a small set of demonstrations and further improve by interacting with an environment. We show that language models fine-tuned with only 1.2% of the expert demonstrations and a simple reinforcement learning algorithm achieve a 51% absolute improvement in success rate over existing methods in the ALFWorld environment.
    IGA : An Intent-Guided Authoring Assistant. (arXiv:2104.07000v2 [cs.CL] UPDATED)
    (2 min) While large-scale pretrained language models have significantly improved writing assistance functionalities such as autocomplete, more complex and controllable writing assistants have yet to be explored. We leverage advances in language modeling to build an interactive writing assistant that generates and rephrases text according to fine-grained author specifications. Users provide input to our Intent-Guided Assistant (IGA) in the form of text interspersed with tags that correspond to specific rhetorical directives (e.g., adding description or contrast, or rephrasing a particular sentence). We fine-tune a language model on a dataset heuristically-labeled with author intent, which allows IGA to fill in these tags with generated text that users can subsequently edit to their liking. A series of automatic and crowdsourced evaluations confirm the quality of IGA's generated outputs, while a small-scale user study demonstrates author preference for IGA over baseline methods in a creative writing task. We release our dataset, code, and demo to spur further research into AI-assisted writing.
    Context-Aware Interaction Network for Question Matching. (arXiv:2104.08451v2 [cs.CL] UPDATED)
    (2 min) Impressive milestones have been achieved in text matching by adopting a cross-attention mechanism to capture pertinent semantic connections between two sentence representations. However, regular cross-attention focuses on word-level links between the two input sequences, neglecting the importance of contextual information. We propose a context-aware interaction network (COIN) to properly align two sequences and infer their semantic relationship. Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information when aligning two sequences, and (2) a gate fusion layer to flexibly interpolate aligned representations. We apply multiple stacked interaction blocks to produce alignments at different levels and gradually refine the attention results. Experiments on two question matching datasets and detailed analyses demonstrate the effectiveness of our model.
    BERT meets LIWC: Exploring State-of-the-Art Language Models for Predicting Communication Behavior in Couples' Conflict Interactions. (arXiv:2106.01536v2 [cs.CL] UPDATED)
    (3 min) Many processes in psychology are complex, such as dyadic interactions between two interacting partners (e.g. patient-therapist, intimate relationship partners). Nevertheless, many basic questions about interactions are difficult to investigate because dyadic processes can be within a person and between partners, they are based on multimodal aspects of behavior and unfold rapidly. Current analyses are mainly based on the behavioral coding method, whereby human coders annotate behavior based on a coding schema. But coding is labor-intensive, expensive, slow, focuses on few modalities. Current approaches in psychology use LIWC for analyzing couples' interactions. However, advances in natural language processing such as BERT could enable the development of systems to potentially automate behavioral coding, which in turn could substantially improve psychological research. In this work, we train machine learning models to automatically predict positive and negative communication behavioral codes of 368 German-speaking Swiss couples during an 8-minute conflict interaction on a fine-grained scale (10-seconds sequences) using linguistic features and paralinguistic features derived with openSMILE. Our results show that both simpler TF-IDF features as well as more complex BERT features performed better than LIWC, and that adding paralinguistic features did not improve the performance. These results suggest it might be time to consider modern alternatives to LIWC, the de facto linguistic features in psychology, for prediction tasks in couples research. This work is a further step towards the automated coding of couples' behavior which could enhance couple research and therapy, and be utilized for other dyadic interactions as well.
    Translatotron 2: Robust direct speech-to-speech translation. (arXiv:2107.08661v3 [cs.CL] UPDATED)
    (2 min) We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, but unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.
    NatCat: Weakly Supervised Text Classification with Naturally Annotated Resources. (arXiv:2009.14335v2 [cs.CL] UPDATED)
    (2 min) We describe NatCat, a large-scale resource for text classification constructed from three data sources: Wikipedia, Stack Exchange, and Reddit. NatCat consists of document-category pairs derived from manual curation that occurs naturally within online communities. To demonstrate its usefulness, we build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval), reporting large improvements compared to prior work. We benchmark different modeling choices and resource combinations and show how tasks benefit from particular NatCat data sources.
    Adversarial Training with Contrastive Learning in NLP. (arXiv:2109.09075v1 [cs.CL])
    (2 min) For years, adversarial training has been extensively studied in natural language processing (NLP) settings. The main goal is to make models robust so that similar inputs derive in semantically similar outcomes, which is not a trivial problem since there is no objective measure of semantic similarity in language. Previous works use an external pre-trained NLP model to tackle this challenge, introducing an extra training stage with huge memory consumption during training. However, the recent popular approach of contrastive learning in language processing hints a convenient way of obtaining such similarity restrictions. The main advantage of the contrastive learning approach is that it aims for similar data points to be mapped close to each other and further from different ones in the representation space. In this work, we propose adversarial training with contrastive learning (ATCL) to adversarially train a language processing task using the benefits of contrastive learning. The core idea is to make linear perturbations in the embedding space of the input via fast gradient methods (FGM) and train the model to keep the original and perturbed representations close via contrastive learning. In NLP experiments, we applied ATCL to language modeling and neural machine translation tasks. The results show not only an improvement in the quantitative (perplexity and BLEU) scores when compared to the baselines, but ATCL also achieves good qualitative results in the semantic level for both tasks without using a pre-trained model.
    Knowledge-Enhanced Evidence Retrieval for Counterargument Generation. (arXiv:2109.09057v1 [cs.CL])
    (2 min) Finding counterevidence to statements is key to many tasks, including counterargument generation. We build a system that, given a statement, retrieves counterevidence from diverse sources on the Web. At the core of this system is a natural language inference (NLI) model that determines whether a candidate sentence is valid counterevidence or not. Most NLI models to date, however, lack proper reasoning abilities necessary to find counterevidence that involves complex inference. Thus, we present a knowledge-enhanced NLI model that aims to handle causality- and example-based inference by incorporating knowledge graphs. Our NLI model outperforms baselines for NLI tasks, especially for instances that require the targeted inference. In addition, this NLI model further improves the counterevidence retrieval system, notably finding complex counterevidence better.
    A Simple Post-Processing Technique for Improving Readability Assessment of Texts using Word Mover's Distance. (arXiv:2103.07277v2 [cs.CL] UPDATED)
    (2 min) Assessing the proper difficulty levels of reading materials or texts in general is the first step towards effective comprehension and learning. In this study, we improve the conventional methodology of automatic readability assessment by incorporating the Word Mover's Distance (WMD) of ranked texts as an additional post-processing technique to further ground the difficulty level given by a model. Results of our experiments on three multilingual datasets in Filipino, German, and English show that the post-processing technique outperforms previous vanilla and ranking-based models using SVM.
    What BERT Based Language Models Learn in Spoken Transcripts: An Empirical Study. (arXiv:2109.09105v1 [cs.CL])
    (2 min) Language Models (LMs) have been ubiquitously leveraged in various tasks including spoken language understanding (SLU). Spoken language requires careful understanding of speaker interactions, dialog states and speech induced multimodal behaviors to generate a meaningful representation of the conversation.In this work, we propose to dissect SLU into three representative properties:conversational(disfluency, pause, overtalk), channel(speaker-type, turn-tasks) andASR(insertion, deletion,substitution). We probe BERT based language models (BERT, RoBERTa) trained on spoken transcripts to investigate its ability to understand multifarious properties in absence of any speech cues. Empirical results indicate that LM is surprisingly good at capturing conversational properties such as pause prediction and overtalk detection from lexical tokens. On the downsides, the LM scores low on turn-tasks and ASR errors predictions. Additionally, pre-training the LM on spoken transcripts restrain its linguistic understanding. Finally,we establish the efficacy and transferability of the mentioned properties on two benchmark datasets: Switchboard Dialog Act and Disfluency datasets.
    "You made me feel this way": Investigating Partners' Influence in Predicting Emotions in Couples' Conflict Interactions using Speech Data. (arXiv:2106.01526v2 [cs.CL] UPDATED)
    (0 min) How romantic partners interact with each other during a conflict influences how they feel at the end of the interaction and is predictive of whether the partners stay together in the long term. Hence understanding the emotions of each partner is important. Yet current approaches that are used include self-reports which are burdensome and hence limit the frequency of this data collection. Automatic emotion prediction could address this challenge. Insights from psychology research indicate that partners' behaviors influence each other's emotions in conflict interaction and hence, the behavior of both partners could be considered to better predict each partner's emotion. However, it is yet to be investigated how doing so compares to only using each partner's own behavior in terms of emotion prediction performance. In this work, we used BERT to extract linguistic features (i.e., what partners said) and openSMILE to extract paralinguistic features (i.e., how they said it) from a data set of 368 German-speaking Swiss couples (N = 736 individuals) who were videotaped during an 8-minutes conflict interaction in the laboratory. Based on those features, we trained machine learning models to predict if partners feel positive or negative after the conflict interaction. Our results show that including the behavior of the other partner improves the prediction performance. Furthermore, for men, considering how their female partners spoke is most important and for women considering what their male partner said is most important in getting better prediction performance. This work is a step towards automatically recognizing each partners' emotion based on the behavior of both, which would enable a better understanding of couples in research, therapy, and the real world.
    Programming Puzzles. (arXiv:2106.05784v2 [cs.LG] UPDATED)
    (0 min) We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input $x$ which makes $f$ output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f(x)$ is all that is needed to test a candidate solution $x$. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping. We develop baseline enumerative program synthesis and GPT-3 solvers that are capable of solving easy puzzles -- even without access to any reference solutions -- by learning from their own past solutions. Based on a small user study, we find puzzle difficulty to correlate between human programmers and the baseline AI solvers.
    Tweet Sentiment Quantification: An Experimental Re-Evaluation. (arXiv:2011.08091v3 [cs.CL] UPDATED)
    (0 min) Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called ``prevalence'') of sentiment-related classes (such as \textsf{Positive}, \textsf{Neutral}, \textsf{Negative}) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of ``classify and count'' (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani (2016) carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We now re-evaluate those quantification methods (plus a few more modern ones) on exactly the same same datasets, this time following a now consolidated and much more robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. The results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
    NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned. (arXiv:2101.00133v2 [cs.CL] UPDATED)
    (0 min) We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing retrieval corpora or the parameters of learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.
    Training Dynamic based data filtering may not work for NLP datasets. (arXiv:2109.09191v1 [cs.CL])
    (0 min) The recent increase in dataset size has brought about significant advances in natural language understanding. These large datasets are usually collected through automation (search engines or web crawlers) or crowdsourcing which inherently introduces incorrectly labeled data. Training on these datasets leads to memorization and poor generalization. Thus, it is pertinent to develop techniques that help in the identification and isolation of mislabelled data. In this paper, we study the applicability of the Area Under the Margin (AUM) metric to identify and remove/rectify mislabelled examples in NLP datasets. We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points and leads to the loss of a large amount of relevant language information. We show that models rely on the distributional information instead of relying on syntactic and semantic representations.
    DRILL: Dynamic Representations for Imbalanced Lifelong Learning. (arXiv:2105.08445v2 [cs.CL] UPDATED)
    (0 min) Continual or lifelong learning has been a long-standing challenge in machine learning to date, especially in natural language processing (NLP). Although state-of-the-art language models such as BERT have ushered in a new era in this field due to their outstanding performance in multitask learning scenarios, they suffer from forgetting when being exposed to a continuous stream of data with shifting data distributions. In this paper, we introduce DRILL, a novel continual learning architecture for open-domain text classification. DRILL leverages a biologically inspired self-organizing neural architecture to selectively gate latent language representations from BERT in a task-incremental manner. We demonstrate in our experiments that DRILL outperforms current methods in a realistic scenario of imbalanced, non-stationary data without prior knowledge about task boundaries. To the best of our knowledge, DRILL is the first of its kind to use a self-organizing neural architecture for open-domain lifelong learning in NLP.
    When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting. (arXiv:2104.09691v3 [cs.CL] UPDATED)
    (0 min) Since the seminal work of Mikolov et al. (2013a) and Bojanowski et al. (2017), word representations of shallow log-bilinear language models have found their way into many NLP applications. Mikolov et al. (2018) introduced a positional log-bilinear language model, which has characteristics of an attention-based language model and which has reached state-of-the-art performance on the intrinsic word analogy task. However, the positional model has never been evaluated on qualitative criteria or extrinsic tasks and its speed is impractical. We outline the similarities between the attention mechanism and the positional model, and we propose a constrained positional model, which adapts the sparse attention mechanism of Dai et al. (2018). We evaluate the positional and constrained positional models on three novel qualitative criteria and on the extrinsic language modeling task of Botha and Blunsom (2014). We show that the positional and constrained positional models contain interpretable information about word order and outperform the subword model of Bojanowski et al. (2017) on language modeling. We also show that the constrained positional model outperforms the positional model on language modeling and is twice as fast.
    Enhancing Language Models with Plug-and-Play Large-Scale Commonsense. (arXiv:2109.02572v2 [cs.CL] UPDATED)
    (0 min) We study how to enhance language models (LMs) with textual commonsense knowledge. Previous work (e.g., KnowBERT) has focused on the integrating entity knowledge from knowledge graphs. In order to introduce the external entity embeddings, they learn to jointly represent the original sentences and external knowledge by pre-training on a large scale corpus. However, when switching to textual commonsense, unlike the light entity embeddings, the encoding of commonsense descriptions is heavy. Therefore, the pre-training for learning to jointly represent the target sentence and external commonsense descriptions is unaffordable. On the other hand, since pre-trained LMs for representing the target sentences alone are readily available, is it feasible to introduce commonsense knowledge in downstream tasks by fine-tuning them only? In this paper, we propose a plug-and-play method for large-scale commonsense integration without pre-training. Our method is inspired by the observation that in the regular fine-tuning for downstream tasks where no external knowledge was introduced, the variation in the parameters of the language model was minor. Our method starts from a pre-trained LM that represents the target sentences only (e.g., BERT). We think that the pre-training for joint representation learning can be avoided, if the joint representation reduces the impact of parameters on the starting LM. Previous methods such as KnowBERT proposed complex modifications to the vanilla LM to introduce external knowledge. Our model (Cook-Transformer, COmmOnsense Knowledge-enhanced Transformer), on the other hand, hardly changes the vanilla LM except adding a knowledge token in each Transformer layer. In a variety of experiments, COOK-Transformer-based BERT/RoBERTa improve their effect without any pre-training.
    SYSML: StYlometry with Structure and Multitask Learning: Implications for Darknet Forum Migrant Analysis. (arXiv:2104.00764v2 [cs.CL] UPDATED)
    (0 min) Darknet market forums are frequently used to exchange illegal goods and services between parties who use encryption to conceal their identities. The Tor network is used to host these markets, which guarantees additional anonymization from IP and location tracking, making it challenging to link across malicious users using multiple accounts (sybils). Additionally, users migrate to new forums when one is closed, making it difficult to link users across multiple forums. We develop a novel stylometry-based multitask learning approach for natural language and interaction modeling using graph embeddings to construct low-dimensional representations of short episodes of user activity for authorship attribution. We provide a comprehensive evaluation of our methods across four different darknet forums demonstrating its efficacy over the state-of-the-art, with a lift of up to 2.5X on Mean Retrieval Rank and 2X on Recall@10.
    AM2iCo: Evaluating Word Meaning in Context across Low-Resource Languages with Adversarial Examples. (arXiv:2104.08639v2 [cs.CL] UPDATED)
    (0 min) Capturing word meaning in context and distinguishing between correspondences and variations across languages is key to building successful multilingual and cross-lingual text representation models. However, existing multilingual evaluation datasets that evaluate lexical semantics "in-context" have various limitations. In particular, 1) their language coverage is restricted to high-resource languages and skewed in favor of only a few language families and areas, 2) a design that makes the task solvable via superficial cues, which results in artificially inflated (and sometimes super-human) performances of pretrained encoders, on many target languages, which limits their usefulness for model probing and diagnostics, and 3) little support for cross-lingual evaluation. In order to address these gaps, we present AM2iCo (Adversarial and Multilingual Meaning in Context), a wide-coverage cross-lingual and multilingual evaluation set; it aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts for 14 language pairs. We conduct a series of experiments in a wide range of setups and demonstrate the challenging nature of AM2iCo. The results reveal that current SotA pretrained encoders substantially lag behind human performance, and the largest gaps are observed for low-resource languages and languages dissimilar to English.
    SAS: Self-Augmented Strategy for Language Model Pre-training. (arXiv:2106.07176v2 [cs.CL] UPDATED)
    (0 min) The core of a self-supervised learning method for pre-training language models includes the design of appropriate data augmentation and corresponding pre-training task(s). Most data augmentations in language model pre-training are context-independent. The seminal contextualized augmentation recently proposed by the ELECTRA requires a separate generator, which leads to extra computation cost as well as the challenge in adjusting the capability of its generator relative to that of the other model component(s). We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch. Essentially our strategy eliminates a separate generator network and uses only one network to generate the data augmentation and undertake two pre-training tasks (the MLM task and the RTD task) jointly, which naturally avoids the challenge in adjusting the generator's capability as well as reduces the computation cost. Additionally, our SAS is a general strategy such that it can seamlessly incorporate many new techniques emerging recently or in the future, such as the disentangled attention mechanism recently proposed by the DeBERTa model. Our experiments show that our SAS is able to outperform the ELECTRA and other state-of-the-art models in the GLUE tasks with the same or less computation cost.
    Bilateral Personalized Dialogue Generation with Dynamic Persona-Aware Fusion. (arXiv:2106.07857v2 [cs.CL] UPDATED)
    (0 min) Generating personalized responses is one of the major challenges in natural human-robot interaction. Current researches in this field mainly focus on generating responses consistent with the robot's pre-assigned persona, while ignoring the user's persona. Such responses may be inappropriate or even offensive, which may lead to the bad user experience. Therefore, we propose a bilateral personalized dialogue generation (BPDG) method with dynamic persona-aware fusion via multi-task transfer learning to generate responses consistent with both personas. The proposed method aims to accomplish three learning tasks: 1) an encoder is trained with dialogue utterances added with corresponded personalized attributes and relative position (language model task), 2) a dynamic persona-aware fusion module predicts the persona presence to adaptively fuse the contextual and bilateral personas encodings (persona prediction task) and 3) a decoder generates natural, fluent and personalized responses (dialogue generation task). To make the generated responses more personalized and bilateral persona-consistent, the Conditional Mutual Information Maximum (CMIM) criterion is adopted to select the final response from the generated candidates. The experimental results show that the proposed method outperforms several state-of-the-art methods in terms of both automatic and manual evaluations.
    Neural News Recommendation with Collaborative News Encoding and Structural User Encoding. (arXiv:2109.00750v2 [cs.CL] UPDATED)
    (0 min) Automatic news recommendation has gained much attention from the academic community and industry. Recent studies reveal that the key to this task lies within the effective representation learning of both news and users. Existing works typically encode news title and content separately while neglecting their semantic interaction, which is inadequate for news text comprehension. Besides, previous models encode user browsing history without leveraging the structural correlation of user browsed news to reflect user interests explicitly. In this work, we propose a news recommendation framework consisting of collaborative news encoding (CNE) and structural user encoding (SUE) to enhance news and user representation learning. CNE equipped with bidirectional LSTMs encodes news title and content collaboratively with cross-selection and cross-attention modules to learn semantic-interactive news representations. SUE utilizes graph convolutional networks to extract cluster-structural features of user history, followed by intra-cluster and inter-cluster attention modules to learn hierarchical user interest representations. Experiment results on the MIND dataset validate the effectiveness of our model to improve the performance of news recommendation. Our code is released at https://github.com/Veason-silverbullet/NNR.
    Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation. (arXiv:2109.05778v2 [cs.CL] UPDATED)
    (0 min) Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task, which aims to satisfy human need for daily communication on open-ended topics by producing related and informative responses. In this paper, we point out that hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding and help generate better responses. Besides, the semantic dependency between an dialogue post and its response is complicated, e.g., few word alignments and some topic transitions. Therefore, the visual impressions of them are not shared, and it is more reasonable to integrate the response visual impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs). However, both the response and its RVIs are not given directly in the test process. To handle the above issues, we propose a framework to explicitly construct VIs based on pure-language dialogue datasets and utilize them for better dialogue understanding and generation. Specifically, we obtain a group of images (PVIs) for each post based on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder to get a post representation with both visual and textual information. Since the RVIs are not provided directly during testing, we design a cascade decoder that consists of two sub-decoders. The first sub-decoder predicts the content words in response, and applies the word-image mapping model to get those RVIs. Then, the second sub-decoder generates the response based on the post and RVIs. Experimental results on two open-domain dialogue datasets show that our proposed approach achieves superior performance over competitive baselines.
    Probing Across Time: What Does RoBERTa Know and When?. (arXiv:2104.07885v2 [cs.CL] UPDATED)
    (0 min) Models of language trained on very large corpora have been demonstrated useful for NLP. As fixed artifacts, they have become the object of intense study, with many researchers "probing" the extent to which linguistic abstractions, factual and commonsense knowledge, and reasoning abilities they acquire and readily demonstrate. Building on this line of work, we consider a new question: for types of knowledge a language model learns, when during (pre)training are they acquired? We plot probing performance across iterations, using RoBERTa as a case study. Among our findings: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.
    Multivalent Entailment Graphs for Question Answering. (arXiv:2104.07846v2 [cs.CL] UPDATED)
    (0 min) Drawing inferences between open-domain natural language predicates is a necessity for true language understanding. There has been much progress in unsupervised learning of entailment graphs for this purpose. We make three contributions: (1) we reinterpret the Distributional Inclusion Hypothesis to model entailment between predicates of different valencies, like DEFEAT(Biden, Trump) entails WIN(Biden); (2) we actualize this theory by learning unsupervised Multivalent Entailment Graphs of open-domain predicates; and (3) we demonstrate the capabilities of these graphs on a novel question answering task. We show that directional entailment is more helpful for inference than bidirectional similarity on questions of fine-grained semantics. We also show that drawing on evidence across valencies answers more questions than by using only the same valency evidence.
    What if This Modified That? Syntactic Interventions via Counterfactual Embeddings. (arXiv:2105.14002v2 [cs.CL] UPDATED)
    (0 min) Neural language models exhibit impressive performance on a variety of tasks, but their internal reasoning may be difficult to understand. Prior art aims to uncover meaningful properties within model representations via probes, but it is unclear how faithfully such probes portray information that the models actually use. To overcome such limitations, we propose a technique, inspired by causal analysis, for generating counterfactual embeddings within models. In experiments testing our technique, we produce evidence that suggests some BERT-based models use a tree-distance-like representation of syntax in downstream prediction tasks.
    Semi-Supervised and Unsupervised Sense Annotation via Translations. (arXiv:2106.06462v2 [cs.CL] UPDATED)
    (0 min) Acquisition of multilingual training data continues to be a challenge in word sense disambiguation (WSD). To address this problem, unsupervised approaches have been proposed to automatically generate sense annotations for training supervised WSD systems. We present three new methods for creating sense-annotated corpora which leverage translations, parallel bitexts, lexical resources, as well as contextual and synset embeddings. Our semi-supervised method applies machine translation to transfer existing sense annotations to other languages. Our two unsupervised methods refine sense annotations produced by a knowledge-based WSD system via lexical translations in a parallel corpus. We obtain state-of-the-art results on standard WSD benchmarks.
    Explaining Answers with Entailment Trees. (arXiv:2104.08661v2 [cs.CL] UPDATED)
    (0 min) Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by showing the line of reasoning from what is known to the answer, rather than simply showing a fragment of textual evidence (a "rationale'"). If this could be done, new opportunities for understanding and debugging the system's reasoning become possible. Our approach is to generate explanations in the form of entailment trees, namely a tree of multipremise entailment steps from facts that are known, through intermediate conclusions, to the hypothesis of interest (namely the question + answer). To train a model with this skill, we created ENTAILMENTBANK, the first dataset to contain multistep entailment trees. Given a hypothesis (question + answer), we define three increasingly difficult explanation tasks: generate a valid entailment tree given (a) all relevant sentences (b) all relevant and some irrelevant sentences, or (c) a corpus. We show that a strong language model can partially solve these tasks, in particular when the relevant sentences are included in the input (e.g., 35% of trees for (a) are perfect), and with indications of generalization to other domains. This work is significant as it provides a new type of dataset (multistep entailments) and baselines, offering a new avenue for the community to generate richer, more systematic explanations.
    How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding. (arXiv:2109.05696v2 [cs.CL] UPDATED)
    (0 min) Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge of a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications, little is understood about how one KD algorithm compares to another and whether these approaches can be complimentary to each other. In this work, we evaluate various KD algorithms on in-domain, out-of-domain and adversarial testing. We propose a framework to assess the adversarial robustness of multiple KD algorithms. Moreover, we introduce a new KD algorithm, Combined-KD, which takes advantage of two promising approaches (better training scheme and more efficient data augmentation). Our extensive experimental results show that Combined-KD achieves state-of-the-art results on the GLUE benchmark, out-of-domain generalization, and adversarial robustness compared to competitive methods.
    Hierarchical Relation-Guided Type-Sentence Alignment for Long-Tail Relation Extraction with Distant Supervision. (arXiv:2109.09036v1 [cs.CL])
    (0 min) Distant supervision uses triple facts in knowledge graphs to label a corpus for relation extraction, leading to wrong labeling and long-tail problems. Some works use the hierarchy of relations for knowledge transfer to long-tail relations. However, a coarse-grained relation often implies only an attribute (e.g., domain or topic) of the distant fact, making it hard to discriminate relations based solely on sentence semantics. One solution is resorting to entity types, but open questions remain about how to fully leverage the information of entity types and how to align multi-granular entity types with sentences. In this work, we propose a novel model to enrich distantly-supervised sentences with entity types. It consists of (1) a pairwise type-enriched sentence encoding module injecting both context-free and -related backgrounds to alleviate sentence-level wrong labeling, and (2) a hierarchical type-sentence alignment module enriching a sentence with the triple fact's basic attributes to support long-tail relations. Our model achieves new state-of-the-art results in overall and long-tail performance on benchmarks.
    Feature Engineering for US State Legislative Hearings: Stance, Affiliation, Engagement and Absentees. (arXiv:2109.08855v1 [cs.IR])
    (0 min) In US State government legislatures, most of the activity occurs in committees made up of lawmakers discussing bills. When analyzing, classifying or summarizing these committee proceedings, some important features become broadly interesting. In this paper, we engineer four useful features, two applying to lawmakers (engagement and absence), and two to non-lawmakers (stance and affiliation). We propose a system to automatically track the affiliation of organizations in public comments and whether the organizational representative supports or opposes the bill. The model tracking affiliation achieves an F1 of 0.872 while the support determination has an F1 of 0.979. Additionally, a metric to compute legislator engagement and absenteeism is also proposed and as proof-of-concept, a list of the most and least engaged legislators over one full California legislative session is presented.
    Emily: Developing An Emotion-affective Open-Domain Chatbot with Knowledge Graph-based Persona. (arXiv:2109.08875v1 [cs.CL])
    (0 min) In this paper, we describe approaches for developing Emily, an emotion-affective open-domain chatbot. Emily can perceive a user's negative emotion state and offer supports by positively converting the user's emotion states. This is done by finetuning a pretrained dialogue model upon data capturing dialogue contexts and desirable emotion states transition across turns. Emily can differentiate a general open-domain dialogue utterance with questions relating to personal information. By leveraging a question-answering approach based on knowledge graphs to handle personal information, Emily maintains personality consistency. We evaluate Emily against a few state-of-the-art open-domain chatbots and show the effects of the proposed approaches in emotion affecting and addressing personality inconsistency.
    Efficient Variational Graph Autoencoders for Unsupervised Cross-domain Prerequisite Chains. (arXiv:2109.08722v1 [cs.LG])
    (0 min) Prerequisite chain learning helps people acquire new knowledge efficiently. While people may quickly determine learning paths over concepts in a domain, finding such paths in other domains can be challenging. We introduce Domain-Adversarial Variational Graph Autoencoders (DAVGAE) to solve this cross-domain prerequisite chain learning task efficiently. Our novel model consists of a variational graph autoencoder (VGAE) and a domain discriminator. The VGAE is trained to predict concept relations through link prediction, while the domain discriminator takes both source and target domain data as input and is trained to predict domain labels. Most importantly, this method only needs simple homogeneous graphs as input, compared with the current state-of-the-art model. We evaluate our model on the LectureBankCD dataset, and results show that our model outperforms recent graph-based benchmarks while using only 1/10 of graph scale and 1/3 computation time.
    Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs. (arXiv:2109.08925v1 [cs.CL])
    (0 min) Complex Query Answering (CQA) is an important reasoning task on knowledge graphs. Current CQA learning models have been shown to be able to generalize from atomic operators to more complex formulas, which can be regarded as the combinatorial generalizability. In this paper, we present EFO-1-QA, a new dataset to benchmark the combinatorial generalizability of CQA models by including 301 different queries types, which is 20 times larger than existing datasets. Besides, our work, for the first time, provides a benchmark to evaluate and analyze the impact of different operators and normal forms by using (a) 7 choices of the operator systems and (b) 9 forms of complex queries. Specifically, we provide the detailed study of the combinatorial generalizability of two commonly used operators, i.e., projection and intersection, and justify the impact of the forms of queries given the canonical choice of operators. Our code and data can provide an effective pipeline to benchmark CQA models.
    On-device neural speech synthesis. (arXiv:2109.08710v1 [eess.AS])
    (0 min) Recent advances in text-to-speech (TTS) synthesis, such as Tacotron and WaveRNN, have made it possible to construct a fully neural network based TTS system, by coupling the two components together. Such a system is conceptually simple as it only takes grapheme or phoneme input, uses Mel-spectrogram as an intermediate feature, and directly generates speech samples. The system achieves quality equal or close to natural speech. However, the high computational cost of the system and issues with robustness have limited their usage in real-world speech synthesis applications and products. In this paper, we present key modeling improvements and optimization strategies that enable deploying these models, not only on GPU servers, but also on mobile devices. The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
    BERT-Beta: A Proactive Probabilistic Approach to Text Moderation. (arXiv:2109.08805v1 [cs.CL])
    (0 min) Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept {\it text toxicity propensity} to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.
    No Keyword is an Island: In search of covert associations. (arXiv:2103.17114v2 [cs.CL] UPDATED)
    (2 min) This paper describes how corpus-assisted discourse analysis based on keyword (KW) identification and interpretation can benefit from employing Market basket analysis (MBA) after KW extraction. MBA is a data mining technique used originally in marketing that can reveal consistent associations between items in a shopping cart, but also between keywords in a corpus of many texts. By identifying recurring associations between KWs we can compensate for the lack of wider context which is a major issue impeding the interpretation of isolated KWs (esp. when analyzing large data). To showcase the advantages of MBA in "re-contextualizing" keywords within the discourse, a pilot study on the topic of migration was conducted contrasting anti-system and center-right Czech internet media. was conducted. The results show that MBA is useful in identifying the dominant strategy of anti-system news portals: to weave in a confounding ideological undercurrent and connect the concept of migrants to a multitude of other topics (i.e., flooding the discourse).
    Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition. (arXiv:2109.09161v1 [cs.CL])
    (2 min) Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.
    Exemplar-Controllable Paraphrasing and Translation using Bitext. (arXiv:2010.05856v2 [cs.CL] UPDATED)
    (2 min) Most prior work on exemplar-based syntactically controlled paraphrase generation relies on automatically-constructed large-scale paraphrase datasets, which are costly to create. We sidestep this prerequisite by adapting models from prior work to be able to learn solely from bilingual text (bitext). Despite only using bitext for training, and in near zero-shot conditions, our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions. To evaluate these tasks quantitatively, we create three novel evaluation datasets. Our experimental results show that our models achieve competitive results on controlled paraphrase generation and strong performance on controlled machine translation. Analysis shows that our models learn to disentangle semantics and syntax in their latent representations, but still suffer from semantic drift.
    Can images help recognize entities? A study of the role of images for Multimodal NER. (arXiv:2010.12712v2 [cs.CL] UPDATED)
    (0 min) Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context. While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood. In this work, we conduct in-depth analyses of existing multimodal fusion techniques from different perspectives and describe the scenarios where adding information from the image does not always boost performance. We also study the use of captions as a way to enrich the context for MNER. Experiments on three datasets from popular social platforms expose the bottleneck of existing multimodal models and the situations where using captions is beneficial.
    Augmenting semantic lexicons using word embeddings and transfer learning. (arXiv:2109.09010v1 [cs.CL])
    (0 min) Sentiment-aware intelligent systems are essential to a wide array of applications including marketing, political campaigns, recommender systems, behavioral economics, social psychology, and national security. These sentiment-aware intelligent systems are driven by language models which broadly fall into two paradigms: 1. Lexicon-based and 2. Contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Crowdsourcing annotations for semantic dictionaries may be an expensive and time-consuming task. Here, we propose two models for predicting sentiment scores to augment semantic lexicons at a relatively low cost using word embeddings and transfer learning. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost.
    SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification. (arXiv:2109.08839v1 [cs.SD])
    (2 min) Recently, x-vector has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, eg, extended TDNN (E-TDNN), factorized TDNN (F-TDNN), and densely connected TDNN (D-TDNN). In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multi-branch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCeleb1. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1, which surpasses previous TDNN based state-of-the-art approaches by a large margin. Code and trained weights are in https://github.com/wentaozhu/speechnas.git
    Weakly Supervised Explainable Phrasal Reasoning with Neural Fuzzy Logic. (arXiv:2109.08927v1 [cs.CL])
    (2 min) Natural language inference (NLI) aims to determine the logical relationship between two sentences among the target labels Entailment, Contradiction, and Neutral. In recent years, deep learning models have become a prevailing approach to NLI, but they lack interpretability and explainability. In this work, we address the explainability for NLI by weakly supervised logical reasoning, and propose an Explainable Phrasal Reasoning (EPR) approach. Our model first detects phrases as the semantic unit and aligns corresponding phrases. Then, the model predicts the NLI label for the aligned phrases, and induces the sentence label by fuzzy logic formulas. Our EPR is almost everywhere differentiable and thus the system can be trained end-to-end in a weakly supervised manner. We annotated a corpus and developed a set of metrics to evaluate phrasal reasoning. Results show that our EPR yields much more meaningful explanations in terms of F scores than previous studies. To the best of our knowledge, we are the first to develop a weakly supervised phrasal reasoning model for the NLI task.
    DuRecDial 2.0: A Bilingual Parallel Corpus for Conversational Recommendation. (arXiv:2109.08877v1 [cs.CL])
    (2 min) In this paper, we provide a bilingual parallel human-to-human recommendation dialog dataset (DuRecDial 2.0) to enable researchers to explore a challenging task of multilingual and cross-lingual conversational recommendation. The difference between DuRecDial 2.0 and existing conversational recommendation datasets is that the data item (Profile, Goal, Knowledge, Context, Response) in DuRecDial 2.0 is annotated in two languages, both English and Chinese, while other datasets are built with the setting of a single language. We collect 8.2k dialogs aligned across English and Chinese languages (16.5k dialogs and 255k utterances in total) that are annotated by crowdsourced workers with strict quality control procedure. We then build monolingual, multilingual, and cross-lingual conversational recommendation baselines on DuRecDial 2.0. Experiment results show that the use of additional English data can bring performance improvement for Chinese conversational recommendation, indicating the benefits of DuRecDial 2.0. Finally, this dataset provides a challenging testbed for future studies of monolingual, multilingual, and cross-lingual conversational recommendation.
    TVRecap: A Dataset for Generating Stories with Character Descriptions. (arXiv:2109.08833v1 [cs.CL])
    (2 min) We introduce TVRecap, a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved. Unlike other story generation datasets, TVRecap contains stories that are authored by professional screenwriters and that feature complex interactions among multiple characters. Generating stories in TVRecap requires drawing relevant information from the lengthy provided documents about characters based on the brief summary. In addition, by swapping the input and output, TVRecap can serve as a challenging testbed for abstractive summarization. We create TVRecap from fan-contributed websites, which allows us to collect 26k episode recaps with 1868.7 tokens on average. Empirically, we take a hierarchical story generation approach and find that the neural model that uses oracle content selectors for character descriptions demonstrates the best performance on automatic metrics, showing the potential of our dataset to inspire future research on story generation with constraints. Qualitative analysis shows that the best-performing model sometimes generates content that is unfaithful to the short summaries, suggesting promising directions for future work.
    When a Computer Cracks a Joke: Automated Generation of Humorous Headlines. (arXiv:2109.08702v1 [cs.CL])
    (2 min) Automated news generation has become a major interest for new agencies in the past. Oftentimes headlines for such automatically generated news articles are unimaginative as they have been generated with ready-made templates. We present a computationally creative approach for headline generation that can generate humorous versions of existing headlines. We evaluate our system with human judges and compare the results to human authored humorous titles. The headlines produced by the system are considered funny 36\% of the time by human evaluators.
    DyLex: Incoporating Dynamic Lexicons into BERT for Sequence Labeling. (arXiv:2109.08818v1 [cs.CL])
    (2 min) Incorporating lexical knowledge into deep learning models has been proved to be very effective for sequence labeling tasks. However, previous works commonly have difficulty dealing with large-scale dynamic lexicons which often cause excessive matching noise and problems of frequent updates. In this paper, we propose DyLex, a plug-in lexicon incorporation approach for BERT based sequence labeling tasks. Instead of leveraging embeddings of words in the lexicon as in conventional methods, we adopt word-agnostic tag embeddings to avoid re-training the representation while updating the lexicon. Moreover, we employ an effective supervised lexical knowledge denoising method to smooth out matching noise. Finally, we introduce a col-wise attention based knowledge fusion mechanism to guarantee the pluggability of the proposed framework. Experiments on ten datasets of three tasks show that the proposed framework achieves new SOTA, even with very large scale lexicons.
    Structured Pattern Pruning Using Regularization. (arXiv:2109.08814v1 [cs.CL])
    (2 min) Iterative Magnitude Pruning (IMP) is a network pruning method that repeats the process of removing weights with the least magnitudes and retraining the model. When visualizing the weight matrices of language models pruned by IMP, previous research has shown that a structured pattern emerges, wherein the resulting surviving weights tend to prominently cluster in a select few rows and columns of the matrix. Though the need for further research in utilizing these structured patterns for potential performance gains has previously been indicated, it has yet to be thoroughly studied. We propose SPUR (Structured Pattern pruning Using Regularization), a novel pruning mechanism that preemptively induces structured patterns in compression by adding a regularization term to the objective function in the IMP. Our results show that SPUR can significantly preserve model performance under high sparsity settings regardless of the language or the task. Our contributions are as follows: (i) We propose SPUR, a network pruning mechanism that improves upon IMP regardless of the language or the task. (ii) We are the first to empirically verify the efficacy of "structured patterns" observed previously in pruning research. (iii) SPUR is a resource-efficient mechanism in that it does not require significant additional computations.
    MM-Deacon: Multimodal molecular domain embedding analysis via contrastive learning. (arXiv:2109.08830v1 [cs.LG])
    (2 min) Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have been popular as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single modality for representing molecules. Driven by the fact that a given molecule can be described through different modalities such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multimodal molecular embedding generation approach called MM-Deacon (multimodal molecular domain embedding analysis via contrastive learning). MM-Deacon is trained using SMILES and IUPAC molecule representations as two different modalities. First, SMILES and IUPAC strings are encoded by using two different transformer-based language models independently, then the contrastive loss is utilized to bring these encoded representations from different modalities closer to each other if they belong to the same molecule, and to push embeddings farther from each other if they belong to different molecules. We evaluate the robustness of our molecule embeddings on molecule clustering, cross-modal molecule search, drug similarity assessment and drug-drug interaction tasks.
    A Machine Learning Pipeline to Examine Political Bias with Congressional Speeches. (arXiv:2109.09014v1 [cs.CY])
    (0 min) Computational methods to model political bias in social media involve several challenges due to heterogeneity, high-dimensional, multiple modalities, and the scale of the data. Political bias in social media has been studied in multiple viewpoints like media bias, political ideology, echo chambers, and controversies using machine learning pipelines. Most of the current methods rely heavily on the manually-labeled ground-truth data for the underlying political bias prediction tasks. Limitations of such methods include human-intensive labeling, labels related to only a specific problem, and the inability to determine the near future bias state of a social media conversation. In this work, we address such problems and give machine learning approaches to study political bias in two ideologically diverse social media forums: Gab and Twitter without the availability of human-annotated data. Our proposed methods exploit the use of transcripts collected from political speeches in US congress to label the data and achieve the highest accuracy of 70.5% and 65.1% in Twitter and Gab data respectively to predict political bias. We also present a machine learning approach that combines features from cascades and text to forecast cascade's political bias with an accuracy of about 85%.
    Complex Temporal Question Answering on Knowledge Graphs. (arXiv:2109.08935v1 [cs.IR])
    (0 min) Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class of practical importance, but have not received much attention in research. This work presents EXAQT, the first end-to-end system for answering complex temporal questions that have multiple entities and predicates, and associated temporal conditions. EXAQT answers natural language questions over KGs in two stages, one geared towards high recall, the other towards precision at top ranks. The first step computes question-relevant compact subgraphs within the KG, and judiciously enhances them with pertinent temporal facts, using Group Steiner Trees and fine-tuned BERT models. The second step constructs relational graph convolutional networks (R-GCNs) from the first step's output, and enhances the R-GCNs with time-aware entity embeddings and attention over temporal relations. We evaluate EXAQT on TimeQuestions, a large dataset of 16k temporal questions we compiled from a variety of general purpose KG-QA benchmarks. Results show that EXAQT outperforms three state-of-the-art systems for answering complex questions over KGs, thereby justifying specialized treatment of temporal QA.
    Back-translation for Large-Scale Multilingual Machine Translation. (arXiv:2109.08712v1 [cs.CL])
    (2 min) This paper illustrates our approach to the shared task on large-scale multilingual machine translation in the sixth conference on machine translation (WMT-21). This work aims to build a single multilingual translation system with a hypothesis that a universal cross-language representation leads to better multilingual translation performance. We extend the exploration of different back-translation methods from bilingual translation to multilingual translation. Better performance is obtained by the constrained sampling method, which is different from the finding of the bilingual translation. Besides, we also explore the effect of vocabularies and the amount of synthetic data. Surprisingly, the smaller size of vocabularies perform better, and the extensive monolingual English data offers a modest improvement. We submitted to both the small tasks and achieved the second place.
    Towards Zero and Few-shot Knowledge-seeking Turn Detection in Task-orientated Dialogue Systems. (arXiv:2109.08820v1 [cs.CL])
    (2 min) Most prior work on task-oriented dialogue systems is restricted to supporting domain APIs. However, users may have requests that are out of the scope of these APIs. This work focuses on identifying such user requests. Existing methods for this task mainly rely on fine-tuning pre-trained models on large annotated data. We propose a novel method, REDE, based on adaptive representation learning and density estimation. REDE can be applied to zero-shot cases, and quickly learns a high-performing detector with only a few shots by updating less than 3K parameters. We demonstrate REDE's competitive performance on DSTC9 data and our newly collected test set.
    Relating Neural Text Degeneration to Exposure Bias. (arXiv:2109.08705v1 [cs.CL])
    (2 min) This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer from (Holtzman et al., 2020). Motivated by the unknown causation of the text degeneration, in this paper we attempt to relate these two mysteries. Specifically, we first qualitatively quantitatively identify mistakes made before text degeneration occurs. Then we investigate the significance of the mistakes by inspecting the hidden states in GPT-2. Our results show that text degeneration is likely to be partly caused by exposure bias. We also study the self-reinforcing mechanism of text degeneration, explaining why the mistakes amplify. In sum, our study provides a more concrete foundation for further investigation on exposure bias and text degeneration problems.
    Semi-Supervised Few-Shot Intent Classification and Slot Filling. (arXiv:2109.08754v1 [cs.CL])
    (0 min) Intent classification (IC) and slot filling (SF) are two fundamental tasks in modern Natural Language Understanding (NLU) systems. Collecting and annotating large amounts of data to train deep learning models for such systems is not scalable. This problem can be addressed by learning from few examples using fast supervised meta-learning techniques such as prototypical networks. In this work, we systematically investigate how contrastive learning and unsupervised data augmentation methods can benefit these existing supervised meta-learning pipelines for jointly modelled IC/SF tasks. Through extensive experiments across standard IC/SF benchmarks (SNIPS and ATIS), we show that our proposed semi-supervised approaches outperform standard supervised meta-learning methods: contrastive losses in conjunction with prototypical networks consistently outperform the existing state-of-the-art for both IC and SF tasks, while data augmentation strategies primarily improve few-shot IC by a significant margin.
    Towards Joint Intent Detection and Slot Filling via Higher-order Attention. (arXiv:2109.08890v1 [cs.CL])
    (0 min) Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU). Recently, attention mechanism has been shown to be effective in jointly optimizing these two tasks in an interactive manner. However, latest attention-based works concentrated only on the first-order attention design, while ignoring the exploration of higher-order attention mechanisms. In this paper, we propose a BiLinear attention block, which leverages bilinear pooling to simultaneously exploit both the contextual and channel-wise bilinear attention distributions to capture the second-order interactions between the input intent or slot features. Higher and even infinity order interactions are built by stacking numerous blocks and assigning Exponential Linear Unit (ELU) to blocks. Before the decoding stage, we introduce the Dynamic Feature Fusion Layer to implicitly fuse intent and slot information in a more effective way. Technically, instead of simply concatenating intent and slot features, we first compute two correlation matrices to weight on two features. Furthermore, we present Higher-order Attention Network for the SLU tasks. Experiments on two benchmark datasets show that our approach yields improvements compared with the state-of-the-art approach. We also provide discussion to demonstrate the effectiveness of the proposed approach.
    ReaSCAN: Compositional Reasoning in Language Grounding. (arXiv:2109.08994v1 [cs.CL])
    (0 min) The ability to compositionally map language to referents, relations, and actions is an essential component of language understanding. The recent gSCAN dataset (Ruis et al. 2020, NeurIPS) is an inspiring attempt to assess the capacity of models to learn this kind of grounding in scenarios involving navigational instructions. However, we show that gSCAN's highly constrained design means that it does not require compositional interpretation and that many details of its instructions and scenarios are not required for task success. To address these limitations, we propose ReaSCAN, a benchmark dataset that builds off gSCAN but requires compositional language interpretation and reasoning about entities and relations. We assess two models on ReaSCAN: a multi-modal baseline and a state-of-the-art graph convolutional neural model. These experiments show that ReaSCAN is substantially harder than gSCAN for both neural architectures. This suggests that ReaSCAN can serve as a valuable benchmark for advancing our understanding of models' compositional generalization and reasoning capabilities.
    The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task. (arXiv:2109.08724v1 [cs.CL])
    (2 min) This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip translation, and pseudo post-editing of the MT output. We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline. Our system is also the top-ranking system on the MT MCC metric for the English-German language pair.
    Dependency distance minimization predicts compression. (arXiv:2109.08900v1 [cs.CL])
    (0 min) Dependency distance minimization (DDm) is a well-established principle of word order. It has been predicted theoretically that DDm implies compression, namely the minimization of word lengths. This is a second order prediction because it links a principle with another principle, rather than a principle and a manifestation as in a first order prediction. Here we test that second order prediction with a parallel collection of treebanks controlling for annotation style with Universal Dependencies and Surface-Syntactic Universal Dependencies. To test it, we use a recently introduced score that has many mathematical and statistical advantages with respect to the widely used sum of dependency distances. We find that the prediction is confirmed by the new score when word lengths are measured in phonemes, independently of the annotation style, but not when word lengths are measured in syllables. In contrast, one of the most widely used scores, i.e. the sum of dependency distances, fails to confirm that prediction, showing the weakness of raw dependency distances for research on word order. Finally, our findings expand the theory of natural communication by linking two distinct levels of organization, namely syntax (word order) and word internal structure.
    Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes. (arXiv:2109.08828v1 [cs.CL])
    (2 min) Empathy is a complex cognitive ability based on the reasoning of others' affective states. In order to better understand others and express stronger empathy in dialogues, we argue that two issues must be tackled at the same time: (i) identifying which word is the cause for the other's emotion from his or her utterance and (ii) reflecting those specific words in the response generation. However, previous approaches for recognizing emotion cause words in text require sub-utterance level annotations, which can be demanding. Taking inspiration from social cognition, we leverage a generative estimator to infer emotion cause words from utterances with no word-level label. Also, we introduce a novel method based on pragmatics to make dialogue models focus on targeted words in the input during generation. Our method is applicable to any dialogue models with no additional training on the fly. We show our approach improves multiple best-performing dialogue agents on generating more focused empathetic responses in terms of both automatic and human evaluation.
    Solar cell patent classification method based on keyword extraction and deep neural network. (arXiv:2109.08796v1 [cs.IR])
    (2 min) With the growing impact of ESG on businesses, research related to renewable energy is receiving great attention. Solar cells are one of them, and accordingly, it can be said that the research value of solar cell patent analysis is very high. Patent documents have high research value. Being able to accurately analyze and classify patent documents can reveal several important technical relationships. It can also describe the business trends in that technology. And when it comes to investment, new industrial solutions will also be inspired and proposed to make important decisions. Therefore, we must carefully analyze patent documents and utilize the value of patents. To solve the solar cell patent classification problem, we propose a keyword extraction method and a deep neural network-based solar cell patent classification method. First, solar cell patents are analyzed for pretreatment. It then uses the KeyBERT algorithm to extract keywords and key phrases from the patent abstract to construct a lexical dictionary. We then build a solar cell patent classification model according to the deep neural network. Finally, we use a deep neural network-based solar cell patent classification model to classify power patents, and the training accuracy is greater than 95%. Also, the validation accuracy is about 87.5%. It can be seen that the deep neural network method can not only realize the classification of complex and difficult solar cell patents, but also have a good classification effect.
    Text Detoxification using Large Pre-trained Neural Models. (arXiv:2109.08914v1 [cs.CL])
    (0 min) We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT to replace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.
  • cs.CV updates on arXiv.org

    Analyzing and Mitigating JPEG Compression Defects in Deep Learning. (arXiv:2011.08932v2 [cs.CV] UPDATED)
    (0 min) With the proliferation of deep learning methods, many computer vision problems which were considered academic are now viable in the consumer setting. One drawback of consumer applications is lossy compression, which is necessary from an engineering standpoint to efficiently and cheaply store and transmit user images. Despite this, there has been little study of the effect of compression on deep neural networks and benchmark datasets are often losslessly compressed or compressed at high quality. Here we present a unified study of the effects of JPEG compression on a range of common tasks and datasets. We show that there is a significant penalty on common performance metrics for high compression. We test several methods for mitigating this penalty, including a novel method based on artifact correction which requires no labels to train.
    3D-CNN for Facial Micro- and Macro-expression Spotting on Long Video Sequences using Temporal Oriented Reference Frame. (arXiv:2105.06340v3 [cs.CV] UPDATED)
    (0 min) Facial expression spotting is the preliminary step for micro- and macro-expression analysis. The task of reliably spotting such expressions in video sequences is currently unsolved. The current best systems depend upon optical flow methods to extract regional motion features, before categorisation of that motion into a specific class of facial movement. Optical flow is susceptible to drift error, which introduces a serious problem for motions with long-term dependencies, such as high frame-rate macro-expression. We propose a purely deep learning solution which, rather than tracking frame differential motion, compares via a convolutional model, each frame with two temporally local reference frames. Reference frames are sampled according to calculated micro- and macro-expression durations. We show that our solution achieves state-of-the-art performance (F1-score of 0.105) in a dataset of high frame-rate (200 fps) long video sequences (SAMM-LV) and is competitive in a low frame-rate (30 fps) dataset (CAS(ME)2). In this paper, we document our deep learning model and parameters, including how we use local contrast normalisation, which we show is critical for optimal results. We surpass a limitation in existing methods, and advance the state of deep learning in the domain of facial expression spotting.
    Learning to Group: A Bottom-Up Framework for 3D Part Discovery in Unseen Categories. (arXiv:2002.06478v4 [cs.CV] UPDATED)
    (0 min) We address the problem of discovering 3D parts for objects in unseen categories. Being able to learn the geometry prior of parts and transfer this prior to unseen categories pose fundamental challenges on data-driven shape segmentation approaches. Formulated as a contextual bandit problem, we propose a learning-based agglomerative clustering framework which learns a grouping policy to progressively group small part proposals into bigger ones in a bottom-up fashion. At the core of our approach is to restrict the local context for extracting part-level features, which encourages the generalizability to unseen categories. On the large-scale fine-grained 3D part dataset, PartNet, we demonstrate that our method can transfer knowledge of parts learned from 3 training categories to 21 unseen testing categories without seeing any annotated samples. Quantitative comparisons against four shape segmentation baselines shows that our approach achieve the state-of-the-art performance.
    DocFormer: End-to-End Transformer for Document Understanding. (arXiv:2106.11539v2 [cs.CV] UPDATED)
    (0 min) We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).
    Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation. (arXiv:2103.11731v3 [cs.CV] UPDATED)
    (0 min) Few-shot object detection has been extensively investigated by incorporating meta-learning into region-based detection frameworks. Despite its success, the said paradigm is constrained by several factors, such as (i) low-quality region proposals for novel classes and (ii) negligence of the inter-class correlation among different classes. Such limitations hinder the generalization of base-class knowledge for the detection of novel-class objects. In this work, we design Meta-DETR, a novel few-shot detection framework that incorporates correlational aggregation for meta-learning into DETR detection frameworks. Meta-DETR works entirely at image level without any region proposals, which circumvents the constraint of inaccurate proposals in prevalent few-shot detection frameworks. Besides, Meta-DETR can simultaneously attend to multiple support classes within a single feed-forward. This unique design allows capturing the inter-class correlation among different classes, which significantly reduces the misclassification of similar classes and enhances knowledge generalization to novel classes. Experiments over multiple few-shot object detection benchmarks show that the proposed Meta-DETR outperforms state-of-the-art methods by large margins. The implementation codes will be released at https://github.com/ZhangGongjie/Meta-DETR.
    DeepPoint: A Deep Learning Model for 3D Reconstruction in Point Clouds via mmWave Radar. (arXiv:2109.09188v1 [eess.IV])
    (0 min) Recent research has shown that mmWave radar sensing is effective for object detection in low visibility environments, which makes it an ideal technique in autonomous navigation systems such as autonomous vehicles. However, due to the characteristics of radar signals such as sparsity, low resolution, specularity, and high noise, it is still quite challenging to reconstruct 3D object shapes via mmWave radar sensing. Built on our recent proposed 3DRIMR (3D Reconstruction and Imaging via mmWave Radar), we introduce in this paper DeepPoint, a deep learning model that generates 3D objects in point cloud format that significantly outperforms the original 3DRIMR design. The model adopts a conditional Generative Adversarial Network (GAN) based deep neural network architecture. It takes as input the 2D depth images of an object generated by 3DRIMR's Stage 1, and outputs smooth and dense 3D point clouds of the object. The model consists of a novel generator network that utilizes a sequence of DeepPoint blocks or layers to extract essential features of the union of multiple rough and sparse input point clouds of an object when observed from various viewpoints, given that those input point clouds may contain many incorrect points due to the imperfect generation process of 3DRIMR's Stage 1. The design of DeepPoint adopts a deep structure to capture the global features of input point clouds, and it relies on an optimally chosen number of DeepPoint blocks and skip connections to achieve performance improvement over the original 3DRIMR design. Our experiments have demonstrated that this model significantly outperforms the original 3DRIMR and other standard techniques in reconstructing 3D objects.
    MUSE: Textual Attributes Guided Portrait Painting Generation. (arXiv:2011.04761v2 [cs.CV] UPDATED)
    (0 min) We propose a novel approach, MUSE, to illustrate textual attributes visually via portrait generation. MUSE takes a set of attributes written in text, in addition to facial features extracted from a photo of the subject as input. We propose 11 attribute types to represent inspirations from a subject's profile, emotion, story, and environment. We propose a novel stacked neural network architecture by extending an image-to-image generative model to accept textual attributes. Experiments show that our approach significantly outperforms several state-of-the-art methods without using textual attributes, with Inception Score score increased by 6% and Fr\'echet Inception Distance (FID) score decreased by 11%, respectively. We also propose a new attribute reconstruction metric to evaluate whether the generated portraits preserve the subject's attributes. Experiments show that our approach can accurately illustrate 78% textual attributes, which also help MUSE capture the subject in a more creative and expressive way.
    MRDet: A Multi-Head Network for Accurate Oriented Object Detection in Aerial Images. (arXiv:2012.13135v2 [cs.CV] UPDATED)
    (0 min) Objects in aerial images usually have arbitrary orientations and are densely located over the ground, making them extremely challenge to be detected. Many recently developed methods attempt to solve these issues by estimating an extra orientation parameter and placing dense anchors, which will result in high model complexity and computational costs. In this paper, we propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors. The AO-RPN is very efficient with only a few amounts of parameters increase than the original RPN. Furthermore, to obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network to accomplish them. Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately. We name it MRDet short for Multi-head Rotated object Detector for convenience. We test the proposed MRDet on two challenging benchmarks, i.e., DOTA and HRSC2016, and compare it with several state-of-the-art methods. Our method achieves very promising results which clearly demonstrate its effectiveness.
    TransVOS: Video Object Segmentation with Transformers. (arXiv:2106.00588v2 [cs.CV] UPDATED)
    (0 min) Recently, Space-Time Memory Network (STM) based methods have achieved state-of-the-art performance in semi-supervised video object segmentation (VOS). A crucial problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformer-based framework, termed TransVOS, introducing a vision transformer to fully exploit and model both the temporal and spatial relationships. Moreover, most STM-based approaches employ two separate encoders to extract features of two significant inputs, i.e., reference sets (history frames with predicted masks) and query frame (current frame), respectively, increasing the models' parameters and complexity. To slim the popular two-encoder pipeline while keeping the effectiveness, we design a single two-path feature extractor to encode the above two inputs in a unified way. Extensive experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
    UMPNet: Universal Manipulation Policy Network for Articulated Objects. (arXiv:2109.05668v2 [cs.CV] UPDATED)
    (0 min) We introduce the Universal Manipulation Policy Network (UMPNet) -- a single image-based policy network that infers closed-loop action sequences for manipulating arbitrary articulated objects. To infer a wide range of action trajectories, the policy supports 6DoF action representation and varying trajectory length. To handle a diverse set of objects, the policy learns from objects with different articulation structures and generalizes to unseen objects or categories. The policy is trained with self-guided exploration without any human demonstrations, scripted policy, or pre-defined goal conditions. To support effective multi-step interaction, we introduce a novel Arrow-of-Time action attribute that indicates whether an action will change the object state back to the past or forward into the future. With the Arrow-of-Time inference at each interaction step, the learned policy is able to select actions that consistently lead towards or away from a given state, thereby, enabling both effective state exploration and goal-conditioned manipulation. Video is available at https://youtu.be/KqlvcL9RqKM
    AirLoop: Lifelong Loop Closure Detection. (arXiv:2109.08975v1 [cs.RO])
    (0 min) Loop closure detection is an important building block that ensures the accuracy and robustness of simultaneous localization and mapping (SLAM) systems. Due to their generalization ability, CNN-based approaches have received increasing attention. Although they normally benefit from training on datasets that are diverse and reflective of the environments, new environments often emerge after the model is deployed. It is therefore desirable to incorporate the data newly collected during operation for incremental learning. Nevertheless, simply finetuning the model on new data is infeasible since it may cause the model's performance on previously learned data to degrade over time, which is also known as the problem of catastrophic forgetting. In this paper, we present AirLoop, a method that leverages techniques from lifelong learning to minimize forgetting when training loop closure detection models incrementally. We experimentally demonstrate the effectiveness of AirLoop on TartanAir, Nordland, and RobotCar datasets. To the best of our knowledge, AirLoop is one of the first works to achieve lifelong learning of deep loop closure detectors.
    Carrying out CNN Channel Pruning in a White Box. (arXiv:2104.11883v3 [cs.CV] UPDATED)
    (0 min) Channel Pruning has been long studied to compress CNNs, which significantly reduces the overall computation. Prior works implement channel pruning in an unexplainable manner, which tends to reduce the final classification errors while failing to consider the internal influence of each channel. In this paper, we conduct channel pruning in a white box. Through deep visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we choose to preserve channels contributing to most categories. Specifically, to model the contribution of each channel to differentiating categories, we develop a class-wise mask for each channel, implemented in a dynamic training manner w.r.t. the input image's category. On the basis of the learned class-wise mask, we perform a global voting mechanism to remove channels with less category discrimination. Lastly, a fine-tuning process is conducted to recover the performance of the pruned model. To our best knowledge, it is the first time that CNN interpretability theory is considered to guide channel pruning. Extensive experiments on representative image classification tasks demonstrate the superiority of our White-Box over many state-of-the-arts. For instance, on CIFAR-10, it reduces 65.23% FLOPs with even 0.62% accuracy improvement for ResNet-110. On ILSVRC-2012, White-Box achieves a 45.6% FLOPs reduction with only a small loss of 0.83% in the top-1 accuracy for ResNet-50.
    ComicGAN: Text-to-Comic Generative Adversarial Network. (arXiv:2109.09120v1 [cs.CV])
    (0 min) Drawing and annotating comic illustrations is a complex and difficult process. No existing machine learning algorithms have been developed to create comic illustrations based on descriptions of illustrations, or the dialogue in comics. Moreover, it is not known if a generative adversarial network (GAN) can generate original comics that correspond to the dialogue and/or descriptions. GANs are successful in producing photo-realistic images, but this technology does not necessarily translate to generation of flawless comics. What is more, comic evaluation is a prominent challenge as common metrics such as Inception Score will not perform comparably, as they are designed to work on photos. In this paper: 1. We implement ComicGAN, a novel text-to-comic pipeline based on a text-to-image GAN that synthesizes comics according to text descriptions. 2. We describe an in-depth empirical study of the technical difficulties of comic generation using GAN's. ComicGAN has two novel features: (i) text description creation from labels via permutation and augmentation, and (ii) custom image encoding with Convolutional Neural Networks. We extensively evaluate the proposed ComicGAN in two scenarios, namely image generation from descriptions, and image generation from dialogue. Our results on 1000 Dilbert comic panels and 6000 descriptions show synthetic comic panels from text inputs resemble original Dilbert panels. Novel methods for text description creation and custom image encoding brought improvements to Frechet Inception Distance, detail, and overall image quality over baseline algorithms. Generating illustrations from descriptions provided clear comics including characters and colours that were specified in the descriptions.
    Manifold-preserved GANs. (arXiv:2109.08955v1 [cs.CV])
    (0 min) Generative Adversarial Networks (GANs) have been widely adopted in various fields. However, existing GANs generally are not able to preserve the manifold of data space, mainly due to the simple representation of discriminator for the real/generated data. To address such open challenges, this paper proposes Manifold-preserved GANs (MaF-GANs), which generalize Wasserstein GANs into high-dimensional form. Specifically, to improve the representation of data, the discriminator in MaF-GANs is designed to map data into a high-dimensional manifold. Furthermore, to stabilize the training of MaF-GANs, an operation with precise and universal solution for any K-Lipschitz continuity, called Topological Consistency is proposed. The effectiveness of the proposed method is justified by both theoretical analysis and empirical results. When adopting DCGAN as the backbone on CelebA (256*256), the proposed method achieved 12.43 FID, which outperforms the state-of-the-art model like Realness GAN (23.51 FID) by a large margin. Code will be made publicly available.
    SNE-RoadSeg+: Rethinking Depth-Normal Translation and Deep Supervision for Freespace Detection. (arXiv:2107.14599v2 [cs.CV] UPDATED)
    (0 min) Freespace detection is a fundamental component of autonomous driving perception. Recently, deep convolutional neural networks (DCNNs) have achieved impressive performance for this task. In particular, SNE-RoadSeg, our previously proposed method based on a surface normal estimator (SNE) and a data-fusion DCNN (RoadSeg), has achieved impressive performance in freespace detection. However, SNE-RoadSeg is computationally intensive, and it is difficult to execute in real time. To address this problem, we introduce SNE-RoadSeg+, an upgraded version of SNE-RoadSeg. SNE-RoadSeg+ consists of 1) SNE+, a module for more accurate surface normal estimation, and 2) RoadSeg+, a data-fusion DCNN that can greatly minimize the trade-off between accuracy and efficiency with the use of deep supervision. Extensive experimental results have demonstrated the effectiveness of our SNE+ for surface normal estimation and the superior performance of our SNE-RoadSeg+ over all other freespace detection approaches. Specifically, our SNE-RoadSeg+ runs in real time, and meanwhile, achieves the state-of-the-art performance on the KITTI road benchmark. Our project page is at https://www.sne-roadseg.site/sne-roadseg-plus.
    Robust Automated Framework for COVID-19 Disease Identification from a Multicenter Dataset of Chest CT Scans. (arXiv:2109.09241v1 [eess.IV])
    (3 min) The objective of this study is to develop a robust deep learning-based framework to distinguish COVID-19, Community-Acquired Pneumonia (CAP), and Normal cases based on chest CT scans acquired in different imaging centers using various protocols, and radiation doses. We showed that while our proposed model is trained on a relatively small dataset acquired from only one imaging center using a specific scanning protocol, the model performs well on heterogeneous test sets obtained by multiple scanners using different technical parameters. We also showed that the model can be updated via an unsupervised approach to cope with the data shift between the train and test sets and enhance the robustness of the model upon receiving a new external dataset from a different center. We adopted an ensemble architecture to aggregate the predictions from multiple versions of the model. For initial training and development purposes, an in-house dataset of 171 COVID-19, 60 CAP, and 76 Normal cases was used, which contained volumetric CT scans acquired from one imaging center using a constant standard radiation dose scanning protocol. To evaluate the model, we collected four different test sets retrospectively to investigate the effects of the shifts in the data characteristics on the model's performance. Among the test cases, there were CT scans with similar characteristics as the train set as well as noisy low-dose and ultra-low dose CT scans. In addition, some test CT scans were obtained from patients with a history of cardiovascular diseases or surgeries. The entire test dataset used in this study contained 51 COVID-19, 28 CAP, and 51 Normal cases. Experimental results indicate that our proposed framework performs well on all test sets achieving total accuracy of 96.15% (95%CI: [91.25-98.74]), COVID-19 sensitivity of 96.08% (95%CI: [86.54-99.5]), CAP sensitivity of 92.86% (95%CI: [76.50-99.19]).
    Contrastive Counterfactual Visual Explanations With Overdetermination. (arXiv:2106.14556v2 [cs.CV] UPDATED)
    (0 min) A novel explainable AI method called CLEAR Image is introduced in this paper. CLEAR Image is based on the view that a satisfactory explanation should be contrastive, counterfactual and measurable. CLEAR Image explains an image's classification probability by contrasting the image with a corresponding image generated automatically via adversarial learning. This enables both salient segmentation and perturbations that faithfully determine each segment's importance. CLEAR Image was successfully applied to a medical imaging case study where it outperformed methods such as Grad-CAM and LIME by an average of 27% using a novel pointing game metric. CLEAR Image excels in identifying cases of "causal overdetermination" where there are multiple patches in an image, any one of which is sufficient by itself to cause the classification probability to be close to one.
    Hierarchical Phenotyping and Graph Modeling of Spatial Architecture in Lymphoid Neoplasms. (arXiv:2106.16174v2 [q-bio.QM] UPDATED)
    (0 min) The cells and their spatial patterns in the tumor microenvironment (TME) play a key role in tumor evolution, and yet the latter remains an understudied topic in computational pathology. This study, to the best of our knowledge, is among the first to hybridize local and global graph methods to profile orchestration and interaction of cellular components. To address the challenge in hematolymphoid cancers, where the cell classes in TME may be unclear, we first implemented cell-level unsupervised learning and identified two new cell subtypes. Local cell graphs or supercells were built for each image by considering the individual cell's geospatial location and classes. Then, we applied supercell level clustering and identified two new cell communities. In the end, we built global graphs to abstract spatial interaction patterns and extract features for disease diagnosis. We evaluate the proposed algorithm on H&E slides of 60 hematolymphoid neoplasms and further compared it with three cell level graph-based algorithms, including the global cell graph, cluster cell graph, and FLocK. The proposed algorithm achieved a mean diagnosis accuracy of 0.703 with the repeated 5-fold cross-validation scheme. In conclusion, our algorithm shows superior performance over the existing methods and can be potentially applied to other cancer types.
    A Pose-only Solution to Visual Reconstruction and Navigation. (arXiv:2103.01530v3 [cs.CV] UPDATED)
    (0 min) Visual navigation and three-dimensional (3D) scene reconstruction are essential for robotics to interact with the surrounding environment. Large-scale scenes and critical camera motions are great challenges facing the research community to achieve this goal. We raised a pose-only imaging geometry framework and algorithms that can help solve these challenges. The representation is a linear function of camera global translations, which allows for efficient and robust camera motion estimation. As a result, the spatial feature coordinates can be analytically reconstructed and do not require nonlinear optimization. Experiments demonstrate that the computational efficiency of recovering the scene and associated camera poses is significantly improved by 2-4 orders of magnitude. This solution might be promising to unlock real-time 3D visual computing in many forefront applications.
    Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. (arXiv:2104.05670v2 [cs.CV] UPDATED)
    (0 min) We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation for human motions by training a generative variational autoencoder (VAE). By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D, HumanAct12 and UESTC datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising. Code and models are available on our project page.
    Understanding Cognitive Fatigue from fMRI Scans with Self-supervised Learning. (arXiv:2106.15009v3 [cs.CV] UPDATED)
    (0 min) Functional magnetic resonance imaging (fMRI) is a neuroimaging technique that records neural activations in the brain by capturing the blood oxygen level in different regions based on the task performed by a subject. Given fMRI data, the problem of predicting the state of cognitive fatigue in a person has not been investigated to its full extent. This paper proposes tackling this issue as a multi-class classification problem by dividing the state of cognitive fatigue into six different levels, ranging from no-fatigue to extreme fatigue conditions. We built a spatio-temporal model that uses convolutional neural networks (CNN) for spatial feature extraction and a long short-term memory (LSTM) network for temporal modeling of 4D fMRI scans. We also applied a self-supervised method called MoCo (Momentum Contrast) to pre-train our model on a public dataset BOLD5000 and fine-tuned it on our labeled dataset to predict cognitive fatigue. Our novel dataset contains fMRI scans from Traumatic Brain Injury (TBI) patients and healthy controls (HCs) while performing a series of N-back cognitive tasks. This method establishes a state-of-the-art technique to analyze cognitive fatigue from fMRI data and beats previous approaches to solve this problem.
    Identifying Autism Spectrum Disorder Based on Individual-Aware Down-Sampling and Multi-Modal Learning. (arXiv:2109.09129v1 [eess.IV])
    (0 min) Autism Spectrum Disorder(ASD) is a set of neurodevelopmental conditions that affect patients' social abilities. In recent years, deep learning methods have been employed to detect ASD through functional MRI (fMRI). However, existing approaches solely concentrated on the abnormal brain functional connections but ignored the importance of regional activities. Due to this biased prior knowledge, previous diagnosis models suffered from inter-site heterogeneity and inter-individual phenotypical differences. To address this issue, we propose a novel feature extraction method for fMRI that can learn a personalized lowe-resolution representation of the entire brain networking regarding both the functional connections and regional activities. First, we abstract the brain imaging as a graph structure, where nodes represent brain areas and edges denote functional connections, and downsample it to a sparse network by hierarchical graph pooling. Subsequently, by assigning each subject with the extracted features and building edges through inter-individual non-imaging characteristics, we build a population graph. The non-identically distributed node features are further recalibrated to node embeddings learned by graph convolutional networks. By these means, our framework can extract features directly and efficiently from the entire fMRI and be aware of implicit inter-individual differences. We have evaluated our framework on the ABIDE-I dataset with 10-fold cross-validation. The present model has achieved a mean classification accuracy of 85.95\% and a mean AUC of 0.92, which is better than the state-of-the-art methods.
    Violence Detection in Videos. (arXiv:2109.08941v1 [cs.CV])
    (0 min) In the recent years, there has been a tremendous increase in the amount of video content uploaded to social networking and video sharing websites like Facebook and Youtube. As of result of this, the risk of children getting exposed to adult and violent content on the web also increased. To address this issue, an approach to automatically detect violent content in videos is proposed in this work. Here, a novel attempt is made also to detect the category of violence present in a video. A system which can automatically detect violence from both Hollywood movies and videos from the web is extremely useful not only in parental control but also for applications related to movie ratings, video surveillance, genre classification and so on. Here, both audio and visual features are used to detect violence. MFCC features are used as audio cues. Blood, Motion, and SentiBank features are used as visual cues. Binary SVM classifiers are trained on each of these features to detect violence. Late fusion using a weighted sum of classification scores is performed to get final classification scores for each of the violence class target by the system. To determine optimal weights for each of the violence classes an approach based on grid search is employed. Publicly available datasets, mainly Violent Scene Detection (VSD), are used for classifier training, weight calculation, and testing. The performance of the system is evaluated on two classification tasks, Multi-Class classification, and Binary Classification. The results obtained for Binary Classification are better than the baseline results from MediaEval-2014.
    YinYang-Net: Complementing Face and Body Information for Wild Gender Recognition. (arXiv:2107.06847v2 [cs.CV] UPDATED)
    (0 min) Soft biometrics inference in surveillance scenarios is a topic of interest for various applications, particularly in security-related areas. However, soft biometric analysis is not extensively reported in wild conditions. In particular, previous works on gender recognition report their results in face datasets, with relatively good image quality and frontal poses. Given the uncertainty of the availability of the facial region in wild conditions, we consider that these methods are not adequate for surveillance settings. To overcome these limitations, we: 1) present frontal and wild face versions of three well-known surveillance datasets; and 2) propose YinYang-Net (YY-Net), a model that effectively and dynamically complements facial and body information, which makes it suitable for gender recognition in wild conditions. The frontal and wild face datasets derive from widely used Pedestrian Attribute Recognition (PAR) sets (PETA, PA-100K, and RAP), using a pose-based approach to filter the frontal samples and facial regions. This approach retrieves the facial region of images with varying image/subject conditions, where the state-of-the-art face detectors often fail. YY-Net combines facial and body information through a learnable fusion matrix and a channel-attention sub-network, focusing on the most influential body parts according to the specific image/subject features. We compare it with five PAR methods, consistently obtaining state-of-the-art results on gender recognition, and reducing the prediction errors by up to 24% in frontal samples. The announced PAR datasets versions and YY-Net serve as the basis for wild soft biometrics classification and are available in https://github.com/Tiago-Roxo.
    Efficient Pairwise Neuroimage Analysis using the Soft Jaccard Index and 3D Keypoint Sets. (arXiv:2103.06966v3 [eess.IV] UPDATED)
    (0 min) We propose a novel pairwise distance measure between image keypoint sets, for the purpose of large-scale medical image indexing. Our measure generalizes the Jaccard index to account for soft set equivalence (SSE) between keypoint elements, via an adaptive kernel framework modeling uncertainty in keypoint appearance and geometry. A new kernel is proposed to quantify the variability of keypoint geometry in location and scale. Our distance measure may be estimated between $O(N^2)$ image pairs in $O(N~\log~N)$ operations via keypoint indexing. Experiments report the first results for the task of predicting family relationships from medical images, using 1010 T1-weighted MRI brain volumes of 434 families including monozygotic and dizygotic twins, siblings and half-siblings sharing 100%-25% of their polymorphic genes. Soft set equivalence and the keypoint geometry kernel improve upon standard hard set equivalence (HSE) and appearance kernels alone in predicting family relationships. Monozygotic twin identification is near 100%, and three subjects with uncertain genotyping are automatically paired with their self-reported families, the first reported practical application of image-based family identification. Our distance measure can also be used to predict group categories, sex is predicted with an AUC=0.97. Software is provided for efficient fine-grained curation of large, generic image datasets.
    The shape and simplicity biases of adversarially robust ImageNet-trained CNNs. (arXiv:2006.09373v5 [cs.CV] UPDATED)
    (0 min) Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-of-distribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of 'robustifying' the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks high-frequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity (Xie and Yuille, 2020) and can act as a strong image prior in image synthesis (Santurkar et al., 2019).
    Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View. (arXiv:2010.16010v2 [cs.CV] UPDATED)
    (0 min) Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem, which refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning visual contents. To tackle it, most existing methods focus on enhancing visual feature learning to reduce this superficial textual shortcut influence on VQA model decisions. However, limited effort has been devoted to providing an explicit interpretation for its inherent cause. It thus lacks a good guidance for the research community to move forward in a purposeful way, resulting in model construction perplexity in overcoming this non-trivial problem. In this paper, we propose to interpret the language prior problem in VQA from a class-imbalance view. Concretely, we design a novel interpretation scheme whereby the loss of mis-predicted frequent and sparse answers of the same question type is distinctly exhibited during the late training phase. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer, to a given question whose right answer is sparse in the training set. Based upon this observation, we further develop a novel loss re-scaling approach to assign different weights to each answer based on the training data statistics for computing the final loss. We apply our approach into three baselines and the experimental results on two VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
    Parallel Residual Bi-Fusion Feature Pyramid Network for Accurate Single-Shot Object Detection. (arXiv:2012.01724v2 [cs.CV] UPDATED)
    (0 min) We propose the Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN) for fast and accurate single-shot object detection. Feature Pyramid (FP) is widely used in recent visual detection, however the top-down pathway of FP cannot preserve accurate localization due to pooling shifting. The advantage of FP is weaken as deeper backbones with more layers are used. To address this issue, we propose a new parallel FP structure with bi-directional (top-down and bottom-up) fusion and associated improvements to retain high-quality features for accurate localization. Our method is particularly suitable for detecting small objects. We provide the following design improvements: (1) A parallel bifusion FP structure with a Bottom-up Fusion Module (BFM) to detect both small and large objects at once with high accuracy. (2) A COncatenation and RE-organization (CORE) module provides a bottom-up pathway for feature fusion, which leads to the bi-directional fusion FP that can recover lost information from lower-layer feature maps. (3) The CORE feature is further purified to retain richer contextual information. Such purification is performed with CORE in a few iterations in both top-down and bottom-up pathways. (4) The adding of a residual design to CORE leads to a new Re-CORE module that enables easy training and integration with a wide range of (deeper or lighter) backbones. The proposed network achieves state-of-the-art performance on UAVDT17 and MS COCO datasets.
    Baby Robot: Improving the Motor Skills of Toddlers. (arXiv:2109.09223v1 [cs.RO])
    (0 min) This article introduces "Baby Robot", a robot aiming to improve motor skills of babies and toddlers. Authors developed a car-like toy that moves autonomously using reinforcement learning and computer vision techniques. The robot behaviour is to escape from a target baby that has been previously recognized, or at least detected, while avoiding obstacles, so that the security of the baby is not compromised. A myriad of commercial toys with a similar mobility improvement purpose are into the market; however, there is no one that bets for an intelligent autonomous movement, as they perform simple yet repetitive trajectories in the best of the cases. Two crawling toys -- one in representation of "Baby Robot" -- were tested in a real environment with respect to regular toys in order to check how they improved the toddlers mobility. These real-life experiments were conducted with our proposed robot in a kindergarten, where a group of children interacted with the toys. Significant improvement in the motion skills of participants were detected.
    Distilling Virtual Examples for Long-tailed Recognition. (arXiv:2103.15042v3 [cs.CV] UPDATED)
    (0 min) We tackle the long-tailed visual recognition problem from the knowledge distillation perspective by proposing a Distill the Virtual Examples (DiVE) method. Specifically, by treating the predictions of a teacher model as virtual examples, we prove that distilling from these virtual examples is equivalent to label distribution learning under certain constraints. We show that when the virtual example distribution becomes flatter than the original input distribution, the under-represented tail classes will receive significant improvements, which is crucial in long-tailed recognition. The proposed DiVE method can explicitly tune the virtual example distribution to become flat. Extensive experiments on three benchmark datasets, including the large-scale iNaturalist ones, justify that the proposed DiVE method can significantly outperform state-of-the-art methods. Furthermore, additional analyses and experiments verify the virtual example interpretation, and demonstrate the effectiveness of tailored designs in DiVE for long-tailed problems.
    RSI-Net: Two-Stream Deep Neural Network Integrating GCN and Atrous CNN for Semantic Segmentation of High-resolution Remote Sensing Images. (arXiv:2109.09148v1 [cs.CV])
    (0 min) For semantic segmentation of remote sensing images (RSI), trade-off between representation power and location accuracy is quite important. How to get the trade-off effectively is an open question, where current approaches of utilizing attention schemes or very deep models result in complex models with large memory consumption. Compared with the popularly-used convolutional neural network (CNN) with fixed square kernels, graph convolutional network (GCN) can explicitly utilize correlations between adjacent land covers and conduct flexible convolution on arbitrarily irregular image regions. However, the problems of large variations of target scales and blurred boundary cannot be easily solved by GCN, while densely connected atrous convolution network (DenseAtrousCNet) with multi-scale atrous convolution can expand the receptive fields and obtain image global information. Inspired by the advantages of both GCN and Atrous CNN, a two-stream deep neural network for semantic segmentation of RSI (RSI-Net) is proposed in this paper to obtain improved performance through modeling and propagating spatial contextual structure effectively and a novel decoding scheme with image-level and graph-level combination. Extensive experiments are implemented on the Vaihingen, Potsdam and Gaofen RSI datasets, where the comparison results demonstrate the superior performance of RSI-Net in terms of overall accuracy, F1 score and kappa coefficient when compared with six state-of-the-art RSI semantic segmentation methods.
    A Study of the Generalizability of Self-Supervised Representations. (arXiv:2109.09150v1 [cs.LG])
    (0 min) Recent advancements in self-supervised learning (SSL) made it possible to learn generalizable visual representations from unlabeled data. The performance of Deep Learning models fine-tuned on pretrained SSL representations is on par with models fine-tuned on the state-of-the-art supervised learning (SL) representations. Irrespective of the progress made in SSL, its generalizability has not been studied extensively. In this article, we perform a deeper analysis of the generalizability of pretrained SSL and SL representations by conducting a domain-based study for transfer learning classification tasks. The representations are learned from the ImageNet source data, which are then fine-tuned using two types of target datasets: similar to the source dataset, and significantly different from the source dataset. We study generalizability of the SSL and SL-based models via their prediction accuracy as well as prediction confidence. In addition to this, we analyze the attribution of the final convolutional layer of these models to understand how they reason about the semantic identity of the data. We show that the SSL representations are more generalizable as compared to the SL representations. We explain the generalizability of the SSL representations by investigating its invariance property, which is shown to be better than that observed in the SL representations.
    Low-resolution Human Pose Estimation. (arXiv:2109.09090v1 [cs.CV])
    (0 min) Human pose estimation has achieved significant progress on images with high imaging resolution. However, low-resolution imagery data bring nontrivial challenges which are still under-studied. To fill this gap, we start with investigating existing methods and reveal that the most dominant heatmap-based methods would suffer more severe model performance degradation from low-resolution, and offset learning is an effective strategy. Established on this observation, in this work we propose a novel Confidence-Aware Learning (CAL) method which further addresses two fundamental limitations of existing offset learning methods: inconsistent training and testing, decoupled heatmap and offset learning. Specifically, CAL selectively weighs the learning of heatmap and offset with respect to ground-truth and most confident prediction, whilst capturing the statistical importance of model output in mini-batch learning manner. Extensive experiments conducted on the COCO benchmark show that our method outperforms significantly the state-of-the-art methods for low-resolution human pose estimation.
    CaTGrasp: Learning Category-Level Task-Relevant Grasping in Clutter from Simulation. (arXiv:2109.09163v1 [cs.RO])
    (0 min) Task-relevant grasping is critical for industrial assembly, where downstream manipulation tasks constrain the set of valid grasps. Learning how to perform this task, however, is challenging, since task-relevant grasp labels are hard to define and annotate. There is also yet no consensus on proper representations for modeling or off-the-shelf tools for performing task-relevant grasps. This work proposes a framework to learn task-relevant grasping for industrial objects without the need of time-consuming real-world data collection or manual annotation. To achieve this, the entire framework is trained solely in simulation, including supervised training with synthetic label generation and self-supervised, hand-object interaction. In the context of this framework, this paper proposes a novel, object-centric canonical representation at the category level, which allows establishing dense correspondence across object instances and transferring task-relevant grasps to novel instances. Extensive experiments on task-relevant grasping of densely-cluttered industrial objects are conducted in both simulation and real-world setups, demonstrating the effectiveness of the proposed framework. Code and data will be released upon acceptance at https://sites.google.com/view/catgrasp.
    Traffic-Net: 3D Traffic Monitoring Using a Single Camera. (arXiv:2109.09165v1 [cs.CV])
    (0 min) Computer Vision has played a major role in Intelligent Transportation Systems (ITS) and traffic surveillance. Along with the rapidly growing automated vehicles and crowded cities, the automated and advanced traffic management systems (ATMS) using video surveillance infrastructures have been evolved by the implementation of Deep Neural Networks. In this research, we provide a practical platform for real-time traffic monitoring, including 3D vehicle/pedestrian detection, speed detection, trajectory estimation, congestion detection, as well as monitoring the interaction of vehicles and pedestrians, all using a single CCTV traffic camera. We adapt a custom YOLOv5 deep neural network model for vehicle/pedestrian detection and an enhanced SORT tracking algorithm. For the first time, a hybrid satellite-ground based inverse perspective mapping (SG-IPM) method for camera auto-calibration is also developed which leads to an accurate 3D object detection and visualisation. We also develop a hierarchical traffic modelling solution based on short- and long-term temporal video data stream to understand the traffic flow, bottlenecks, and risky spots for vulnerable road users. Several experiments on real-world scenarios and comparisons with state-of-the-art are conducted using various traffic monitoring datasets, including MIO-TCD, UA-DETRAC and GRAM-RTM collected from highways, intersections, and urban areas under different lighting and weather conditions.
    Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking. (arXiv:2004.01382v2 [cs.CV] UPDATED)
    (0 min) Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.
    Human Recognition based on Retinal Bifurcations and Modified Correlation Function. (arXiv:2109.08977v1 [eess.IV])
    (0 min) Nowadays high security is an important issue for most of the secure places and recent advances increase the needs of high-security systems. Therefore, needs to high security for controlling and permitting the allowable people to enter the high secure places, increases and extends the use of conventional recognition methods. Therefore, a novel identification method using retinal images is proposed in this paper. For this purpose, new mathematical functions are applied on corners and bifurcations. To evaluate the proposed method we use 40 retinal images from the DRIVE database, 20 normal retinal image from STARE database and 140 normal retinal images from local collected database and the accuracy rate is 99.34 percent.
    The Unreasonable Effectiveness of the Final Batch Normalization Layer. (arXiv:2109.09016v1 [cs.CV])
    (0 min) Early-stage disease indications are rarely recorded in real-world domains, such as Agriculture and Healthcare, and yet, their accurate identification is critical in that point of time. In this type of highly imbalanced classification problems, which encompass complex features, deep learning (DL) is much needed because of its strong detection capabilities. At the same time, DL is observed in practice to favor majority over minority classes and consequently suffer from inaccurate detection of the targeted early-stage indications. In this work, we extend the study done by Kocaman et al., 2020, showing that the final BN layer, when placed before the softmax output layer, has a considerable impact in highly imbalanced image classification problems as well as undermines the role of the softmax outputs as an uncertainty measure. This current study addresses additional hypotheses and reports on the following findings: (i) the performance gain after adding the final BN layer in highly imbalanced settings could still be achieved after removing this additional BN layer in inference; (ii) there is a certain threshold for the imbalance ratio upon which the progress gained by the final BN layer reaches its peak; (iii) the batch size also plays a role and affects the outcome of the final BN application; (iv) the impact of the BN application is also reproducible on other datasets and when utilizing much simpler neural architectures; (v) the reported BN effect occurs only per a single majority class and multiple minority classes i.e., no improvements are evident when there are two majority classes; and finally, (vi) utilizing this BN layer with sigmoid activation has almost no impact when dealing with a strongly imbalanced image classification tasks.
    Joint Distribution Alignment via Adversarial Learning for Domain Adaptive Object Detection. (arXiv:2109.09033v1 [cs.CV])
    (0 min) Unsupervised domain adaptive object detection aims to adapt a well-trained detector from its original source domain with rich labeled data to a new target domain with unlabeled data. Recently, mainstream approaches perform this task through adversarial learning, yet still suffer from two limitations. First, they mainly align marginal distribution by unsupervised cross-domain feature matching, and ignore each feature's categorical and positional information that can be exploited for conditional alignment; Second, they treat all classes as equally important for transferring cross-domain knowledge and ignore that different classes usually have different transferability. In this paper, we propose a joint adaptive detection framework (JADF) to address the above challenges. First, an end-to-end joint adversarial adaptation framework for object detection is proposed, which aligns both marginal and conditional distributions between domains without introducing any extra hyperparameter. Next, to consider the transferability of each object class, a metric for class-wise transferability assessment is proposed, which is incorporated into the JADF objective for domain adaptation. Further, an extended study from unsupervised domain adaptation (UDA) to unsupervised few-shot domain adaptation (UFDA) is conducted, where only a few unlabeled training images are available in unlabeled target domain. Extensive experiments validate that JADF is effective in both the UDA and UFDA settings, achieving significant performance gains over existing state-of-the-art cross-domain detection methods.
    Unsupervised 3D Pose Estimation for Hierarchical Dance Video Recognition. (arXiv:2109.09166v1 [cs.CV])
    (0 min) Dance experts often view dance as a hierarchy of information, spanning low-level (raw images, image sequences), mid-levels (human poses and bodypart movements), and high-level (dance genre). We propose a Hierarchical Dance Video Recognition framework (HDVR). HDVR estimates 2D pose sequences, tracks dancers, and then simultaneously estimates corresponding 3D poses and 3D-to-2D imaging parameters, without requiring ground truth for 3D poses. Unlike most methods that work on a single person, our tracking works on multiple dancers, under occlusions. From the estimated 3D pose sequence, HDVR extracts body part movements, and therefrom dance genre. The resulting hierarchical dance representation is explainable to experts. To overcome noise and interframe correspondence ambiguities, we enforce spatial and temporal motion smoothness and photometric continuity over time. We use an LSTM network to extract 3D movement subsequences from which we recognize the dance genre. For experiments, we have identified 154 movement types, of 16 body parts, and assembled a new University of Illinois Dance (UID) Dataset, containing 1143 video clips of 9 genres covering 30 hours, annotated with movement and genre labels. Our experimental results demonstrate that our algorithms outperform the state-of-the-art 3D pose estimation methods, which also enhances our dance recognition performance.
    DECORAS: detection and characterization of radio-astronomical sources using deep learning. (arXiv:2109.09077v1 [astro-ph.IM])
    (0 min) We present DECORAS, a deep learning based approach to detect both point and extended sources from Very Long Baseline Interferometry (VLBI) observations. Our approach is based on an encoder-decoder neural network architecture that uses a low number of convolutional layers to provide a scalable solution for source detection. In addition, DECORAS performs source characterization in terms of the position, effective radius and peak brightness of the detected sources. We have trained and tested the network with images that are based on realistic Very Long Baseline Array (VLBA) observations at 20 cm. Also, these images have not gone through any prior de-convolution step and are directly related to the visibility data via a Fourier transform. We find that the source catalog generated by DECORAS has a better overall completeness and purity, when compared to a traditional source detection algorithm. DECORAS is complete at the 7.5$\sigma$ level, and has an almost factor of two improvement in reliability at 5.5$\sigma$. We find that DECORAS can recover the position of the detected sources to within 0.61 $\pm$ 0.69 mas, and the effective radius and peak surface brightness are recovered to within 20 per cent for 98 and 94 per cent of the sources, respectively. Overall, we find that DECORAS provides a reliable source detection and characterization solution for future wide-field VLBI surveys.
    Towards robustness under occlusion for face recognition. (arXiv:2109.09083v1 [cs.CV])
    (0 min) In this paper, we evaluate the effects of occlusions in the performance of a face recognition pipeline that uses a ResNet backbone. The classifier was trained on a subset of the CelebA-HQ dataset containing 5,478 images from 307 classes, to achieve top-1 error rate of 17.91%. We designed 8 different occlusion masks which were applied to the input images. This caused a significant drop in the classifier performance: its error rate for each mask became at least two times worse than before. In order to increase robustness under occlusions, we followed two approaches. The first is image inpainting using the pre-trained pluralistic image completion network. The second is Cutmix, a regularization strategy consisting of mixing training images and their labels using rectangular patches, making the classifier more robust against input corruptions. Both strategies revealed effective and interesting results were observed. In particular, the Cutmix approach makes the network more robust without requiring additional steps at the application time, though its training time is considerably longer. Our datasets containing the different occlusion masks as well as their inpainted counterparts are made publicly available to promote research on the field.
    HPTQ: Hardware-Friendly Post Training Quantization. (arXiv:2109.09113v1 [cs.CV])
    (0 min) Neural network quantization enables the deployment of models on edge devices. An essential requirement for their hardware efficiency is that the quantizers are hardware-friendly: uniform, symmetric, and with power-of-two thresholds. To the best of our knowledge, current post-training quantization methods do not support all of these constraints simultaneously. In this work, we introduce a hardware-friendly post training quantization (HPTQ) framework, which addresses this problem by synergistically combining several known quantization methods. We perform a large-scale study on four tasks: classification, object detection, semantic segmentation and pose estimation over a wide variety of network architectures. Our extensive experiments show that competitive results can be obtained under hardware-friendly constraints.
    Unsupervised domain adaptation for cross-modality liver segmentation via joint adversarial learning and self-learning. (arXiv:2109.05664v2 [cs.CV] UPDATED)
    (0 min) Liver segmentation on images acquired using computed tomography (CT) and magnetic resonance imaging (MRI) plays an important role in clinical management of liver diseases. Compared to MRI, CT images of liver are more abundant and readily available. However, MRI can provide richer quantitative information of the liver compared to CT. Thus, it is desirable to achieve unsupervised domain adaptation for transferring the learned knowledge from the source domain containing labeled CT images to the target domain containing unlabeled MR images. In this work, we report a novel unsupervised domain adaptation framework for cross-modality liver segmentation via joint adversarial learning and self-learning. We propose joint semantic-aware and shape-entropy-aware adversarial learning with post-situ identification manner to implicitly align the distribution of task-related features extracted from the target domain with those from the source domain. In proposed framework, a network is trained with the above two adversarial losses in an unsupervised manner, and then a mean completer of pseudo-label generation is employed to produce pseudo-labels to train the next network (desired model). Additionally, semantic-aware adversarial learning and two self-learning methods, including pixel-adaptive mask refinement and student-to-partner learning, are proposed to train the desired model. To improve the robustness of the desired model, a low-signal augmentation function is proposed to transform MRI images as the input of the desired model to handle hard samples. Using the public data sets, our experiments demonstrated the proposed unsupervised domain adaptation framework outperformed four supervised learning methods with a Dice score 0.912 plus or minus 0.037 (mean plus or minus standard deviation).
    Sk-Unet Model with Fourier Domain for Mitosis Detection. (arXiv:2109.00957v2 [eess.IV] UPDATED)
    (0 min) Mitotic count is the most important morphological feature of breast cancer grading. Many deep learning-based methods have been proposed but suffer from domain shift. In this work, we construct a Fourier-based segmentation model for mitosis detection to address the problem. Swapping the low-frequency spectrum of source and target images is shown effective to alleviate the discrepancy between different scanners. Our Fourier-based segmentation method can achieve F1 with 0.7456 on the preliminary test set.
    Underwater Image Enhancement Using Convolutional Neural Network. (arXiv:2109.08916v1 [eess.IV])
    (0 min) This work proposes a method for underwater image enhancement using the principle of histogram equalization. Since underwater images have a global strong dominant colour, their colourfulness and contrast are often degraded. Before applying the histogram equalisation technique on the image, the image is converted from coloured image to a gray scale image for further operations. Histogram equalization is a technique for adjusting image intensities to enhance contrast. The colours of the image are retained using a convolutional neural network model which is trained by the datasets of underwater images to give better results.
    Towards to Robust and Generalized Medical Image Segmentation Framework. (arXiv:2108.03823v2 [cs.CV] UPDATED)
    (0 min) To mitigate the radiologist's workload, computer-aided diagnosis with the capability to review and analyze medical images is gradually deployed. Deep learning-based region of interest segmentation is among the most exciting use cases. However, this paradigm is restricted in real-world clinical applications due to poor robustness and generalization. The issue is more sinister with a lack of training data. In this paper, we address the challenge from the representation learning point of view. We investigate that the collapsed representations, as one of the main reasons which caused poor robustness and generalization, could be avoided through transfer learning. Therefore, we propose a novel two-stage framework for robust generalized segmentation. In particular, an unsupervised Tile-wise AutoEncoder (T-AE) pretraining architecture is coined to learn meaningful representation for improving the generalization and robustness of the downstream tasks. Furthermore, the learned knowledge is transferred to the segmentation benchmark. Coupled with an image reconstruction network, the representation keeps to be decoded, encouraging the model to capture more semantic features. Experiments of lung segmentation on multi chest X-ray datasets are conducted. Empirically, the related experimental results demonstrate the superior generalization capability of the proposed framework on unseen domains in terms of high performance and robustness to corruption, especially under the scenario of the limited training data.
    Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition. (arXiv:2103.01716v2 [cs.CV] UPDATED)
    (0 min) Using the face as a biometric identity trait is motivated by the contactless nature of the capture process and the high accuracy of the recognition algorithms. After the current COVID-19 pandemic, wearing a face mask has been imposed in public places to keep the pandemic under control. However, face occlusion due to wearing a mask presents an emerging challenge for face recognition systems. In this paper, we present a solution to improve the masked face recognition performance. Specifically, we propose the Embedding Unmasking Model (EUM) operated on top of existing face recognition models. We also propose a novel loss function, the Self-restrained Triplet (SRT), which enabled the EUM to produce embeddings similar to these of unmasked faces of the same identities. The achieved evaluation results on three face recognition models, two real masked datasets, and two synthetically generated masked face datasets proved that our proposed approach significantly improves the performance in most experimental settings.
    Assessing bikeability with street view imagery and computer vision. (arXiv:2105.08499v3 [cs.CV] UPDATED)
    (0 min) Studies evaluating bikeability usually compute spatial indicators shaping cycling conditions and conflate them in a quantitative index. Much research involves site visits or conventional geospatial approaches, and few studies have leveraged street view imagery (SVI) for conducting virtual audits. These have assessed a limited range of aspects, and not all have been automated using computer vision (CV). Furthermore, studies have not yet zeroed in on gauging the usability of these technologies thoroughly. We investigate, with experiments at a fine spatial scale and across multiple geographies (Singapore and Tokyo), whether we can use SVI and CV to assess bikeability comprehensively. Extending related work, we develop an exhaustive index of bikeability composed of 34 indicators. The results suggest that SVI and CV are adequate to evaluate bikeability in cities comprehensively. As they outperformed non-SVI counterparts by a wide margin, SVI indicators are also found to be superior in assessing urban bikeability, and potentially can be used independently, replacing traditional techniques. However, the paper exposes some limitations, suggesting that the best way forward is combining both SVI and non-SVI approaches. The new bikeability index presents a contribution in transportation and urban analytics, and it is scalable to assess cycling appeal widely.
    Classic versus deep learning approaches to address computer vision challenges. (arXiv:2101.09744v2 [cs.CV] UPDATED)
    (0 min) Computer vision and image processing address many challenging applications. While the last decade has seen deep neural network architectures revolutionizing those fields, early methods relied on 'classic', i.e., non-learned approaches. In this study, we explore the differences between classic and deep learning (DL) algorithms to gain new insight regarding which is more suitable for a given application. The focus is on two challenging ill-posed problems, namely faint edge detection and multispectral image registration, studying recent state-of-the-art DL and classic solutions. While those DL algorithms outperform classic methods in terms of accuracy and development time, they tend to have higher resource requirements and are unable to perform outside their training space. Moreover, classic algorithms are more transparent, which facilitates their adoption for real-life applications. As both classes of approaches have unique strengths and limitations, the choice of a solution is clearly application dependent.
    LODE: Deep Local Deblurring and A New Benchmark. (arXiv:2109.09149v1 [cs.CV])
    (0 min) While recent deep deblurring algorithms have achieved remarkable progress, most existing methods focus on the global deblurring problem, where the image blur mostly arises from severe camera shake. We argue that the local blur, which is mostly derived from moving objects with a relatively static background, is prevalent but remains under-explored. In this paper, we first lay the data foundation for local deblurring by constructing, for the first time, a LOcal-DEblur (LODE) dataset consisting of 3,700 real-world captured locally blurred images and their corresponding ground-truth. Then, we propose a novel framework, termed BLur-Aware DEblurring network (BladeNet), which contains three components: the Local Blur Synthesis module generates locally blurred training pairs, the Local Blur Perception module automatically captures the locally blurred region and the Blur-guided Spatial Attention module guides the deblurring network with spatial attention. This framework is flexible such that it can be combined with many existing SotA algorithms. We carry out extensive experiments on REDS and LODE datasets showing that BladeNet improves PSNR by 2.5dB over SotAs for local deblurring while keeping comparable performance for global deblurring. We will publish the dataset and codes.
    Evaluating explainable artificial intelligence methods for multi-label deep learning classification tasks in remote sensing. (arXiv:2104.01375v2 [cs.LG] UPDATED)
    (0 min) Although deep neural networks hold the state-of-the-art in several remote sensing tasks, their black-box operation hinders the understanding of their decisions, concealing any bias and other shortcomings in datasets and model performance. To this end, we have applied explainable artificial intelligence (XAI) methods in remote sensing multi-label classification tasks towards producing human-interpretable explanations and improve transparency. In particular, we utilized and trained deep learning models with state-of-the-art performance in the benchmark BigEarthNet and SEN12MS datasets. Ten XAI methods were employed towards understanding and interpreting models' predictions, along with quantitative metrics to assess and compare their performance. Numerous experiments were performed to assess the overall performance of XAI methods for straightforward prediction cases, competing multiple labels, as well as misclassification cases. According to our findings, Occlusion, Grad-CAM and Lime were the most interpretable and reliable XAI methods. However, none delivers high-resolution outputs, while apart from Grad-CAM, both Lime and Occlusion are computationally expensive. We also highlight different aspects of XAI performance and elaborate with insights on black-box decisions in order to improve transparency, understand their behavior and reveal, as well, datasets' particularities.
    Reveal of Vision Transformers Robustness against Adversarial Attacks. (arXiv:2106.03734v2 [cs.CV] UPDATED)
    (0 min) The major part of the vanilla vision transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. For better performance, ViT needs large-scale training data. To overcome this data hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely investigated in the literature like CNNs. This work studies the robustness of ViT variants 1) against different Lp-based adversarial attacks in comparison with CNNs, 2) under adversarial examples (AEs) after applying preprocessing defense methods and 3) under the adaptive attacks using expectation over transformation (EOT) framework. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under Lp-based attacks and under adaptive attacks. 2) Unlike hybrid-ViTs, Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models.
    TFPnP: Tuning-free Plug-and-Play Proximal Algorithm with Applications to Inverse Imaging Problems. (arXiv:2012.05703v3 [cs.CV] UPDATED)
    (0 min) Plug-and-Play (PnP) is a non-convex optimization framework that combines proximal algorithms, for example, the alternating direction method of multipliers (ADMM), with advanced denoising priors. Over the past few years, great empirical success has been obtained by PnP algorithms, especially for the ones that integrate deep learning-based denoisers. However, a key challenge of PnP approaches is the need for manual parameter tweaking as it is essential to obtain high-quality results across the high discrepancy in imaging conditions and varying scene content. In this work, we present a class of tuning-free PnP proximal algorithms that can determine parameters such as denoising strength, termination time, and other optimization-specific parameters automatically. A core part of our approach is a policy network for automated parameter search which can be effectively learned via a mixture of model-free and model-based deep reinforcement learning strategies. We demonstrate, through rigorous numerical and visual experiments, that the learned policy can customize parameters to different settings, and is often more efficient and effective than existing handcrafted criteria. Moreover, we discuss several practical considerations of PnP denoisers, which together with our learned policy yield state-of-the-art results. This advanced performance is prevalent on both linear and nonlinear exemplar inverse imaging problems, and in particular shows promising results on compressed sensing MRI, sparse-view CT, single-photon imaging, and phase retrieval.
    Capsule networks with non-iterative cluster routing. (arXiv:2109.09213v1 [cs.CV])
    (0 min) Capsule networks use routing algorithms to flow information between consecutive layers. In the existing routing procedures, capsules produce predictions (termed votes) for capsules of the next layer. In a nutshell, the next-layer capsule's input is a weighted sum over all the votes it receives. In this paper, we propose non-iterative cluster routing for capsule networks. In the proposed cluster routing, capsules produce vote clusters instead of individual votes for next-layer capsules, and each vote cluster sends its centroid to a next-layer capsule. Generally speaking, the next-layer capsule's input is a weighted sum over the centroid of each vote cluster it receives. The centroid that comes from a cluster with a smaller variance is assigned a larger weight in the weighted sum process. Compared with the state-of-the-art capsule networks, the proposed capsule networks achieve the best accuracy on the Fashion-MNIST and SVHN datasets with fewer parameters, and achieve the best accuracy on the smallNORB and CIFAR-10 datasets with a moderate number of parameters. The proposed capsule networks also produce capsules with disentangled representation and generalize well to images captured at novel viewpoints. The proposed capsule networks also preserve 2D spatial information of an input image in the capsule channels: if the capsule channels are rotated, the object reconstructed from these channels will be rotated by the same transformation. Codes are available at https://github.com/ZHAOZHIHAO/ClusterRouting.
    A Cascaded Zoom-In Network for Patterned Fabric Defect Detection. (arXiv:2108.06760v2 [cs.CV] UPDATED)
    (2 min) Nowadays, Deep Convolutional Neural Networks (DCNNs) are widely used in fabric defect detection, which come with the cost of expensive training and complex model parameters. With the observation that most fabrics are defect free in practice, a two-step Cascaded Zoom-In Network (CZI-Net) is proposed for patterned fabric defect detection. In the CZI-Net, the Aggregated HOG (A-HOG) and SIFT features are used to instead of simple convolution filters for feature extraction. Moreover, in order to extract more distinctive features, the feature representation layer and full connection layer are included in the CZI-Net. In practice, Most defect-free fabrics only involve in the first step of our method and avoid a costive computation in the second step, which makes very fast fabric detection. More importantly, we propose the Locality-constrained Reconstruction Error (LCRE) in the first step and Restrictive Locality-constrained Coding (RLC), Bag-of-Indexes (BoI) methods in the second step. We also analyse the connections between different coding methods and conclude that the index of visual words plays an essential role in the coding methods. In conclusion, experiments based on real-world datasets are implemented and demonstrate that our proposed method is not only computationally simple but also with high detection accuracy.
    Sketch Your Own GAN. (arXiv:2108.02774v2 [cs.CV] UPDATED)
    (2 min) Can a user create a deep generative model by sketching a single example? Traditionally, creating a GAN model has required the collection of a large-scale dataset of exemplars and specialized knowledge in deep learning. In contrast, sketching is possibly the most universally accessible way to convey a visual concept. In this work, we present a method, GAN Sketching, for rewriting GANs with one or more sketches, to make GANs training easier for novice users. In particular, we change the weights of an original GAN model according to user sketches. We encourage the model's output to match the user sketches through a cross-domain adversarial loss. Furthermore, we explore different regularization methods to preserve the original model's diversity and image quality. Experiments have shown that our method can mold GANs to match shapes and poses specified by sketches while maintaining realism and diversity. Finally, we demonstrate a few applications of the resulting GAN, including latent space interpolation and image editing.
    Wasserstein Distances, Geodesics and Barycenters of Merge Trees. (arXiv:2107.07789v2 [cs.GR] UPDATED)
    (2 min) This paper presents a unified computational framework for the estimation of distances, geodesics and barycenters of merge trees. We extend recent work on the edit distance [106] and introduce a new metric, called the Wasserstein distance between merge trees, which is purposely designed to enable efficient computations of geodesics and barycenters. Specifically, our new distance is strictly equivalent to the L2-Wasserstein distance between extremum persistence diagrams, but it is restricted to a smaller solution space, namely, the space of rooted partial isomorphisms between branch decomposition trees. This enables a simple extension of existing optimization frameworks [112] for geodesics and barycenters from persistence diagrams to merge trees. We introduce a task-based algorithm which can be generically applied to distance, geodesic, barycenter or cluster computation. The task-based nature of our approach enables further accelerations with shared-memory parallelism. Extensive experiments on public ensembles and SciVis contest benchmarks demonstrate the efficiency of our approach -- with barycenter computations in the orders of minutes for the largest examples -- as well as its qualitative ability to generate representative barycenter merge trees, visually summarizing the features of interest found in the ensemble. We show the utility of our contributions with dedicated visualization applications: feature tracking, temporal reduction and ensemble clustering. We provide a lightweight C++ implementation that can be used to reproduce our results.
    Edge Prior Augmented Networks for Motion Deblurring on Naturally Blurry Images. (arXiv:2109.08915v1 [cs.CV])
    (2 min) Motion deblurring has witnessed rapid development in recent years, and most of the recent methods address it by using deep learning techniques, with the help of different kinds of prior knowledge. Concerning that deblurring is essentially expected to improve the image sharpness, edge information can serve as an important prior. However, the edge has not yet been seriously taken into consideration in previous methods when designing deep models. To this end, we present a novel framework that incorporates edge prior knowledge into deep models, termed Edge Prior Augmented Networks (EPAN). EPAN has a content-based main branch and an edge-based auxiliary branch, which are constructed as a Content Deblurring Net (CDN) and an Edge Enhancement Net (EEN), respectively. EEN is designed to augment CDN in the deblurring process via an attentive fusion mechanism, where edge features are mapped as spatial masks to guide content features in a feature-based hierarchical manner. An edge-guided loss function is proposed to further regulate the optimization of EPAN by enforcing the focus on edge areas. Besides, we design a dual-camera-based image capturing setting to build a new dataset, Real Object Motion Blur (ROMB), with paired sharp and naturally blurry images of fast-moving cars, so as to better train motion deblurring models and benchmark the capability of motion deblurring algorithms in practice. Extensive experiments on the proposed ROMB and other existing datasets demonstrate that EPAN outperforms state-of-the-art approaches qualitatively and quantitatively.
    Hebbian Semi-Supervised Learning in a Sample Efficiency Setting. (arXiv:2103.09002v2 [cs.NE] CROSS LISTED)
    (2 min) We propose to address the issue of sample efficiency, in Deep Convolutional Neural Networks (DCNN), with a semi-supervised training strategy that combines Hebbian learning with gradient descent: all internal layers (both convolutional and fully connected) are pre-trained using an unsupervised approach based on Hebbian learning, and the last fully connected layer (the classification layer) is trained using Stochastic Gradient Descent (SGD). In fact, as Hebbian learning is an unsupervised learning method, its potential lies in the possibility of training the internal layers of a DCNN without labels. Only the final fully connected layer has to be trained with labeled examples. We performed experiments on various object recognition datasets, in different regimes of sample efficiency, comparing our semi-supervised (Hebbian for internal layers + SGD for the final fully connected layer) approach with end-to-end supervised backprop training, and with semi-supervised learning based on Variational Auto-Encoder (VAE). The results show that, in regimes where the number of available labeled samples is low, our semi-supervised approach outperforms the other approaches in almost all the cases.
    Learned Image Coding for Machines: A Content-Adaptive Approach. (arXiv:2108.09992v2 [eess.IV] UPDATED)
    (2 min) Today, according to the Cisco Annual Internet Report (2018-2023), the fastest-growing category of Internet traffic is machine-to-machine communication. In particular, machine-to-machine communication of images and videos represents a new challenge and opens up new perspectives in the context of data compression. One possible solution approach consists of adapting current human-targeted image and video coding standards to the use case of machine consumption. Another approach consists of developing completely new compression paradigms and architectures for machine-to-machine communications. In this paper, we focus on image compression and present an inference-time content-adaptive finetuning scheme that optimizes the latent representation of an end-to-end learned image codec, aimed at improving the compression efficiency for machine-consumption. The conducted experiments show that our online finetuning brings an average bitrate saving (BD-rate) of -3.66% with respect to our pretrained image codec. In particular, at low bitrate points, our proposed method results in a significant bitrate saving of -9.85%. Overall, our pretrained-and-then-finetuned system achieves -30.54% BD-rate over the state-of-the-art image/video codec Versatile Video Coding (VVC).
    Learning to Cut by Watching Movies. (arXiv:2108.04294v2 [cs.CV] UPDATED)
    (2 min) Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise. Many video editing components are astonishingly hard to automate primarily due to the lack of raw video materials. This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility. Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts. To do this, we first collected a data source of more than 10K videos, from which we extract more than 255K cuts. We devise a model that learns to discriminate between real and artificial cuts via contrastive learning. We set up a new task and a set of baselines to benchmark video cut generation. We observe that our proposed model outperforms the baselines by large margins. To demonstrate our model in real-world applications, we conduct human studies in a collection of unedited videos. The results show that our model does a better job at cutting than random and alternative baselines.
    RECALL: Replay-based Continual Learning in Semantic Segmentation. (arXiv:2108.03673v2 [cs.CV] UPDATED)
    (2 min) Deep networks allow to obtain outstanding results in semantic segmentation, however they need to be trained in a single shot with a large amount of data. Continual learning settings where new classes are learned in incremental steps and previous training data is no longer available are challenging due to the catastrophic forgetting phenomenon. Existing approaches typically fail when several incremental steps are performed or in presence of a distribution shift of the background class. We tackle these issues by recreating no longer available data for the old classes and outlining a content inpainting scheme on the background class. We propose two sources for replay data. The first resorts to a generative adversarial network to sample from the class space of past learning steps. The second relies on web-crawled data to retrieve images containing examples of old classes from online databases. In both scenarios no samples of past steps are stored, thus avoiding privacy concerns. Replay data are then blended with new samples during the incremental steps. Our approach, RECALL, outperforms state-of-the-art methods.
    DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling. (arXiv:2105.12441v3 [cs.LG] UPDATED)
    (2 min) Since 2014 transfer learning has become the key driver for the improvement of spatial saliency prediction; however, with stagnant progress in the last 3-5 years. We conduct a large-scale transfer learning study which tests different ImageNet backbones, always using the same read out architecture and learning protocol adopted from DeepGaze II. By replacing the VGG19 backbone of DeepGaze II with ResNet50 features we improve the performance on saliency prediction from 78% to 85%. However, as we continue to test better ImageNet models as backbones (such as EfficientNetB5) we observe no additional improvement on saliency prediction. By analyzing the backbones further, we find that generalization to other datasets differs substantially, with models being consistently overconfident in their fixation predictions. We show that by combining multiple backbones in a principled manner a good confidence calibration on unseen datasets can be achieved. This new model, "DeepGaze IIE", yields a significant leap in benchmark performance in and out-of-domain with a 15 percent point improvement over DeepGaze II to 93% on MIT1003, marking a new state of the art on the MIT/Tuebingen Saliency Benchmark in all available metrics (AUC: 88.3%, sAUC: 79.4%, CC: 82.4%).
    XIRL: Cross-embodiment Inverse Reinforcement Learning. (arXiv:2106.03911v2 [cs.RO] UPDATED)
    (2 min) We investigate the visual cross-embodiment imitation setting, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, etc. In this work, we demonstrate that it is possible to automatically discover and learn vision-based reward functions from cross-embodiment demonstration videos that are robust to these differences. Specifically, we present a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL) that leverages temporal cycle-consistency constraints to learn deep visual embeddings that capture task progression from offline videos of demonstrations across multiple expert agents, each performing the same task differently due to embodiment differences. Prior to our work, producing rewards from self-supervised embeddings typically required alignment with a reference trajectory, which may be difficult to acquire under stark embodiment differences. We show empirically that if the embeddings are aware of task progress, simply taking the negative distance between the current state and goal state in the learned embedding space is useful as a reward for training policies with reinforcement learning. We find our learned reward function not only works for embodiments seen during training, but also generalizes to entirely new embodiments. Additionally, when transferring real-world human demonstrations to a simulated robot, we find that XIRL is more sample efficient than current best methods. Qualitative results, code, and datasets are available at https://x-irl.github.io
    Atrial Fibrillation: A Medical and Technological Review. (arXiv:2109.08974v1 [cs.LG])
    (2 min) Atrial Fibrillation (AF) is the most common type of arrhythmia (Greek a-, loss + rhythmos, rhythm = loss of rhythm) leading to hospitalization in the United States. Though sometimes AF is asymptomatic, it increases the risk of stroke and heart failure in patients, in addition to lowering the health-related quality of life (HRQOL). AF-related care costs the healthcare system between $6.0 to $26 billion each year. Early detection of AF and clinical attention can help improve symptoms and HRQOL of the patient, as well as bring down the cost of care. However, the prevalent paradigm of AF detection depends on electrocardiogram (ECG) recorded at a single point in time and does not shed light on the relation of the symptoms with heart rhythm or AF. In the recent decade, due to the democratization of health monitors and the advent of high-performing computers, Machine Learning algorithms have been proven effective in identifying AF, from the ECG of patients. This paper provides an overview of the symptoms of AF, its diagnosis, and future prospects for research in the field.
    Higher Order Recurrent Space-Time Transformer. (arXiv:2104.08665v2 [cs.CV] UPDATED)
    (0 min) Endowing visual agents with predictive capability is a key step towards video intelligence at scale. The predominant modeling paradigm for this is sequence learning, mostly implemented through LSTMs. Feed-forward Transformer architectures have replaced recurrent model designs in ML applications of language processing and also partly in computer vision. In this paper we investigate on the competitiveness of Transformer-style architectures for video predictive tasks. To do so we propose HORST, a novel higher order recurrent layer design whose core element is a spatial-temporal decomposition of self-attention for video. HORST achieves state of the art competitive performance on Something-Something-V2 early action recognition and EPIC-Kitchens-55 action anticipation, without exploiting a task specific design. We believe this is promising evidence of causal predictive capability that we attribute to our recurrent higher order design of self-attention.
    Surgical data science for safe cholecystectomy: a protocol for segmentation of hepatocystic anatomy and assessment of the critical view of safety. (arXiv:2106.10916v2 [eess.IV] UPDATED)
    (0 min) Minimally invasive image-guided surgery heavily relies on vision. Deep learning models for surgical video analysis could therefore support visual tasks such as assessing the critical view of safety (CVS) in laparoscopic cholecystectomy (LC), potentially contributing to surgical safety and efficiency. However, the performance, reliability and reproducibility of such models are deeply dependent on the quality of data and annotations used in their development. Here, we present a protocol, checklists, and visual examples to promote consistent annotation of hepatocystic anatomy and CVS criteria. We believe that sharing annotation guidelines can help build trustworthy multicentric datasets for assessing generalizability of performance, thus accelerating the clinical translation of deep learning models for surgical video analysis.
    iWave3D: End-to-end Brain Image Compression with Trainable 3-D Wavelet Transform. (arXiv:2109.08942v1 [eess.IV])
    (2 min) With the rapid development of whole brain imaging technology, a large number of brain images have been produced, which puts forward a great demand for efficient brain image compression methods. At present, the most commonly used compression methods are all based on 3-D wavelet transform, such as JP3D. However, traditional 3-D wavelet transforms are designed manually with certain assumptions on the signal, but brain images are not as ideal as assumed. What's more, they are not directly optimized for compression task. In order to solve these problems, we propose a trainable 3-D wavelet transform based on the lifting scheme, in which the predict and update steps are replaced by 3-D convolutional neural networks. Then the proposed transform is embedded into an end-to-end compression scheme called iWave3D, which is trained with a large amount of brain images to directly minimize the rate-distortion loss. Experimental results demonstrate that our method outperforms JP3D significantly by 2.012 dB in terms of average BD-PSNR.
    Small Lesion Segmentation in Brain MRIs with Subpixel Embedding. (arXiv:2109.08791v1 [eess.IV])
    (2 min) We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues. We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network. Our embedding network learns features that can resolve detailed structures in the brain without the need for high-resolution training images, which are often unavailable and expensive to acquire. Alternatively, the encoder-decoder learns global structures by means of striding and max pooling. Our embedding network complements the encoder-decoder architecture by guiding the decoder with fine-grained details lost to spatial downsampling during the encoder stage. Unlike previous works, our decoder outputs at 2 times the input resolution, where a single pixel in the input resolution is predicted by four neighboring subpixels in our output. To obtain the output at the original scale, we propose a learnable downsampler (as opposed to hand-crafted ones e.g. bilinear) that combines subpixel predictions. Our approach improves the baseline architecture by approximately 11.7% and achieves the state of the art on the ATLAS public benchmark dataset with a smaller memory footprint and faster runtime than the best competing method. Our source code has been made available at: https://github.com/alexklwong/subpixel-embedding-segmentation.
    A Studious Approach to Semi-Supervised Learning. (arXiv:2109.08924v1 [cs.CV])
    (2 min) The problem of learning from few labeled examples while using large amounts of unlabeled data has been approached by various semi-supervised methods. Although these methods can achieve superior performance, the models are often not deployable due to the large number of parameters. This paper is an ablation study of distillation in a semi-supervised setting, which not just reduces the number of parameters of the model but can achieve this while improving the performance over the baseline supervised model and making it better at generalizing. After the supervised pretraining, the network is used as a teacher model, and a student network is trained over the soft labels that the teacher model generates over the entire unlabeled data. We find that the fewer the labels, the more this approach benefits from a smaller student network. This brings forward the potential of distillation as an effective solution to enhance performance in semi-supervised computer vision tasks while maintaining deployability.
    Efficient Urban-scale Point Clouds Segmentation with BEV Projection. (arXiv:2109.09074v1 [cs.CV])
    (2 min) Point clouds analysis has grasped researchers' eyes in recent years, while 3D semantic segmentation remains a problem. Most deep point clouds models directly conduct learning on 3D point clouds, which will suffer from the severe sparsity and extreme data processing load in urban-scale data. To tackle the challenge, we propose to transfer the 3D point clouds to dense bird's-eye-view projection. In this case, the segmentation task is simplified because of class unbalance reduction and the feasibility of leveraging various 2D segmentation methods. We further design an attention-based fusion network that can conduct multi-modal learning on the projected images. Finally, the 2D out are remapped to generate 3D semantic segmentation results. To demonstrate the benefits of our method, we conduct various experiments on the SensatUrban dataset, in which our model presents competitive evaluation results (61.17% mIoU and 91.37% OverallAccuracy). We hope our work can inspire further exploration in point cloud analysis.
    A survey on deep learning approaches for breast cancer diagnosis. (arXiv:2109.08853v1 [eess.IV])
    (2 min) Deep learning has introduced several learning-based methods to recognize breast tumours and presents high applicability in breast cancer diagnostics. It has presented itself as a practical installment in Computer-Aided Diagnostic (CAD) systems to further assist radiologists in diagnostics for different modalities. A deep learning network trained on images provided by hospitals or public databases can perform classification, detection, and segmentation of lesion types. Significant progress has been made in recognizing tumours on 2D images but recognizing 3D images remains a frontier so far. The interconnection of deep learning networks between different fields of study help propels discoveries for more efficient, accurate, and robust networks. In this review paper, the following topics will be explored: (i) theory and application of deep learning, (ii) progress of 2D, 2.5D, and 3D CNN approaches in breast tumour recognition from a performance metric perspective, and (iii) challenges faced in CNN approaches.
    SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction. (arXiv:2109.08963v1 [cs.CV])
    (2 min) Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge. Few effective multi-scale techniques are applied in transformer and there are two main limitations in the current methods. On one hand, self-attention module in vanilla transformer fails to sufficiently exploit the diversity of semantic information because of its rigid mechanism. On the other hand, it is hard to build attention and interaction among different levels due to the heavy computational burden. To alleviate this problem, we first revisit multi-scale problem in dense prediction, verifying the significance of diverse semantic representation and multi-scale interaction, and exploring the adaptation of transformer to pyramidal structure. Inspired by these findings, we propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF). ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation. Besides, ARF is further added to refine the attention in transformer. Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks. Furthermore, the proposed components are all plug-and-play, which can be embedded in other methods.
    Ontology-based n-ball Concept Embeddings Informing Few-shot Image Classification. (arXiv:2109.09063v1 [cs.CV])
    (2 min) We propose a novel framework named ViOCE that integrates ontology-based background knowledge in the form of $n$-ball concept embeddings into a neural network based vision architecture. The approach consists of two components - converting symbolic knowledge of an ontology into continuous space by learning n-ball embeddings that capture properties of subsumption and disjointness, and guiding the training and inference of a vision model using the learnt embeddings. We evaluate ViOCE using the task of few-shot image classification, where it demonstrates superior performance on two standard benchmarks.
    Object Tracking by Jointly Exploiting Frame and Event Domain. (arXiv:2109.09052v1 [cs.CV])
    (2 min) Inspired by the complementarity between conventional frame-based and bio-inspired event-based cameras, we propose a multi-modal based approach to fuse visual cues from the frame- and event-domain to enhance the single object tracking performance, especially in degraded conditions (e.g., scenes with high dynamic range, low light, and fast-motion objects). The proposed approach can effectively and adaptively combine meaningful information from both domains. Our approach's effectiveness is enforced by a novel designed cross-domain attention schemes, which can effectively enhance features based on self- and cross-domain attention schemes; The adaptiveness is guarded by a specially designed weighting scheme, which can adaptively balance the contribution of the two domains. To exploit event-based visual cues in single-object tracking, we construct a large-scale frame-event-based dataset, which we subsequently employ to train a novel frame-event fusion based model. Extensive experiments show that the proposed approach outperforms state-of-the-art frame-based tracking methods by at least 10.4% and 11.9% in terms of representative success rate and precision rate, respectively. Besides, the effectiveness of each key component of our approach is evidenced by our thorough ablation study.
    V-SlowFast Network for Efficient Visual Sound Separation. (arXiv:2109.08867v1 [cs.CV])
    (2 min) The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small- and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2% reduction in the number of model parameters and 81.4% reduction in GMACs compared to the previous multi-stage models. Project page: \href{https://ly-zhu.github.io/V-SlowFast}{https://ly-zhu.github.io/V-SlowFast}.
    Unsupervised Domain Adaptation for Semantic Segmentation via Low-level Edge Information Transfer. (arXiv:2109.08912v1 [cs.CV])
    (2 min) Unsupervised domain adaptation for semantic segmentation aims to make models trained on synthetic data (source domain) adapt to real images (target domain). Previous feature-level adversarial learning methods only consider adapting models on the high-level semantic features. However, the large domain gap between source and target domains in the high-level semantic features makes accurate adaptation difficult. In this paper, we present the first attempt at explicitly using low-level edge information, which has a small inter-domain gap, to guide the transfer of semantic information. To this end, a semantic-edge domain adaptation architecture is proposed, which uses an independent edge stream to process edge information, thereby generating high-quality semantic boundaries over the target domain. Then, an edge consistency loss is presented to align target semantic predictions with produced semantic boundaries. Moreover, we further propose two entropy reweighting methods for semantic adversarial learning and self-supervised learning, respectively, which can further enhance the adaptation performance of our architecture. Comprehensive experiments on two UDA benchmark datasets demonstrate the superiority of our architecture compared with state-of-the-art methods.
    Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation. (arXiv:2109.08937v1 [cs.CV])
    (2 min) Semantic segmentation of fine-resolution urban scene images plays a vital role in extensive practical applications, such as land cover mapping, urban change detection, environmental protection and economic assessment. Driven by rapid developments in deep learning technologies, convolutional neural networks (CNNs) have dominated the semantic segmentation task for many years. Convolutional neural networks adopt hierarchical feature representation and have strong local context extraction. However, the local property of the convolution layer limits the network from capturing global information that is crucial for improving fine-resolution image segmentation. Recently, Transformer comprise a hot topic in the computer vision domain. Vision Transformer demonstrates the great capability of global information modelling, boosting many vision tasks, such as image classification, object detection and especially semantic segmentation. In this paper, we propose an efficient hybrid Transformer (EHT) for semantic segmentation of urban scene images. EHT takes advantage of CNNs and Transformer, learning global-local context to strengthen the feature representation. Extensive experiments demonstrate that EHT has higher efficiency with competitive accuracy compared with state-of-the-art benchmark methods. Specifically, the proposed EHT achieves a 67.0% mIoU on the UAVid test set and outperforms other lightweight models significantly. The code will be available soon.
    Random Multi-Channel Image Synthesis for Multiplexed Immunofluorescence Imaging. (arXiv:2109.09004v1 [eess.IV])
    (2 min) Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell mapping. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce tissue is inevitable from extensive rounds of staining and bleaching ('missing tissue'). Moreover, the immunofluorescence (IF) imaging can globally fail for particular rounds ('missing stain''). In this work, we focus on the 'missing stain' issue. It would be appealing to develop digital image synthesis approaches to restore missing stain images without losing more tissue physically. Herein, we aim to develop image synthesis approaches for eleven MxIF structural molecular markers (i.e., epithelial and stromal) on real samples. We propose a novel multi-channel high-resolution image synthesis approach, called pixN2N-HD, to tackle possible missing stain scenarios via a high-resolution generative adversarial network (GAN). Our contribution is three-fold: (1) a single deep network framework is proposed to tackle missing stain in MxIF; (2) the proposed 'N-to-N' strategy reduces theoretical four years of computational time to 20 hours when covering all possible missing stains scenarios, with up to five missing stains (e.g., '(N-1)-to-1', '(N-2)-to-2'); and (3) this work is the first comprehensive experimental study of investigating cross-stain synthesis in MxIF. Our results elucidate a promising direction of advancing MxIF imaging with deep image synthesis.
    Simple and Efficient Unpaired Real-world Super-Resolution using Image Statistics. (arXiv:2109.09071v1 [eess.IV])
    (2 min) Learning super-resolution (SR) network without the paired low resolution (LR) and high resolution (HR) image is difficult because direct supervision through the corresponding HR counterpart is unavailable. Recently, many real-world SR researches take advantage of the unpaired image-to-image translation technique. That is, they used two or more generative adversarial networks (GANs), each of which translates images from one domain to another domain, \eg, translates images from the HR domain to the LR domain. However, it is not easy to stably learn such a translation with GANs using unpaired data. In this study, we present a simple and efficient method of training of real-world SR network. To stably train the network, we use statistics of an image patch, such as means and variances. Our real-world SR framework consists of two GANs, one for translating HR images to LR images (degradation task) and the other for translating LR to HR (SR task). We argue that the unpaired image translation using GANs can be learned efficiently with our proposed data sampling strategy, namely, variance matching. We test our method on the NTIRE 2020 real-world SR dataset. Our method outperforms the current state-of-the-art method in terms of the SSIM metric as well as produces comparable results on the LPIPS metric.
    HYouTube: Video Harmonization Dataset. (arXiv:2109.08809v1 [cs.CV])
    (2 min) Video composition aims to generate a composite video by combining the foreground of one video with the background of another video, but the inserted foreground may be incompatible with the background in terms of color and illumination. Video harmonization aims to adjust the foreground of a composite video to make it compatible with the background. So far, video harmonization has only received limited attention and there is no public dataset for video harmonization. In this work, we construct a new video harmonization dataset HYouTube by adjusting the foreground of real videos to create synthetic composite videos. Considering the domain gap between real composite videos and synthetic composite videos, we additionally create 100 real composite videos via copy-and-paste. Datasets are available at https://github.com/bcmi/Video-Harmonization-Dataset-HYouTube.
    Domain Composition and Attention for Unseen-Domain Generalizable Medical Image Segmentation. (arXiv:2109.08852v1 [eess.IV])
    (2 min) Domain generalizable model is attracting increasing attention in medical image analysis since data is commonly acquired from different institutes with various imaging protocols and scanners. To tackle this challenging domain generalization problem, we propose a Domain Composition and Attention-based network (DCA-Net) to improve the ability of domain representation and generalization. First, we present a domain composition method that represents one certain domain by a linear combination of a set of basis representations (i.e., a representation bank). Second, a novel plug-and-play parallel domain preceptor is proposed to learn these basis representations and we introduce a divergence constraint function to encourage the basis representations to be as divergent as possible. Then, a domain attention module is proposed to learn the linear combination coefficients of the basis representations. The result of linear combination is used to calibrate the feature maps of an input image, which enables the model to generalize to different and even unseen domains. We validate our method on public prostate MRI dataset acquired from six different institutions with apparent domain shift. Experimental results show that our proposed model can generalize well on different and even unseen domains and it outperforms state-of-the-art methods on the multi-domain prostate segmentation task.
    Learning to Regrasp by Learning to Place. (arXiv:2109.08817v1 [cs.RO])
    (2 min) In this paper, we explore whether a robot can learn to regrasp a diverse set of objects to achieve various desired grasp poses. Regrasping is needed whenever a robot's current grasp pose fails to perform desired manipulation tasks. Endowing robots with such an ability has applications in many domains such as manufacturing or domestic services. Yet, it is a challenging task due to the large diversity of geometry in everyday objects and the high dimensionality of the state and action space. In this paper, we propose a system for robots to take partial point clouds of an object and the supporting environment as inputs and output a sequence of pick-and-place operations to transform an initial object grasp pose to the desired object grasp poses. The key technique includes a neural stable placement predictor and a regrasp graph based solution through leveraging and changing the surrounding environment. We introduce a new and challenging synthetic dataset for learning and evaluating the proposed approach. In this dataset, we show that our system is able to achieve 73.3% success rate of regrasping diverse objects.
    S$^3$VAADA: Submodular Subset Selection for Virtual Adversarial Active Domain Adaptation. (arXiv:2109.08901v1 [cs.LG])
    (2 min) Unsupervised domain adaptation (DA) methods have focused on achieving maximal performance through aligning features from source and target domains without using labeled data in the target domain. Whereas, in the real-world scenario's it might be feasible to get labels for a small proportion of target data. In these scenarios, it is important to select maximally-informative samples to label and find an effective way to combine them with the existing knowledge from source data. Towards achieving this, we propose S$^3$VAADA which i) introduces a novel submodular criterion to select a maximally informative subset to label and ii) enhances a cluster-based DA procedure through novel improvements to effectively utilize all the available data for improving generalization on target. Our approach consistently outperforms the competing state-of-the-art approaches on datasets with varying degrees of domain shifts.
    Clean-label Backdoor Attack against Deep Hashing based Retrieval. (arXiv:2109.08868v1 [cs.CV])
    (2 min) Deep hashing has become a popular method in large-scale image retrieval due to its computational and storage efficiency. However, recent works raise the security concerns of deep hashing. Although existing works focus on the vulnerability of deep hashing in terms of adversarial perturbations, we identify a more pressing threat, backdoor attack, when the attacker has access to the training data. A backdoored deep hashing model behaves normally on original query images, while returning the images with the target label when the trigger presents, which makes the attack hard to be detected. In this paper, we uncover this security concern by utilizing clean-label data poisoning. To the best of our knowledge, this is the first attempt at the backdoor attack against deep hashing models. To craft the poisoned images, we first generate the targeted adversarial patch as the backdoor trigger. Furthermore, we propose the confusing perturbations to disturb the hashing code learning, such that the hashing model can learn more about the trigger. The confusing perturbations are imperceptible and generated by dispersing the images with the target label in the Hamming space. We have conducted extensive experiments to verify the efficacy of our backdoor attack under various settings. For instance, it can achieve 63% targeted mean average precision on ImageNet under 48 bits code length with only 40 poisoned images.
    FastHyMix: Fast and Parameter-free Hyperspectral Image Mixed Noise Removal. (arXiv:2109.08879v1 [eess.IV])
    (2 min) Hyperspectral imaging with high spectral resolution plays an important role in finding objects, identifying materials, or detecting processes. The decrease of the widths of spectral bands leads to a decrease in the signal-to-noise ratio (SNR) of measurements. The decreased SNR reduces the reliability of measured features or information extracted from HSIs. Furthermore, the image degradations linked with various mechanisms also result in different types of noise, such as Gaussian noise, impulse noise, deadlines, and stripes. This paper introduces a fast and parameter-free hyperspectral image mixed noise removal method (termed FastHyMix), which characterizes the complex distribution of mixed noise by using a Gaussian mixture model and exploits two main characteristics of hyperspectral data, namely low-rankness in the spectral domain and high correlation in the spatial domain. The Gaussian mixture model enables us to make a good estimation of Gaussian noise intensity and the location of sparse noise. The proposed method takes advantage of the low-rankness using subspace representation and the spatial correlation of HSIs by adding a powerful deep image prior, which is extracted from a neural denoising network. An exhaustive array of experiments and comparisons with state-of-the-art denoisers were carried out. The experimental results show significant improvement in both synthetic and real datasets. A MATLAB demo of this work will be available at https://github.com/LinaZhuang for the sake of reproducibility.
    Homogeneous and Heterogeneous Relational Graph for Visible-infrared Person Re-identification. (arXiv:2109.08811v1 [cs.CV])
    (2 min) Visible-infrared person re-identification (VI Re-ID) aims to match person images between the visible and infrared modalities. Existing VI Re-ID methods mainly focus on extracting homogeneous structural relationships from a single image, while ignoring the heterogeneous correlation between cross-modality images. The homogenous and heterogeneous structured relationships are crucial to learning effective identity representation and cross-modality matching. In this paper, we separately model the homogenous structural relationship by a modality-specific graph within individual modality and then mine the heterogeneous structural correlation in these two modality-specific graphs. First, the homogeneous structured graph (HOSG) mines one-vs.-rest relation between an arbitrary node (local feature) and all the rest nodes within a visible or infrared image to learn effective identity representation. Second, to find cross-modality identity-consistent correspondence, the heterogeneous graph alignment module (HGAM) further measures the relational edge strength by route search between two-modality local node features. Third, we propose the cross-modality cross-correlation (CMCC) loss to extract the modality invariance in heterogeneous global graph representation. CMCC computes the mutual information between modalities and expels semantic redundancy. Extensive experiments on SYSU-MM01 and RegDB datasets demonstrate that our method outperforms state-of-the-arts with a gain of 13.73\% and 9.45\% Rank1/mAP. The code is available at https://github.com/fegnyujian/Homogeneous-and-Heterogeneous-Relational-Graph.
    Towards High-Quality Temporal Action Detection with Sparse Proposals. (arXiv:2109.08847v1 [cs.CV])
    (2 min) Temporal Action Detection (TAD) is an essential and challenging topic in video understanding, aiming to localize the temporal segments containing human action instances and predict the action categories. The previous works greatly rely upon dense candidates either by designing varying anchors or enumerating all the combinations of boundaries on video sequences; therefore, they are related to complicated pipelines and sensitive hand-crafted designs. Recently, with the resurgence of Transformer, query-based methods have tended to become the rising solutions for their simplicity and flexibility. However, there still exists a performance gap between query-based methods and well-established methods. In this paper, we identify the main challenge lies in the large variants of action duration and the ambiguous boundaries for short action instances; nevertheless, quadratic-computational global attention prevents query-based methods to build multi-scale feature maps. Towards high-quality temporal action detection, we introduce Sparse Proposals to interact with the hierarchical features. In our method, named SP-TAD, each proposal attends to a local segment feature in the temporal feature pyramid. The local interaction enables utilization of high-resolution features to preserve action instances details. Extensive experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds. E.g., we achieve the state-of-the-art performance on THUMOS14 (45.7% on mAP@0.6, 33.4% on mAP@0.7 and 53.5% on mAP@Avg) and competitive results on ActivityNet-1.3 (32.99% on mAP@Avg). Code will be made available at https://github.com/wjn922/SP-TAD.
    Measuring the rogue wave pattern triggered from Gaussian perturbations by deep learning. (arXiv:2109.08909v1 [cs.CV])
    (2 min) Weak Gaussian perturbations on a plane wave background could trigger lots of rogue waves, due to modulational instability. Numerical simulations showed that these rogue waves seemed to have similar unit structure. However, to the best of our knowledge, there is no relative result to prove that these rogue waves have the similar patterns for different perturbations, partly due to that it is hard to measure the rogue wave pattern automatically. In this work, we address these problems from the perspective of computer vision via using deep neural networks. We propose a Rogue Wave Detection Network (RWD-Net) model to automatically and accurately detect RWs on the images, which directly indicates they have the similar computer vision patterns. For this purpose, we herein meanwhile have designed the related dataset, termed as Rogue Wave Dataset-$10$K (RWD-$10$K), which has $10,191$ RW images with bounding box annotations for each RW unit. In our detection experiments, we get $99.29\%$ average precision on the test splits of the RWD-$10$K dataset. Finally, we derive our novel metric, the density of RW units (DRW), to characterize the evolution of Gaussian perturbations and obtain the statistical results on them.
    SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification. (arXiv:2109.08839v1 [cs.SD])
    (2 min) Recently, x-vector has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, eg, extended TDNN (E-TDNN), factorized TDNN (F-TDNN), and densely connected TDNN (D-TDNN). In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multi-branch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCeleb1. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1, which surpasses previous TDNN based state-of-the-art approaches by a large margin. Code and trained weights are in https://github.com/wentaozhu/speechnas.git
    Memory Regulation and Alignment toward Generalizer RGB-Infrared Person. (arXiv:2109.08843v1 [cs.CV])
    (2 min) The domain shift, coming from unneglectable modality gap and non-overlapped identity classes between training and test sets, is a major issue of RGB-Infrared person re-identification. A key to tackle the inherent issue -- domain shift -- is to enforce the data distributions of the two domains to be similar. However, RGB-IR ReID always demands discriminative features, leading to over-rely feature sensitivity of seen classes, \textit{e.g.}, via attention-based feature alignment or metric learning. Therefore, predicting the unseen query category from predefined training classes may not be accurate and leads to a sub-optimal adversarial gradient. In this paper, we uncover it in a more explainable way and propose a novel multi-granularity memory regulation and alignment module (MG-MRA) to solve this issue. By explicitly incorporating a latent variable attribute, from fine-grained to coarse semantic granularity, into intermediate features, our method could alleviate the over-confidence of the model about discriminative features of seen classes. Moreover, instead of matching discriminative features by traversing nearest neighbor, sparse attributes, \textit{i.e.}, global structural pattern, are recollected with respect to features and assigned to measure pair-wise image similarity in hashing. Extensive experiments on RegDB \cite{RegDB} and SYSU-MM01 \cite{SYSU} show the superiority of the proposed method that outperforms existing state-of-the-art methods. Our code is available in https://github.com/Chenfeng1271/MGMRA.
    The Report on China-Spain Joint Clinical Testing for Rapid COVID-19 Risk Screening by Eye-region Manifestations. (arXiv:2109.08807v1 [eess.IV])
    (3 min) Background: The worldwide surge in coronavirus cases has led to the COVID-19 testing demand surge. Rapid, accurate, and cost-effective COVID-19 screening tests working at a population level are in imperative demand globally. Methods: Based on the eye symptoms of COVID-19, we developed and tested a COVID-19 rapid prescreening model using the eye-region images captured in China and Spain with cellphone cameras. The convolutional neural networks (CNNs)-based model was trained on these eye images to complete binary classification task of identifying the COVID-19 cases. The performance was measured using area under receiver-operating-characteristic curve (AUC), sensitivity, specificity, accuracy, and F1. The application programming interface was open access. Findings: The multicenter study included 2436 pictures corresponding to 657 subjects (155 COVID-19 infection, 23.6%) in development dataset (train and validation) and 2138 pictures corresponding to 478 subjects (64 COVID-19 infections, 13.4%) in test dataset. The image-level performance of COVID-19 prescreening model in the China-Spain multicenter study achieved an AUC of 0.913 (95% CI, 0.898-0.927), with a sensitivity of 0.695 (95% CI, 0.643-0.748), a specificity of 0.904 (95% CI, 0.891 -0.919), an accuracy of 0.875(0.861-0.889), and a F1 of 0.611(0.568-0.655). Interpretation: The CNN-based model for COVID-19 rapid prescreening has reliable specificity and sensitivity. This system provides a low-cost, fully self-performed, non-invasive, real-time feedback solution for continuous surveillance and large-scale rapid prescreening for COVID-19. Funding: This project is supported by Aimomics (Shanghai) Intelligent
    Self-Adaptive Partial Domain Adaptation. (arXiv:2109.08829v1 [cs.CV])
    (2 min) Partial Domain adaptation (PDA) aims to solve a more practical cross-domain learning problem that assumes target label space is a subset of source label space. However, the mismatched label space causes significant negative transfer. A traditional solution is using soft weights to increase weights of source shared domain and reduce those of source outlier domain. But it still learns features of outliers and leads to negative immigration. The other mainstream idea is to distinguish source domain into shared and outlier parts by hard binary weights, while it is unavailable to correct the tangled shared and outlier classes. In this paper, we propose an end-to-end Self-Adaptive Partial Domain Adaptation(SAPDA) Network. Class weights evaluation mechanism is introduced to dynamically self-rectify the weights of shared, outlier and confused classes, thus the higher confidence samples have the more sufficient weights. Meanwhile it can eliminate the negative transfer caused by the mismatching of label space greatly. Moreover, our strategy can efficiently measure the transferability of samples in a broader sense, so that our method can achieve competitive results on unsupervised DA task likewise. A large number of experiments on multiple benchmarks have demonstrated the effectiveness of our SAPDA.
    Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts. (arXiv:2109.08857v1 [cs.NE])
    (2 min) Evolutionary algorithms have been used in the digital art scene since the 1970s. A popular application of genetic algorithms is to optimize the procedural placement of vector graphic primitives to resemble a given painting. In recent years, deep learning-based approaches have also been proposed to generate procedural drawings, which can be optimized using gradient descent. In this work, we revisit the use of evolutionary algorithms for computational creativity. We find that modern evolution strategies (ES) algorithms, when tasked with the placement of shapes, offer large improvements in both quality and efficiency compared to traditional genetic algorithms, and even comparable to gradient-based methods. We demonstrate that ES is also well suited at optimizing the placement of shapes to fit the CLIP model, and can produce diverse, distinct geometric abstractions that are aligned with human interpretation of language. Videos and demo: https://es-clip.github.io/
    ChipQA: No-Reference Video Quality Prediction via Space-Time Chips. (arXiv:2109.08726v1 [eess.IV])
    (2 min) We propose a new model for no-reference video quality assessment (VQA). Our approach uses a new idea of highly-localized space-time (ST) slices called Space-Time Chips (ST Chips). ST Chips are localized cuts of video data along directions that \textit{implicitly} capture motion. We use perceptually-motivated bandpass and normalization models to first process the video data, and then select oriented ST Chips based on how closely they fit parametric models of natural video statistics. We show that the parameters that describe these statistics can be used to reliably predict the quality of videos, without the need for a reference video. The proposed method implicitly models ST video naturalness, and deviations from naturalness. We train and test our model on several large VQA databases, and show that our model achieves state-of-the-art performance at reduced cost, without requiring motion computation.
    Computational Imaging and Artificial Intelligence: The Next Revolution of Mobile Vision. (arXiv:2109.08880v1 [cs.CV])
    (2 min) Signal capture stands in the forefront to perceive and understand the environment and thus imaging plays the pivotal role in mobile vision. Recent explosive progresses in Artificial Intelligence (AI) have shown great potential to develop advanced mobile platforms with new imaging devices. Traditional imaging systems based on the "capturing images first and processing afterwards" mechanism cannot meet this unprecedented demand. Differently, Computational Imaging (CI) systems are designed to capture high-dimensional data in an encoded manner to provide more information for mobile vision systems.Thanks to AI, CI can now be used in real systems by integrating deep learning algorithms into the mobile vision platform to achieve the closed loop of intelligent acquisition, processing and decision making, thus leading to the next revolution of mobile vision.Starting from the history of mobile vision using digital cameras, this work first introduces the advances of CI in diverse applications and then conducts a comprehensive review of current research topics combining CI and AI. Motivated by the fact that most existing studies only loosely connect CI and AI (usually using AI to improve the performance of CI and only limited works have deeply connected them), in this work, we propose a framework to deeply integrate CI and AI by using the example of self-driving vehicles with high-speed communication, edge computing and traffic planning. Finally, we outlook the future of CI plus AI by investigating new materials, brain science and new computing techniques to shed light on new directions of mobile vision systems.
    Locally Weighted Mean Phase Angle (LWMPA) Based Tone Mapping Quality Index (TMQI-3). (arXiv:2109.08774v1 [eess.IV])
    (2 min) High Dynamic Range (HDR) images are the ones that contain a greater range of luminosity as compared to the standard images. HDR images have a higher detail and clarity of structure, objects, and color, which the standard images lack. HDR images are useful in capturing scenes that pose high brightness, darker areas, and shadows, etc. An HDR image comprises multiple narrow-range-exposure images combined into one high-quality image. As these HDR images cannot be displayed on standard display devices, the real challenge comes while converting these HDR images to Low dynamic range (LDR) images. The conversion of HDR image to LDR image is performed using Tone-mapped operators (TMOs). This conversion results in the loss of much valuable information in structure, color, naturalness, and exposures. The loss of information in the LDR image may not directly be visible to the human eye. To calculate how good an LDR image is after conversion, various metrics have been proposed previously. Some are not noise resilient, some work on separate color channels (Red, Green, and Blue one by one), and some lack capacity to identify the structure. To deal with this problem, we propose a metric in this paper called the Tone Mapping Quality Index (TMQI-3), which evaluates the quality of the LDR image based on its objective score. TMQI-3 is noise resilient, takes account of structure and naturalness, and works on all three color channels combined into one luminosity component. This eliminates the need to use multiple metrics at the same time. We compute results for several HDR and LDR images from the literature and show that our quality index metric performs better than the baseline models.
    WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data. (arXiv:2109.08770v1 [cs.CV])
    (2 min) The transition to green energy grids depends on detailed wind and solar forecasts to optimize the siting and scheduling of renewable energy generation. Operational forecasts from numerical weather prediction models, however, only have a spatial resolution of 10 to 20-km, which leads to sub-optimal usage and development of renewable energy farms. Weather scientists have been developing super-resolution methods to increase the resolution, but often rely on simple interpolation techniques or computationally expensive differential equation-based models. Recently, machine learning-based models, specifically the physics-informed resolution-enhancing generative adversarial network (PhIREGAN), have outperformed traditional downscaling methods. We provide a thorough and extensible benchmark of leading deep learning-based super-resolution techniques, including the enhanced super-resolution generative adversarial network (ESRGAN) and an enhanced deep super-resolution (EDSR) network, on wind and solar data. We accompany the benchmark with a novel public, processed, and machine learning-ready dataset for benchmarking super-resolution methods on wind and solar data.
    Auto White-Balance Correction for Mixed-Illuminant Scenes. (arXiv:2109.08750v1 [cs.CV])
    (2 min) Auto white balance (AWB) is applied by camera hardware at capture time to remove the color cast caused by the scene illumination. The vast majority of white-balance algorithms assume a single light source illuminates the scene; however, real scenes often have mixed lighting conditions. This paper presents an effective AWB method to deal with such mixed-illuminant scenes. A unique departure from conventional AWB, our method does not require illuminant estimation, as is the case in traditional camera AWB modules. Instead, our method proposes to render the captured scene with a small set of predefined white-balance settings. Given this set of rendered images, our method learns to estimate weighting maps that are used to blend the rendered images to generate the final corrected image. Through extensive experiments, we show this proposed method produces promising results compared to other alternatives for single- and mixed-illuminant scene color correction. Our source code and trained models are available at https://github.com/mahmoudnafifi/mixedillWB.
    Unsupervised View-Invariant Human Posture Representation. (arXiv:2109.08730v1 [cs.CV])
    (2 min) Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.
    Self-supervised learning methods and applications in medical imaging analysis: A survey. (arXiv:2109.08685v1 [eess.IV])
    (2 min) The availability of high quality annotated medical imaging datasets is a major problem that collides with machine learning applications in the field of medical imaging analysis and impedes its advancement. Self-supervised learning is a recent training paradigm that enables learning robust representations without the need for human annotation which can be considered as an effective solution for the scarcity in annotated medical data. This article reviews the state-of-the-art research directions in self-supervised learning approaches for image data with concentration on their applications in the field of medical imaging analysis. The article covers a set of the most recent self-supervised learning methods from the computer vision field as they are applicable to the medical imaging analysis and categorize them as predictive, generative and contrastive approaches. Moreover, the article covers (40) of the most recent researches in the field of self-supervised learning in medical imaging analysis aiming at shedding the light on the recent innovation in the field. Ultimately, the article concludes with possible future research directions in the field.
    Segmentation of Brain MRI using an Altruistic Harris Hawks' Optimization algorithm. (arXiv:2109.08688v1 [eess.IV])
    (2 min) Segmentation is an essential requirement in medicine when digital images are used in illness diagnosis, especially, in posterior tasks as analysis and disease identification. An efficient segmentation of brain Magnetic Resonance Images (MRIs) is of prime concern to radiologists due to their poor illumination and other conditions related to de acquisition of the images. Thresholding is a popular method for segmentation that uses the histogram of an image to label different homogeneous groups of pixels into different classes. However, the computational cost increases exponentially according to the number of thresholds. In this paper, we perform the multi-level thresholding using an evolutionary metaheuristic. It is an improved version of the Harris Hawks Optimization (HHO) algorithm that combines the chaotic initialization and the concept of altruism. Further, for fitness assignment, we use a hybrid objective function where along with the cross-entropy minimization, we apply a new entropy function, and leverage weights to the two objective functions to form a new hybrid approach. The HHO was originally designed to solve numerical optimization problems. Earlier, the statistical results and comparisons have demonstrated that the HHO provides very promising results compared with well-established metaheuristic techniques. In this article, the altruism has been incorporated into the HHO algorithm to enhance its exploitation capabilities. We evaluate the proposed method over 10 benchmark images from the WBA database of the Harvard Medical School and 8 benchmark images from the Brainweb dataset using some standard evaluation metrics.
    Asymmetric 3D Context Fusion for Universal Lesion Detection. (arXiv:2109.08684v1 [eess.IV])
    (2 min) Modeling 3D context is essential for high-performance 3D medical image analysis. Although 2D networks benefit from large-scale 2D supervised pretraining, it is weak in capturing 3D context. 3D networks are strong in 3D context yet lack supervised pretraining. As an emerging technique, \emph{3D context fusion operator}, which enables conversion from 2D pretrained networks, leverages the advantages of both and has achieved great success. Existing 3D context fusion operators are designed to be spatially symmetric, i.e., performing identical operations on each 2D slice like convolutions. However, these operators are not truly equivariant to translation, especially when only a few 3D slices are used as inputs. In this paper, we propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices. Notably, A3D is NOT translation-equivariant while it significantly outperforms existing symmetric context fusion operators without introducing large computational overhead. We validate the effectiveness of the proposed method by extensive experiments on DeepLesion benchmark, a large-scale public dataset for universal lesion detection from computed tomography (CT). The proposed A3D consistently outperforms symmetric context fusion operators by considerable margins, and establishes a new \emph{state of the art} on DeepLesion. To facilitate open research, our code and model in PyTorch are available at https://github.com/M3DV/AlignShift.
  • cs.IR updates on arXiv.org

    Tweet Sentiment Quantification: An Experimental Re-Evaluation. (arXiv:2011.08091v3 [cs.CL] UPDATED)
    (3 min) Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called ``prevalence'') of sentiment-related classes (such as \textsf{Positive}, \textsf{Neutral}, \textsf{Negative}) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of ``classify and count'' (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani (2016) carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We now re-evaluate those quantification methods (plus a few more modern ones) on exactly the same same datasets, this time following a now consolidated and much more robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. The results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
    Understanding the Effectiveness of Reviews in E-commerce Top-N Recommendation. (arXiv:2106.09665v7 [cs.IR] UPDATED)
    (2 min) Modern E-commerce websites contain heterogeneous sources of information, such as numerical ratings, textual reviews and images. These information can be utilized to assist recommendation. Through textual reviews, a user explicitly express her affinity towards the item. Previous researchers found that by using the information extracted from these reviews, we can better profile the users' explicit preferences as well as the item features, leading to the improvement of recommendation performance. However, most of the previous algorithms were only utilizing the review information for explicit-feedback problem i.e. rating prediction, and when it comes to implicit-feedback ranking problem such as top-N recommendation, the usage of review information has not been fully explored. Seeing this gap, in this work, we investigate the effectiveness of textual review information for top-N recommendation under E-commerce settings. We adapt several SOTA review-based rating prediction models for top-N recommendation tasks and compare them to existing top-N recommendation models from both performance and efficiency. We find that models utilizing only review information can not achieve better performances than vanilla implicit-feedback matrix factorization method. When utilizing review information as a regularizer or auxiliary information, the performance of implicit-feedback matrix factorization method can be further improved. However, the optimal model structure to utilize textual reviews for E-commerce top-N recommendation is yet to be determined.
    Can the Crowd Judge Truthfulness? A Longitudinal Study on Recent Misinformation about COVID-19. (arXiv:2107.11755v2 [cs.IR] UPDATED)
    (3 min) Recently, the misinformation problem has been addressed with a crowdsourcing-based approach: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of non-expert is exploited. We study whether crowdsourcing is an effective and reliable method to assess truthfulness during a pandemic, targeting statements related to COVID-19, thus addressing (mis)information that is both related to a sensitive and personal issue and very recent as compared to when the judgment is done. In our experiments, crowd workers are asked to assess the truthfulness of statements, and to provide evidence for the assessments. Besides showing that the crowd is able to accurately judge the truthfulness of the statements, we report results on workers behavior, agreement among workers, effect of aggregation functions, of scales transformations, and of workers background and bias. We perform a longitudinal study by re-launching the task multiple times with both novice and experienced workers, deriving important insights on how the behavior and quality change over time. Our results show that: workers are able to detect and objectively categorize online (mis)information related to COVID-19; both crowdsourced and expert judgments can be transformed and aggregated to improve quality; worker background and other signals (e.g., source of information, behavior) impact the quality of the data. The longitudinal study demonstrates that the time-span has a major effect on the quality of the judgments, for both novice and experienced workers. Finally, we provide an extensive failure analysis of the statements misjudged by the crowd-workers.
    Polar Deconvolution of Mixed Signals. (arXiv:2010.10508v3 [cs.IR] UPDATED)
    (2 min) The signal demixing problem seeks to separate a superposition of multiple signals into its constituent components. This paper studies a two-stage approach that first decompresses and subsequently deconvolves the noisy and undersampled observations of the superposition using two convex programs. Probabilistic error bounds are given on the accuracy with which this process approximates the individual signals. The theory of polar convolution of convex sets and gauge functions plays a central role in the analysis and solution process. If the measurements are random and the noise is bounded, this approach stably recovers low-complexity and mutually incoherent signals, with high probability and with near-optimal sample complexity. We develop an efficient algorithm, based on level-set and conditional-gradient methods, that solves the convex optimization problems with sublinear iteration complexity and linear space requirements. Numerical experiments on both real and synthetic data confirm the theory and the efficiency of the approach.
    Evaluating Fairness in Argument Retrieval. (arXiv:2108.10442v2 [cs.IR] UPDATED)
    (2 min) Existing commercial search engines often struggle to represent different perspectives of a search query. Argument retrieval systems address this limitation of search engines and provide both positive (PRO) and negative (CON) perspectives about a user's information need on a controversial topic (e.g., climate change). The effectiveness of such argument retrieval systems is typically evaluated based on topical relevance and argument quality, without taking into account the often differing number of documents shown for the argument stances (PRO or CON). Therefore, systems may retrieve relevant passages, but with a biased exposure of arguments. In this work, we analyze a range of non-stochastic fairness-aware ranking and diversity metrics to evaluate the extent to which argument stances are fairly exposed in argument retrieval systems. Using the official runs of the argument retrieval task Touch\'e at CLEF 2020, as well as synthetic data to control the amount and order of argument stances in the rankings, we show that systems with the best effectiveness in terms of topical relevance are not necessarily the most fair or the most diverse in terms of argument stance. The relationships we found between (un)fairness and diversity metrics shed light on how to evaluate group fairness -- in addition to topical relevance -- in argument retrieval settings.
    Multi-class Text Classification using BERT-based Active Learning. (arXiv:2104.14289v2 [cs.IR] UPDATED)
    (2 min) Text Classification finds interesting applications in the pickup and delivery services industry where customers require one or more items to be picked up from a location and delivered to a certain destination. Classifying these customer transactions into multiple categories helps understand the market needs for different customer segments. Each transaction is accompanied by a text description provided by the customer to describe the products being picked up and delivered which can be used to classify the transaction. BERT-based models have proven to perform well in Natural Language Understanding. However, the product descriptions provided by the customers tend to be short, incoherent and code-mixed (Hindi-English) text which demands fine-tuning of such models with manually labelled data to achieve high accuracy. Collecting this labelled data can prove to be expensive. In this paper, we explore Active Learning strategies to label transaction descriptions cost effectively while using BERT to train a transaction classification model. On TREC-6, AG's News Corpus and an internal dataset, we benchmark the performance of BERT across different Active Learning strategies in Multi-Class Text Classification.
    "Don't Downvote A\$\$\$\$\$\$s!!": An Exploration of Reddit's Advice Communities. (arXiv:2109.09044v1 [cs.HC])
    (2 min) Advice forums are a crowdsourced way to reinforce cultural norms and moral behavior. Sites like Reddit contain massive amounts of natural language human interaction, with rules and norms unique to each individual subreddit community. To explore this data, we created a dataset with top 1000 posts from each of two such forums, r/AmItheAsshole and r/relationships, and extracted natural language features including sentiment, similarity, word frequency, and demographics using both algorithmic and manual methods. Further, we developed a method to extract demographic information from the subreddits, examined how the post authors' self-disclosures reflect the unique communities in which their posts are shared, and discussed how the authors' language use choices might be related to broader social patterns. We observed some differences between the subreddits in terms of word frequency, demographics disclosure, and gendered language. In general, both subreddits had more female posters than male, and posters tended to use more words about their opposite gender than the same. Gender-diverse posters were uncommon. Implications for future research include a more careful, inclusive focus on identity and disclosure and how that interacts with advice-seeking behavior in online communities.
    Inductive Conformal Recommender System. (arXiv:2109.08949v1 [cs.LG])
    (2 min) Traditional recommendation algorithms develop techniques that can help people to choose desirable items. However, in many real-world applications, along with a set of recommendations, it is also essential to quantify each recommendation's (un)certainty. The conformal recommender system uses the experience of a user to output a set of recommendations, each associated with a precise confidence value. Given a significance level $\varepsilon$, it provides a bound $\varepsilon$ on the probability of making a wrong recommendation. The conformal framework uses a key concept called nonconformity measure that measure the strangeness of an item concerning other items. One of the significant design challenges of any conformal recommendation framework is integrating nonconformity measure with the recommendation algorithm. In this paper, we introduce an inductive variant of a conformal recommender system. We propose and analyze different nonconformity measures in the inductive setting. We also provide theoretical proofs on the error-bound and the time complexity. Extensive empirical analysis on ten benchmark datasets demonstrates that the inductive variant substantially improves the performance in computation time while preserving the accuracy.
    A Comprehensive Overview of Recommender System and Sentiment Analysis. (arXiv:2109.08794v1 [cs.AI])
    (2 min) Recommender system has been proven to be significantly crucial in many fields and is widely used by various domains. Most of the conventional recommender systems rely on the numeric rating given by a user to reflect his opinion about a consumed item; however, these ratings are not available in many domains. As a result, a new source of information represented by the user-generated reviews is incorporated in the recommendation process to compensate for the lack of these ratings. The reviews contain prosperous and numerous information related to the whole item or a specific feature that can be extracted using the sentiment analysis field. This paper gives a comprehensive overview to help researchers who aim to work with recommender system and sentiment analysis. It includes a background of the recommender system concept, including phases, approaches, and performance metrics used in recommender systems. Then, it discusses the sentiment analysis concept and highlights the main points in the sentiment analysis, including level, approaches, and focuses on aspect-based sentiment analysis.
    Fast query-by-example speech search using separable model. (arXiv:2109.08870v1 [eess.AS])
    (2 min) Traditional Query-by-Example (QbE) speech search approaches usually use methods based on frame-level features, while state-of-the-art approaches tend to use models based on acoustic word embeddings (AWEs) to transform variable length audio signals into fixed length feature vector representations. However, these approaches cannot meet the requirements of the search quality as well as speed at the same time. In this paper, we propose a novel fast QbE speech search method based on separable models to fix this problem. First, a QbE speech search training framework is introduced. Second, we design a novel model inference scheme based on RepVGG which can efficiently improve the QbE search quality. Third, we modify and improve our QbE speech search model according to the proposed model inference scheme. Experiments on keywords dataset shows that our proposed method can improve the GPU Real-time Factor (RTF) from 1/150 to 1/2300 by just applying separable model scheme and outperforms other state-of-the-art methods.
    Feature Engineering for US State Legislative Hearings: Stance, Affiliation, Engagement and Absentees. (arXiv:2109.08855v1 [cs.IR])
    (2 min) In US State government legislatures, most of the activity occurs in committees made up of lawmakers discussing bills. When analyzing, classifying or summarizing these committee proceedings, some important features become broadly interesting. In this paper, we engineer four useful features, two applying to lawmakers (engagement and absence), and two to non-lawmakers (stance and affiliation). We propose a system to automatically track the affiliation of organizations in public comments and whether the organizational representative supports or opposes the bill. The model tracking affiliation achieves an F1 of 0.872 while the support determination has an F1 of 0.979. Additionally, a metric to compute legislator engagement and absenteeism is also proposed and as proof-of-concept, a list of the most and least engaged legislators over one full California legislative session is presented.
    Solar cell patent classification method based on keyword extraction and deep neural network. (arXiv:2109.08796v1 [cs.IR])
    (2 min) With the growing impact of ESG on businesses, research related to renewable energy is receiving great attention. Solar cells are one of them, and accordingly, it can be said that the research value of solar cell patent analysis is very high. Patent documents have high research value. Being able to accurately analyze and classify patent documents can reveal several important technical relationships. It can also describe the business trends in that technology. And when it comes to investment, new industrial solutions will also be inspired and proposed to make important decisions. Therefore, we must carefully analyze patent documents and utilize the value of patents. To solve the solar cell patent classification problem, we propose a keyword extraction method and a deep neural network-based solar cell patent classification method. First, solar cell patents are analyzed for pretreatment. It then uses the KeyBERT algorithm to extract keywords and key phrases from the patent abstract to construct a lexical dictionary. We then build a solar cell patent classification model according to the deep neural network. Finally, we use a deep neural network-based solar cell patent classification model to classify power patents, and the training accuracy is greater than 95%. Also, the validation accuracy is about 87.5%. It can be seen that the deep neural network method can not only realize the classification of complex and difficult solar cell patents, but also have a good classification effect.
    Complex Temporal Question Answering on Knowledge Graphs. (arXiv:2109.08935v1 [cs.IR])
    (2 min) Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class of practical importance, but have not received much attention in research. This work presents EXAQT, the first end-to-end system for answering complex temporal questions that have multiple entities and predicates, and associated temporal conditions. EXAQT answers natural language questions over KGs in two stages, one geared towards high recall, the other towards precision at top ranks. The first step computes question-relevant compact subgraphs within the KG, and judiciously enhances them with pertinent temporal facts, using Group Steiner Trees and fine-tuned BERT models. The second step constructs relational graph convolutional networks (R-GCNs) from the first step's output, and enhances the R-GCNs with time-aware entity embeddings and attention over temporal relations. We evaluate EXAQT on TimeQuestions, a large dataset of 16k temporal questions we compiled from a variety of general purpose KG-QA benchmarks. Results show that EXAQT outperforms three state-of-the-art systems for answering complex questions over KGs, thereby justifying specialized treatment of temporal QA.
    BERT-Beta: A Proactive Probabilistic Approach to Text Moderation. (arXiv:2109.08805v1 [cs.CL])
    (2 min) Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept {\it text toxicity propensity} to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.
  • cs.LG updates on arXiv.org

    DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling. (arXiv:2105.12441v3 [cs.LG] UPDATED)
    (2 min) Since 2014 transfer learning has become the key driver for the improvement of spatial saliency prediction; however, with stagnant progress in the last 3-5 years. We conduct a large-scale transfer learning study which tests different ImageNet backbones, always using the same read out architecture and learning protocol adopted from DeepGaze II. By replacing the VGG19 backbone of DeepGaze II with ResNet50 features we improve the performance on saliency prediction from 78% to 85%. However, as we continue to test better ImageNet models as backbones (such as EfficientNetB5) we observe no additional improvement on saliency prediction. By analyzing the backbones further, we find that generalization to other datasets differs substantially, with models being consistently overconfident in their fixation predictions. We show that by combining multiple backbones in a principled manner a good confidence calibration on unseen datasets can be achieved. This new model, "DeepGaze IIE", yields a significant leap in benchmark performance in and out-of-domain with a 15 percent point improvement over DeepGaze II to 93% on MIT1003, marking a new state of the art on the MIT/Tuebingen Saliency Benchmark in all available metrics (AUC: 88.3%, sAUC: 79.4%, CC: 82.4%).
    Binary Classification as a Phase Separation Process. (arXiv:2009.02467v3 [stat.ML] UPDATED)
    (2 min) We propose a new binary classification model called Phase Separation Binary Classifier (PSBC). It consists of a discretization of a nonlinear reaction-diffusion equation coupled with an Ordinary Differential Equation, and is inspired by fluids behavior, namely, on how binary fluids phase separate. Thus, parameters and hyperparameters have physical meaning, whose effects are studied in several different scenarios. PSBC's equations can be seen as a dynamical system whose coefficients are trainable weights, with a similar architecture to that of a Recurrent Neural Network. As such, forward propagation amounts to an initial value problem. Boundary conditions are also present, bearing similarity with figure padding techniques in Computer Vision. Model compression is exploited in several ways, with weight sharing taking place both across and within layers. The model is tested on pairs of digits of the classical MNIST database. An associated multiclass classifier is also constructed using a combination of Ensemble Learning and one versus one techniques. It is also shown how the PSBC can be combined with other methods - like aggregation and PCA - in order to construct better binary classifiers. The role of boundary conditions and viscosity is thoroughly studied in the case of digits ``0'' and ``1''.
    Towards Zero and Few-shot Knowledge-seeking Turn Detection in Task-orientated Dialogue Systems. (arXiv:2109.08820v1 [cs.CL])
    (2 min) Most prior work on task-oriented dialogue systems is restricted to supporting domain APIs. However, users may have requests that are out of the scope of these APIs. This work focuses on identifying such user requests. Existing methods for this task mainly rely on fine-tuning pre-trained models on large annotated data. We propose a novel method, REDE, based on adaptive representation learning and density estimation. REDE can be applied to zero-shot cases, and quickly learns a high-performing detector with only a few shots by updating less than 3K parameters. We demonstrate REDE's competitive performance on DSTC9 data and our newly collected test set.
    Reinforcement Learning for Robust Missile Autopilot Design. (arXiv:2011.12956v2 [cs.LG] UPDATED)
    (0 min) Designing missiles' autopilot controllers has been a complex task, given the extensive flight envelope and the nonlinear flight dynamics. A solution that can excel both in nominal performance and in robustness to uncertainties is still to be found. While Control Theory often debouches into parameters' scheduling procedures, Reinforcement Learning has presented interesting results in ever more complex tasks, going from videogames to robotic tasks with continuous action domains. However, it still lacks clearer insights on how to find adequate reward functions and exploration strategies. To the best of our knowledge, this work is pioneer in proposing Reinforcement Learning as a framework for flight control. In fact, it aims at training a model-free agent that can control the longitudinal flight of a missile, achieving optimal performance and robustness to uncertainties. To that end, under TRPO's methodology, the collected experience is augmented according to HER, stored in a replay buffer and sampled according to its significance. Not only does this work enhance the concept of prioritized experience replay into BPER, but it also reformulates HER, activating them both only when the training progress converges to suboptimal policies, in what is proposed as the SER methodology. Besides, the Reward Engineering process is carefully detailed. The results show that it is possible both to achieve the optimal performance and to improve the agent's robustness to uncertainties (with low damage on nominal performance) by further training it in non-nominal environments, therefore validating the proposed approach and encouraging future research in this field.
    Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking. (arXiv:2004.01382v2 [cs.CV] UPDATED)
    (0 min) Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.
    A Conceptual Framework for Externally-influenced Agents: An Assisted Reinforcement Learning Review. (arXiv:2007.01544v2 [cs.AI] UPDATED)
    (0 min) A long-term goal of reinforcement learning agents is to be able to perform tasks in complex real-world scenarios. The use of external information is one way of scaling agents to more complex problems. However, there is a general lack of collaboration or interoperability between different approaches using external information. In this work, while reviewing externally-influenced methods, we propose a conceptual framework and taxonomy for assisted reinforcement learning, aimed at fostering collaboration by classifying and comparing various methods that use external information in the learning process. The proposed taxonomy details the relationship between the external information source and the learner agent, highlighting the process of information decomposition, structure, retention, and how it can be used to influence agent learning. As well as reviewing state-of-the-art methods, we identify current streams of reinforcement learning that use external information in order to improve the agent's performance and its decision-making process. These include heuristic reinforcement learning, interactive reinforcement learning, learning from demonstration, transfer learning, and learning from multiple sources, among others. These streams of reinforcement learning operate with the shared objective of scaffolding the learner agent. Lastly, we discuss further possibilities for future work in the field of assisted reinforcement learning systems.
    Doubly infinite residual neural networks: a diffusion process approach. (arXiv:2007.03253v2 [stat.ML] UPDATED)
    (0 min) Modern neural networks (NN) featuring a large number of layers (depth) and units per layer (width) have achieved a remarkable performance across many domains. While there exists a vast literature on the interplay between infinitely wide NNs and Gaussian processes, a little is known about analogous interplays with respect to infinitely deep NNs. NNs with independent and identically distributed (i.i.d.) initializations exhibit undesirable forward and backward propagation properties as the number of layers increases. To overcome these drawbacks, Peluchetti and Favaro (2020) considered fully-connected residual networks (ResNets) with network's parameters initialized by means of distributions that shrink as the number of layers increases, thus establishing an interplay between infinitely deep ResNets and solutions to stochastic differential equations, i.e. diffusion processes, and showing that infinitely deep ResNets does not suffer from undesirable forward-propagation properties. In this paper, we review the results of Peluchetti and Favaro (2020), extending them to convolutional ResNets, and we establish analogous backward-propagation results, which directly relate to the problem of training fully-connected deep ResNets. Then, we investigate the more general setting of doubly infinite NNs, where both network's width and network's depth grow unboundedly. We focus on doubly infinite fully-connected ResNets, for which we consider i.i.d. initializations. Under this setting, we show that the dynamics of quantities of interest converge, at initialization, to deterministic limits. This allow us to provide analytical expressions for inference, both in the case of weakly trained and fully trained ResNets. Our results highlight a limited expressive power of doubly infinite ResNets when the unscaled network's parameters are i.i.d. and the residual blocks are shallow.
    S$^3$VAADA: Submodular Subset Selection for Virtual Adversarial Active Domain Adaptation. (arXiv:2109.08901v1 [cs.LG])
    (0 min) Unsupervised domain adaptation (DA) methods have focused on achieving maximal performance through aligning features from source and target domains without using labeled data in the target domain. Whereas, in the real-world scenario's it might be feasible to get labels for a small proportion of target data. In these scenarios, it is important to select maximally-informative samples to label and find an effective way to combine them with the existing knowledge from source data. Towards achieving this, we propose S$^3$VAADA which i) introduces a novel submodular criterion to select a maximally informative subset to label and ii) enhances a cluster-based DA procedure through novel improvements to effectively utilize all the available data for improving generalization on target. Our approach consistently outperforms the competing state-of-the-art approaches on datasets with varying degrees of domain shifts.
    What BERT Based Language Models Learn in Spoken Transcripts: An Empirical Study. (arXiv:2109.09105v1 [cs.CL])
    (0 min) Language Models (LMs) have been ubiquitously leveraged in various tasks including spoken language understanding (SLU). Spoken language requires careful understanding of speaker interactions, dialog states and speech induced multimodal behaviors to generate a meaningful representation of the conversation.In this work, we propose to dissect SLU into three representative properties:conversational(disfluency, pause, overtalk), channel(speaker-type, turn-tasks) andASR(insertion, deletion,substitution). We probe BERT based language models (BERT, RoBERTa) trained on spoken transcripts to investigate its ability to understand multifarious properties in absence of any speech cues. Empirical results indicate that LM is surprisingly good at capturing conversational properties such as pause prediction and overtalk detection from lexical tokens. On the downsides, the LM scores low on turn-tasks and ASR errors predictions. Additionally, pre-training the LM on spoken transcripts restrain its linguistic understanding. Finally,we establish the efficacy and transferability of the mentioned properties on two benchmark datasets: Switchboard Dialog Act and Disfluency datasets.
    A Machine Learning Pipeline to Examine Political Bias with Congressional Speeches. (arXiv:2109.09014v1 [cs.CY])
    (0 min) Computational methods to model political bias in social media involve several challenges due to heterogeneity, high-dimensional, multiple modalities, and the scale of the data. Political bias in social media has been studied in multiple viewpoints like media bias, political ideology, echo chambers, and controversies using machine learning pipelines. Most of the current methods rely heavily on the manually-labeled ground-truth data for the underlying political bias prediction tasks. Limitations of such methods include human-intensive labeling, labels related to only a specific problem, and the inability to determine the near future bias state of a social media conversation. In this work, we address such problems and give machine learning approaches to study political bias in two ideologically diverse social media forums: Gab and Twitter without the availability of human-annotated data. Our proposed methods exploit the use of transcripts collected from political speeches in US congress to label the data and achieve the highest accuracy of 70.5% and 65.1% in Twitter and Gab data respectively to predict political bias. We also present a machine learning approach that combines features from cascades and text to forecast cascade's political bias with an accuracy of about 85%.
    DECORAS: detection and characterization of radio-astronomical sources using deep learning. (arXiv:2109.09077v1 [astro-ph.IM])
    (0 min) We present DECORAS, a deep learning based approach to detect both point and extended sources from Very Long Baseline Interferometry (VLBI) observations. Our approach is based on an encoder-decoder neural network architecture that uses a low number of convolutional layers to provide a scalable solution for source detection. In addition, DECORAS performs source characterization in terms of the position, effective radius and peak brightness of the detected sources. We have trained and tested the network with images that are based on realistic Very Long Baseline Array (VLBA) observations at 20 cm. Also, these images have not gone through any prior de-convolution step and are directly related to the visibility data via a Fourier transform. We find that the source catalog generated by DECORAS has a better overall completeness and purity, when compared to a traditional source detection algorithm. DECORAS is complete at the 7.5$\sigma$ level, and has an almost factor of two improvement in reliability at 5.5$\sigma$. We find that DECORAS can recover the position of the detected sources to within 0.61 $\pm$ 0.69 mas, and the effective radius and peak surface brightness are recovered to within 20 per cent for 98 and 94 per cent of the sources, respectively. Overall, we find that DECORAS provides a reliable source detection and characterization solution for future wide-field VLBI surveys.
    Learning to Group: A Bottom-Up Framework for 3D Part Discovery in Unseen Categories. (arXiv:2002.06478v4 [cs.CV] UPDATED)
    (0 min) We address the problem of discovering 3D parts for objects in unseen categories. Being able to learn the geometry prior of parts and transfer this prior to unseen categories pose fundamental challenges on data-driven shape segmentation approaches. Formulated as a contextual bandit problem, we propose a learning-based agglomerative clustering framework which learns a grouping policy to progressively group small part proposals into bigger ones in a bottom-up fashion. At the core of our approach is to restrict the local context for extracting part-level features, which encourages the generalizability to unseen categories. On the large-scale fine-grained 3D part dataset, PartNet, we demonstrate that our method can transfer knowledge of parts learned from 3 training categories to 21 unseen testing categories without seeing any annotated samples. Quantitative comparisons against four shape segmentation baselines shows that our approach achieve the state-of-the-art performance.
    Lifelong Robotic Reinforcement Learning by Retaining Experiences. (arXiv:2109.09180v1 [cs.LG])
    (0 min) Multi-task learning ideally allows robots to acquire a diverse repertoire of useful skills. However, many multi-task reinforcement learning efforts assume the robot can collect data from all tasks at all times. In reality, the tasks that the robot learns arrive sequentially, depending on the user and the robot's current environment. In this work, we study a practical sequential multi-task RL problem that is motivated by the practical constraints of physical robotic systems, and derive an approach that effectively leverages the data and policies learned for previous tasks to cumulatively grow the robot's skill-set. In a series of simulated robotic manipulation experiments, our approach requires less than half the samples than learning each task from scratch, while avoiding impractical round-robin data collection. On a Franka Emika Panda robot arm, our approach incrementally learns ten challenging tasks, including bottle capping and block insertion.
    Multimodal Classification: Current Landscape, Taxonomy and Future Directions. (arXiv:2109.09020v1 [cs.LG])
    (0 min) Multimodal classification research has been gaining popularity in many domains that collect more data from multiple sources including satellite imagery, biometrics, and medicine. However, the lack of consistent terminology and architectural descriptions makes it difficult to compare different existing solutions. We address these challenges by proposing a new taxonomy for describing such systems based on trends found in recent publications on multimodal classification. Many of the most difficult aspects of unimodal classification have not yet been fully addressed for multimodal datasets including big data, class imbalance, and instance level difficulty. We also provide a discussion of these challenges and future directions.
    Unsupervised Behaviour Discovery with Quality-Diversity Optimisation. (arXiv:2106.05648v2 [cs.NE] UPDATED)
    (0 min) Quality-Diversity algorithms refer to a class of evolutionary algorithms designed to find a collection of diverse and high-performing solutions to a given problem. In robotics, such algorithms can be used for generating a collection of controllers covering most of the possible behaviours of a robot. To do so, these algorithms associate a behavioural descriptor to each of these behaviours. Each behavioural descriptor is used for estimating the novelty of one behaviour compared to the others. In most existing algorithms, the behavioural descriptor needs to be hand-coded, thus requiring prior knowledge about the task to solve. In this paper, we introduce: Autonomous Robots Realising their Abilities, an algorithm that uses a dimensionality reduction technique to automatically learn behavioural descriptors based on raw sensory data. The performance of this algorithm is assessed on three robotic tasks in simulation. The experimental results show that it performs similarly to traditional hand-coded approaches without the requirement to provide any hand-coded behavioural descriptor. In the collection of diverse and high-performing solutions, it also manages to find behaviours that are novel with respect to more features than its hand-coded baselines. Finally, we introduce a variant of the algorithm which is robust to the dimensionality of the behavioural descriptor space.
    Asymmetric 3D Context Fusion for Universal Lesion Detection. (arXiv:2109.08684v1 [eess.IV])
    (0 min) Modeling 3D context is essential for high-performance 3D medical image analysis. Although 2D networks benefit from large-scale 2D supervised pretraining, it is weak in capturing 3D context. 3D networks are strong in 3D context yet lack supervised pretraining. As an emerging technique, \emph{3D context fusion operator}, which enables conversion from 2D pretrained networks, leverages the advantages of both and has achieved great success. Existing 3D context fusion operators are designed to be spatially symmetric, i.e., performing identical operations on each 2D slice like convolutions. However, these operators are not truly equivariant to translation, especially when only a few 3D slices are used as inputs. In this paper, we propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices. Notably, A3D is NOT translation-equivariant while it significantly outperforms existing symmetric context fusion operators without introducing large computational overhead. We validate the effectiveness of the proposed method by extensive experiments on DeepLesion benchmark, a large-scale public dataset for universal lesion detection from computed tomography (CT). The proposed A3D consistently outperforms symmetric context fusion operators by considerable margins, and establishes a new \emph{state of the art} on DeepLesion. To facilitate open research, our code and model in PyTorch are available at https://github.com/M3DV/AlignShift.
    Adversarial examples attack based on random warm restart mechanism and improved Nesterov momentum. (arXiv:2105.05029v2 [cs.LG] UPDATED)
    (0 min) The deep learning algorithm has achieved great success in the field of computer vision, but some studies have pointed out that the deep learning model is vulnerable to attacks adversarial examples and makes false decisions. This challenges the further development of deep learning, and urges researchers to pay more attention to the relationship between adversarial examples attacks and deep learning security. This work focuses on adversarial examples, optimizes the generation of adversarial examples from the view of adversarial robustness, takes the perturbations added in adversarial examples as the optimization parameter. We propose RWR-NM-PGD attack algorithm based on random warm restart mechanism and improved Nesterov momentum from the view of gradient optimization. The algorithm introduces improved Nesterov momentum, using its characteristics of accelerating convergence and improving gradient update direction in optimization algorithm to accelerate the generation of adversarial examples. In addition, the random warm restart mechanism is used for optimization, and the projected gradient descent algorithm is used to limit the range of the generated perturbations in each warm restart, which can obtain better attack effect. Experiments on two public datasets show that the algorithm proposed in this work can improve the success rate of attacking deep learning models without extra time cost. Compared with the benchmark attack method, the algorithm proposed in this work can achieve better attack success rate for both normal training model and defense model. Our method has average attack success rate of 46.3077%, which is 27.19% higher than I-FGSM and 9.27% higher than PGD. The attack results in 13 defense models show that the attack algorithm proposed in this work is superior to the benchmark algorithm in attack universality and transferability.
    Efficient Variational Graph Autoencoders for Unsupervised Cross-domain Prerequisite Chains. (arXiv:2109.08722v1 [cs.LG])
    (0 min) Prerequisite chain learning helps people acquire new knowledge efficiently. While people may quickly determine learning paths over concepts in a domain, finding such paths in other domains can be challenging. We introduce Domain-Adversarial Variational Graph Autoencoders (DAVGAE) to solve this cross-domain prerequisite chain learning task efficiently. Our novel model consists of a variational graph autoencoder (VGAE) and a domain discriminator. The VGAE is trained to predict concept relations through link prediction, while the domain discriminator takes both source and target domain data as input and is trained to predict domain labels. Most importantly, this method only needs simple homogeneous graphs as input, compared with the current state-of-the-art model. We evaluate our model on the LectureBankCD dataset, and results show that our model outperforms recent graph-based benchmarks while using only 1/10 of graph scale and 1/3 computation time.
    Learned Image Coding for Machines: A Content-Adaptive Approach. (arXiv:2108.09992v2 [eess.IV] UPDATED)
    (0 min) Today, according to the Cisco Annual Internet Report (2018-2023), the fastest-growing category of Internet traffic is machine-to-machine communication. In particular, machine-to-machine communication of images and videos represents a new challenge and opens up new perspectives in the context of data compression. One possible solution approach consists of adapting current human-targeted image and video coding standards to the use case of machine consumption. Another approach consists of developing completely new compression paradigms and architectures for machine-to-machine communications. In this paper, we focus on image compression and present an inference-time content-adaptive finetuning scheme that optimizes the latent representation of an end-to-end learned image codec, aimed at improving the compression efficiency for machine-consumption. The conducted experiments show that our online finetuning brings an average bitrate saving (BD-rate) of -3.66% with respect to our pretrained image codec. In particular, at low bitrate points, our proposed method results in a significant bitrate saving of -9.85%. Overall, our pretrained-and-then-finetuned system achieves -30.54% BD-rate over the state-of-the-art image/video codec Versatile Video Coding (VVC).
    Development of patients triage algorithm from nationwide COVID-19 registry data based on machine learning. (arXiv:2109.09001v1 [cs.LG])
    (0 min) Prompt severity assessment model of confirmed patients who were infected with infectious diseases could enable efficient diagnosis and alleviate the burden on the medical system. This paper provides the development processes of the severity assessment model using machine learning techniques and its application on SARS-CoV-2 patients. Here, we highlight that our model only requires basic patients' basic personal data, allowing for them to judge their own severity. We selected the boosting-based decision tree model as a classifier and interpreted mortality as a probability score after modeling. Specifically, hyperparameters that determine the structure of the tree model were tuned using the Bayesian optimization technique without any knowledge of medical information. As a result, we measured model performance and identified the variables affecting the severity through the model. Finally, we aim to establish a medical system that allows patients to check their own severity and informs them to visit the appropriate clinic center based on the past treatment details of other patients with similar severity.
    Programming Puzzles. (arXiv:2106.05784v2 [cs.LG] UPDATED)
    (0 min) We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input $x$ which makes $f$ output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f(x)$ is all that is needed to test a candidate solution $x$. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping. We develop baseline enumerative program synthesis and GPT-3 solvers that are capable of solving easy puzzles -- even without access to any reference solutions -- by learning from their own past solutions. Based on a small user study, we find puzzle difficulty to correlate between human programmers and the baseline AI solvers.
    Augmenting semantic lexicons using word embeddings and transfer learning. (arXiv:2109.09010v1 [cs.CL])
    (0 min) Sentiment-aware intelligent systems are essential to a wide array of applications including marketing, political campaigns, recommender systems, behavioral economics, social psychology, and national security. These sentiment-aware intelligent systems are driven by language models which broadly fall into two paradigms: 1. Lexicon-based and 2. Contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Crowdsourcing annotations for semantic dictionaries may be an expensive and time-consuming task. Here, we propose two models for predicting sentiment scores to augment semantic lexicons at a relatively low cost using word embeddings and transfer learning. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost.
    Locally-symplectic neural networks for learning volume-preserving dynamics. (arXiv:2109.09151v1 [math-ph])
    (0 min) We propose locally-symplectic neural networks LocSympNets for learning volume-preserving dynamics. The construction of LocSympNets stems from the theorem of local Hamiltonian description of the vector field of a volume-preserving dynamical system and the splitting methods based on symplectic integrators. Modified gradient modules of recently proposed symplecticity-preserving neural networks SympNets are used to construct locally-symplectic modules, which composition results in volume-preserving neural networks. LocSympNets are studied numerically considering linear and nonlinear dynamics, i.e., semi-discretized advection equation and Euler equations of the motion of a free rigid body, respectively. LocSympNets are able to learn linear and nonlinear dynamics to high degree of accuracy. When learning a single trajectory of the rigid body dynamics LocSympNets are able to learn both invariants of the system with absolute relative errors below 1% in long-time predictions and produce qualitatively good short-time predictions, when the learning of the whole system from randomly sampled data is considered.
    Graph Neural Networks for Decentralized Multi-Robot Submodular Action Selection. (arXiv:2105.08601v2 [cs.RO] UPDATED)
    (0 min) In this paper, we develop a learning-based approach for decentralized submodular maximization. We focus on applications where robots are required to jointly select actions, e.g., motion primitives, to maximize team submodular objectives with local communications only. Such applications are essential for large-scale multi-robot coordination such as multi-robot motion planning for area coverage, environment exploration, and target tracking. But the current decentralized submodular maximization algorithms either require assumptions on the inter-robot communication or lose some suboptimal guarantees. In this work, we propose a general-purpose learning architecture towards submodular maximization at scale, with decentralized communications. Particularly, our learning architecture leverages a graph neural network (GNN) to capture local interactions of the robots and learns decentralized decision-making for the robots. We train the learning model by imitating an expert solution and implement the resulting model for decentralized action selection involving local observations and communications only. We demonstrate the performance of our GNN-based learning approach in a scenario of active target coverage with large networks of robots. The simulation results show our approach nearly matches the coverage performance of the expert algorithm, and yet runs several orders faster with up to 50 robots. Moreover, its coverage performance is superior to the existing decentralized greedy algorithms. The results also exhibit our approach's generalization capability in previously unseen scenarios, e.g., larger environments and larger networks of robots.
    Optimal Ensemble Construction for Multi-Study Prediction with Applications to COVID-19 Excess Mortality Estimation. (arXiv:2109.09164v1 [stat.ML])
    (3 min) It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets and applying standard statistical learning methods can result in poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown $\textit{multi-study ensembling}$ to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multi-study ensembling uses a two-stage $\textit{stacking}$ strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in a loss of efficiency. We therefore propose $\textit{optimal ensemble construction}$, an $\textit{all-in-one}$ approach to multi-study stacking whereby we jointly estimate ensemble weights as well as parameters associated with each study-specific model. We prove that limiting cases of our approach yield existing methods such as multi-study stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the proposed loss function. We compare our approach to standard methods by applying it to a multi-country COVID-19 dataset for baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. Importantly, our approach outperforms multi-study stacking and other standard methods in this application. We further characterize the method's performance in data-driven and other simulations. Our method remains competitive with or outperforms multi-study stacking and other earlier methods across a range of between-study heterogeneity levels.
    UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims. (arXiv:2109.09232v1 [cs.CL])
    (2 min) Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is check-worthy.In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages.
    Hebbian Semi-Supervised Learning in a Sample Efficiency Setting. (arXiv:2103.09002v2 [cs.NE] CROSS LISTED)
    (2 min) We propose to address the issue of sample efficiency, in Deep Convolutional Neural Networks (DCNN), with a semi-supervised training strategy that combines Hebbian learning with gradient descent: all internal layers (both convolutional and fully connected) are pre-trained using an unsupervised approach based on Hebbian learning, and the last fully connected layer (the classification layer) is trained using Stochastic Gradient Descent (SGD). In fact, as Hebbian learning is an unsupervised learning method, its potential lies in the possibility of training the internal layers of a DCNN without labels. Only the final fully connected layer has to be trained with labeled examples. We performed experiments on various object recognition datasets, in different regimes of sample efficiency, comparing our semi-supervised (Hebbian for internal layers + SGD for the final fully connected layer) approach with end-to-end supervised backprop training, and with semi-supervised learning based on Variational Auto-Encoder (VAE). The results show that, in regimes where the number of available labeled samples is low, our semi-supervised approach outperforms the other approaches in almost all the cases.
    Topology, Convergence, and Reconstruction of Predictive States. (arXiv:2109.09203v1 [cond-mat.stat-mech])
    (0 min) Predictive equivalence in discrete stochastic processes have been applied with great success to identify randomness and structure in statistical physics and chaotic dynamical systems and to inferring hidden Markov models. We examine the conditions under which they can be reliably reconstructed from time-series data, showing that convergence of predictive states can be achieved from empirical samples in the weak topology of measures. Moreover, predictive states may be represented in Hilbert spaces that replicate the weak topology. We mathematically explain how these representations are particularly beneficial when reconstructing high-memory processes and connect them to reproducing kernel Hilbert spaces.
    AHAR: Adaptive CNN for Energy-efficient Human Activity Recognition in Low-power Edge Devices. (arXiv:2102.01875v2 [cs.LG] UPDATED)
    (0 min) Human Activity Recognition (HAR) is one of the key applications of health monitoring that requires continuous use of wearable devices to track daily activities. This paper proposes an Adaptive CNN for energy-efficient HAR (AHAR) suitable for low-power edge devices. Unlike traditional early exit architecture that makes the exit decision based on classification confidence, AHAR proposes a novel adaptive architecture that uses an output block predictor to select a portion of the baseline architecture to use during the inference phase. Experimental results show that traditional early exit architectures suffer from performance loss whereas our adaptive architecture provides similar or better performance as the baseline one while being energy-efficient. We validate our methodology in classifying locomotion activities from two datasets- Opportunity and w-HAR. Compared to the fog/cloud computing approaches for the Opportunity dataset, our baseline and adaptive architecture shows a comparable weighted F1 score of 91.79%, and 91.57%, respectively. For the w-HAR dataset, our baseline and adaptive architecture outperforms the state-of-the-art works with a weighted F1 score of 97.55%, and 97.64%, respectively. Evaluation on real hardware shows that our baseline architecture is significantly energy-efficient (422.38x less) and memory-efficient (14.29x less) compared to the works on the Opportunity dataset. For the w-HAR dataset, our baseline architecture requires 2.04x less energy and 2.18x less memory compared to the state-of-the-art work. Moreover, experimental results show that our adaptive architecture is 12.32% (Opportunity) and 11.14% (w-HAR) energy-efficient than our baseline while providing similar (Opportunity) or better (w-HAR) performance with no significant memory overhead.
    The Optimization of the Constant Flow Parallel Micropump Using RBF Neural Network. (arXiv:2109.08717v1 [cs.LG])
    (0 min) The objective of this work is to optimize the performance of a constant flow parallel mechanical displacement micropump, which has parallel pump chambers and incorporates passive check valves. The critical task is to minimize the pressure pulse caused by regurgitation, which negatively impacts the constant flow rate, during the reciprocating motion when the left and right pumps interchange their role of aspiration and transfusion. Previous works attempt to solve this issue via the mechanical design of passive check valves. In this work, the novel concept of overlap time is proposed, and the issue is solved from the aspect of control theory by implementing a RBF neural network trained by both unsupervised and supervised learning. The experimental results indicate that the pressure pulse is optimized in the range of 0.15 - 0.25 MPa, which is a significant improvement compared to the maximum pump working pressure of 40 MPa.
    Machine Learning Methods for Identifying Atrial Fibrillation Cases and Their Predictors in Patients With Hypertrophic Cardiomyopathy: The HCM-AF-Risk Model. (arXiv:2109.09207v1 [cs.LG])
    (0 min) Hypertrophic cardiomyopathy (HCM) patients have a high incidence of atrial fibrillation (AF) and increased stroke risk, even with low risk of congestive heart failure, hypertension, age, diabetes, previous stroke/transient ischemic attack scores. Hence, there is a need to understand the pathophysiology of AF and stroke in HCM. In this retrospective study, we develop and apply a data-driven, machine learning based method to identify AF cases, and clinical and imaging features associated with AF, using electronic health record data. HCM patients with documented paroxysmal/persistent/permanent AF (n = 191) were considered AF cases, and the remaining patients in sinus rhythm (n = 640) were tagged as No-AF. We evaluated 93 clinical variables and the most informative variables useful for distinguishing AF from No-AF cases were selected based on the 2-sample t test and the information gain criterion. We identified 18 highly informative variables that are positively (n = 11) and negatively (n = 7) correlated with AF in HCM. Next, patient records were represented via these 18 variables. Data imbalance resulting from the relatively low number of AF cases was addressed via a combination of oversampling and under-sampling strategies. We trained and tested multiple classifiers under this sampling approach, showing effective classification. Specifically, an ensemble of logistic regression and naive Bayes classifiers, trained based on the 18 variables and corrected for data imbalance, proved most effective for separating AF from No-AF cases (sensitivity = 0.74, specificity = 0.70, C-index = 0.80). Our model is the first machine learning based method for identification of AF cases in HCM. This model demonstrates good performance, addresses data imbalance, and suggests that AF is associated with a more severe cardiac HCM phenotype.
    Learning by Turning: Neural Architecture Aware Optimisation. (arXiv:2102.07227v2 [cs.NE] UPDATED)
    (0 min) Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero's memory footprint is ~ square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron's hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.
    A Physics-Informed Deep Learning Paradigm for Traffic State Estimation and Fundamental Diagram Discovery. (arXiv:2106.03142v3 [cs.LG] UPDATED)
    (0 min) Traffic state estimation (TSE) bifurcates into two main categories, model-driven and data-driven (e.g., machine learning, ML) approaches, while each suffers from either deficient physics or small data. To mitigate these limitations, recent studies introduced hybrid methods, such as physics-informed deep learning (PIDL), which contains both model-driven and data-driven components. This paper contributes an improved paradigm, called physics-informed deep learning with a fundamental diagram learner (PIDL+FDL), which integrates ML terms into the model-driven component to learn a functional form of a fundamental diagram (FD), i.e., a mapping from traffic density to flow or velocity. The proposed PIDL+FDL has the advantages of performing the TSE learning, model parameter discovery, and FD discovery simultaneously. This paper focuses on highway TSE with observed data from loop detectors, using traffic density or velocity as traffic variables. We demonstrate the use of PIDL+FDL to solve popular first-order and second-order traffic flow models and reconstruct the FD relation as well as model parameters that are outside the FD term. We then evaluate the PIDL+FDL-based TSE using the Next Generation SIMulation (NGSIM) dataset. The experimental results show the superiority of the PIDL+FDL in terms of improved estimation accuracy and data efficiency over advanced baseline TSE methods, and additionally, the capacity to properly learn the unknown underlying FD relation.
    Scenario adaptive disruption prediction study for next generation burning-plasma tokamaks. (arXiv:2109.08956v1 [physics.plasm-ph])
    (0 min) Next generation high performance (HP) tokamaks risk damage from unmitigated disruptions at high current and power. Achieving reliable disruption prediction for a device's HP operation based on its low performance (LP) data is key to success. In this letter, through explorative data analysis and dedicated numerical experiments on multiple existing tokamaks, we demonstrate how the operational regimes of tokamaks can affect the power of a trained disruption predictor. First, our results suggest data-driven disruption predictors trained on abundant LP discharges work poorly on the HP regime of the same tokamak, which is a consequence of the distinct distributions of the tightly correlated signals related to disruptions in these two regimes. Second, we find that matching operational parameters among tokamaks strongly improves cross-machine accuracy which implies our model learns from the underlying scalings of dimensionless physics parameters like q_{95}, \beta_{p} and confirms the importance of these parameters in disruption physics and cross machine domain matching from the data-driven perspective. Finally, our results show how in the absence of HP data from the target devices, the best predictivity of the HP regime for the target machine can be achieved by combining LP data from the target with HP data from other machines. These results provide a possible disruption predictor development strategy for next generation tokamaks, such as ITER and SPARC, and highlight the importance of developing on existing machines baseline scenario discharges of future tokamaks to collect more relevant disruptive data.
    Greedy UnMixing for Q-Learning in Multi-Agent Reinforcement Learning. (arXiv:2109.09034v1 [cs.LG])
    (0 min) This paper introduces Greedy UnMix (GUM) for cooperative multi-agent reinforcement learning (MARL). Greedy UnMix aims to avoid scenarios where MARL methods fail due to overestimation of values as part of the large joint state-action space. It aims to address this through a conservative Q-learning approach through restricting the state-marginal in the dataset to avoid unobserved joint state action spaces, whilst concurrently attempting to unmix or simplify the problem space under the centralized training with decentralized execution paradigm. We demonstrate the adherence to Q-function lower bounds in the Q-learning for MARL scenarios, and demonstrate superior performance to existing Q-learning MARL approaches as well as more general MARL algorithms over a set of benchmark MARL tasks, despite its relative simplicity compared with state-of-the-art approaches.
    A Simple Generative Network. (arXiv:2106.09330v4 [cs.LG] UPDATED)
    (0 min) Generative neural networks are able to mimic intricate probability distributions such as those of handwritten text, natural images, etc. Since their inception several models were proposed. The most successful of these were based on adversarial (GAN), auto-encoding (VAE) and maximum mean discrepancy (MMD) relatively complex architectures and schemes. Surprisingly, a very simple architecture (a single feed-forward neural network) in conjunction with an obvious optimization goal (Kullback_Leibler divergence) was apparently overlooked. This paper demonstrates that such a model (denoted SGN for its simplicity) is able to generate samples visually and quantitatively competitive as compared with the fore-mentioned state of the art methods.
    Impact of GPU uncertainty on the training of predictive deep neural networks. (arXiv:2109.01451v2 [cs.LG] UPDATED)
    (0 min) [retracted] We found out that the difference was dependent on the Chainer library, and does not replicate with another library (pytorch) which indicates that the results are probably due to a bug in Chainer, rather than being hardware-dependent. -- old abstract Deep neural networks often present uncertainties such as hardware- and software-derived noise and randomness. We studied the effects of such uncertainty on learning outcomes, with a particular focus on the function of graphics processing units (GPUs), and found that GPU-induced uncertainty increased learning accuracy of a certain deep neural network. When training a predictive deep neural network using only the CPU without the GPU, the learning error is higher than when training the same number of epochs using the GPU, suggesting that the GPU plays a different role in the learning process than just increasing the computational speed. Because this effect cannot be observed in learning by a simple autoencoder, it could be a phenomenon specific to certain types of neural networks. GPU-specific computational processing is more indeterminate than that by CPUs, and hardware-derived uncertainties, which are often considered obstacles that need to be eliminated, might, in some cases, be successfully incorporated into the training of deep neural networks. Moreover, such uncertainties might be interesting phenomena to consider in brain-related computational processing, which comprises a large mass of uncertain signals.
    Releasing Graph Neural Networks with Differential Privacy Guarantees. (arXiv:2109.08907v1 [cs.LG])
    (0 min) With the increasing popularity of Graph Neural Networks (GNNs) in several sensitive applications like healthcare and medicine, concerns have been raised over the privacy aspects of trained GNNs. More notably, GNNs are vulnerable to privacy attacks, such as membership inference attacks, even if only blackbox access to the trained model is granted. To build defenses, differential privacy has emerged as a mechanism to disguise the sensitive data in training datasets. Following the strategy of Private Aggregation of Teacher Ensembles (PATE), recent methods leverage a large ensemble of teacher models. These teachers are trained on disjoint subsets of private data and are employed to transfer knowledge to a student model, which is then released with privacy guarantees. However, splitting graph data into many disjoint training sets may destroy the structural information and adversely affect accuracy. We propose a new graph-specific scheme of releasing a student GNN, which avoids splitting private training data altogether. The student GNN is trained using public data, partly labeled privately using the teacher GNN models trained exclusively for each query node. We theoretically analyze our approach in the R\`{e}nyi differential privacy framework and provide privacy guarantees. Besides, we show the solid experimental performance of our method compared to several baselines, including the PATE baseline adapted for graph-structured data. Our anonymized code is available.
    Generalized Translation and Scale Invariant Online Algorithm for Adversarial Multi-Armed Bandits. (arXiv:2109.09212v1 [cs.LG])
    (0 min) We study the adversarial multi-armed bandit problem and create a completely online algorithmic framework that is invariant under arbitrary translations and scales of the arm losses. We study the expected performance of our algorithm against a generic competition class, which makes it applicable for a wide variety of problem scenarios. Our algorithm works from a universal prediction perspective and the performance measure used is the expected regret against arbitrary arm selection sequences, which is the difference between our losses and a competing loss sequence. The competition class can be designed to include fixed arm selections, switching bandits, contextual bandits, or any other competition of interest. The sequences in the competition class are generally determined by the specific application at hand and should be designed accordingly. Our algorithm neither uses nor needs any preliminary information about the loss sequences and is completely online. Its performance bounds are the second order bounds in terms of sum of the squared losses, where any affine transform of the losses has no effect on the normalized regret.
    Predicting speech intelligibility from EEG using a dilated convolutional network. (arXiv:2105.06844v3 [eess.AS] UPDATED)
    (0 min) Objective: Currently, only behavioral speech understanding tests are available, which require active participation of the person being tested. As this is infeasible for certain populations, an objective measure of speech intelligibility is required. Recently, brain imaging data has been used to establish a relationship between stimulus and brain response. Linear models have been successfully linked to speech intelligibility but require per-subject training. We present a deep-learning-based model incorporating dilated convolutions that operates in a match/mismatch paradigm. The accuracy of the model's match/mismatch predictions can be used as a proxy for speech intelligibility without subject-specific (re)training. Approach: We evaluated the performance of the model as a function of input segment length, EEG frequency band and receptive field size while comparing it to multiple baseline models. Next, we evaluated performance on held-out data and finetuning. Finally, we established a link between the accuracy of our model and the state-of-the-art behavioral MATRIX test. Main results: The dilated convolutional model significantly outperformed the baseline models for every input segment length, for all EEG frequency bands except the delta and theta band, and receptive field sizes between 250 and 500 ms. Additionally, finetuning significantly increased the accuracy on a held-out dataset. Finally, a significant correlation (r=0.59, p=0.0154) was found between the speech reception threshold estimated using the behavioral MATRIX test and our objective method. Significance: Our method is the first to predict the speech reception threshold from EEG for unseen subjects, contributing to objective measures of speech intelligibility.
    Learning to Regrasp by Learning to Place. (arXiv:2109.08817v1 [cs.RO])
    (0 min) In this paper, we explore whether a robot can learn to regrasp a diverse set of objects to achieve various desired grasp poses. Regrasping is needed whenever a robot's current grasp pose fails to perform desired manipulation tasks. Endowing robots with such an ability has applications in many domains such as manufacturing or domestic services. Yet, it is a challenging task due to the large diversity of geometry in everyday objects and the high dimensionality of the state and action space. In this paper, we propose a system for robots to take partial point clouds of an object and the supporting environment as inputs and output a sequence of pick-and-place operations to transform an initial object grasp pose to the desired object grasp poses. The key technique includes a neural stable placement predictor and a regrasp graph based solution through leveraging and changing the surrounding environment. We introduce a new and challenging synthetic dataset for learning and evaluating the proposed approach. In this dataset, we show that our system is able to achieve 73.3% success rate of regrasping diverse objects.
    A Study of the Generalizability of Self-Supervised Representations. (arXiv:2109.09150v1 [cs.LG])
    (2 min) Recent advancements in self-supervised learning (SSL) made it possible to learn generalizable visual representations from unlabeled data. The performance of Deep Learning models fine-tuned on pretrained SSL representations is on par with models fine-tuned on the state-of-the-art supervised learning (SL) representations. Irrespective of the progress made in SSL, its generalizability has not been studied extensively. In this article, we perform a deeper analysis of the generalizability of pretrained SSL and SL representations by conducting a domain-based study for transfer learning classification tasks. The representations are learned from the ImageNet source data, which are then fine-tuned using two types of target datasets: similar to the source dataset, and significantly different from the source dataset. We study generalizability of the SSL and SL-based models via their prediction accuracy as well as prediction confidence. In addition to this, we analyze the attribution of the final convolutional layer of these models to understand how they reason about the semantic identity of the data. We show that the SSL representations are more generalizable as compared to the SL representations. We explain the generalizability of the SSL representations by investigating its invariance property, which is shown to be better than that observed in the SL representations.
    MM-Deacon: Multimodal molecular domain embedding analysis via contrastive learning. (arXiv:2109.08830v1 [cs.LG])
    (0 min) Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have been popular as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single modality for representing molecules. Driven by the fact that a given molecule can be described through different modalities such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multimodal molecular embedding generation approach called MM-Deacon (multimodal molecular domain embedding analysis via contrastive learning). MM-Deacon is trained using SMILES and IUPAC molecule representations as two different modalities. First, SMILES and IUPAC strings are encoded by using two different transformer-based language models independently, then the contrastive loss is utilized to bring these encoded representations from different modalities closer to each other if they belong to the same molecule, and to push embeddings farther from each other if they belong to different molecules. We evaluate the robustness of our molecule embeddings on molecule clustering, cross-modal molecule search, drug similarity assessment and drug-drug interaction tasks.
    Multi-class Text Classification using BERT-based Active Learning. (arXiv:2104.14289v2 [cs.IR] UPDATED)
    (2 min) Text Classification finds interesting applications in the pickup and delivery services industry where customers require one or more items to be picked up from a location and delivered to a certain destination. Classifying these customer transactions into multiple categories helps understand the market needs for different customer segments. Each transaction is accompanied by a text description provided by the customer to describe the products being picked up and delivered which can be used to classify the transaction. BERT-based models have proven to perform well in Natural Language Understanding. However, the product descriptions provided by the customers tend to be short, incoherent and code-mixed (Hindi-English) text which demands fine-tuning of such models with manually labelled data to achieve high accuracy. Collecting this labelled data can prove to be expensive. In this paper, we explore Active Learning strategies to label transaction descriptions cost effectively while using BERT to train a transaction classification model. On TREC-6, AG's News Corpus and an internal dataset, we benchmark the performance of BERT across different Active Learning strategies in Multi-Class Text Classification.
    Identifying Ventricular Arrhythmias and Their Predictors by Applying Machine Learning Methods to Electronic Health Records in Patients With Hypertrophic Cardiomyopathy(HCM-VAr-Risk Model). (arXiv:2109.09210v1 [cs.LG])
    (3 min) Clinical risk stratification for sudden cardiac death (SCD) in hypertrophic cardiomyopathy (HC) employs rules derived from American College of Cardiology Foundation/American Heart Association (ACCF/AHA) guidelines or the HCM Risk-SCD model (C-index of 0.69), which utilize a few clinical variables. We assessed whether data-driven machine learning methods that consider a wider range of variables can effectively identify HC patients with ventricular arrhythmias (VAr) that lead to SCD. We scanned the electronic health records of 711 HC patients for sustained ventricular tachycardia or ventricular fibrillation. Patients with ventricular tachycardia or ventricular fibrillation (n = 61) were tagged as VAr cases and the remaining (n = 650) as non-VAr. The 2-sample t test and information gain criterion were used to identify the most informative clinical variables that distinguish VAr from non-VAr; patient records were reduced to include only these variables. Data imbalance stemming from low number of VAr cases was addressed by applying a combination of over- and under-sampling strategies.We trained and tested multiple classifiers under this sampling approach, showing effective classification. We evaluated 93 clinical variables, of which 22 proved predictive of VAr. The ensemble of logistic regression and naive Bayes classifiers, trained based on these 22 variables and corrected for data imbalance, was most effective in separating VAr from non-VAr cases (sensitivity = 0.73, specificity = 0.76, C-index = 0.83). Our method (HCM-VAr-Risk Model) identified 12 new predictors of VAr, in addition to 10 established SCD predictors. In conclusion, this is the first application of machine learning for identifying HC patients with VAr, using clinical attributes.
    Improving Fairness for Data Valuation in Federated Learning. (arXiv:2109.09046v1 [cs.LG])
    (2 min) Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners' participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
    Deep Inertial Navigation using Continuous Domain Adaptation and Optimal Transport. (arXiv:2106.15178v2 [cs.LG] UPDATED)
    (0 min) In this paper, we propose a new strategy for learning inertial robotic navigation models. The proposed strategy enhances the generalisability of end-to-end inertial modelling, and is aimed at wheeled robotic deployments. Concretely, the paper describes the following. (1) Using precision robotics, we empirically characterise the effect of changing the sensor position during navigation on the distribution of raw inertial signals, as well as the corresponding impact on learnt latent spaces. (2) We propose neural architectures and algorithms to assimilate knowledge from an indexed set of sensor positions in order to enhance the robustness and generalisability of robotic inertial tracking in the field. Our scheme of choice uses continuous domain adaptation (DA) and optimal transport (OT). (3) In our evaluation, continuous OT DA outperforms a continuous adversarial DA baseline, while also showing quantifiable learning benefits over simple data augmentation. We will release our dataset to help foster future research.
    Learning and Adaptation for Millimeter-Wave Beam Tracking and Training: a Dual Timescale Variational Framework. (arXiv:2107.05466v2 [cs.LG] UPDATED)
    (0 min) Millimeter-wave vehicular networks incur enormous beam-training overhead to enable narrow-beam communications. This paper proposes a learning and adaptation framework in which the dynamics of the communication beams are learned and then exploited to design adaptive beam-tracking and training with low overhead: on a long-timescale, a deep recurrent variational autoencoder (DR-VAE) uses noisy beam-training feedback to learn a probabilistic model of beam dynamics and enable predictive beam-tracking; on a short-timescale, an adaptive beam-training procedure is formulated as a partially observable (PO-) Markov decision process (MDP) and optimized via point-based value iteration (PBVI) by leveraging beam-training feedback and a probabilistic prediction of the strongest beam pair provided by the DR-VAE. In turn, beam-training feedback is used to refine the DR-VAE via stochastic gradient ascent in a continuous process of learning and adaptation. The proposed DR-VAE learning framework learns accurate beam dynamics: it reduces the Kullback-Leibler divergence between the ground truth and the learned model of beam dynamics by 95% over the Baum-Welch algorithm and a naive learning approach that neglects feedback errors. Numerical results on a line-of-sight (LOS) scenario with multipath reveal that the proposed dual timescale approach yields near-optimal spectral efficiency, and improves it by 130% over a policy that scans exhaustively over the dominant beam pairs, and by 20% over a state-of-the-art POMDP policy. Finally, a low-complexity policy is proposed by reducing the POMDP to an error-robust MDP, and is shown to perform well in regimes with infrequent feedback errors.
    WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU. (arXiv:2108.13976v2 [cs.LG] UPDATED)
    (0 min) Deep reinforcement learning (RL) is a powerful framework to train decision-making models in complex dynamical environments. However, RL can be slow as it learns through repeated interaction with a simulation of the environment. Accelerating RL requires both algorithmic and engineering innovations. In particular, there are key systems engineering bottlenecks when using RL in complex environments that feature multiple agents or high-dimensional state, observation, or action spaces, for example. We present WarpDrive, a flexible, lightweight, and easy-to-use open-source RL framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit), building on PyCUDA and PyTorch. Using the extreme parallelization capability of GPUs, WarpDrive enables orders-of-magnitude faster RL compared to common implementations that blend CPU simulations and GPU models. Our design runs simulations and the agents in each simulation in parallel. It eliminates data copying between CPU and GPU. It also uses a single simulation data store on the GPU that is safely updated in-place. Together, this allows the user to run thousands of concurrent multi-agent simulations and train on extremely large batches of experience. For example, WarpDrive yields 2.9 million environment steps/second with 2000 environments and 1000 agents (at least 100x higher throughput compared to a CPU implementation) in a benchmark Tag simulation. WarpDrive provides a lightweight Python interface and environment wrappers to simplify usage and promote flexibility and extensions. As such, WarpDrive provides a framework for building high-throughput RL systems.
    Time Series Data Augmentation for Deep Learning: A Survey. (arXiv:2002.12478v3 [cs.LG] UPDATED)
    (0 min) Deep learning performs remarkably well on many time series analysis tasks recently. The superior performance of deep neural networks relies heavily on a large number of training data to avoid overfitting. However, the labeled data of many real-world time series applications may be limited such as classification in medical time series and anomaly detection in AIOps. As an effective way to enhance the size and quality of the training data, data augmentation is crucial to the successful application of deep learning models on time series data. In this paper, we systematically review different data augmentation methods for time series. We propose a taxonomy for the reviewed methods, and then provide a structured review for these methods by highlighting their strengths and limitations. We also empirically compare different data augmentation methods for different tasks including time series classification, anomaly detection, and forecasting. Finally, we discuss and highlight five future directions to provide useful research guidance.
    Conflict-free collective stochastic decision making by orbital angular momentum of photons through quantum interference. (arXiv:2107.00877v2 [quant-ph] UPDATED)
    (0 min) In recent cross-disciplinary studies involving both optics and computing, single-photon-based decision-making has been demonstrated by utilizing the wave-particle duality of light to solve multi-armed bandit problems. Furthermore, entangled-photon-based decision-making has managed to solve a competitive multi-armed bandit problem in such a way that conflicts of decisions among players are avoided while ensuring equality. However, as these studies are based on the polarization of light, the number of available choices is limited to two, corresponding to two orthogonal polarization states. Here we propose a scalable principle to solve competitive decision-making situations by using the orbital angular momentum of photons based on its high dimensionality, which theoretically allows an unlimited number of arms. Moreover, by extending the Hong-Ou-Mandel effect to more than two states, we theoretically establish an experimental configuration able to generate multi-photon states with orbital angular momentum and conditions that provide conflict-free selections at every turn. We numerically examine total rewards regarding three-armed bandit problems, for which the proposed strategy accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This is thanks to the quantum interference effect that achieves no-conflict selections, even in the exploring phase to find the best arms.
    Ontology-based n-ball Concept Embeddings Informing Few-shot Image Classification. (arXiv:2109.09063v1 [cs.CV])
    (2 min) We propose a novel framework named ViOCE that integrates ontology-based background knowledge in the form of $n$-ball concept embeddings into a neural network based vision architecture. The approach consists of two components - converting symbolic knowledge of an ontology into continuous space by learning n-ball embeddings that capture properties of subsumption and disjointness, and guiding the training and inference of a vision model using the learnt embeddings. We evaluate ViOCE using the task of few-shot image classification, where it demonstrates superior performance on two standard benchmarks.
    Online Multiobjective Minimax Optimization and Applications. (arXiv:2108.03837v2 [cs.LG] UPDATED)
    (0 min) We introduce a simple but general online learning framework, in which at every round, an adaptive adversary introduces a new game, consisting of an action space for the learner, an action space for the adversary, and a vector valued objective function that is convex-concave in every coordinate. The learner and the adversary then play in this game. The learner's goal is to play so as to minimize the maximum coordinate of the cumulative vector-valued loss. The resulting one-shot game is not convex-concave, and so the minimax theorem does not apply. Nevertheless, we give a simple algorithm that can compete with the setting in which the adversary must announce their action first, with optimally diminishing regret. We demonstrate the power of our simple framework by using it to derive optimal bounds and algorithms across a variety of domains. This includes no regret learning: we can recover optimal algorithms and bounds for minimizing external regret, internal regret, adaptive regret, multigroup regret, subsequence regret, and a notion of regret in the sleeping experts setting. Next, we use it to derive a variant of Blackwell's Approachability Theorem, which we term "Fast Polytope Approachability". Finally, we are able to recover recently derived algorithms and bounds for online adversarial multicalibration and related notions (mean-conditioned moment multicalibration, and prediction interval multivalidity).
    Relative Entropy-Regularized Optimal Transport on a Graph: a new algorithm and an experimental comparison. (arXiv:2108.10004v2 [cs.LG] UPDATED)
    (0 min) Following [21, 23], the present work investigates a new relative entropy-regularized algorithm for solving the optimal transport on a graph problem within the randomized shortest paths formalism. More precisely, a unit flow is injected into a set of input nodes and collected from a set of output nodes while minimizing the expected transportation cost together with a paths relative entropy regularization term, providing a randomized routing policy. The main advantage of this new formulation is the fact that it can easily accommodate edge flow capacity constraints which commonly occur in real-world problems. The resulting optimal routing policy, i.e., the probability distribution of following an edge in each node, is Markovian and is computed by constraining the input and output flows to the prescribed marginal probabilities thanks to a variant of the algorithm developed in [8]. In addition, experimental comparisons with other recently developed techniques show that the distance measure between nodes derived from the introduced model provides competitive results on semi-supervised classification tasks.
    A Comparative Study of Algorithms for Intelligent Traffic Signal Control. (arXiv:2109.00937v2 [eess.SY] UPDATED)
    (0 min) In this paper, methods have been explored to effectively optimise traffic signal control to minimise waiting times and queue lengths, thereby increasing traffic flow. The traffic intersection was first defined as a Markov Decision Process, and a state representation, actions and rewards were chosen. Simulation of Urban MObility (SUMO) was used to simulate an intersection and then compare a Round Robin Scheduler, a Feedback Control mechanism and two Reinforcement Learning techniques - Deep Q Network (DQN) and Advantage Actor-Critic (A2C), as the policy for the traffic signal in the simulation under different scenarios. Finally, the methods were tested on a simulation of a real-world intersection in Bengaluru, India.
    Sibyl: Understanding and Addressing the Usability Challenges of Machine Learning In High-Stakes Decision Making. (arXiv:2103.02071v2 [cs.HC] CROSS LISTED)
    (0 min) Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts - who often have no expertise in ML or data science - are asked to use ML predictions to make high-stakes decisions. Multiple ML usability challenges can appear as result, such as lack of user trust in the model, inability to reconcile human-ML disagreement, and ethical concerns about oversimplification of complex problems to a single algorithm output. In this paper, we investigate the ML usability challenges that present in the domain of child welfare screening through a series of collaborations with child welfare screeners. Following the iterative design process between the ML scientists, visualization researchers, and domain experts (child screeners), we first identified four key ML challenges and honed in on one promising explainable ML technique to address them (local factor contributions). Then we implemented and evaluated our visual analytics tool, Sibyl, to increase the interpretability and interactivity of local factor contributions. The effectiveness of our tool is demonstrated by two formal user studies with 12 non-expert participants and 13 expert participants respectively. Valuable feedback was collected, from which we composed a list of design implications as a useful guideline for researchers who aim to develop an interpretable and interactive visualization tool for ML prediction models deployed for child welfare screeners and other similar domain experts.
    ComicGAN: Text-to-Comic Generative Adversarial Network. (arXiv:2109.09120v1 [cs.CV])
    (2 min) Drawing and annotating comic illustrations is a complex and difficult process. No existing machine learning algorithms have been developed to create comic illustrations based on descriptions of illustrations, or the dialogue in comics. Moreover, it is not known if a generative adversarial network (GAN) can generate original comics that correspond to the dialogue and/or descriptions. GANs are successful in producing photo-realistic images, but this technology does not necessarily translate to generation of flawless comics. What is more, comic evaluation is a prominent challenge as common metrics such as Inception Score will not perform comparably, as they are designed to work on photos. In this paper: 1. We implement ComicGAN, a novel text-to-comic pipeline based on a text-to-image GAN that synthesizes comics according to text descriptions. 2. We describe an in-depth empirical study of the technical difficulties of comic generation using GAN's. ComicGAN has two novel features: (i) text description creation from labels via permutation and augmentation, and (ii) custom image encoding with Convolutional Neural Networks. We extensively evaluate the proposed ComicGAN in two scenarios, namely image generation from descriptions, and image generation from dialogue. Our results on 1000 Dilbert comic panels and 6000 descriptions show synthetic comic panels from text inputs resemble original Dilbert panels. Novel methods for text description creation and custom image encoding brought improvements to Frechet Inception Distance, detail, and overall image quality over baseline algorithms. Generating illustrations from descriptions provided clear comics including characters and colours that were specified in the descriptions.
    Sketch Your Own GAN. (arXiv:2108.02774v2 [cs.CV] UPDATED)
    (0 min) Can a user create a deep generative model by sketching a single example? Traditionally, creating a GAN model has required the collection of a large-scale dataset of exemplars and specialized knowledge in deep learning. In contrast, sketching is possibly the most universally accessible way to convey a visual concept. In this work, we present a method, GAN Sketching, for rewriting GANs with one or more sketches, to make GANs training easier for novice users. In particular, we change the weights of an original GAN model according to user sketches. We encourage the model's output to match the user sketches through a cross-domain adversarial loss. Furthermore, we explore different regularization methods to preserve the original model's diversity and image quality. Experiments have shown that our method can mold GANs to match shapes and poses specified by sketches while maintaining realism and diversity. Finally, we demonstrate a few applications of the resulting GAN, including latent space interpolation and image editing.
    Atrial Fibrillation: A Medical and Technological Review. (arXiv:2109.08974v1 [cs.LG])
    (2 min) Atrial Fibrillation (AF) is the most common type of arrhythmia (Greek a-, loss + rhythmos, rhythm = loss of rhythm) leading to hospitalization in the United States. Though sometimes AF is asymptomatic, it increases the risk of stroke and heart failure in patients, in addition to lowering the health-related quality of life (HRQOL). AF-related care costs the healthcare system between $6.0 to $26 billion each year. Early detection of AF and clinical attention can help improve symptoms and HRQOL of the patient, as well as bring down the cost of care. However, the prevalent paradigm of AF detection depends on electrocardiogram (ECG) recorded at a single point in time and does not shed light on the relation of the symptoms with heart rhythm or AF. In the recent decade, due to the democratization of health monitors and the advent of high-performing computers, Machine Learning algorithms have been proven effective in identifying AF, from the ECG of patients. This paper provides an overview of the symptoms of AF, its diagnosis, and future prospects for research in the field.
    Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging. (arXiv:2011.08484v2 [cs.RO] UPDATED)
    (0 min) We consider the problem of designing an algorithm to allow a car to autonomously merge on to a highway from an on-ramp. Two broad classes of techniques have been proposed to solve motion planning problems in autonomous driving: Model Predictive Control (MPC) and Reinforcement Learning (RL). In this paper, we first establish the strengths and weaknesses of state-of-the-art MPC and RL-based techniques through simulations. We show that the performance of the RL agent is worse than that of the MPC solution from the perspective of safety and robustness to out-of-distribution traffic patterns, i.e., traffic patterns which were not seen by the RL agent during training. On the other hand, the performance of the RL agent is better than that of the MPC solution when it comes to efficiency and passenger comfort. We subsequently present an algorithm which blends the model-free RL agent with the MPC solution and show that it provides better trade-offs between all metrics -- passenger comfort, efficiency, crash rate and robustness.
    Task-Driven Out-of-Distribution Detection with Statistical Guarantees for Robot Learning. (arXiv:2106.13703v2 [cs.RO] UPDATED)
    (0 min) Our goal is to perform out-of-distribution (OOD) detection, i.e., to detect when a robot is operating in environments that are drawn from a different distribution than the environments used to train the robot. We leverage Probably Approximately Correct (PAC)-Bayes theory in order to train a policy with a guaranteed bound on performance on the training distribution. Our key idea for OOD detection then relies on the following intuition: violation of the performance bound on test environments provides evidence that the robot is operating OOD. We formalize this via statistical techniques based on p-values and concentration inequalities. The resulting approach (i) provides guaranteed confidence bounds on OOD detection, and (ii) is task-driven and sensitive only to changes that impact the robot's performance. We demonstrate our approach on a simulated example of grasping objects with unfamiliar poses or shapes. We also present both simulation and hardware experiments for a drone performing vision-based obstacle avoidance in unfamiliar environments (including wind disturbances and different obstacle densities). Our examples demonstrate that we can perform task-driven OOD detection within just a handful of trials. Comparisons with baselines also demonstrate the advantages of our approach in terms of providing statistical guarantees and being insensitive to task-irrelevant distribution shifts.
    The shape and simplicity biases of adversarially robust ImageNet-trained CNNs. (arXiv:2006.09373v5 [cs.CV] UPDATED)
    (0 min) Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-of-distribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of 'robustifying' the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks high-frequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity (Xie and Yuille, 2020) and can act as a strong image prior in image synthesis (Santurkar et al., 2019).
    Diagrammatic Differentiation for Quantum Machine Learning. (arXiv:2103.07960v3 [quant-ph] UPDATED)
    (0 min) We introduce diagrammatic differentiation for tensor calculus by generalising the dual number construction from rigs to monoidal categories. Applying this to ZX diagrams, we show how to calculate diagrammatically the gradient of a linear map with respect to a phase parameter. For diagrams of parametrised quantum circuits, we get the well-known parameter-shift rule at the basis of many variational quantum algorithms. We then extend our method to the automatic differentation of hybrid classical-quantum circuits, using diagrams with bubbles to encode arbitrary non-linear operators. Moreover, diagrammatic differentiation comes with an open-source implementation in DisCoPy, the Python library for monoidal categories. Diagrammatic gradients of classical-quantum circuits can then be simplified using the PyZX library and executed on quantum hardware via the tket compiler. This opens the door to many practical applications harnessing both the structure of string diagrams and the computational power of quantum machine learning.
    Co-occurrence of medical conditions: Exposing patterns through probabilistic topic modeling of SNOMED codes. (arXiv:2109.09199v1 [cs.LG])
    (0 min) Patients associated with multiple co-occurring health conditions often face aggravated complications and less favorable outcomes. Co-occurring conditions are especially prevalent among individuals suffering from kidney disease, an increasingly widespread condition affecting 13% of the general population in the US. This study aims to identify and characterize patterns of co-occurring medical conditions in patients employing a probabilistic framework. Specifically, we apply topic modeling in a non-traditional way to find associations across SNOMEDCT codes assigned and recorded in the EHRs of>13,000 patients diagnosed with kidney disease. Unlike most prior work on topic modeling, we apply the method to codes rather than to natural language. Moreover, we quantitatively evaluate the topics, assessing their tightness and distinctiveness, and also assess the medical validity of our results. Our experiments show that each topic is succinctly characterized by a few highly probable and unique disease codes, indicating that the topics are tight. Furthermore, inter-topic distance between each pair of topics is typically high, illustrating distinctiveness. Last, most coded conditions grouped together within a topic, are indeed reported to co-occur in the medical literature. Notably, our results uncover a few indirect associations among conditions that have hitherto not been reported as correlated in the medical literature.
    A Unified Convergence Analysis for Shuffling-Type Gradient Methods. (arXiv:2002.08246v2 [math.OC] UPDATED)
    (0 min) In this paper, we propose a unified convergence analysis for a class of generic shuffling-type gradient methods for solving finite-sum optimization problems. Our analysis works with any sampling without replacement strategy and covers many known variants such as randomized reshuffling, deterministic or randomized single permutation, and cyclic and incremental gradient schemes. We focus on two different settings: strongly convex and nonconvex problems, but also discuss the non-strongly convex case. Our main contribution consists of new non-asymptotic and asymptotic convergence rates for a wide class of shuffling-type gradient methods in both nonconvex and convex settings. We also study uniformly randomized shuffling variants with different learning rates and model assumptions. While our rate in the nonconvex case is new and significantly improved over existing works under standard assumptions, the rate on the strongly convex one matches the existing best-known rates prior to this paper up to a constant factor without imposing a bounded gradient condition. Finally, we empirically illustrate our theoretical results via two numerical examples: nonconvex logistic regression and neural network training examples. As byproducts, our results suggest some appropriate choices for diminishing learning rates in certain shuffling variants.
    Score-Based Explanations in Data Management and Machine Learning: An Answer-Set Programming Approach to Counterfactual Analysis. (arXiv:2106.10562v2 [cs.AI] UPDATED)
    (0 min) We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown.
    A Data-Driven Convergence Bidding Strategy Based on Reverse Engineering of Market Participants' Performance: A Case of California ISO. (arXiv:2109.09238v1 [eess.SP])
    (0 min) Convergence bidding, a.k.a., virtual bidding, has been widely adopted in wholesale electricity markets in recent years. It provides opportunities for market participants to arbitrage on the difference between the day-ahead market locational marginal prices and the real-time market locational marginal prices. Given the fact that convergence bids (CBs) have a significant impact on the operation of electricity markets, it is important to understand how market participants strategically select their CBs in real-world. We address this open problem with focus on the electricity market that is operated by the California ISO. In this regard, we use the publicly available electricity market data to learn, characterize, and evaluate different types of convergence bidding strategies that are currently used by market participants. Our analysis includes developing a data-driven reverse engineering method that we apply to three years of real-world data. Our analysis involves feature selection and density-based data clustering. It results in identifying three main clusters of CB strategies in the California ISO market. Different characteristics and the performance of each cluster of strategies are analyzed. Interestingly, we unmask a common real-world strategy that does not match any of the existing strategic convergence bidding methods in the literature. Next, we build upon the lessons learned from the existing real-world strategies to propose a new CB strategy that can significantly outperform them. Our analysis includes developing a new strategy for convergence bidding. The new strategy has three steps: net profit maximization by capturing price spikes, dynamic node labeling, and strategy selection algorithm. We show through case studies that the annual net profit for the most lucrative market participants can increase by over 40% if the proposed convergence bidding strategy is used.
    Traffic-Net: 3D Traffic Monitoring Using a Single Camera. (arXiv:2109.09165v1 [cs.CV])
    (0 min) Computer Vision has played a major role in Intelligent Transportation Systems (ITS) and traffic surveillance. Along with the rapidly growing automated vehicles and crowded cities, the automated and advanced traffic management systems (ATMS) using video surveillance infrastructures have been evolved by the implementation of Deep Neural Networks. In this research, we provide a practical platform for real-time traffic monitoring, including 3D vehicle/pedestrian detection, speed detection, trajectory estimation, congestion detection, as well as monitoring the interaction of vehicles and pedestrians, all using a single CCTV traffic camera. We adapt a custom YOLOv5 deep neural network model for vehicle/pedestrian detection and an enhanced SORT tracking algorithm. For the first time, a hybrid satellite-ground based inverse perspective mapping (SG-IPM) method for camera auto-calibration is also developed which leads to an accurate 3D object detection and visualisation. We also develop a hierarchical traffic modelling solution based on short- and long-term temporal video data stream to understand the traffic flow, bottlenecks, and risky spots for vulnerable road users. Several experiments on real-world scenarios and comparisons with state-of-the-art are conducted using various traffic monitoring datasets, including MIO-TCD, UA-DETRAC and GRAM-RTM collected from highways, intersections, and urban areas under different lighting and weather conditions.
    Translatotron 2: Robust direct speech-to-speech translation. (arXiv:2107.08661v3 [cs.CL] UPDATED)
    (0 min) We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, but unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.
    Rethnicity: Predicting Ethnicity from Names. (arXiv:2109.09228v1 [cs.LG])
    (0 min) I provide an R package, \texttt{rethnicity}, for predicting ethnicity from names. I use the Bidirectional LSTM as the model and Florida Voter Registration as training data. Special care is given for the accuracy of minority groups, by adjusting the imbalance in the dataset. I also compare the availability, accuracy, and performance with other solutions for predicting ethnicity from names. Sample code snippet and analysis of the DIME dataset are also shown as applications of the package.
    Tweet Sentiment Quantification: An Experimental Re-Evaluation. (arXiv:2011.08091v3 [cs.CL] UPDATED)
    (0 min) Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called ``prevalence'') of sentiment-related classes (such as \textsf{Positive}, \textsf{Neutral}, \textsf{Negative}) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of ``classify and count'' (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani (2016) carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We now re-evaluate those quantification methods (plus a few more modern ones) on exactly the same same datasets, this time following a now consolidated and much more robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. The results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
    Multiscale Manifold Warping. (arXiv:2109.09222v1 [cs.LG])
    (0 min) Many real-world applications require aligning two temporal sequences, including bioinformatics, handwriting recognition, activity recognition, and human-robot coordination. Dynamic Time Warping (DTW) is a popular alignment method, but can fail on high-dimensional real-world data where the dimensions of aligned sequences are often unequal. In this paper, we show that exploiting the multiscale manifold latent structure of real-world data can yield improved alignment. We introduce a novel framework called Warping on Wavelets (WOW) that integrates DTW with a a multi-scale manifold learning framework called Diffusion Wavelets. We present a theoretical analysis of the WOW family of algorithms and show that it outperforms previous state of the art methods, such as canonical time warping (CTW) and manifold warping, on several real-world datasets.
    RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks. (arXiv:2002.09545v2 [cs.LG] UPDATED)
    (0 min) The monitoring and management of numerous and diverse time series data at Alibaba Group calls for an effective and scalable time series anomaly detection service. In this paper, we propose RobustTAD, a Robust Time series Anomaly Detection framework by integrating robust seasonal-trend decomposition and convolutional neural network for time series data. The seasonal-trend decomposition can effectively handle complicated patterns in time series, and meanwhile significantly simplifies the architecture of the neural network, which is an encoder-decoder architecture with skip connections. This architecture can effectively capture the multi-scale information from time series, which is very useful in anomaly detection. Due to the limited labeled data in time series anomaly detection, we systematically investigate data augmentation methods in both time and frequency domains. We also introduce label-based weight and value-based weight in the loss function by utilizing the unbalanced nature of the time series anomaly detection problem. Compared with the widely used forecasting-based anomaly detection algorithms, decomposition-based algorithms, traditional statistical algorithms, as well as recent neural network based algorithms, RobustTAD performs significantly better on public benchmark datasets. It is deployed as a public online service and widely adopted in different business scenarios at Alibaba Group.
    Omnidirectional Transfer for Quasilinear Lifelong Learning. (arXiv:2004.12908v9 [cs.AI] UPDATED)
    (0 min) In biological learning, data are used to improve performance not only on the current task, but also on previously encountered and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on all tasks (including past and future) with any new data. We propose omnidirectional transfer learning algorithms, which includes two special cases of interest: decision forests and deep networks. Our key insight is the development of the omni-voter layer, which ensembles representations learned independently on all tasks to jointly decide how to proceed on any given new data point, thereby improving performance on both past and future tasks. Our algorithms demonstrate omnidirectional transfer in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they do so with quasilinear space and time complexity.
    ARCA23K: An audio dataset for investigating open-set label noise. (arXiv:2109.09227v1 [cs.SD])
    (0 min) The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically Retrieved and Curated Audio dataset comprised of over 23000 labelled Freesound clips. Unlike past datasets such as FSDKaggle2018 and FSDnoisy18K, ARCA23K facilitates the study of label noise in a more controlled manner. We describe the entire process of creating the dataset such that it is fully reproducible, meaning researchers can extend our work with little effort. We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise. Experiments are carried out in which we study the impact of label noise in terms of classification performance and representation learning.
    Unified and Multilingual Author Profiling for Detecting Haters. (arXiv:2109.09233v1 [cs.CL])
    (0 min) This paper presents a unified user profiling framework to identify hate speech spreaders by processing their tweets regardless of the language. The framework encodes the tweets with sentence transformers and applies an attention mechanism to select important tweets for learning user profiles. Furthermore, the attention layer helps to explain why a user is a hate speech spreader by producing attention weights at both token and post level. Our proposed model outperformed the state-of-the-art multilingual transformer models.
    Hierarchical clustering that takes advantage of both density-peak and density-connectivity. (arXiv:1810.03393v2 [cs.LG] UPDATED)
    (0 min) This paper focuses on density-based clustering, particularly the Density Peak (DP) algorithm and the one based on density-connectivity DBSCAN; and proposes a new method which takes advantage of the individual strengths of these two methods to yield a density-based hierarchical clustering algorithm. Our investigation begins with formally defining the types of clusters DP and DBSCAN are designed to detect; and then identifies the kinds of distributions that DP and DBSCAN individually fail to detect all clusters in a dataset. These identified weaknesses inspire us to formally define a new kind of clusters and propose a new method called DC-HDP to overcome these weaknesses to identify clusters with arbitrary shapes and varied densities. In addition, the new method produces a richer clustering result in terms of hierarchy or dendrogram for better cluster structures understanding. Our empirical evaluation results show that DC-HDP produces the best clustering results on 14 datasets in comparison with 7 state-of-the-art clustering algorithms.
    An Accelerated Variance-Reduced Conditional Gradient Sliding Algorithm for First-order and Zeroth-order Optimization. (arXiv:2109.08858v1 [cs.LG])
    (0 min) The conditional gradient algorithm (also known as the Frank-Wolfe algorithm) has recently regained popularity in the machine learning community due to its projection-free property to solve constrained problems. Although many variants of the conditional gradient algorithm have been proposed to improve performance, they depend on first-order information (gradient) to optimize. Naturally, these algorithms are unable to function properly in the field of increasingly popular zeroth-order optimization, where only zeroth-order information (function value) is available. To fill in this gap, we propose a novel Accelerated variance-Reduced Conditional gradient Sliding (ARCS) algorithm for finite-sum problems, which can use either first-order or zeroth-order information to optimize. To the best of our knowledge, ARCS is the first zeroth-order conditional gradient sliding type algorithms solving convex problems in zeroth-order optimization. In first-order optimization, the convergence results of ARCS substantially outperform previous algorithms in terms of the number of gradient query oracle. Finally we validated the superiority of ARCS by experiments on real-world datasets.
    Harnessing the Power of Ego Network Layers for Link Prediction in Online Social Networks. (arXiv:2109.09190v1 [cs.SI])
    (0 min) Being able to recommend links between users in online social networks is important for users to connect with like-minded individuals as well as for the platforms themselves and third parties leveraging social media information to grow their business. Predictions are typically based on unsupervised or supervised learning, often leveraging simple yet effective graph topological information, such as the number of common neighbors. However, we argue that richer information about personal social structure of individuals might lead to better predictions. In this paper, we propose to leverage well-established social cognitive theories to improve link prediction performance. According to these theories, individuals arrange their social relationships along, on average, five concentric circles of decreasing intimacy. We postulate that relationships in different circles have different importance in predicting new links. In order to validate this claim, we focus on popular feature-extraction prediction algorithms (both unsupervised and supervised) and we extend them to include social-circles awareness. We validate the prediction performance of these circle-aware algorithms against several benchmarks (including their baseline versions as well as node-embedding- and GNN-based link prediction), leveraging two Twitter datasets comprising a community of video gamers and generic users. We show that social-awareness generally provides significant improvements in the prediction performance, beating also state-of-the-art solutions like node2vec and SEAL, and without increasing the computational complexity. Finally, we show that social-awareness can be used in place of using a classifier (which may be costly or impractical) for targeting a specific category of users.
    Evaluating explainable artificial intelligence methods for multi-label deep learning classification tasks in remote sensing. (arXiv:2104.01375v2 [cs.LG] UPDATED)
    (2 min) Although deep neural networks hold the state-of-the-art in several remote sensing tasks, their black-box operation hinders the understanding of their decisions, concealing any bias and other shortcomings in datasets and model performance. To this end, we have applied explainable artificial intelligence (XAI) methods in remote sensing multi-label classification tasks towards producing human-interpretable explanations and improve transparency. In particular, we utilized and trained deep learning models with state-of-the-art performance in the benchmark BigEarthNet and SEN12MS datasets. Ten XAI methods were employed towards understanding and interpreting models' predictions, along with quantitative metrics to assess and compare their performance. Numerous experiments were performed to assess the overall performance of XAI methods for straightforward prediction cases, competing multiple labels, as well as misclassification cases. According to our findings, Occlusion, Grad-CAM and Lime were the most interpretable and reliable XAI methods. However, none delivers high-resolution outputs, while apart from Grad-CAM, both Lime and Occlusion are computationally expensive. We also highlight different aspects of XAI performance and elaborate with insights on black-box decisions in order to improve transparency, understand their behavior and reveal, as well, datasets' particularities.
    From Human Explanation to Model Interpretability: A Framework Based on Weight of Evidence. (arXiv:2104.13299v2 [cs.AI] UPDATED)
    (2 min) We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the social sciences, and propose a list of design principles for machine-generated explanations that are meaningful to humans. Using the concept of weight of evidence from information theory, we develop a method for generating explanations that adhere to these principles. We show that this method can be adapted to handle high-dimensional, multi-class settings, yielding a flexible framework for generating explanations. We demonstrate that these explanations can be estimated accurately from finite samples and are robust to small perturbations of the inputs. We also evaluate our method through a qualitative user study with machine learning practitioners, where we observe that the resulting explanations are usable despite some participants struggling with background concepts like prior class probabilities. Finally, we conclude by surfacing~design~implications for interpretability tools in general.
    Language Models are Few-Shot Butlers. (arXiv:2104.07972v2 [cs.CL] UPDATED)
    (2 min) Pretrained language models demonstrate strong performance in most NLP tasks when fine-tuned on small task-specific datasets. Hence, these autoregressive models constitute ideal agents to operate in text-based environments where language understanding and generative capabilities are essential. Nonetheless, collecting expert demonstrations in such environments is a time-consuming endeavour. We introduce a two-stage procedure to learn from a small set of demonstrations and further improve by interacting with an environment. We show that language models fine-tuned with only 1.2% of the expert demonstrations and a simple reinforcement learning algorithm achieve a 51% absolute improvement in success rate over existing methods in the ALFWorld environment.
    Speech Denoising Without Clean Training Data: A Noise2Noise Approach. (arXiv:2104.03838v2 [cs.SD] UPDATED)
    (2 min) This paper tackles the problem of the heavy dependence of clean speech data required by deep learning based audio-denoising methods by showing that it is possible to train deep speech denoising networks using only noisy speech samples. Conventional wisdom dictates that in order to achieve good speech denoising performance, there is a requirement for a large quantity of both noisy speech samples and perfectly clean speech samples, resulting in a need for expensive audio recording equipment and extremely controlled soundproof recording studios. These requirements pose significant challenges in data collection, especially in economically disadvantaged regions and for low resource languages. This work shows that speech denoising deep neural networks can be successfully trained utilizing only noisy training audio. Furthermore it is revealed that such training regimes achieve superior denoising performance over conventional training regimes utilizing clean training audio targets, in cases involving complex noise distributions and low Signal-to-Noise ratios (high noise environments). This is demonstrated through experiments studying the efficacy of our proposed approach over both real-world noises and synthetic noises using the 20 layered Deep Complex U-Net architecture.
    Leveraging Non-uniformity in First-order Non-convex Optimization. (arXiv:2105.06072v2 [cs.LG] UPDATED)
    (2 min) Classical global convergence results for first-order methods rely on uniform smoothness and the \L{}ojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform \L{}ojasiewicz inequality} (N\L{}). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical $\Omega(1/t^2)$ lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to $O(e^{-t})$ while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings.
    Equivariant geometric learning for digital rock physics: estimating formation factor and effective permeability tensors from Morse graph. (arXiv:2104.05608v2 [physics.geo-ph] UPDATED)
    (2 min) We present a SE(3)-equivariant graph neural network (GNN) approach that directly predicting the formation factor and effective permeability from micro-CT images. FFT solvers are established to compute both the formation factor and effective permeability, while the topology and geometry of the pore space are represented by a persistence-based Morse graph. Together, they constitute the database for training, validating, and testing the neural networks. While the graph and Euclidean convolutional approaches both employ neural networks to generate low-dimensional latent space to represent the features of the micro-structures for forward predictions, the SE(3) equivariant neural network is found to generate more accurate predictions, especially when the training data is limited. Numerical experiments have also shown that the new SE(3) approach leads to predictions that fulfill the material frame indifference whereas the predictions from classical convolutional neural networks (CNN) may suffer from spurious dependence on the coordinate system of the training data. Comparisons among predictions inferred from training the CNN and those from graph convolutional neural networks (GNN) with and without the equivariant constraint indicate that the equivariant graph neural network seems to perform better than the CNN and GNN without enforcing equivariant constraints.
    Utilizing Self-supervised Representations for MOS Prediction. (arXiv:2104.03017v3 [eess.AS] UPDATED)
    (2 min) Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and time-consuming because crowd work is necessary. It thus becomes highly desired to develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data. In this paper, we use self-supervised pre-trained models for MOS prediction. We show their representations can distinguish between clean and noisy audios. Then, we fine-tune these pre-trained models followed by simple linear layers in an end-to-end manner. The experiment results showed that our framework outperforms the two previous state-of-the-art models by a significant improvement on Voice Conversion Challenge 2018 and achieves comparable or superior performance on Voice Conversion Challenge 2016. We also conducted an ablation study to further investigate how each module benefits the task. The experiment results are implemented and reproducible with publicly available toolkits.
    Identification of AC Networks via Online Learning. (arXiv:2003.06210v3 [eess.SY] UPDATED)
    (2 min) The increasing penetration of intermittent distributed energy resources in power networks calls for novel planning and control methodologies which hinge on detailed knowledge of the grid. However, reliable information concerning the system topology and parameters may be missing or outdated for temporally varying electric distribution networks. This paper proposes an online learning procedure to estimate the network admittance matrix capturing topological information and line parameters. We start off by providing a recursive identification algorithm exploiting phasor measurements of voltages and currents. With the goal of accelerating convergence, we subsequently complement our base algorithm with a design-of-experiment procedure which maximizes the information content of data at each step by computing optimal voltage excitations. Our approach improves on existing techniques, and its effectiveness is substantiated by numerical studies on realistic testbeds.
    Dynamic Silos: Increased Modularity in Intra-organizational Communication Networks during the Covid-19 Pandemic. (arXiv:2104.00641v4 [stat.ML] UPDATED)
    (2 min) Workplace communications around the world were drastically altered by Covid-19 and the resulting work-from-home orders and rise of remote work. We analyze aggregated, anonymized metadata from over 360 billion emails within over 4,000 organizations worldwide to examine changes in network community structures over 24 months. We find that, in 2020, organizations around the world became more siloed than in 2019, evidenced by increased modularity. This shift was concurrent with decreased stability, indicating that organizational siloes had less stable membership. We provide initial insights into the meaning and implications of these network changes -- which we term dynamic silos -- for new models of work.
    Attention network forecasts time-to-failure in laboratory shear experiments. (arXiv:1912.06087v3 [physics.geo-ph] UPDATED)
    (2 min) Rocks under stress deform by creep mechanisms that include formation and slip on small-scale internal cracks. Intragranular cracks and slip along grain contacts release energy as elastic waves termed acoustic emissions (AE). AEs are thought to contain predictive information that can be used for fault failure forecasting. Here we present a method using unsupervised classification and an attention network to forecast labquakes using AE waveform features. Our data were generated in a laboratory setting using a biaxial shearing device with granular fault gouge intended to mimic the conditions of tectonic faults. Here we analyzed the temporal evolution of AEs generated throughout several hundred laboratory earthquake cycles. We used a Conscience Self-Organizing Map (CSOM) to perform topologically ordered vector quantization based on waveform properties. The resulting map was used to interactively cluster AEs. We examined the clusters over time to identify those with predictive ability. Finally, we used a variety of LSTM and attention-based networks to test the predictive power of the AE clusters. By tracking cumulative waveform features over the seismic cycle, the network is able to forecast the time-to-failure (TTF) of lab earthquakes. Our results show that analyzing the data to isolate predictive signals and using a more sophisticated network architecture are key to robustly forecasting labquakes. In the future, this method could be applied on tectonic faults monitor earthquakes and augment current early warning systems.
    Fast decentralized non-convex finite-sum optimization with recursive variance reduction. (arXiv:2008.07428v6 [math.OC] UPDATED)
    (3 min) This paper considers decentralized minimization of $N:=nm$ smooth non-convex cost functions equally divided over a directed network of $n$ nodes. Specifically, we describe a stochastic first-order gradient method, called GT-SARAH, that employs a SARAH-type variance reduction technique and gradient tracking (GT) to address the stochastic and decentralized nature of the problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds an $\epsilon$-accurate first-order stationary point with $O\big(\max\big\{N^{\frac{1}{2}},n(1-\lambda)^{-2},n^{\frac{2}{3}}m^{\frac{1}{3}}(1-\lambda)^{-1}\big\}L\epsilon^{-2}\big)$ gradient complexity, where ${(1-\lambda)\in(0,1]}$ is the spectral gap of the network weight matrix and $L$ is the smoothness parameter of the cost functions. This gradient complexity outperforms that of the existing decentralized stochastic gradient methods. In particular, in a big-data regime such that ${n = O(N^{\frac{1}{2}}(1-\lambda)^{3})}$, this gradient complexity furthers reduces to ${O(N^{\frac{1}{2}}L\epsilon^{-2})}$, independent of the network topology, and matches that of the centralized near-optimal variance-reduced methods. Moreover, in this regime GT-SARAH achieves a non-asymptotic linear speedup, in that, the total number of gradient computations at each node is reduced by a factor of $1/n$ compared to the centralized near-optimal algorithms that perform all gradient computations at a single node. To the best of our knowledge, GT-SARAH is the first algorithm that achieves this property. In addition, we show that appropriate choices of local minibatch size balance the trade-offs between the gradient and communication complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes in GT-SARAH asymptotically achieve consensus and converge to a first-order stationary point in the almost sure and mean-squared sense.
    Time-domain Speech Enhancement with Generative Adversarial Learning. (arXiv:2103.16149v2 [cs.SD] UPDATED)
    (2 min) Speech enhancement aims to obtain speech signals with high intelligibility and quality from noisy speech. Recent work has demonstrated the excellent performance of time-domain deep learning methods, such as Conv-TasNet. However, these methods can be degraded by the arbitrary scales of the waveform induced by the scale-invariant signal-to-noise ratio (SI-SNR) loss. This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN), which is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem, and provide model training stability, thus achieving performance improvement. In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN, and explain why it is better than the Wasserstein GAN. Experiments conducted demonstrate the effectiveness of our proposed method, and illustrate the advantage of Metric GAN.
    Visual Representation Learning for Preference-Aware Path Planning. (arXiv:2109.08968v1 [cs.RO])
    (2 min) Autonomous mobile robots deployed in outdoor environments must reason about different types of terrain for both safety (e.g., prefer dirt over mud) and deployer preferences (e.g., prefer dirt path over flower beds). Most existing solutions to this preference-aware path planning problem use semantic segmentation to classify terrain types from camera images, and then ascribe costs to each type. Unfortunately, there are three key limitations of such approaches -- they 1) require pre-enumeration of the discrete terrain types, 2) are unable to handle hybrid terrain types (e.g., grassy dirt), and 3) require expensive labelled data to train visual semantic segmentation. We introduce Visual Representation Learning for Preference-Aware Path Planning (VRL-PAP), an alternative approach that overcomes all three limitations: VRL-PAP leverages unlabeled human demonstrations of navigation to autonomously generate triplets for learning visual representations of terrain that are viewpoint invariant and encode terrain types in a continuous representation space. The learned representations are then used along with the same unlabeled human navigation demonstrations to learn a mapping from the representation space to terrain costs. At run time, VRL-PAP maps from images to representations and then representations to costs to perform preference-aware path planning. We present empirical results from challenging outdoor settings that demonstrate VRL-PAP 1) is successfully able to pick paths that reflect demonstrated preferences, 2) is comparable in execution to geometric navigation with a highly detailed manually annotated map (without requiring such annotations), 3) is able to generalize to novel terrain types with minimal additional unlabeled demonstrations.
    Efficient Pairwise Neuroimage Analysis using the Soft Jaccard Index and 3D Keypoint Sets. (arXiv:2103.06966v3 [eess.IV] UPDATED)
    (2 min) We propose a novel pairwise distance measure between image keypoint sets, for the purpose of large-scale medical image indexing. Our measure generalizes the Jaccard index to account for soft set equivalence (SSE) between keypoint elements, via an adaptive kernel framework modeling uncertainty in keypoint appearance and geometry. A new kernel is proposed to quantify the variability of keypoint geometry in location and scale. Our distance measure may be estimated between $O(N^2)$ image pairs in $O(N~\log~N)$ operations via keypoint indexing. Experiments report the first results for the task of predicting family relationships from medical images, using 1010 T1-weighted MRI brain volumes of 434 families including monozygotic and dizygotic twins, siblings and half-siblings sharing 100%-25% of their polymorphic genes. Soft set equivalence and the keypoint geometry kernel improve upon standard hard set equivalence (HSE) and appearance kernels alone in predicting family relationships. Monozygotic twin identification is near 100%, and three subjects with uncertain genotyping are automatically paired with their self-reported families, the first reported practical application of image-based family identification. Our distance measure can also be used to predict group categories, sex is predicted with an AUC=0.97. Software is provided for efficient fine-grained curation of large, generic image datasets.
    Deep Reinforcement Learning with Function Properties in Mean Reversion Strategies. (arXiv:2101.03418v3 [q-fin.MF] UPDATED)
    (2 min) Over the past decades, researchers have been pushing the limits of Deep Reinforcement Learning (DRL). Although DRL has attracted substantial interest from practitioners, many are blocked by having to search through a plethora of available methodologies that are seemingly alike, while others are still building RL agents from scratch based on classical theories. To address the aforementioned gaps in adopting the latest DRL methods, I am particularly interested in testing out if any of the recent technology developed by the leads in the field can be readily applied to a class of optimal trading problems. Unsurprisingly, many prominent breakthroughs in DRL are investigated and tested on strategic games: from AlphaGo to AlphaStar and at about the same time, OpenAI Five. Thus, in this writing, I want to show precisely how to use a DRL library that is initially built for games in a fundamental trading problem; mean reversion. And by introducing a framework that incorporates economically-motivated function properties, I also demonstrate, through the library, a highly-performant and convergent DRL solution to decision-making financial problems in general.
    Analyzing and Mitigating JPEG Compression Defects in Deep Learning. (arXiv:2011.08932v2 [cs.CV] UPDATED)
    (2 min) With the proliferation of deep learning methods, many computer vision problems which were considered academic are now viable in the consumer setting. One drawback of consumer applications is lossy compression, which is necessary from an engineering standpoint to efficiently and cheaply store and transmit user images. Despite this, there has been little study of the effect of compression on deep neural networks and benchmark datasets are often losslessly compressed or compressed at high quality. Here we present a unified study of the effects of JPEG compression on a range of common tasks and datasets. We show that there is a significant penalty on common performance metrics for high compression. We test several methods for mitigating this penalty, including a novel method based on artifact correction which requires no labels to train.
    Increasing the Confidence of Deep Neural Networks by Coverage Analysis. (arXiv:2101.12100v2 [cs.LG] UPDATED)
    (2 min) The great performance of machine learning algorithms and deep neural networks in several perception and control tasks is pushing the industry to adopt such technologies in safety-critical applications, as autonomous robots and self-driving vehicles. At present, however, several issues need to be solved to make deep learning methods more trustworthy, predictable, safe, and secure against adversarial attacks. Although several methods have been proposed to improve the trustworthiness of deep neural networks, most of them are tailored for specific classes of adversarial examples, hence failing to detect other corner cases or unsafe inputs that heavily deviate from the training samples. This paper presents a lightweight monitoring architecture based on coverage paradigms to enhance the model robustness against different unsafe inputs. In particular, four coverage analysis methods are proposed and tested in the architecture for evaluating multiple detection logics. Experimental results show that the proposed approach is effective in detecting both powerful adversarial examples and out-of-distribution inputs, introducing limited extra-execution time and memory requirements.
    CARLA Real Traffic Scenarios -- novel training ground and benchmark for autonomous driving. (arXiv:2012.11329v2 [cs.RO] UPDATED)
    (2 min) This work introduces interactive traffic scenarios in the CARLA simulator, which are based on real-world traffic. We concentrate on tactical tasks lasting several seconds, which are especially challenging for current control methods. The CARLA Real Traffic Scenarios (CRTS) is intended to be a training and testing ground for autonomous driving systems. To this end, we open-source the code under a permissive license and present a set of baseline policies. CRTS combines the realism of traffic scenarios and the flexibility of simulation. We use it to train agents using a reinforcement learning algorithm. We show how to obtain competitive polices and evaluate experimentally how observation types and reward schemes affect the training process and the resulting agent's behavior.
    Importance Reweighting for Biquality Learning. (arXiv:2010.09621v5 [cs.LG] UPDATED)
    (2 min) The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of "supervision deficiencies", namely: poor quality, non adaptability, and insufficient quantity of labels. Regarding quality, label noise can be of different types, including completely-at-random, at-random or even not-at-random. All these kinds of label noise are addressed separately in the literature, leading to highly specialized approaches. This paper proposes an original, encompassing, view of Weakly Supervised Learning, which results in the design of generic approaches capable of dealing with any kind of label noise. For this purpose, an alternative setting called "Biquality data" is used. It assumes that a small trusted dataset of correctly labeled examples is available, in addition to an untrusted dataset of noisy examples. In this paper, we propose a new reweigthing scheme capable of identifying noncorrupted examples in the untrusted dataset. This allows one to learn classifiers using both datasets. Extensive experiments that simulate several types of label noise and that vary the quality and quantity of untrusted examples, demonstrate that the proposed approach outperforms baselines and state-of-the-art approaches.
    Hydroelectric Generation Forecasting with Long Short Term Memory (LSTM) Based Deep Learning Model for Turkey. (arXiv:2109.09013v1 [cs.LG])
    (0 min) Hydroelectricity is one of the renewable energy source, has been used for many years in Turkey. The production of hydraulic power plants based on water reservoirs varies based on different parameters. For this reason, the estimation of hydraulic production gains importance in terms of the planning of electricity generation. In this article, the estimation of Turkey's monthly hydroelectricity production has been made with the long-short-term memory (LSTM) network-based deep learning model. The designed deep learning model is based on hydraulic production time series and future production planning for many years. By using real production data and different LSTM deep learning models, their performance on the monthly forecast of hydraulic electricity generation of the next year has been examined. The obtained results showed that the use of time series based on real production data for many years and deep learning model together is successful in long-term prediction. In the study, it is seen that the 100-layer LSTM model, in which 120 months (10 years) hydroelectric generation time data are used according to the RMSE and MAPE values, are the highest model in terms of estimation accuracy, with a MAPE value of 0.1311 (13.1%) in the annual total and 1.09% as the monthly average distribution. In this model, the best results were obtained for the 100-layer LSTM model, in which the time data of 144 months (12 years) hydroelectric generation data are used, with a RMSE value of 29,689 annually and 2474.08 in monthly distribution. According to the results of the study, time data covering at least 120 months of production is recommended to create an acceptable hydropower forecasting model with LSTM.
    Proteome-informed machine learning studies of cocaine addiction. (arXiv:2109.08718v1 [q-bio.MN])
    (0 min) Cocaine addiction accounts for a large portion of substance use disorders and threatens millions of lives worldwide. There is an urgent need to come up with efficient anti-cocaine addiction drugs. Unfortunately, no medications have been approved by the Food and Drug Administration (FDA), despite the extensive effort in the past few decades. The main challenge is the intricate molecular mechanisms of cocaine addiction, involving synergistic interactions among proteins upstream and downstream of dopamine transporter (DAT) functions impacted by cocaine. However, traditional in vivo or in vitro experiments can not address the roles of so many proteins, highlighting the need for innovative strategies in the field. We propose a proteome-informed machine learning/deep learning (ML/DL) platform to discover nearly optimal anti-cocaine addiction lead compounds. We construct and analyze proteomic protein-protein interaction (PPI) networks for cocaine dependence to identify 141 involved drug targets and represent over 60,000 associated drug candidates or experimental drugs in the latent space using an autoencoder (EA) model trained from over 104 million molecules. We build 32 ML models for cross-target analysis of these drug candidates for side effects and repurposing potential. We further screen the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of these candidates. Our platform reveals that essentially all of the existing drug candidates, including dozens of experimental drugs, fail to pass our cross-target and ADMET screenings. Nonetheless, we have identified two nearly optimal leads for further optimization.
    On the Noise Stability and Robustness of Adversarially Trained Networks on NVM Crossbars. (arXiv:2109.09060v1 [cs.LG])
    (0 min) Applications based on Deep Neural Networks (DNNs) have grown exponentially in the past decade. To match their increasing computational needs, several Non-Volatile Memory (NVM) crossbar-based accelerators have been proposed. Apart from improved energy efficiency and performance, these approximate hardware also possess intrinsic robustness for defense against Adversarial Attacks, which is an important security concern for DNNs. Prior works have focused on quantifying this intrinsic robustness for vanilla networks, that is DNNs trained on unperturbed inputs. However, adversarial training of DNNs is the benchmark technique for robustness, and sole reliance on intrinsic robustness of the hardware may not be sufficient. In this work, we explore the design of robust DNNs through the amalgamation of adversarial training and the intrinsic robustness offered by NVM crossbar-based analog hardware. First, we study the noise stability of such networks on unperturbed inputs and observe that internal activations of adversarially trained networks have lower Signal-to-Noise Ratio (SNR), and are sensitive to noise than vanilla networks. As a result, they suffer significantly higher performance degradation due to the non-ideal computations; on an average 2x accuracy drop. On the other hand, for adversarial images generated using Projected-Gradient-Descent (PGD) White-Box attacks, ResNet-10/20 adversarially trained on CIFAR-10/100 display a 5-10% gain in robust accuracy due to the underlying NVM crossbar when the attack epsilon ($\epsilon_{attack}$, the degree of input perturbations) is greater than the epsilon of the adversarial training ($\epsilon_{train}$). Our results indicate that implementing adversarially trained networks on analog hardware requires careful calibration between hardware non-idealities and $\epsilon_{train}$ to achieve optimum robustness and performance.
    Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts. (arXiv:2109.08857v1 [cs.NE])
    (0 min) Evolutionary algorithms have been used in the digital art scene since the 1970s. A popular application of genetic algorithms is to optimize the procedural placement of vector graphic primitives to resemble a given painting. In recent years, deep learning-based approaches have also been proposed to generate procedural drawings, which can be optimized using gradient descent. In this work, we revisit the use of evolutionary algorithms for computational creativity. We find that modern evolution strategies (ES) algorithms, when tasked with the placement of shapes, offer large improvements in both quality and efficiency compared to traditional genetic algorithms, and even comparable to gradient-based methods. We demonstrate that ES is also well suited at optimizing the placement of shapes to fit the CLIP model, and can produce diverse, distinct geometric abstractions that are aligned with human interpretation of language. Videos and demo: https://es-clip.github.io/
    PluGeN: Multi-Label Conditional Generation From Pre-Trained Models. (arXiv:2109.09011v1 [cs.LG])
    (0 min) Modern generative models achieve excellent quality in a variety of tasks including image or text generation and chemical molecule modeling. However, existing methods often lack the essential ability to generate examples with requested properties, such as the age of the person in the photo or the weight of the generated molecule. Incorporating such additional conditioning factors would require rebuilding the entire architecture and optimizing the parameters from scratch. Moreover, it is difficult to disentangle selected attributes so that to perform edits of only one attribute while leaving the others unchanged. To overcome these limitations we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin to pre-trained generative models. The idea behind our approach is to transform the entangled latent representation using a flow-based module into a multi-dimensional space where the values of each attribute are modeled as an independent one-dimensional distribution. In consequence, PluGeN can generate new samples with desired attributes as well as manipulate labeled attributes of existing examples. Due to the disentangling of the latent representation, we are even able to generate samples with rare or unseen combinations of attributes in the dataset, such as a young person with gray hair, men with make-up, or women with beards. We combined PluGeN with GAN and VAE models and applied it to conditional generation and manipulation of images and chemical molecule modeling. Experiments demonstrate that PluGeN preserves the quality of backbone models while adding the ability to control the values of labeled attributes.
    Change of human mobility during COVID-19: A United States case study. (arXiv:2109.09022v1 [cs.LG])
    (0 min) With the onset of COVID-19 and the resulting shelter in place guidelines combined with remote working practices, human mobility in 2020 has been dramatically impacted. Existing studies typically examine whether mobility in specific localities increases or decreases at specific points in time and relate these changes to certain pandemic and policy events. In this paper, we study mobility change in the US through a five-step process using mobility footprint data. (Step 1) Propose the delta Time Spent in Public Places (Delta-TSPP) as a measure to quantify daily changes in mobility for each US county from 2019-2020. (Step 2) Conduct Principal Component Analysis (PCA) to reduce the Delta-TSPP time series of each county to lower-dimensional latent components of change in mobility. (Step 3) Conduct clustering analysis to find counties that exhibit similar latent components. (Step 4) Investigate local and global spatial autocorrelation for each component. (Step 5) Conduct correlation analysis to investigate how various population characteristics and behavior correlate with mobility patterns. Results show that by describing each county as a linear combination of the three latent components, we can explain 59% of the variation in mobility trends across all US counties. Specifically, change in mobility in 2020 for US counties can be explained as a combination of three latent components: 1) long-term reduction in mobility, 2) no change in mobility, and 3) short-term reduction in mobility. We observe significant correlations between the three latent components of mobility change and various population characteristics, including political leaning, population, COVID-19 cases and deaths, and unemployment. We find that our analysis provides a comprehensive understanding of mobility change in response to the COVID-19 pandemic.
    JEM++: Improved Techniques for Training JEM. (arXiv:2109.09032v1 [cs.LG])
    (0 min) Joint Energy-based Model (JEM) is a recently proposed hybrid model that retains strong discriminative power of modern CNN classifiers, while generating samples rivaling the quality of GAN-based approaches. In this paper, we propose a variety of new training procedures and architecture features to improve JEM's accuracy, training stability, and speed altogether. 1) We propose a proximal SGLD to generate samples in the proximity of samples from the previous step, which improves the stability. 2) We further treat the approximate maximum likelihood learning of EBM as a multi-step differential game, and extend the YOPO framework to cut out redundant calculations during backpropagation, which accelerates the training substantially. 3) Rather than initializing SGLD chain from random noise, we introduce a new informative initialization that samples from a distribution estimated from training data. 4) This informative initialization allows us to enable batch normalization in JEM, which further releases the power of modern CNN architectures for hybrid modeling. Code: https://github.com/sndnyang/JEMPP
    Hybrid Data Augmentation and Deep Attention-based Dilated Convolutional-Recurrent Neural Networks for Speech Emotion Recognition. (arXiv:2109.09026v1 [cs.SD])
    (0 min) Speech emotion recognition (SER) has been one of the significant tasks in Human-Computer Interaction (HCI) applications. However, it is hard to choose the optimal features and deal with imbalance labeled data. In this article, we investigate hybrid data augmentation (HDA) methods to generate and balance data based on traditional and generative adversarial networks (GAN) methods. To evaluate the effectiveness of HDA methods, a deep learning framework namely (ADCRNN) is designed by integrating deep dilated convolutional-recurrent neural networks with an attention mechanism. Besides, we choose 3D log Mel-spectrogram (MelSpec) features as the inputs for the deep learning framework. Furthermore, we reconfigure a loss function by combining a softmax loss and a center loss to classify the emotions. For validating our proposed methods, we use the EmoDB dataset that consists of several emotions with imbalanced samples. Experimental results prove that the proposed methods achieve better accuracy than the state-of-the-art methods on the EmoDB with 87.12% and 88.47% for the traditional and GAN-based methods, respectively.
    An empirical study on using CNNs for fast radio signal prediction. (arXiv:2006.09245v3 [cs.LG] UPDATED)
    (0 min) Accurate radio frequency power prediction in a geographic region is a computationally expensive part of finding the optimal transmitter location using a ray tracing software. We empirically analyze the viability of deep learning models to speed up this process. Specifically, deep learning methods including CNNs and UNET are typically used for segmentation, and can also be employed in power prediction tasks. We consider a dataset that consists of radio frequency power values for five different regions with four different frame dimensions. We compare deep learning-based prediction models including RadioUNET and four different variations of the UNET model for the power prediction task. More complex UNET variations improve the model on higher resolution frames such as 256x256. However, using the same models on lower resolutions results in overfitting and simpler models perform better. Our detailed numerical analysis shows that the deep learning models are effective in power prediction and they are able to generalize well to the new regions.
    Ensemble Learning using Error Correcting Output Codes: New Classification Error Bounds. (arXiv:2109.08967v1 [cs.LG])
    (0 min) New bounds on classification error rates for the error-correcting output code (ECOC) approach in machine learning are presented. These bounds have exponential decay complexity with respect to codeword length and theoretically validate the effectiveness of the ECOC approach. Bounds are derived for two different models: the first under the assumption that all base classifiers are independent and the second under the assumption that all base classifiers are mutually correlated up to first-order. Moreover, we perform ECOC classification on six datasets and compare their error rates with our bounds to experimentally validate our work and show the effect of correlation on classification accuracy.
    Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Model Training. (arXiv:2109.09023v1 [cs.CR])
    (0 min) In this paper, we raise up an emerging personal data protection problem where user personal data (e.g. images) could be inappropriately exploited to train deep neural network models without authorization. To solve this problem, we revisit traditional watermarking in advanced machine learning settings. By embedding a watermarking signature using specialized linear color transformation to user images, neural models will be imprinted with such a signature if training data include watermarked images. Then, a third-party verifier can verify potential unauthorized usage by inferring the watermark signature from neural models. We further explore the desired properties of watermarking and signature space for convincing verification. Through extensive experiments, we show empirically that linear color transformation is effective in protecting user's personal images for various realistic settings. To the best of our knowledge, this is the first work to protect users' personal data from unauthorized usage in neural network training.
    Experimental Evaluation of Computational Complexity for Different Neural Network Equalizers in Optical Communications. (arXiv:2109.08711v1 [eess.SP])
    (0 min) Addressing the neural network-based optical channel equalizers, we quantify the trade-off between their performance and complexity by carrying out the comparative analysis of several neural network architectures, presenting the results for TWC and SSMF set-ups.
    Inductive Conformal Recommender System. (arXiv:2109.08949v1 [cs.LG])
    (0 min) Traditional recommendation algorithms develop techniques that can help people to choose desirable items. However, in many real-world applications, along with a set of recommendations, it is also essential to quantify each recommendation's (un)certainty. The conformal recommender system uses the experience of a user to output a set of recommendations, each associated with a precise confidence value. Given a significance level $\varepsilon$, it provides a bound $\varepsilon$ on the probability of making a wrong recommendation. The conformal framework uses a key concept called nonconformity measure that measure the strangeness of an item concerning other items. One of the significant design challenges of any conformal recommendation framework is integrating nonconformity measure with the recommendation algorithm. In this paper, we introduce an inductive variant of a conformal recommender system. We propose and analyze different nonconformity measures in the inductive setting. We also provide theoretical proofs on the error-bound and the time complexity. Extensive empirical analysis on ten benchmark datasets demonstrates that the inductive variant substantially improves the performance in computation time while preserving the accuracy.
    Hindsight Foresight Relabeling for Meta-Reinforcement Learning. (arXiv:2109.09031v1 [cs.LG])
    (0 min) Meta-reinforcement learning (meta-RL) algorithms allow for agents to learn new behaviors from small amounts of experience, mitigating the sample inefficiency problem in RL. However, while meta-RL agents can adapt quickly to new tasks at test time after experiencing only a few trajectories, the meta-training process is still sample-inefficient. Prior works have found that in the multi-task RL setting, relabeling past transitions and thus sharing experience among tasks can improve sample efficiency and asymptotic performance. We apply this idea to the meta-RL setting and devise a new relabeling method called Hindsight Foresight Relabeling (HFR). We construct a relabeling distribution using the combination of "hindsight", which is used to relabel trajectories using reward functions from the training task distribution, and "foresight", which takes the relabeled trajectories and computes the utility of each trajectory for each task. HFR is easy to implement and readily compatible with existing meta-RL algorithms. We find that HFR improves performance when compared to other relabeling methods on a variety of meta-RL tasks.
    Regularize! Don't Mix: Multi-Agent Reinforcement Learning without Explicit Centralized Structures. (arXiv:2109.09038v1 [cs.LG])
    (0 min) We propose using regularization for Multi-Agent Reinforcement Learning rather than learning explicit cooperative structures called {\em Multi-Agent Regularized Q-learning} (MARQ). Many MARL approaches leverage centralized structures in order to exploit global state information or removing communication constraints when the agents act in a decentralized manner. Instead of learning redundant structures which is removed during agent execution, we propose instead to leverage shared experiences of the agents to regularize the individual policies in order to promote structured exploration. We examine several different approaches to how MARQ can either explicitly or implicitly regularize our policies in a multi-agent setting. MARQ aims to address these limitations in the MARL context through applying regularization constraints which can correct bias in off-policy out-of-distribution agent experiences and promote diverse exploration. Our algorithm is evaluated on several benchmark multi-agent environments and we show that MARQ consistently outperforms several baselines and state-of-the-art algorithms; learning in fewer steps and converging to higher returns.
    Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition. (arXiv:2109.08744v1 [eess.AS])
    (0 min) In this paper, we propose a dual-encoder ASR architecture for joint modeling of close-talk (CT) and far-talk (FT) speech, in order to combine the advantages of CT and FT devices for better accuracy. The key idea is to add an encoder selection network to choose the optimal input source (CT or FT) and the corresponding encoder. We use a single-channel encoder for CT speech and a multi-channel encoder with Spatial Filtering neural beamforming for FT speech, which are jointly trained with the encoder selection. We validate our approach on both attention-based and RNN Transducer end-to-end ASR systems. The experiments are done with conversational speech from a medical use case, which is recorded simultaneously with a CT device and a microphone array. Our results show that the proposed dual-encoder architecture obtains up to 9% relative WER reduction when using both CT and FT input, compared to the best single-encoder system trained and tested in matched condition.
    Back-translation for Large-Scale Multilingual Machine Translation. (arXiv:2109.08712v1 [cs.CL])
    (0 min) This paper illustrates our approach to the shared task on large-scale multilingual machine translation in the sixth conference on machine translation (WMT-21). This work aims to build a single multilingual translation system with a hypothesis that a universal cross-language representation leads to better multilingual translation performance. We extend the exploration of different back-translation methods from bilingual translation to multilingual translation. Better performance is obtained by the constrained sampling method, which is different from the finding of the bilingual translation. Besides, we also explore the effect of vocabularies and the amount of synthetic data. Surprisingly, the smaller size of vocabularies perform better, and the extensive monolingual English data offers a modest improvement. We submitted to both the small tasks and achieved the second place.
    SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification. (arXiv:2109.08839v1 [cs.SD])
    (0 min) Recently, x-vector has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, eg, extended TDNN (E-TDNN), factorized TDNN (F-TDNN), and densely connected TDNN (D-TDNN). In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multi-branch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCeleb1. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1, which surpasses previous TDNN based state-of-the-art approaches by a large margin. Code and trained weights are in https://github.com/wentaozhu/speechnas.git
    G-CoS: GNN-Accelerator Co-Search Towards Both Better Accuracy and Efficiency. (arXiv:2109.08983v1 [cs.AR])
    (0 min) Graph Neural Networks (GNNs) have emerged as the state-of-the-art (SOTA) method for graph-based learning tasks. However, it still remains prohibitively challenging to inference GNNs over large graph datasets, limiting their application to large-scale real-world tasks. While end-to-end jointly optimizing GNNs and their accelerators is promising in boosting GNNs' inference efficiency and expediting the design process, it is still underexplored due to the vast and distinct design spaces of GNNs and their accelerators. In this work, we propose G-CoS, a GNN and accelerator co-search framework that can automatically search for matched GNN structures and accelerators to maximize both task accuracy and acceleration efficiency. Specifically, GCoS integrates two major enabling components: (1) a generic GNN accelerator search space which is applicable to various GNN structures and (2) a one-shot GNN and accelerator co-search algorithm that enables simultaneous and efficient search for optimal GNN structures and their matched accelerators. To the best of our knowledge, G-CoS is the first co-search framework for GNNs and their accelerators. Extensive experiments and ablation studies show that the GNNs and accelerators generated by G-CoS consistently outperform SOTA GNNs and GNN accelerators in terms of both task accuracy and hardware efficiency, while only requiring a few hours for the end-to-end generation of the best matched GNNs and their accelerators.
    Robust Automated Framework for COVID-19 Disease Identification from a Multicenter Dataset of Chest CT Scans. (arXiv:2109.09241v1 [eess.IV])
    (0 min) The objective of this study is to develop a robust deep learning-based framework to distinguish COVID-19, Community-Acquired Pneumonia (CAP), and Normal cases based on chest CT scans acquired in different imaging centers using various protocols, and radiation doses. We showed that while our proposed model is trained on a relatively small dataset acquired from only one imaging center using a specific scanning protocol, the model performs well on heterogeneous test sets obtained by multiple scanners using different technical parameters. We also showed that the model can be updated via an unsupervised approach to cope with the data shift between the train and test sets and enhance the robustness of the model upon receiving a new external dataset from a different center. We adopted an ensemble architecture to aggregate the predictions from multiple versions of the model. For initial training and development purposes, an in-house dataset of 171 COVID-19, 60 CAP, and 76 Normal cases was used, which contained volumetric CT scans acquired from one imaging center using a constant standard radiation dose scanning protocol. To evaluate the model, we collected four different test sets retrospectively to investigate the effects of the shifts in the data characteristics on the model's performance. Among the test cases, there were CT scans with similar characteristics as the train set as well as noisy low-dose and ultra-low dose CT scans. In addition, some test CT scans were obtained from patients with a history of cardiovascular diseases or surgeries. The entire test dataset used in this study contained 51 COVID-19, 28 CAP, and 51 Normal cases. Experimental results indicate that our proposed framework performs well on all test sets achieving total accuracy of 96.15% (95%CI: [91.25-98.74]), COVID-19 sensitivity of 96.08% (95%CI: [86.54-99.5]), CAP sensitivity of 92.86% (95%CI: [76.50-99.19]).
    Manifold-preserved GANs. (arXiv:2109.08955v1 [cs.CV])
    (0 min) Generative Adversarial Networks (GANs) have been widely adopted in various fields. However, existing GANs generally are not able to preserve the manifold of data space, mainly due to the simple representation of discriminator for the real/generated data. To address such open challenges, this paper proposes Manifold-preserved GANs (MaF-GANs), which generalize Wasserstein GANs into high-dimensional form. Specifically, to improve the representation of data, the discriminator in MaF-GANs is designed to map data into a high-dimensional manifold. Furthermore, to stabilize the training of MaF-GANs, an operation with precise and universal solution for any K-Lipschitz continuity, called Topological Consistency is proposed. The effectiveness of the proposed method is justified by both theoretical analysis and empirical results. When adopting DCGAN as the backbone on CelebA (256*256), the proposed method achieved 12.43 FID, which outperforms the state-of-the-art model like Realness GAN (23.51 FID) by a large margin. Code will be made publicly available.
    AutoInit: Analytic Signal-Preserving Weight Initialization for Neural Networks. (arXiv:2109.08958v1 [cs.LG])
    (0 min) Neural networks require careful weight initialization to prevent signals from exploding or vanishing. Existing initialization schemes solve this problem in specific cases by assuming that the network has a certain activation function or topology. It is difficult to derive such weight initialization strategies, and modern architectures therefore often use these same initialization schemes even though their assumptions do not hold. This paper introduces AutoInit, a weight initialization algorithm that automatically adapts to different neural network architectures. By analytically tracking the mean and variance of signals as they propagate through the network, AutoInit is able to appropriately scale the weights at each layer to avoid exploding or vanishing signals. Experiments demonstrate that AutoInit improves performance of various convolutional and residual networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings. Further, in neural architecture search and activation function meta-learning, AutoInit automatically calculates specialized weight initialization strategies for thousands of unique architectures and hundreds of unique activation functions, and improves performance in vision, language, tabular, multi-task, and transfer learning scenarios. AutoInit thus serves as an automatic configuration tool that makes design of new neural network architectures more robust. The AutoInit package provides a wrapper around existing TensorFlow models and is available at https://github.com/cognizant-ai-labs/autoinit.
    Small Lesion Segmentation in Brain MRIs with Subpixel Embedding. (arXiv:2109.08791v1 [eess.IV])
    (0 min) We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues. We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network. Our embedding network learns features that can resolve detailed structures in the brain without the need for high-resolution training images, which are often unavailable and expensive to acquire. Alternatively, the encoder-decoder learns global structures by means of striding and max pooling. Our embedding network complements the encoder-decoder architecture by guiding the decoder with fine-grained details lost to spatial downsampling during the encoder stage. Unlike previous works, our decoder outputs at 2 times the input resolution, where a single pixel in the input resolution is predicted by four neighboring subpixels in our output. To obtain the output at the original scale, we propose a learnable downsampler (as opposed to hand-crafted ones e.g. bilinear) that combines subpixel predictions. Our approach improves the baseline architecture by approximately 11.7% and achieves the state of the art on the ATLAS public benchmark dataset with a smaller memory footprint and faster runtime than the best competing method. Our source code has been made available at: https://github.com/alexklwong/subpixel-embedding-segmentation.
    BERT-Beta: A Proactive Probabilistic Approach to Text Moderation. (arXiv:2109.08805v1 [cs.CL])
    (0 min) Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept {\it text toxicity propensity} to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.
    Removing Noise from Extracellular Neural Recordings Using Fully Convolutional Denoising Autoencoders. (arXiv:2109.08945v1 [q-bio.NC])
    (0 min) Extracellular recordings are severely contaminated by a considerable amount of noise sources, rendering the denoising process an extremely challenging task that should be tackled for efficient spike sorting. To this end, we propose an end-to-end deep learning approach to the problem, utilizing a Fully Convolutional Denoising Autoencoder, which learns to produce a clean neuronal activity signal from a noisy multichannel input. The experimental results on simulated data show that our proposed method can improve significantly the quality of noise-corrupted neural signals, outperforming widely-used wavelet denoising techniques.
    Geometric Interpretation of Running Nystr\"{o}m-Based Kernel Machines and Error Analysis. (arXiv:2002.08937v2 [cs.LG] UPDATED)
    (0 min) Recently, Nystr\"{o}m method has proved its prominence empirically and theoretically in speeding up the training of kernel machines while retaining satisfactory performances and accuracy. So far, there are several different approaches proposed to exploit Nystr\"{o}m method in scaling up kernel machines. However, there is no comparative study over these approaches, and they were individually analyzed for specific types of kernel machines. Therefore, it remains a question that the philosophy of which approach is more promising when it extends to other kernel machines. In this work, motivated by the column inclusion property of Gram matrices, we develop a new approach with a clear geometric interpretation for running Nystr\"{o}m-based kernel machines. We show that the other two well-studied approaches can be equivalently transformed to be our proposed one. Consequently, analysis established for the proposed approach also works for these two. Particularly, our proposed approach makes it possible to develop approximation errors in a general setting. Besides, our analysis also manifests the relations among the aforementioned two approaches and another naive one. First, the analytical forms of the corresponding approximate solutions are only at odds with one term. Second, the naive approach can be implemented efficiently by sharing the same training procedure with others. These analytical results lead to the conjecture that the naive approach can provide more accurate approximate solutions than the other two sophisticated approaches. Since our analysis also offers ways for computing the accuracy of these approximate solutions, we run experiments with classification tasks to confirm our conjecture.
    Text Detoxification using Large Pre-trained Neural Models. (arXiv:2109.08914v1 [cs.CL])
    (0 min) We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT to replace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.
    KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks. (arXiv:2107.01739v2 [cs.LG] UPDATED)
    (0 min) Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA A100 GPUs. Compared to the original optimizers, KAISA converges 18.1-36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers. KAISA is open source and available at https://github.com/gpauloski/kfac_pytorch.
    KNN Learning Techniques for Proportional Myocontrol in Prosthetics. (arXiv:2109.08917v1 [eess.SP])
    (0 min) This work has been conducted in the context of pattern-recognition-based control for electromyographic prostheses. It presents a k-nearest neighbour (kNN) classification technique for gesture recognition, extended by a proportionality scheme. The methods proposed are practically implemented and validated. Datasets are captured by means of a state-of-the-art 8-channel electromyography (EMG) armband positioned on the forearm. Based on this data, the influence of kNN's parameters is analyzed in pilot experiments. Moreover, the effect of proportionality scaling and rest thresholding schemes is investigated. A randomized, double-blind user study is conducted to compare the implemented method with the state-of-research algorithm Ridge Regression with Random Fourier Features (RR-RFF) for different levels of gesture exertion. The results from these experiments show a statistically significant improvement in favour of the kNN-based algorithm.
    Temporal Knowledge Graph Completion using Box Embeddings. (arXiv:2109.08970v1 [cs.LG])
    (0 min) Knowledge graph completion is the task of inferring missing facts based on existing data in a knowledge graph. Temporal knowledge graph completion (TKGC) is an extension of this task to temporal knowledge graphs, where each fact is additionally associated with a time stamp. Current approaches for TKGC primarily build on existing embedding models which are developed for (static) knowledge graph completion, and extend these models to incorporate time, where the idea is to learn latent representations for entities, relations, and timestamps and then use the learned representations to predict missing facts at various time steps. In this paper, we propose BoxTE, a box embedding model for TKGC, building on the static knowledge graph embedding model BoxE. We show that BoxTE is fully expressive, and possesses strong inductive capacity in the temporal setting. We then empirically evaluate our model and show that it achieves state-of-the-art results on several TKGC benchmarks.
    AI Accelerator Survey and Trends. (arXiv:2109.08957v1 [cs.AR])
    (0 min) Over the past several years, new machine learning accelerators were being announced and released every month for a variety of applications from speech recognition, video object detection, assisted driving, and many data center applications. This paper updates the survey of AI accelerators and processors from past two years. This paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. This year, we also compile a list of benchmarking performance results and compute the computational efficiency with respect to peak performance.
    Probabilistic Bearing Fault Diagnosis Using Gaussian Process with Tailored Feature Extraction. (arXiv:2109.09189v1 [cs.LG])
    (0 min) Rolling bearings are subject to various faults due to its long-time operation under harsh environment, which will lead to unexpected breakdown of machinery system and cause severe accidents. Deep learning methods recently have gained growing interests and extensively applied in the data-driven bearing fault diagnosis. However, current deep learning methods perform the bearing fault diagnosis in the form of deterministic classification, which overlook the uncertainties that inevitably exist in actual practice. To tackle this issue, in this research we develop a probabilistic fault diagnosis framework that can account for the uncertainty effect in prediction, which bears practical significance. This framework fully leverages the probabilistic feature of Gaussian process classifier (GPC). To facilitate the establishment of high-fidelity GPC, the tailored feature extraction with dimensionality reduction method can be optimally determined through the cross validation-based grid search upon a prespecified method pool consisting of various kernel principal component analysis (KPCA) methods and stacked autoencoder. This strategy can ensure the complex nonlinear relations between the features and faults to be adequately characterized. Furthermore, the sensor fusion concept is adopted to enhance the diagnosis performance. As compared with the traditional deep learning methods, this proposed framework usually requires less labeled data and less effort for parameter tuning. Systematic case studies using the publicly accessible experimental rolling bearing dataset are carried out to validate this new framework. Various influencing factors on fault diagnosis performance also are thoroughly investigated.
    Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy. (arXiv:2109.08780v1 [cs.LG])
    (0 min) Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context.
    Towards Zero-Label Language Learning. (arXiv:2109.09193v1 [cs.CL])
    (0 min) This paper explores zero-label learning in Natural Language Processing (NLP), whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. At the core of our framework is a novel approach for better leveraging the powerful pretrained language models. Specifically, inspired by the recent success of few-shot inference on GPT-3, we present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations. Our method enables zero-label learning as we train task-specific models solely on the synthetic data, yet we achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, our approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
    Results on the algebraic matroid of the determinantal variety. (arXiv:2002.05082v6 [math.AG] UPDATED)
    (0 min) We make progress towards characterizing the algebraic matroid of the determinantal variety. We present a family of base sets of the matroid and conjecture these are all the base sets. This conjecture is reduced to a purely combinatorial statement, which is proved for special cases. Our results rely on the combinatorial notion of relaxed supports of linkage matching fields that we introduce, our interpretation of the problem of completing a matrix of bounded rank from a subset of its entries as a linear section problem on the Grassmannian, and a connection that we draw with a class of local coordinates on the Grassmannian described by Sturmfels $\&$ Zelevinsky.
    Towards Representation Learning for Atmospheric Dynamics. (arXiv:2109.09076v1 [cs.LG])
    (0 min) The prediction of future climate scenarios under anthropogenic forcing is critical to understand climate change and to assess the impact of potentially counter-acting technologies. Machine learning and hybrid techniques for this prediction rely on informative metrics that are sensitive to pertinent but often subtle influences. For atmospheric dynamics, a critical part of the climate system, the "eyeball metric", i.e. a visual inspection by an expert, is currently still the gold standard. However, it cannot be used as metric in machine learning systems where an algorithmic description is required. Motivated by the success of intermediate neural network activations as basis for learned metrics, e.g. in computer vision, we present a novel, self-supervised representation learning approach specifically designed for atmospheric dynamics. Our approach, called AtmoDist, trains a neural network on a simple, auxiliary task: predicting the temporal distance between elements of a shuffled sequence of atmospheric fields (e.g. the components of the wind field from a reanalysis or simulation). The task forces the network to learn important intrinsic aspects of the data as activations in its layers and from these hence a discriminative metric can be obtained. We demonstrate this by using AtmoDist to define a metric for GAN-based super resolution of vorticity and divergence. Our upscaled data matches closely the true statistics of a high resolution reference and it significantly outperform the state-of-the-art based on mean squared error. Since AtmoDist is unsupervised, only requires a temporal sequence of fields, and uses a simple auxiliary task, it can be used in a wide range of applications that aim to understand and mitigate climate change.
    Underwater Image Enhancement Using Convolutional Neural Network. (arXiv:2109.08916v1 [eess.IV])
    (0 min) This work proposes a method for underwater image enhancement using the principle of histogram equalization. Since underwater images have a global strong dominant colour, their colourfulness and contrast are often degraded. Before applying the histogram equalisation technique on the image, the image is converted from coloured image to a gray scale image for further operations. Histogram equalization is a technique for adjusting image intensities to enhance contrast. The colours of the image are retained using a convolutional neural network model which is trained by the datasets of underwater images to give better results.
    Dual Behavior Regularized Reinforcement Learning. (arXiv:2109.09037v1 [cs.LG])
    (0 min) Reinforcement learning has been shown to perform a range of complex tasks through interaction with an environment or collected leveraging experience. However, many of these approaches presume optimal or near optimal experiences or the presence of a consistent environment. In this work we propose dual, advantage-based behavior policy based on counterfactual regret minimization. We demonstrate the flexibility of this approach and how it can be adapted to online contexts where the environment is available to collect experiences and a variety of other contexts. We demonstrate this new algorithm can outperform several strong baseline models in different contexts based on a range of continuous environments. Additional ablations provide insights into how our dual behavior regularized reinforcement learning approach is designed compared with other plausible modifications and demonstrates its ability to generalize.
    A survey on deep learning approaches for breast cancer diagnosis. (arXiv:2109.08853v1 [eess.IV])
    (0 min) Deep learning has introduced several learning-based methods to recognize breast tumours and presents high applicability in breast cancer diagnostics. It has presented itself as a practical installment in Computer-Aided Diagnostic (CAD) systems to further assist radiologists in diagnostics for different modalities. A deep learning network trained on images provided by hospitals or public databases can perform classification, detection, and segmentation of lesion types. Significant progress has been made in recognizing tumours on 2D images but recognizing 3D images remains a frontier so far. The interconnection of deep learning networks between different fields of study help propels discoveries for more efficient, accurate, and robust networks. In this review paper, the following topics will be explored: (i) theory and application of deep learning, (ii) progress of 2D, 2.5D, and 3D CNN approaches in breast tumour recognition from a performance metric perspective, and (iii) challenges faced in CNN approaches.
    Coordinate Descent for MCP/SCAD Penalized Least Squares Converges Linearly. (arXiv:2109.08850v1 [stat.ML])
    (0 min) Recovering sparse signals from observed data is an important topic in signal/imaging processing, statistics and machine learning. Nonconvex penalized least squares have been attracted a lot of attentions since they enjoy nice statistical properties. Computationally, coordinate descent (CD) is a workhorse for minimizing the nonconvex penalized least squares criterion due to its simplicity and scalability. In this work, we prove the linear convergence rate to CD for solving MCP/SCAD penalized least squares problems.
    Towards Resilient Artificial Intelligence: Survey and Research Issues. (arXiv:2109.08904v1 [cs.LG])
    (0 min) Artificial intelligence (AI) systems are becoming critical components of today's IT landscapes. Their resilience against attacks and other environmental influences needs to be ensured just like for other IT assets. Considering the particular nature of AI, and machine learning (ML) in particular, this paper provides an overview of the emerging field of resilient AI and presents research issues the authors identify as potential future work.
    Model-Based Approach for Measuring the Fairness in ASR. (arXiv:2109.09061v1 [stat.ML])
    (0 min) The issue of fairness arises when the automatic speech recognition (ASR) systems do not perform equally well for all subgroups of the population. In any fairness measurement studies for ASR, the open questions of how to control the nuisance factors, how to handle unobserved heterogeneity across speakers, and how to trace the source of any word error rate (WER) gap among different subgroups are especially important - if not appropriately accounted for, incorrect conclusions will be drawn. In this paper, we introduce mixed-effects Poisson regression to better measure and interpret any WER difference among subgroups of interest. Particularly, the presented method can effectively address the three problems raised above and is very flexible to use in practical disparity analyses. We demonstrate the validity of proposed model-based approach on both synthetic and real-world speech data.
    Self-supervised learning methods and applications in medical imaging analysis: A survey. (arXiv:2109.08685v1 [eess.IV])
    (0 min) The availability of high quality annotated medical imaging datasets is a major problem that collides with machine learning applications in the field of medical imaging analysis and impedes its advancement. Self-supervised learning is a recent training paradigm that enables learning robust representations without the need for human annotation which can be considered as an effective solution for the scarcity in annotated medical data. This article reviews the state-of-the-art research directions in self-supervised learning approaches for image data with concentration on their applications in the field of medical imaging analysis. The article covers a set of the most recent self-supervised learning methods from the computer vision field as they are applicable to the medical imaging analysis and categorize them as predictive, generative and contrastive approaches. Moreover, the article covers (40) of the most recent researches in the field of self-supervised learning in medical imaging analysis aiming at shedding the light on the recent innovation in the field. Ultimately, the article concludes with possible future research directions in the field.
    Probabilistic Inference of Simulation Parameters via Parallel Differentiable Simulation. (arXiv:2109.08815v1 [cs.RO])
    (0 min) To accurately reproduce measurements from the real world, simulators need to have an adequate model of the physical system and require the parameters of the model be identified. We address the latter problem of estimating parameters through a Bayesian inference approach that approximates a posterior distribution over simulation parameters given real sensor measurements. By extending the commonly used Gaussian likelihood model for trajectories via the multiple-shooting formulation, our chosen particle-based inference algorithm Stein Variational Gradient Descent is able to identify highly nonlinear, underactuated systems. We leverage GPU code generation and differentiable simulation to evaluate the likelihood and its gradient for many particles in parallel. Our algorithm infers non-parametric distributions over simulation parameters more accurately than comparable baselines and handles constraints over parameters efficiently through gradient-based optimization. We evaluate estimation performance on several physical experiments. On an underactuated mechanism where a 7-DOF robot arm excites an object with an unknown mass configuration, we demonstrate how our inference technique can identify symmetries between the parameters and provide highly accurate predictions. Project website: https://uscresl.github.io/prob-diff-sim
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v1 [stat.ML])
    (0 min) We study the problem of estimating an unknown function from noisy data using shallow (single-hidden layer) ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the space of functions of second-order bounded variation in the Radon domain. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. We also show that this is a "mixed variation" function space that contains classical multivariate function spaces including certain Sobolev spaces and certain spectral Barron spaces. Finally, we use these results to quantify a gap between neural networks and linear methods (which include kernel methods). This paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.
    A framework for benchmarking uncertainty in deep regression. (arXiv:2109.09048v1 [cs.LG])
    (0 min) We propose a framework for the assessment of uncertainty quantification in deep regression. The framework is based on regression problems where the regression function is a linear combination of nonlinear functions. Basically, any level of complexity can be realized through the choice of the nonlinear functions and the dimensionality of their domain. Results of an uncertainty quantification for deep regression are compared against those obtained by a statistical reference method. The reference method utilizes knowledge of the underlying nonlinear functions and is based on a Bayesian linear regression using a reference prior. Reliability of uncertainty quantification is assessed in terms of coverage probabilities, and accuracy through the size of calculated uncertainties. We illustrate the proposed framework by applying it to current approaches for uncertainty quantification in deep regression. The flexibility, together with the availability of a reference solution, makes the framework suitable for defining benchmark sets for uncertainty quantification.
    Effective and Scalable Clustering on Massive Attributed Graphs. (arXiv:2102.03826v2 [cs.SI] UPDATED)
    (3 min) Given a graph G where each node is associated with a set of attributes, and a parameter k specifying the number of output clusters, k-attributed graph clustering (k-AGC) groups nodes in G into k disjoint clusters, such that nodes within the same cluster share similar topological and attribute characteristics, while those in different clusters are dissimilar. This problem is challenging on massive graphs, e.g., with millions of nodes and billions of edges. For such graphs, existing solutions either incur prohibitively high costs, or produce clustering results with compromised quality. In this paper, we propose ACMin, an effective approach to k-AGC that yields high-quality clusters with cost linear to the size of the input graph G. The main contributions of ACMin are twofold: (i) a novel formulation of the k-AGC problem based on an attributed multi-hop conductance quality measure custom-made for this problem setting, which effectively captures cluster coherence in terms of both topological proximities and attribute similarities, and (ii) a linear-time optimization solver that obtains high-quality clusters iteratively, based on efficient matrix operations such as orthogonal iterations, an alternative optimization approach, as well as an initialization technique that significantly speeds up the convergence of ACMin in practice. Extensive experiments, comparing 11 competitors on 6 real datasets, demonstrate that ACMin consistently outperforms all competitors in terms of result quality measured against ground-truth labels, while being up to orders of magnitude faster. In particular, on the Microsoft Academic Knowledge Graph dataset with 265.2 million edges and 1.1 billion attribute values, ACMin outputs high-quality results for 5-AGC within 1.68 hours using a single CPU core, while none of the 11 competitors finish within 3 days.
    Learning to be Fair: A Consequentialist Approach to Equitable Decision-Making. (arXiv:2109.08792v1 [cs.LG])
    (2 min) In the dominant paradigm for designing equitable machine learning systems, one works to ensure that model predictions satisfy various fairness criteria, such as parity in error rates across race, gender, and other legally protected traits. That approach, however, typically divorces predictions from the downstream outcomes they ultimately affect, and, as a result, can induce unexpected harms. Here we present an alternative framework for fairness that directly anticipates the consequences of actions. Stakeholders first specify preferences over the possible outcomes of an algorithmically informed decision-making process. For example, lenders may prefer extending credit to those most likely to repay a loan, while also preferring similar lending rates across neighborhoods. One then searches the space of decision policies to maximize the specified utility. We develop and describe a method for efficiently learning these optimal policies from data for a large family of expressive utility functions, facilitating a more holistic approach to equitable decision-making.
    Intra-Inter Subject Self-supervised Learning for Multivariate Cardiac Signals. (arXiv:2109.08908v1 [cs.LG])
    (2 min) Learning information-rich and generalizable representations effectively from unlabeled multivariate cardiac signals to identify abnormal heart rhythms (cardiac arrhythmias) is valuable in real-world clinical settings but often challenging due to its complex temporal dynamics. Cardiac arrhythmias can vary significantly in temporal patterns even for the same patient ($i.e.$, intra subject difference). Meanwhile, the same type of cardiac arrhythmia can show different temporal patterns among different patients due to different cardiac structures ($i.e.$, inter subject difference). In this paper, we address the challenges by proposing an Intra-inter Subject self-supervised Learning (ISL) model that is customized for multivariate cardiac signals. Our proposed ISL model integrates medical knowledge into self-supervision to effectively learn from intra-inter subject differences. In intra subject self-supervision, ISL model first extracts heartbeat-level features from each subject using a channel-wise attentional CNN-RNN encoder. Then a stationarity test module is employed to capture the temporal dependencies between heartbeats. In inter subject self-supervision, we design a set of data augmentations according to the clinical characteristics of cardiac signals and perform contrastive learning among subjects to learn distinctive representations for various types of patients. Extensive experiments on three real-world datasets were conducted. In a semi-supervised transfer learning scenario, our pre-trained ISL model leads about 10% improvement over supervised training when only 1% labeled data is available, suggesting strong generalizability and robustness of the model.
    Analyzing the Habitable Zones of Circumbinary Planets Using Machine Learning. (arXiv:2109.08735v1 [astro-ph.EP])
    (2 min) Exoplanet detection in the past decade by efforts including NASA's Kepler and TESS missions has discovered many worlds that differ substantially from planets in our own Solar System, including more than 150 exoplanets orbiting binary or multi-star systems. This not only broadens our understanding of the diversity of exoplanets, but also promotes our study of exoplanets in the complex binary systems and provides motivation to explore their habitability. In this study, we investigate the Habitable Zones of circumbinary planets based on planetary trajectory and dynamically informed habitable zones. Our results indicate that the mass ratio and orbital eccentricity of binary stars are important factors affecting the orbital stability and habitability of planetary systems. Moreover, planetary trajectory and dynamically informed habitable zones divide planetary habitability into three categories: habitable, part-habitable and uninhabitable. Therefore, we train a machine learning model to quickly and efficiently classify these planetary systems.
    Exploring the Robustness of Distributional Reinforcement Learning against Noisy State Observations. (arXiv:2109.08776v1 [cs.LG])
    (2 min) In real scenarios, state observations that an agent observes may contain measurement errors or adversarial noises, misleading the agent to take suboptimal actions or even collapse while training. In this paper, we study the training robustness of distributional Reinforcement Learning~(RL), a class of state-of-the-art methods that estimate the whole distribution, as opposed to only the expectation, of the total return. Firstly, we propose State-Noisy Markov Decision Process~(SN-MDP) in the tabular case to incorporate both random and adversarial state observation noises, in which the contraction of both expectation-based and distributional Bellman operators is derived. Beyond SN-MDP with the function approximation, we theoretically characterize the bounded gradient norm of histogram-based distributional loss, accounting for the better training robustness of distribution RL. We also provide stricter convergence conditions of the Temporal-Difference~(TD) learning under more flexible state noises, as well as the sensitivity analysis by the leverage of influence function. Finally, extensive experiments on the suite of games show that distributional RL enjoys better training robustness compared with its expectation-based counterpart across various state observation noises.
    A Robust and Efficient Multi-Scale Seasonal-Trend Decomposition. (arXiv:2109.08800v1 [stat.AP])
    (2 min) Many real-world time series exhibit multiple seasonality with different lengths. The removal of seasonal components is crucial in numerous applications of time series, including forecasting and anomaly detection. However, many seasonal-trend decomposition algorithms suffer from high computational cost and require a large amount of data when multiple seasonal components exist, especially when the periodic length is long. In this paper, we propose a general and efficient multi-scale seasonal-trend decomposition algorithm for time series with multiple seasonality. We first down-sample the original time series onto a lower resolution, and then convert it to a time series with single seasonality. Thus, existing seasonal-trend decomposition algorithms can be applied directly to obtain the rough estimates of trend and the seasonal component corresponding to the longer periodic length. By considering the relationship between different resolutions, we formulate the recovery of different components on the high resolution as an optimization problem, which is solved efficiently by our alternative direction multiplier method (ADMM) based algorithm. Our experimental results demonstrate the accurate decomposition results with significantly improved efficiency.
    XIRL: Cross-embodiment Inverse Reinforcement Learning. (arXiv:2106.03911v2 [cs.RO] UPDATED)
    (2 min) We investigate the visual cross-embodiment imitation setting, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, etc. In this work, we demonstrate that it is possible to automatically discover and learn vision-based reward functions from cross-embodiment demonstration videos that are robust to these differences. Specifically, we present a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL) that leverages temporal cycle-consistency constraints to learn deep visual embeddings that capture task progression from offline videos of demonstrations across multiple expert agents, each performing the same task differently due to embodiment differences. Prior to our work, producing rewards from self-supervised embeddings typically required alignment with a reference trajectory, which may be difficult to acquire under stark embodiment differences. We show empirically that if the embeddings are aware of task progress, simply taking the negative distance between the current state and goal state in the learned embedding space is useful as a reward for training policies with reinforcement learning. We find our learned reward function not only works for embodiments seen during training, but also generalizes to entirely new embodiments. Additionally, when transferring real-world human demonstrations to a simulated robot, we find that XIRL is more sample efficient than current best methods. Qualitative results, code, and datasets are available at https://x-irl.github.io
    Hierarchical Phenotyping and Graph Modeling of Spatial Architecture in Lymphoid Neoplasms. (arXiv:2106.16174v2 [q-bio.QM] UPDATED)
    (2 min) The cells and their spatial patterns in the tumor microenvironment (TME) play a key role in tumor evolution, and yet the latter remains an understudied topic in computational pathology. This study, to the best of our knowledge, is among the first to hybridize local and global graph methods to profile orchestration and interaction of cellular components. To address the challenge in hematolymphoid cancers, where the cell classes in TME may be unclear, we first implemented cell-level unsupervised learning and identified two new cell subtypes. Local cell graphs or supercells were built for each image by considering the individual cell's geospatial location and classes. Then, we applied supercell level clustering and identified two new cell communities. In the end, we built global graphs to abstract spatial interaction patterns and extract features for disease diagnosis. We evaluate the proposed algorithm on H&E slides of 60 hematolymphoid neoplasms and further compared it with three cell level graph-based algorithms, including the global cell graph, cluster cell graph, and FLocK. The proposed algorithm achieved a mean diagnosis accuracy of 0.703 with the repeated 5-fold cross-validation scheme. In conclusion, our algorithm shows superior performance over the existing methods and can be potentially applied to other cancer types.
    An Empirical Evaluation of the t-SNE Algorithm for Data Visualization in Structural Engineering. (arXiv:2109.08795v1 [cs.LG])
    (2 min) A fundamental task in machine learning involves visualizing high-dimensional data sets that arise in high-impact application domains. When considering the context of large imbalanced data, this problem becomes much more challenging. In this paper, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is used to reduce the dimensions of an earthquake engineering related data set for visualization purposes. Since imbalanced data sets greatly affect the accuracy of classifiers, we employ Synthetic Minority Oversampling Technique (SMOTE) to tackle the imbalanced nature of such data set. We present the result obtained from t-SNE and SMOTE and compare it to the basic approaches with various aspects. Considering four options and six classification algorithms, we show that using t-SNE on the imbalanced data and SMOTE on the training data set, neural network classifiers have promising results without sacrificing accuracy. Hence, we can transform the studied scientific data into a two-dimensional (2D) space, enabling the visualization of the classifier and the resulting decision surface using a 2D plot.
    Relating Neural Text Degeneration to Exposure Bias. (arXiv:2109.08705v1 [cs.CL])
    (2 min) This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer from (Holtzman et al., 2020). Motivated by the unknown causation of the text degeneration, in this paper we attempt to relate these two mysteries. Specifically, we first qualitatively quantitatively identify mistakes made before text degeneration occurs. Then we investigate the significance of the mistakes by inspecting the hidden states in GPT-2. Our results show that text degeneration is likely to be partly caused by exposure bias. We also study the self-reinforcing mechanism of text degeneration, explaining why the mistakes amplify. In sum, our study provides a more concrete foundation for further investigation on exposure bias and text degeneration problems.
    Training Provably Robust Models by Polyhedral Envelope Regularization. (arXiv:1912.04792v3 [cs.LG] UPDATED)
    (2 min) Training certifiable neural networks enables one to obtain models with robustness guarantees against adversarial attacks. In this work, we introduce a framework to bound the adversary-free region in the neighborhood of the input data by a polyhedral envelope, which yields finer-grained certified robustness. We further introduce polyhedral envelope regularization (PER) to encourage larger polyhedral envelopes and thus improve the provable robustness of the models. We demonstrate the flexibility and effectiveness of our framework on standard benchmarks; it applies to networks of different architectures and general activation functions. Compared with the state-of-the-art methods, PER has very little computational overhead and better robustness guarantees without over-regularizing the model.
    Segmentation of Brain MRI using an Altruistic Harris Hawks' Optimization algorithm. (arXiv:2109.08688v1 [eess.IV])
    (2 min) Segmentation is an essential requirement in medicine when digital images are used in illness diagnosis, especially, in posterior tasks as analysis and disease identification. An efficient segmentation of brain Magnetic Resonance Images (MRIs) is of prime concern to radiologists due to their poor illumination and other conditions related to de acquisition of the images. Thresholding is a popular method for segmentation that uses the histogram of an image to label different homogeneous groups of pixels into different classes. However, the computational cost increases exponentially according to the number of thresholds. In this paper, we perform the multi-level thresholding using an evolutionary metaheuristic. It is an improved version of the Harris Hawks Optimization (HHO) algorithm that combines the chaotic initialization and the concept of altruism. Further, for fitness assignment, we use a hybrid objective function where along with the cross-entropy minimization, we apply a new entropy function, and leverage weights to the two objective functions to form a new hybrid approach. The HHO was originally designed to solve numerical optimization problems. Earlier, the statistical results and comparisons have demonstrated that the HHO provides very promising results compared with well-established metaheuristic techniques. In this article, the altruism has been incorporated into the HHO algorithm to enhance its exploitation capabilities. We evaluate the proposed method over 10 benchmark images from the WBA database of the Harvard Medical School and 8 benchmark images from the Brainweb dataset using some standard evaluation metrics.
    Capacitance Resistance Model and Recurrent Neural Network for Well Connectivity Estimation : A Comparison Study. (arXiv:2109.08779v1 [cs.LG])
    (2 min) In this report, two commonly used data-driven models for predicting well production under a waterflood setting: the capacitance resistance model (CRM) and recurrent neural networks (RNN) are compared. Both models are completely data-driven and are intended to learn the reservoir behavior during a water flood from historical data. This report serves as a technical guide to the python-based implementation of the CRM model available from the associated GitHub repository.
    Toward Efficient Federated Learning in Multi-Channeled Mobile Edge Network with Layerd Gradient Compression. (arXiv:2109.08819v1 [cs.LG])
    (2 min) A fundamental issue for federated learning (FL) is how to achieve optimal model performance under highly dynamic communication environments. This issue can be alleviated by the fact that modern edge devices usually can connect to the edge FL server via multiple communication channels (e.g., 4G, LTE and 5G). However, having an edge device send copies of local models to the FL server along multiple channels is redundant, time-consuming, and would waste resources (e.g., bandwidth, battery life and monetary cost). In this paper, motivated by the layered coding techniques in video streaming, we propose a novel FL framework called layered gradient compression (LGC). Specifically, in LGC, local gradients from a device is coded into several layers and each layer is sent to the FL server along a different channel. The FL server aggregates the received layers of local gradients from devices to update the global model, and sends the result back to the devices. We prove the convergence of LGC, and formally define the problem of resource-efficient federated learning with LGC. We then propose a learning based algorithm for each device to dynamically adjust its local computation (i.e., the number of local stochastic descent) and communication decisions (i.e.,the compression level of different layers and the layer to channel mapping) in each iteration. Results from extensive experiments show that using our algorithm, LGC significantly reduces the training time, improves the resource utilization, while achieving a similar accuracy, compared with well-known FL mechanisms.
    Modeling Dynamic User Interests: A Neural Matrix Factorization Approach. (arXiv:2102.06602v2 [cs.LG] UPDATED)
    (2 min) In recent years, there has been significant interest in understanding users' online content consumption patterns. But, the unstructured, high-dimensional, and dynamic nature of such data makes extracting valuable insights challenging. Here we propose a model that combines the simplicity of matrix factorization with the flexibility of neural networks to efficiently extract nonlinear patterns from massive text data collections relevant to consumers' online consumption patterns. Our model decomposes a user's content consumption journey into nonlinear user and content factors that are used to model their dynamic interests. This natural decomposition allows us to summarize each user's content consumption journey with a dynamic probabilistic weighting over a set of underlying content attributes. The model is fast to estimate, easy to interpret and can harness external data sources as an empirical prior. These advantages make our method well suited to the challenges posed by modern datasets. We use our model to understand the dynamic news consumption interests of Boston Globe readers over five years. Thorough qualitative studies, including a crowdsourced evaluation, highlight our model's ability to accurately identify nuanced and coherent consumption patterns. These results are supported by our model's superior and robust predictive performance over several competitive baseline methods.
    A Deep-Learning Based Optimization Approach to Address Stop-Skipping Strategy in Urban Rail Transit Lines. (arXiv:2109.08786v1 [cs.LG])
    (2 min) Different passenger demand rates in transit stations underscore the importance of adopting operational strategies to provide a demand-responsive service. Aiming at improving passengers' travel time, the present study introduces an advanced data-driven optimization approach to determine the optimal stop-skip pattern in urban rail transit lines. In detail, first, using the time-series smart card data for an entire month, we employ a Long Short-Term Memory (LSTM) deep learning model to predict the station-level demand rates for the peak hour. This prediction is based on four preceding hours and is especially important knowing that the true demand rates of the peak hour are posterior information that can be obtained only after the peak hour operation is finished. Moreover, utilizing a real-time prediction instead of assuming fixed demand rates, allows us to account for unexpected real-time changes which can be detrimental to the subsequent analyses. Then, we integrate the output of the LSTM model as an input to an optimization model with the objective of minimizing patrons' total travel time. Considering the exponential nature of the problem, we propose an Ant Colony Optimization technique to solve the problem in a desirable amount of time. Finally, the performance of the proposed models and the solution algorithm is assessed using real case data. The results suggest that the proposed approach can enhance the performance of the service by improving both passengers' in-vehicle time as well as passengers' waiting time.
    Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Kernel Renormalization. (arXiv:2012.04030v2 [cs.LG] UPDATED)
    (3 min) The success of deep learning in many real-world tasks has triggered an intense effort to understand the power and limitations of deep learning in the training and generalization of complex tasks, so far with limited progress. In this work, we study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. Despite the linearity of the units, learning in DLNNs is nonlinear, hence studying its properties reveals some of the features of nonlinear Deep Neural Networks (DNNs). Importantly, we solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. To do this, we introduce the Back-Propagating Kernel Renormalization (BPKR), which allows for the incremental integration of the network weights starting from the network output layer and progressing backward until the first layer's weights are integrated out. This procedure allows us to evaluate important network properties, such as its generalization error, the role of network width and depth, the impact of the size of the training set, and the effects of weight regularization and learning stochasticity. BPKR does not assume specific statistics of the input or the task's output. Furthermore, by performing partial integration of the layers, the BPKR allows us to compute the properties of the neural representations across the different hidden layers. We have proposed an extension of the BPKR to nonlinear DNNs with ReLU. Surprisingly, our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks in a wide regime of parameters. Our work is the first exact statistical mechanical study of learning in a family of DNNs, and the first successful theory of learning through successive integration of DoFs in the learned weight space.
    Differentially Private Quantiles. (arXiv:2102.08244v3 [cs.LG] UPDATED)
    (2 min) Quantiles are often used for summarizing and understanding data. If that data is sensitive, it may be necessary to compute quantiles in a way that is differentially private, providing theoretical guarantees that the result does not reveal private information. However, when multiple quantiles are needed, existing differentially private algorithms fare poorly: they either compute quantiles individually, splitting the privacy budget, or summarize the entire distribution, wasting effort. In either case the result is reduced accuracy. In this work we propose an instance of the exponential mechanism that simultaneously estimates exactly $m$ quantiles from $n$ data points while guaranteeing differential privacy. The utility function is carefully structured to allow for an efficient implementation that returns estimates of all $m$ quantiles in time $O(mn\log(n) + m^2n)$. Experiments show that our method significantly outperforms the current state of the art on both real and synthetic data while remaining efficient enough to be practical.
    Interest-oriented Universal User Representation via Contrastive Learning. (arXiv:2109.08865v1 [cs.LG])
    (2 min) User representation is essential for providing high-quality commercial services in industry. Universal user representation has received many interests recently, with which we can be free from the cumbersome work of training a specific model for each downstream application. In this paper, we attempt to improve universal user representation from two points of views. First, a contrastive self-supervised learning paradigm is presented to guide the representation model training. It provides a unified framework that allows for long-term or short-term interest representation learning in a data-driven manner. Moreover, a novel multi-interest extraction module is presented. The module introduces an interest dictionary to capture principal interests of the given user, and then generate his/her interest-oriented representations via behavior aggregation. Experimental results demonstrate the effectiveness and applicability of the learned user representations.
    Dynamic and Systematic Survey of Deep Learning Approaches for Driving Behavior Analysis. (arXiv:2109.08996v1 [cs.LG])
    (2 min) Improper driving results in fatalities, damages, increased energy consumptions, and depreciation of the vehicles. Analyzing driving behaviour could lead to optimize and avoid mentioned issues. By identifying the type of driving and mapping them to the consequences of that type of driving, we can get a model to prevent them. In this regard, we try to create a dynamic survey paper to review and present driving behaviour survey data for future researchers in our research. By analyzing 58 articles, we attempt to classify standard methods and provide a framework for future articles to be examined and studied in different dashboards and updated about trends.

2021-09-20

  • cs.CL updates on arXiv.org

    Conversational Multi-Hop Reasoning with Neural Commonsense Knowledge and Symbolic Logic Rules. (arXiv:2109.08544v1 [cs.AI])
    (2 min) One of the challenges faced by conversational agents is their inability to identify unstated presumptions of their users' commands, a task trivial for humans due to their common sense. In this paper, we propose a zero-shot commonsense reasoning system for conversational agents in an attempt to achieve this. Our reasoner uncovers unstated presumptions from user commands satisfying a general template of if-(state), then-(action), because-(goal). Our reasoner uses a state-of-the-art transformer-based generative commonsense knowledge base (KB) as its source of background knowledge for reasoning. We propose a novel and iterative knowledge query mechanism to extract multi-hop reasoning chains from the neural KB which uses symbolic logic rules to significantly reduce the search space. Similar to any KBs gathered to date, our commonsense KB is prone to missing knowledge. Therefore, we propose to conversationally elicit the missing knowledge from human users with our novel dynamic question generation strategy, which generates and presents contextualized queries to human users. We evaluate the model with a user study with human users that achieves a 35% higher success rate compared to SOTA.
    Adversarial Scrubbing of Demographic Information for Text Classification. (arXiv:2109.08613v1 [cs.CL])
    (2 min) Contextual representations learned by language models can often encode undesirable attributes, like demographic associations of the users, while being trained for an unrelated target task. We aim to scrub such undesirable attributes and learn fair representations while maintaining performance on the target task. In this paper, we present an adversarial learning framework "Adversarial Scrubber" (ADS), to debias contextual representations. We perform theoretical analysis to show that our framework converges without leaking demographic information under certain conditions. We extend previous evaluation techniques by evaluating debiasing performance using Minimum Description Length (MDL) probing. Experimental evaluations on 8 datasets show that ADS generates representations with minimal information about demographic attributes while being maximally informative about the target task.
    A Role-Selected Sharing Network for Joint Machine-Human Chatting Handoff and Service Satisfaction Analysis. (arXiv:2109.08412v1 [cs.CL])
    (0 min) Chatbot is increasingly thriving in different domains, however, because of unexpected discourse complexity and training data sparseness, its potential distrust hatches vital apprehension. Recently, Machine-Human Chatting Handoff (MHCH), predicting chatbot failure and enabling human-algorithm collaboration to enhance chatbot quality, has attracted increasing attention from industry and academia. In this study, we propose a novel model, Role-Selected Sharing Network (RSSN), which integrates both dialogue satisfaction estimation and handoff prediction in one multi-task learning framework. Unlike prior efforts in dialog mining, by utilizing local user satisfaction as a bridge, global satisfaction detector and handoff predictor can effectively exchange critical information. Specifically, we decouple the relation and interaction between the two tasks by the role information after the shared encoder. Extensive experiments on two public datasets demonstrate the effectiveness of our model.
    Task-adaptive Pre-training of Language Models with Word Embedding Regularization. (arXiv:2109.08354v1 [cs.CL])
    (2 min) Pre-trained language models (PTLMs) acquire domain-independent linguistic knowledge through pre-training with massive textual resources. Additional pre-training is effective in adapting PTLMs to domains that are not well covered by the pre-training corpora. Here, we focus on the static word embeddings of PTLMs for domain adaptation to teach PTLMs domain-specific meanings of words. We propose a novel fine-tuning process: task-adaptive pre-training with word embedding regularization (TAPTER). TAPTER runs additional pre-training by making the static word embeddings of a PTLM close to the word embeddings obtained in the target domain with fastText. TAPTER requires no additional corpus except for the training data of the downstream task. We confirmed that TAPTER improves the performance of the standard fine-tuning and the task-adaptive pre-training on BioASQ (question answering in the biomedical domain) and on SQuAD (the Wikipedia domain) when their pre-training corpora were not dominated by in-domain data.
    Exploring Multitask Learning for Low-Resource AbstractiveSummarization. (arXiv:2109.08565v1 [cs.CL])
    (2 min) This paper explores the effect of using multitask learning for abstractive summarization in the context of small training corpora. In particular, we incorporate four different tasks (extractive summarization, language modeling, concept detection, and paraphrase detection) both individually and in combination, with the goal of enhancing the target task of abstractive summarization via multitask learning. We show that for many task combinations, a model trained in a multitask setting outperforms a model trained only for abstractive summarization, with no additional summarization data introduced. Additionally, we do a comprehensive search and find that certain tasks (e.g. paraphrase detection) consistently benefit abstractive summarization, not only when combined with other tasks but also when using different architectures and training corpora.
    Simple Entity-Centric Questions Challenge Dense Retrievers. (arXiv:2109.08535v1 [cs.CL])
    (2 min) Open-domain question answering has exploded in popularity recently due to the success of dense retrieval models, which have surpassed sparse models using only a few supervised training examples. However, in this paper, we demonstrate current dense models are not yet the holy grail of retrieval. We first construct EntityQuestions, a set of simple, entity-rich questions based on facts from Wikidata (e.g., "Where was Arve Furset born?"), and observe that dense retrievers drastically underperform sparse methods. We investigate this issue and uncover that dense retrievers can only generalize to common entities unless the question pattern is explicitly observed during training. We discuss two simple solutions towards addressing this critical problem. First, we demonstrate that data augmentation is unable to fix the generalization problem. Second, we argue a more robust passage encoder helps facilitate better question adaptation using specialized question encoders. We hope our work can shed light on the challenges in creating a robust, universal dense retriever that works well across different input distributions.
    ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora. (arXiv:2012.15674v4 [cs.CL] UPDATED)
    (2 min) Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.
    RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering. (arXiv:2109.08678v1 [cs.CL])
    (2 min) Existing KBQA approaches, despite achieving strong performance on i.i.d. test data, often struggle in generalizing to questions involving unseen KB schema items. Prior ranking-based approaches have shown some success in generalization, but suffer from the coverage issue. We present RnG-KBQA, a Rank-and-Generate approach for KBQA, which remedies the coverage issue with a generation model while preserving a strong generalization capability. Our approach first uses a contrastive ranker to rank a set of candidate logical forms obtained by searching over the knowledge graph. It then introduces a tailored generation model conditioned on the question and the top-ranked candidates to compose the final logical form. We achieve new state-of-the-art results on GrailQA and WebQSP datasets. In particular, our method surpasses the prior state-of-the-art by a large margin on the GrailQA leaderboard. In addition, RnG-KBQA outperforms all prior approaches on the popular WebQSP benchmark, even including the ones that use the oracle entity linking. The experimental results demonstrate the effectiveness of the interplay between ranking and generation, which leads to the superior performance of our proposed approach across all settings with especially strong improvements in zero-shot generalization.
    Is Everything in Order? A Simple Way to Order Sentences. (arXiv:2104.07064v2 [cs.CL] UPDATED)
    (2 min) The task of organizing a shuffled set of sentences into a coherent text has been used to evaluate a machine's understanding of causal and temporal relations. We formulate the sentence ordering task as a conditional text-to-marker generation problem. We present Reorder-BART (Re-BART) that leverages a pre-trained Transformer-based model to identify a coherent order for a given set of shuffled sentences. The model takes a set of shuffled sentences with sentence-specific markers as input and generates a sequence of position markers of the sentences in the ordered text. Re-BART achieves the state-of-the-art performance across 7 datasets in Perfect Match Ratio (PMR) and Kendall's tau ($\tau$). We perform evaluations in a zero-shot setting, showcasing that our model is able to generalize well across other datasets. We additionally perform several experiments to understand the functioning and limitations of our framework.
    Primer: Searching for Efficient Transformers for Language Modeling. (arXiv:2109.08668v1 [cs.LG])
    (2 min) Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.
    WebQA: Multihop and Multimodal QA. (arXiv:2109.00590v2 [cs.CL] UPDATED)
    (2 min) Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources -- akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
    Distilling Linguistic Context for Language Model Compression. (arXiv:2109.08359v1 [cs.CL])
    (2 min) A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.
    A Multimodal Sentiment Dataset for Video Recommendation. (arXiv:2109.08333v1 [cs.CL])
    (2 min) Recently, multimodal sentiment analysis has seen remarkable advance and a lot of datasets are proposed for its development. In general, current multimodal sentiment analysis datasets usually follow the traditional system of sentiment/emotion, such as positive, negative and so on. However, when applied in the scenario of video recommendation, the traditional sentiment/emotion system is hard to be leveraged to represent different contents of videos in the perspective of visual senses and language understanding. Based on this, we propose a multimodal sentiment analysis dataset, named baiDu Video Sentiment dataset (DuVideoSenti), and introduce a new sentiment system which is designed to describe the sentimental style of a video on recommendation scenery. Specifically, DuVideoSenti consists of 5,630 videos which displayed on Baidu, each video is manually annotated with a sentimental style label which describes the user's real feeling of a video. Furthermore, we propose UNIMO as our baseline for DuVideoSenti. Experimental results show that DuVideoSenti brings new challenges to multimodal sentiment analysis, and could be used as a new benchmark for evaluating approaches designed for video understanding and multimodal fusion. We also expect our proposed DuVideoSenti could further improve the development of multimodal sentiment analysis and its application to video recommendations.
    Capturing Global Informativeness in Open Domain Keyphrase Extraction. (arXiv:2004.13639v2 [cs.CL] UPDATED)
    (2 min) Open-domain KeyPhrase Extraction (KPE) aims to extract keyphrases from documents without domain or quality restrictions, e.g., web pages with variant domains and qualities. Recently, neural methods have shown promising results in many KPE tasks due to their powerful capacity for modeling contextual semantics of the given documents. However, we empirically show that most neural KPE methods prefer to extract keyphrases with good phraseness, such as short and entity-style n-grams, instead of globally informative keyphrases from open-domain documents. This paper presents JointKPE, an open-domain KPE architecture built on pre-trained language models, which can capture both local phraseness and global informativeness when extracting keyphrases. JointKPE learns to rank keyphrases by estimating their informativeness in the entire document and is jointly trained on the keyphrase chunking task to guarantee the phraseness of keyphrase candidates. Experiments on two large KPE datasets with diverse domains, OpenKP and KP20k, demonstrate the effectiveness of JointKPE on different pre-trained variants in open-domain scenarios. Further analyses reveal the significant advantages of JointKPE in predicting long and non-entity keyphrases, which are challenging for previous neural KPE methods. Our code is publicly available at https://github.com/thunlp/BERT-KPE.
    To what extent do human explanations of model behavior align with actual model behavior?. (arXiv:2012.13354v2 [cs.CL] UPDATED)
    (2 min) Given the increasingly prominent role NLP models (will) play in our lives, it is important for human expectations of model behavior to align with actual model behavior. Using Natural Language Inference (NLI) as a case study, we investigate the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. More specifically, we define three alignment metrics that quantify how well natural language explanations align with model sensitivity to input words, as measured by integrated gradients. Then, we evaluate eight different models (the base and large versions of BERT, RoBERTa and ELECTRA, as well as anRNN and bag-of-words model), and find that the BERT-base model has the highest alignment with human-generated explanations, for all alignment metrics. Focusing in on transformers, we find that the base versions tend to have higher alignment with human-generated explanations than their larger counterparts, suggesting that increasing the number of model parameters leads, in some cases, to worse alignment with human explanations. Finally, we find that a model's alignment with human explanations is not predicted by the model's accuracy, suggesting that accuracy and alignment are complementary ways to evaluate models.
    Neural Unification for Logic Reasoning over Natural Language. (arXiv:2109.08460v1 [cs.CL])
    (2 min) Automated Theorem Proving (ATP) deals with the development of computer programs being able to show that some conjectures (queries) are a logical consequence of a set of axioms (facts and rules). There exists several successful ATPs where conjectures and axioms are formally provided (e.g. formalised as First Order Logic formulas). Recent approaches, such as (Clark et al., 2020), have proposed transformer-based architectures for deriving conjectures given axioms expressed in natural language (English). The conjecture is verified through a binary text classifier, where the transformers model is trained to predict the truth value of a conjecture given the axioms. The RuleTaker approach of (Clark et al., 2020) achieves appealing results both in terms of accuracy and in the ability to generalize, showing that when the model is trained with deep enough queries (at least 3 inference steps), the transformers are able to correctly answer the majority of queries (97.6%) that require up to 5 inference steps. In this work we propose a new architecture, namely the Neural Unifier, and a relative training procedure, which achieves state-of-the-art results in term of generalisation, showing that mimicking a well-known inference procedure, the backward chaining, it is possible to answer deep queries even when the model is trained only on shallow ones. The approach is demonstrated in experiments using a diverse set of benchmark data.
    Ethics Sheets for AI Tasks. (arXiv:2107.01183v3 [cs.AI] UPDATED)
    (2 min) Recent innovations such as Datasheets for Datasets and Model Cards for Model Reporting have made useful contributions to furthering ethical research. Yet, several high-profile events, such as the mass testing of emotion recognition systems on vulnerable sub-populations, have highlighted how technology will often lead to more adverse outcomes for those that are already marginalized. In this paper, I will make a case for thinking about ethical considerations not just at the level of individual models and datasets, but also at the level of AI tasks. I will present a new form of such an effort, Ethics Sheets for AI Tasks, dedicated to fleshing out the assumptions and ethical considerations hidden in how a task is commonly framed and in the choices we make regarding the data, method, and evaluation. Finally, I will provide an example ethics sheet for automatic emotion recognition. Ethics sheets are a mechanism to document ethical considerations \textit{before} building datasets and systems. Such pre-production activities (e.g., ethics analyses) and associated artifacts (e.g., accessible documentation) are crucial for responsible AI: for communicating risks to all stakeholders, to help decision and policy making, and for developing more effective post-production documents such as Data Sheets and Model Cards.
    Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. (arXiv:2007.15779v6 [cs.CL] UPDATED)
    (2 min) Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.
    Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions. (arXiv:2106.04484v2 [cs.CV] UPDATED)
    (2 min) Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.
    New Students on Sesame Street: What Order-Aware Matrix Embeddings Can Learn from BERT. (arXiv:2109.08449v1 [cs.CL])
    (2 min) Large-scale pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive in low-resource or large-scale applications. While common approaches reduce the size of PreLMs via same-architecture distillation or pruning, we explore distilling PreLMs into more efficient order-aware embedding models. Our results on the GLUE benchmark show that embedding-centric students, which have learned from BERT, yield scores comparable to DistilBERT on QQP and RTE, often match or exceed the scores of ELMo, and only fall behind on detecting linguistic acceptability.
    To be Closer: Learning to Link up Aspects with Opinions. (arXiv:2109.08382v1 [cs.CL])
    (2 min) Dependency parse trees are helpful for discovering the opinion words in aspect-based sentiment analysis (ABSA). However, the trees obtained from off-the-shelf dependency parsers are static, and could be sub-optimal in ABSA. This is because the syntactic trees are not designed for capturing the interactions between opinion words and aspect words. In this work, we aim to shorten the distance between aspects and corresponding opinion words by learning an aspect-centric tree structure. The aspect and opinion words are expected to be closer along such tree structure compared to the standard dependency parse tree. The learning process allows the tree structure to adaptively correlate the aspect and opinion words, enabling us to better identify the polarity in the ABSA task. We conduct experiments on five aspect-based sentiment datasets, and the proposed model significantly outperforms recent strong baselines. Furthermore, our thorough analysis demonstrates the average distance between aspect and opinion words are shortened by at least 19% on the standard SemEval Restaurant14 dataset.
    Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers. (arXiv:2109.08406v1 [cs.CL])
    (2 min) Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance, even with no further tuning.
    GoG: Relation-aware Graph-over-Graph Network for Visual Dialog. (arXiv:2109.08475v1 [cs.CL])
    (2 min) Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the representation of the image based on the fully represented question. Therefore, we propose a novel relation-aware graph-over-graph network (GoG) for visual dialog. Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Question-aware I-Graph, which aims to capture the relations between objects in an image based on fully question representation. As an additional feature representation module, we add GoG to the existing visual dialogue model. Experimental results show that our model outperforms the strong baseline in both generative and discriminative settings by a significant margin.
    CKMorph: A Comprehensive Morphological Analyzer for Central Kurdish. (arXiv:2109.08615v1 [cs.CL])
    (2 min) A morphological analyzer, which is a significant component of many natural language processing applications especially for morphologically rich languages, divides an input word into all its composing morphemes and identifies their morphological roles. In this paper, we introduce a comprehensive morphological analyzer for Central Kurdish (CK), a low-resourced language with a rich morphology. Building upon the limited existing literature, we first assembled and systematically categorized a comprehensive collection of the morphological and morphophonological rules of the language. Additionally, we collected and manually labeled a generative lexicon containing nearly 10,000 verb, noun and adjective stems, named entities, and other types of word stems. We used these rule sets and resources to implement CKMorph Analyzer based on finite-state transducers. In order to provide a benchmark for future research, we collected, manually labeled, and publicly shared test sets for evaluating accuracy and coverage of the analyzer. CKMorph was able to correctly analyze 95.9% of the accuracy test set, containing 1,000 CK words morphologically analyzed according to the context. Moreover, CKMorph gave at least one analysis for 95.5% of 4.22M CK tokens of the coverage test set. The demonstration of the application and resources including CK verb database and test sets are openly accessible at https://github.com/CKMorph.
    Hierarchical Learning for Generation with Long Source Sequences. (arXiv:2104.07545v2 [cs.CL] UPDATED)
    (2 min) One of the challenges for current sequence to sequence (seq2seq) models is processing long sequences, such as those in summarization and document level machine translation tasks. These tasks require the model to reason at the token level as well as the sentence and paragraph level. We design and study a new Hierarchical Attention Transformer-based architecture (HAT) that outperforms standard Transformers on several sequence to sequence tasks. Furthermore, our model achieves state-of-the-art ROUGE scores on four summarization tasks, including PubMed, arXiv, CNN/DM, SAMSum, and AMI. Our model outperforms document-level machine translation baseline on the WMT20 English to German translation task. We investigate what the hierarchical layers learn by visualizing the hierarchical encoder-decoder attention. Finally, we study hierarchical learning on encoder-only pre-training and analyze its performance on classification tasks.
    Boosting Transformers for Job Expression Extraction and Classification in a Low-Resource Setting. (arXiv:2109.08597v1 [cs.CL])
    (2 min) In this paper, we explore possible improvements of transformer models in a low-resource setting. In particular, we present our approaches to tackle the first two of three subtasks of the MEDDOPROF competition, i.e., the extraction and classification of job expressions in Spanish clinical texts. As neither language nor domain experts, we experiment with the multilingual XLM-R transformer model and tackle these low-resource information extraction tasks as sequence-labeling problems. We explore domain- and language-adaptive pretraining, transfer learning and strategic datasplits to boost the transformer model. Our results show strong improvements using these methods by up to 5.3 F1 points compared to a fine-tuned XLM-R model. Our best models achieve 83.2 and 79.3 F1 for the first two tasks, respectively.
    Mitigating Data Scarceness through Data Synthesis, Augmentation and Curriculum for Abstractive Summarization. (arXiv:2109.08569v1 [cs.CL])
    (2 min) This paper explores three simple data manipulation techniques (synthesis, augmentation, curriculum) for improving abstractive summarization models without the need for any additional data. We introduce a method of data synthesis with paraphrasing, a data augmentation technique with sample mixing, and curriculum learning with two new difficulty metrics based on specificity and abstractiveness. We conduct experiments to show that these three techniques can help improve abstractive summarization across two summarization models and two different small datasets. Furthermore, we show that these techniques can improve performance when applied in isolation and when combined.
    Towards Handling Unconstrained User Preferences in Dialogue. (arXiv:2109.08650v1 [cs.CL])
    (2 min) A user input to a schema-driven dialogue information navigation system, such as venue search, is typically constrained by the underlying database which restricts the user to specify a predefined set of preferences, or slots, corresponding to the database fields. We envision a more natural information navigation dialogue interface where a user has flexibility to specify unconstrained preferences that may not match a predefined schema. We propose to use information retrieval from unstructured knowledge to identify entities relevant to a user request. We update the Cambridge restaurants database with unstructured knowledge snippets (reviews and information from the web) for each of the restaurants and annotate a set of query-snippet pairs with a relevance label. We use the annotated dataset to train and evaluate snippet relevance classifiers, as a proxy to evaluating recommendation accuracy. We show that with a pretrained transformer model as an encoder, an unsupervised/supervised classifier achieves a weighted F1 of .661/.856.
    Slot Filling for Biomedical Information Extraction. (arXiv:2109.08564v1 [cs.CL])
    (2 min) Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in the above sub-tasks.In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and pretrained data are available at https://github.com/healx/biomed-slot-filling.
    Does Commonsense help in detecting Sarcasm?. (arXiv:2109.08588v1 [cs.CL])
    (2 min) Sarcasm detection is important for several NLP tasks such as sentiment identification in product reviews, user feedback, and online forums. It is a challenging task requiring a deep understanding of language, context, and world knowledge. In this paper, we investigate whether incorporating commonsense knowledge helps in sarcasm detection. For this, we incorporate commonsense knowledge into the prediction process using a graph convolution network with pre-trained language model embeddings as input. Our experiments with three sarcasm detection datasets indicate that the approach does not outperform the baseline model. We perform an exhaustive set of experiments to analyze where commonsense support adds value and where it hurts classification. Our implementation is publicly available at: https://github.com/brcsomnath/commonsense-sarcasm.
    Ethics Sheet for Automatic Emotion Recognition and Sentiment Analysis. (arXiv:2109.08256v1 [cs.CL])
    (2 min) The importance and pervasiveness of emotions in our lives makes affective computing a tremendously important and vibrant line of work. Systems for automatic emotion recognition (AER) and sentiment analysis can be facilitators of enormous progress (e.g., in improving public health and commerce) but also enablers of great harm (e.g., for suppressing dissidents and manipulating voters). Thus, it is imperative that the affective computing community actively engage with the ethical ramifications of their creations. In this paper, I have synthesized and organized information from AI Ethics and Emotion Recognition literature to present fifty ethical considerations relevant to AER. Notably, the sheet fleshes out assumptions hidden in how AER is commonly framed, and in the choices often made regarding the data, method, and evaluation. Special attention is paid to the implications of AER on privacy and social groups. The objective of the sheet is to facilitate and encourage more thoughtfulness on why to automate, how to automate, and how to judge success well before the building of AER systems. Additionally, the sheet acts as a useful introductory document on emotion recognition (complementing survey articles).
    SentiPrompt: Sentiment Knowledge Enhanced Prompt-Tuning for Aspect-Based Sentiment Analysis. (arXiv:2109.08306v1 [cs.CL])
    (2 min) Aspect-based sentiment analysis (ABSA) is an emerging fine-grained sentiment analysis task that aims to extract aspects, classify corresponding sentiment polarities and find opinions as the causes of sentiment. The latest research tends to solve the ABSA task in a unified way with end-to-end frameworks. Yet, these frameworks get fine-tuned from downstream tasks without any task-adaptive modification. Specifically, they do not use task-related knowledge well or explicitly model relations between aspect and opinion terms, hindering them from better performance. In this paper, we propose SentiPrompt to use sentiment knowledge enhanced prompts to tune the language model in the unified framework. We inject sentiment knowledge regarding aspects, opinions, and polarities into prompt and explicitly model term relations via constructing consistency and polarity judgment templates from the ground truth triplets. Experimental results demonstrate that our approach can outperform strong baselines on Triplet Extraction, Pair Extraction, and Aspect Term Extraction with Sentiment Classification by a notable margin.
    Self-training with Few-shot Rationalization: Teacher Explanations Aid Student in Few-shot NLU. (arXiv:2109.08259v1 [cs.CL])
    (2 min) While pre-trained language models have obtained state-of-the-art performance for several natural language understanding tasks, they are quite opaque in terms of their decision-making process. While some recent works focus on rationalizing neural predictions by highlighting salient concepts in the text as justifications or rationales, they rely on thousands of labeled training examples for both task labels as well as an-notated rationales for every instance. Such extensive large-scale annotations are infeasible to obtain for many tasks. To this end, we develop a multi-task teacher-student framework based on self-training language models with limited task-specific labels and rationales, and judicious sample selection to learn from informative pseudo-labeled examples1. We study several characteristics of what constitutes a good rationale and demonstrate that the neural model performance can be significantly improved by making it aware of its rationalized predictions, particularly in low-resource settings. Extensive experiments in several bench-mark datasets demonstrate the effectiveness of our approach.
    Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition. (arXiv:2109.00627v3 [cs.CL] UPDATED)
    (2 min) Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut between the tree and the final ASR output distribution to facilitate recognising biasing words during decoding. Systems were trained and evaluated on the Librispeech corpus where biasing words were extracted at the scales of an utterance, a chapter, or a book to simulate different application scenarios. Experimental results showed that TCPGen consistently improved word error rates (WERs) compared to the baselines, and in particular, achieved significant WER reductions on the biasing words. TCPGen is highly efficient: it can handle 5,000 biasing words and distractors and only add a small overhead to memory use and computation cost.
    Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature. (arXiv:2106.13375v2 [cs.IR] UPDATED)
    (3 min) Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.
    Regularized Training of Nearest Neighbor Language Models. (arXiv:2109.08249v1 [cs.CL])
    (2 min) Including memory banks in a natural language processing architecture increases model capacity by equipping it with additional data at inference time. In this paper, we build upon $k$NN-LM \citep{khandelwal20generalization}, which uses a pre-trained language model together with an exhaustive $k$NN search through the training data (memory bank) to achieve state-of-the-art results. We investigate whether we can improve the $k$NN-LM performance by instead training a LM with the knowledge that we will be using a $k$NN post-hoc. We achieved significant improvement using our method on language modeling tasks on \texttt{WIKI-2} and \texttt{WIKI-103}. The main phenomenon that we encounter is that adding a simple L2 regularization on the activations (not weights) of the model, a transformer, improves the post-hoc $k$NN classification performance. We explore some possible reasons for this improvement. In particular, we find that the added L2 regularization seems to improve the performance for high-frequency words without deteriorating the performance for low frequency ones.
    A Bag of Tricks for Dialogue Summarization. (arXiv:2109.08232v1 [cs.CL])
    (2 min) Dialogue summarization comes with its own peculiar challenges as opposed to news or scientific articles summarization. In this work, we explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to multiple speakers, negation understanding, reasoning about the situation, and informal language understanding. Using a pretrained sequence-to-sequence language model, we explore speaker name substitution, negation scope highlighting, multi-task learning with relevant tasks, and pretraining on in-domain data. Our experiments show that our proposed techniques indeed improve summarization performance, outperforming strong baselines.
    Balancing out Bias: Achieving Fairness Through Training Reweighting. (arXiv:2109.08253v1 [cs.CL])
    (2 min) Bias in natural language processing arises primarily from models learning characteristics of the author such as gender and race when modelling tasks such as sentiment and syntactic parsing. This problem manifests as disparities in error rates across author demographics, typically disadvantaging minority groups. Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables. Moreover, evaluation of bias has been inconsistent in previous work, in terms of dataset balance and evaluation methods. This paper introduces a very simple but highly effective method for countering bias using instance reweighting, based on the frequency of both task labels and author demographics. We extend the method in the form of a gated model which incorporates the author demographic as an input, and show that while it is highly vulnerable to input data bias, it provides debiased predictions through demographic input perturbation, and outperforms all other bias mitigation techniques when combined with instance reweighting.
    Understanding Politics via Contextualized Discourse Processing. (arXiv:2012.15784v2 [cs.CL] UPDATED)
    (2 min) Politicians often have underlying agendas when reacting to events. Arguments in contexts of various events reflect a fairly consistent set of agendas for a given entity. In spite of recent advances in Pretrained Language Models (PLMs), those text representations are not designed to capture such nuanced patterns. In this paper, we propose a Compositional Reader model consisting of encoder and composer modules, that attempts to capture and leverage such information to generate more effective representations for entities, issues, and events. These representations are contextualized by tweets, press releases, issues, news articles, and participating entities. Our model can process several documents at once and generate composed representations for multiple entities over several issues or events. Via qualitative and quantitative empirical analysis, we show that these representations are meaningful and effective.
    CodeQA: A Question Answering Dataset for Source Code Comprehension. (arXiv:2109.08365v1 [cs.CL])
    (2 min) We propose CodeQA, a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs. To obtain natural and faithful questions and answers, we implement syntactic rules and semantic analysis to transform code comments into question-answer pairs. We present the construction process and conduct systematic analysis of our dataset. Experiment results achieved by several neural baselines on our dataset are shown and discussed. While research on question-answering and machine reading comprehension develops rapidly, few prior work has drawn attention to code question answering. This new dataset can serve as a useful research benchmark for source code comprehension.
    Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?. (arXiv:2109.08634v1 [cs.CL])
    (2 min) Models designed for intelligent process automation are required to be capable of grounding user interface elements. This task of interface element grounding is centred on linking instructions in natural language to their target referents. Even though BERT and similar pre-trained language models have excelled in several NLP tasks, their use has not been widely explored for the UI grounding domain. This work concentrates on testing and probing the grounding abilities of three different transformer-based models: BERT, RoBERTa and LayoutLM. Our primary focus is on these models' spatial reasoning skills, given their importance in this domain. We observe that LayoutLM has a promising advantage for applications in this domain, even though it was created for a different original purpose (representing scanned documents): the learned spatial features appear to be transferable to the UI grounding setting, especially as they demonstrate the ability to discriminate between target directions in natural language instructions.
    Self-supervised Document Clustering Based on BERT with Data Augment. (arXiv:2011.08523v3 [cs.CL] UPDATED)
    (2 min) Contrastive learning is a promising approach to unsupervised learning, as it inherits the advantages of well-studied deep models without a dedicated and complex model design. In this paper, based on bidirectional encoder representations from transformers, we propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures. FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.
    Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles. (arXiv:2109.00199v2 [cs.IR] UPDATED)
    (2 min) We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article's contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
    reproducing "ner and pos when nothing is capitalized". (arXiv:2109.08396v1 [cs.CL])
    (2 min) Capitalization is an important feature in many NLP tasks such as Named Entity Recognition (NER) or Part of Speech Tagging (POS). We are trying to reproduce results of paper which shows how to mitigate a significant performance drop when casing is mismatched between training and testing data. In particular we show that lowercasing 50% of the dataset provides the best performance, matching the claims of the original paper. We also show that we got slightly lower performance in almost all experiments we have tried to reproduce, suggesting that there might be some hidden factors impacting our performance. Lastly, we make all of our work available in a public github repository.
    The futility of STILTs for the classification of lexical borrowings in Spanish. (arXiv:2109.08607v1 [cs.CL])
    (2 min) The first edition of the IberLEF 2021 shared task on automatic detection of borrowings (ADoBo) focused on detecting lexical borrowings that appeared in the Spanish press and that have recently been imported into the Spanish language. In this work, we tested supplementary training on intermediate labeled-data tasks (STILTs) from part of speech (POS), named entity recognition (NER), code-switching, and language identification approaches to the classification of borrowings at the token level using existing pre-trained transformer-based language models. Our extensive experimental results suggest that STILTs do not provide any improvement over direct fine-tuning of multilingual models. However, multilingual models trained on small subsets of languages perform reasonably better than multilingual BERT but not as good as multilingual RoBERTa for the given dataset.
    Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation. (arXiv:2109.08478v1 [cs.CL])
    (2 min) Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.
    Efficient Measuring of Readability to Improve Documents Accessibility for Arabic Language Learners. (arXiv:2109.08648v1 [cs.CL])
    (2 min) This paper presents an approach based on supervised machine learning methods to build a classifier that can identify text complexity in order to present Arabic language learners with texts suitable to their levels. The approach is based on machine learning classification methods to discriminate between the different levels of difficulty in reading and understanding a text. Several models were trained on a large corpus mined from online Arabic websites and manually annotated. The model uses both Count and TF-IDF representations and applies five machine learning algorithms; Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Support Vector Machine and Random Forest, using unigrams and bigrams features. With the goal of extracting the text complexity, the problem is usually addressed by formulating the level identification as a classification task. Experimental results showed that n-gram features could be indicative of the reading level of a text and could substantially improve performance, and showed that SVM and Multinomial Naive Bayes are the most accurate in predicting the complexity level. Best results were achieved using TF-IDF Vectors trained by a combination of word-based unigrams and bigrams with an overall accuracy of 87.14% over four classes of complexity.
    Hierarchical Control of Situated Agents through Natural Language. (arXiv:2109.08214v1 [cs.CL])
    (2 min) When humans conceive how to perform a particular task, they do so hierarchically: splitting higher-level tasks into smaller sub-tasks. However, in the literature on natural language (NL) command of situated agents, most works have treated the procedures to be executed as flat sequences of simple actions, or any hierarchies of procedures have been shallow at best. In this paper, we propose a formalism of procedures as programs, a powerful yet intuitive method of representing hierarchical procedural knowledge for agent command and control. We further propose a modeling paradigm of hierarchical modular networks, which consist of a planner and reactors that convert NL intents to predictions of executable programs and probe the environment for information necessary to complete the program execution. We instantiate this framework on the IQA and ALFRED datasets for NL instruction following. Our model outperforms reactive baselines by a large margin on both datasets. We also demonstrate that our framework is more data-efficient, and that it allows for fast iterative development.
    Language Models as a Knowledge Source for Cognitive Agents. (arXiv:2109.08270v1 [cs.AI])
    (2 min) Language models (LMs) are sentence-completion engines trained on massive corpora. LMs have emerged as a significant breakthrough in natural-language processing, providing capabilities that go far beyond sentence completion including question answering, summarization, and natural-language inference. While many of these capabilities have potential application to cognitive systems, exploiting language models as a source of task knowledge, especially for task learning, offers significant, near-term benefits. We introduce language models and the various tasks to which they have been applied and then review methods of knowledge extraction from language models. The resulting analysis outlines both the challenges and opportunities for using language models as a new knowledge source for cognitive systems. It also identifies possible ways to improve knowledge extraction from language models using the capabilities provided by cognitive systems. Central to success will be the ability of a cognitive agent to itself learn an abstract model of the knowledge implicit in the LM as well as methods to extract high-quality knowledge effectively and efficiently. To illustrate, we introduce a hypothetical robot agent and describe how language models could extend its task knowledge and improve its performance and the kinds of knowledge and methods the agent can use to exploit the knowledge within a language model.
    Hierarchy-Aware T5 with Path-Adaptive Mask Mechanism for Hierarchical Text Classification. (arXiv:2109.08585v1 [cs.CL])
    (2 min) Hierarchical Text Classification (HTC), which aims to predict text labels organized in hierarchical space, is a significant task lacking in investigation in natural language processing. Existing methods usually encode the entire hierarchical structure and fail to construct a robust label-dependent model, making it hard to make accurate predictions on sparse lower-level labels and achieving low Macro-F1. In this paper, we propose a novel PAMM-HiA-T5 model for HTC: a hierarchy-aware T5 model with path-adaptive mask mechanism that not only builds the knowledge of upper-level labels into low-level ones but also introduces path dependency information in label prediction. Specifically, we generate a multi-level sequential label structure to exploit hierarchical dependency across different levels with Breadth-First Search (BFS) and T5 model. To further improve label dependency prediction within each path, we then propose an original path-adaptive mask mechanism (PAMM) to identify the label's path information, eliminating sources of noises from other paths. Comprehensive experiments on three benchmark datasets show that our novel PAMM-HiA-T5 model greatly outperforms all state-of-the-art HTC approaches especially in Macro-F1. The ablation studies show that the improvements mainly come from our innovative approach instead of T5.
    Classification-based Quality Estimation: Small and Efficient Models for Real-world Applications. (arXiv:2109.08627v1 [cs.CL])
    (2 min) Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously-unseen levels of correlation with human judgments, but they rely on large multilingual contextualized language models that are computationally expensive and make them infeasible for real-world applications. In this work, we evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting. We observe that a full model parameterization is required to achieve SoTA results in a regression task. However, we argue that the level of expressiveness of a model in a continuous range is unnecessary given the downstream applications of QE, and show that reframing QE as a classification problem and evaluating QE models using classification metrics would better reflect their actual performance in real-world applications.
    Fast-Slow Transformer for Visually Grounding Speech. (arXiv:2109.08186v1 [eess.AS])
    (2 min) We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.
    Numerical reasoning in machine reading comprehension tasks: are we there yet?. (arXiv:2109.08207v1 [cs.CL])
    (2 min) Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting. The DROP benchmark (Dua et al., 2019) is a recent dataset that has inspired the design of NLP models aimed at solving this task. The current standings of these models in the DROP leaderboard, over standard metrics, suggest that the models have achieved near-human performance. However, does this mean that these models have learned to reason? In this paper, we present a controlled study on some of the top-performing model architectures for the task of numerical reasoning. Our observations suggest that the standard metrics are incapable of measuring progress towards such tasks.
  • cs.CV updates on arXiv.org

    Effective Tensor Completion via Element-wise Weighted Low-rank Tensor Train with Overlapping Ket Augmentation. (arXiv:2109.05736v2 [cs.CV] UPDATED)
    (0 min) In recent years, there have been an increasing number of applications of tensor completion based on the tensor train (TT) format because of its efficiency and effectiveness in dealing with higher-order tensor data. However, existing tensor completion methods using TT decomposition have two obvious drawbacks. One is that they only consider mode weights according to the degree of mode balance, even though some elements are recovered better in an unbalanced mode. The other is that serious blocking artifacts appear when the missing element rate is relatively large. To remedy such two issues, in this work, we propose a novel tensor completion approach via the element-wise weighted technique. Accordingly, a novel formulation for tensor completion and an effective optimization algorithm, called as tensor completion by parallel weighted matrix factorization via tensor train (TWMac-TT), is proposed. In addition, we specifically consider the recovery quality of edge elements from adjacent blocks. Different from traditional reshaping and ket augmentation, we utilize a new tensor augmentation technique called overlapping ket augmentation, which can further avoid blocking artifacts. We then conduct extensive performance evaluations on synthetic data and several real image data sets. Our experimental results demonstrate that the proposed algorithm TWMac-TT outperforms several other competing tensor completion methods.
    Standardised convolutional filtering for radiomics. (arXiv:2006.05470v4 [eess.IV] UPDATED)
    (0 min) The Image Biomarker Standardisation Initiative (IBSI) aims to improve reproducibility of radiomics studies by standardising the computational process of extracting image biomarkers (features) from images. We have previously established reference values for 169 commonly used features, created a standard radiomics image processing scheme, and developed reporting guidelines for radiomic studies. However, several aspects are not standardised. Here we present a preliminary version of a reference manual on the use of convolutional image filters in radiomics. Filters, such as wavelets or Laplacian of Gaussian filters, play an important part in emphasising specific image characteristics such as edges and blobs. Features derived from filter response maps have been found to be poorly reproducible. This reference manual forms the basis of ongoing work on standardising convolutional filters in radiomics, and will be updated as this work progresses.
    MSN: Multi-Style Network for Trajectory Prediction. (arXiv:2107.00932v2 [cs.CV] UPDATED)
    (2 min) It is essential to predict future trajectories of various agents in complex scenes. Whether it is internal personality factors of agents, interactive behavior of the neighborhood, or the influence of surroundings, it will have an impact on their future plannings. It means that even for the same physical type of agents, there are huge differences in their behavior styles. We concentrate on the problem of modeling agents' multi-style characteristics when predicting their trajectories. We propose the Multi-Style Network (MSN) to focus on this problem by dividing agents' behaviors into several hidden behavior categories adaptively, and then train each category's prediction network jointly, thus giving agents multiple styles of predictions simultaneously. Experiments show that MSN outperforms current state-of-the-art methods with 10\% - 20\% performance improvement on two widely used datasets, and presents better multi-style characteristics in predictions.
    Minimal Adversarial Examples for Deep Learning on 3D Point Clouds. (arXiv:2008.12066v4 [cs.CV] UPDATED)
    (2 min) With recent developments of convolutional neural networks, deep learning for 3D point clouds has shown significant progress in various 3D scene understanding tasks, e.g., object recognition, semantic segmentation. In a safety-critical environment, it is however not well understood how such deep learning models are vulnerable to adversarial examples. In this work, we explore adversarial attacks for point cloud-based neural networks. We propose a unified formulation for adversarial point cloud generation that can generalise two different attack strategies. Our method generates adversarial examples by attacking the classification ability of point cloud-based networks while considering the perceptibility of the examples and ensuring the minimal level of point manipulations. Experimental results show that our method achieves the state-of-the-art performance with higher than 89% and 90% of attack success rate on synthetic and real-world data respectively, while manipulating only about 4% of the total points.
    Rescaling Egocentric Vision. (arXiv:2006.13256v4 [cs.CV] UPDATED)
    (2 min) This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the "test of time" - i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics
    LEAN: graph-based pruning for convolutional neural networks by extracting longest chains. (arXiv:2011.06923v2 [cs.LG] UPDATED)
    (2 min) Convolutional neural networks (CNNs) have proven to be highly successful at a range of image-to-image tasks. CNNs can be prohibitively computationally expensive for real time use, which can limit their applicability in practice. Model pruning can improve computational efficiency by sparsifying trained networks. Common methods for pruning CNNs determine what convolutional filters to remove by ranking filters on an individual basis. However, filters are not independent, as CNNs consist of chains of convolutions, which can result in sub-optimal filter selection. We propose a novel pruning method, LongEst-chAiN (LEAN) pruning, which takes the interdependency between the convolution operations into account. We propose to prune CNNs by using graph-based algorithms to select relevant chains of convolutions. A CNN is interpreted as a graph, with the operator norm of each operator as distance metric for the edges. LEAN pruning iteratively extracts the highest value path from the graph to keep. In our experiments, we test LEAN pruning for several image-to-image tasks, including the well-known CamVid dataset, and a real-world dynamic X-ray CT dataset. When pruning CNNs with LEAN, we achieve a higher accuracy than pruning filters individually, and different pruned substructures emerge.
    Cross Modification Attention Based Deliberation Model for Image Captioning. (arXiv:2109.08411v1 [cs.CV])
    (0 min) The conventional encoder-decoder framework for image captioning generally adopts a single-pass decoding process, which predicts the target descriptive sentence word by word in temporal order. Despite the great success of this framework, it still suffers from two serious disadvantages. Firstly, it is unable to correct the mistakes in the predicted words, which may mislead the subsequent prediction and result in error accumulation problem. Secondly, such a framework can only leverage the already generated words but not the possible future words, and thus lacks the ability of global planning on linguistic information. To overcome these limitations, we explore a universal two-pass decoding framework, where a single-pass decoding based model serving as the Drafting Model first generates a draft caption according to an input image, and a Deliberation Model then performs the polishing process to refine the draft caption to a better image description. Furthermore, inspired from the complementarity between different modalities, we propose a novel Cross Modification Attention (CMA) module to enhance the semantic expression of the image features and filter out error information from the draft captions. We integrate CMA with the decoder of our Deliberation Model and name it as Cross Modification Attention based Deliberation Model (CMA-DM). We train our proposed framework by jointly optimizing all trainable components from scratch with a trade-off coefficient. Experiments on MS COCO dataset demonstrate that our approach obtains significant improvements over single-pass decoding baselines and achieves competitive performances compared with other state-of-the-art two-pass decoding based methods.
    LUAI Challenge 2021 on Learning to Understand Aerial Images. (arXiv:2108.13246v2 [cs.CV] UPDATED)
    (2 min) This report summarizes the results of Learning to Understand Aerial Images (LUAI) 2021 challenge held on ICCV 2021, which focuses on object detection and semantic segmentation in aerial images. Using DOTA-v2.0 and GID-15 datasets, this challenge proposes three tasks for oriented object detection, horizontal object detection, and semantic segmentation of common categories in aerial images. This challenge received a total of 146 registrations on the three tasks. Through the challenge, we hope to draw attention from a wide range of communities and call for more efforts on the problems of learning to understand aerial images.
    Semi-Supervised Learning for Sparsely-Labeled Sequential Data: Application to Healthcare Video Processing. (arXiv:2011.14101v4 [cs.CV] UPDATED)
    (0 min) Labeled data is a critical resource for training and evaluating machine learning models. However, many real-life datasets are only partially labeled. We propose a semi-supervised machine learning training strategy to improve event detection performance on sequential data, such as video recordings, when only sparse labels are available, such as event start times without their corresponding end times. Our method uses noisy guesses of the events' end times to train event detection models. Depending on how conservative these guesses are, mislabeled false positives may be introduced into the training set (i.e., negative sequences mislabeled as positives). We further propose a mathematical model for estimating how many inaccurate labels a model is exposed to, based on how noisy the end time guesses are. Finally, we show that neural networks can improve their detection performance by leveraging more training data with less conservative approximations despite the higher proportion of incorrect labels. We adapt sequential versions of MNIST and CIFAR-10 to empirically evaluate our method, and find that our risk-tolerant strategy outperforms conservative estimates by 12 points of mean average precision for MNIST, and 3.5 points for CIFAR. Then, we leverage the proposed training strategy to tackle a real-life application: processing continuous video recordings of epilepsy patients to improve seizure detection, and show that our method outperforms baseline labeling methods by 10 points of average precision.
    In-N-Out: Towards Good Initialization for Inpainting and Outpainting. (arXiv:2106.13953v3 [cs.CV] UPDATED)
    (2 min) In computer vision, recovering spatial information by filling in masked regions, e.g., inpainting, has been widely investigated for its usability and wide applicability to other various applications: image inpainting, image extrapolation, and environment map estimation. Most of them are studied separately depending on the applications. Our focus, however, is on accommodating the opposite task, e.g., image outpainting, which would benefit the target applications, e.g., image inpainting. Our self-supervision method, In-N-Out, is summarized as a training approach that leverages the knowledge of the opposite task into the target model. We empirically show that In-N-Out -- which explores the complementary information -- effectively takes advantage over the traditional pipelines where only task-specific learning takes place in training. In experiments, we compare our method to the traditional procedure and analyze the effectiveness of our method on different applications: image inpainting, image extrapolation, and environment map estimation. For these tasks, we demonstrate that In-N-Out consistently improves the performance of the recent works with In-N-Out self-supervision to their training procedure. Also, we show that our approach achieves better results than an existing training approach for outpainting.
    Hepatocellular Carcinoma Segmentation from Digital Subtraction Angiography Videos using Learnable Temporal Difference. (arXiv:2107.04306v3 [eess.IV] UPDATED)
    (2 min) Automatic segmentation of hepatocellular carcinoma (HCC) in Digital Subtraction Angiography (DSA) videos can assist radiologists in efficient diagnosis of HCC and accurate evaluation of tumors in clinical practice. Few studies have investigated HCC segmentation from DSA videos. It shows great challenging due to motion artifacts in filming, ambiguous boundaries of tumor regions and high similarity in imaging to other anatomical tissues. In this paper, we raise the problem of HCC segmentation in DSA videos, and build our own DSA dataset. We also propose a novel segmentation network called DSA-LTDNet, including a segmentation sub-network, a temporal difference learning (TDL) module and a liver region segmentation (LRS) sub-network for providing additional guidance. DSA-LTDNet is preferable for learning the latent motion information from DSA videos proactively and boosting segmentation performance. All of experiments are conducted on our self-collected dataset. Experimental results show that DSA-LTDNet increases the DICE score by nearly 4% compared to the U-Net baseline.
    Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation. (arXiv:2109.08478v1 [cs.CL])
    (0 min) Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.
    External Knowledge Augmented Text Visual Question Answering. (arXiv:2108.09717v2 [cs.CV] UPDATED)
    (2 min) The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously unseen, scene-text content of an image to generate answers. In this work, we propose the generalized use of external knowledge to augment our understanding of the said scene-text. We design a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks. Through empirical evidence and qualitative results, we demonstrate how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-of-the-art on two publicly available datasets, under the constraints of similar upstream OCR systems and training data.
    ProCAN: Progressive Growing Channel Attentive Non-Local Network for Lung Nodule Classification. (arXiv:2010.15417v3 [eess.IV] UPDATED)
    (2 min) Lung cancer classification in screening computed tomography (CT) scans is one of the most crucial tasks for early detection of this disease. Many lives can be saved if we are able to accurately classify malignant/cancerous lung nodules. Consequently, several deep learning based models have been proposed recently to classify lung nodules as malignant or benign. Nevertheless, the large variation in the size and heterogeneous appearance of the nodules makes this task an extremely challenging one. We propose a new Progressive Growing Channel Attentive Non-Local (ProCAN) network for lung nodule classification. The proposed method addresses this challenge from three different aspects. First, we enrich the Non-Local network by adding channel-wise attention capability to it. Second, we apply Curriculum Learning principles, whereby we first train our model on easy examples before hard ones. Third, as the classification task gets harder during the Curriculum learning, our model is progressively grown to increase its capability of handling the task at hand. We examined our proposed method on two different public datasets and compared its performance with state-of-the-art methods in the literature. The results show that the ProCAN model outperforms state-of-the-art methods and achieves an AUC of 98.05% and an accuracy of 95.28% on the LIDC-IDRI dataset. Moreover, we conducted extensive ablation studies to analyze the contribution and effects of each new component of our proposed method.
    Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests. (arXiv:2108.02356v2 [cs.CV] UPDATED)
    (2 min) Although deep neural networks (DNNs) enable great progress in video abnormal event detection (VAD), existing solutions typically suffer from two issues: (1) The localization of video events cannot be both precious and comprehensive. (2) The semantics and temporal context are under-explored. To tackle those issues, we are motivated by the prevalent cloze test in education and propose a novel approach named Visual Cloze Completion (VCC), which conducts VAD by learning to complete "visual cloze tests" (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To achieve both precise and comprehensive localization, appearance and motion are used as complementary cues to mark the object region associated with each event. For each marked region, a normalized patch sequence is extracted from current and adjacent frames and stacked into a STC. With each patch and the patch sequence of a STC compared to a visual "word" and "sentence" respectively, we deliberately erase a certain "word" (patch) to yield a VCT. Then, the VCT is completed by training DNNs to infer the erased patch and its optical flow via video semantics. Meanwhile, VCC fully exploits temporal context by alternatively erasing each patch in temporal context and creating multiple VCTs. Furthermore, we propose localization-level, event-level, model-level and decision-level solutions to enhance VCC, which can further exploit VCC's potential and produce significant performance improvement gain. Extensive experiments demonstrate that VCC achieves state-of-the-art VAD performance. Our codes and results are open at https://github.com/yuguangnudt/VEC_VAD/tree/VCC.
    Sign-MAML: Efficient Model-Agnostic Meta-Learning by SignSGD. (arXiv:2109.07497v1 [cs.LG] CROSS LISTED)
    (2 min) We propose a new computationally-efficient first-order algorithm for Model-Agnostic Meta-Learning (MAML). The key enabling technique is to interpret MAML as a bilevel optimization (BLO) problem and leverage the sign-based SGD(signSGD) as a lower-level optimizer of BLO. We show that MAML, through the lens of signSGD-oriented BLO, naturally yields an alternating optimization scheme that just requires first-order gradients of a learned meta-model. We term the resulting MAML algorithm Sign-MAML. Compared to the conventional first-order MAML (FO-MAML) algorithm, Sign-MAML is theoretically-grounded as it does not impose any assumption on the absence of second-order derivatives during meta training. In practice, we show that Sign-MAML outperforms FO-MAML in various few-shot image classification tasks, and compared to MAML, it achieves a much more graceful tradeoff between classification accuracy and computation efficiency.
    Estimating Example Difficulty Using Variance of Gradients. (arXiv:2008.11600v3 [cs.CV] UPDATED)
    (2 min) In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples ensures the safe deployment of models, isolates samples that require further human inspection and provides interpretability into model behavior. In this work, we propose Variance of Gradients (VoG) as a valuable and efficient metric to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. We show that data points with high VoG scores are far more difficult for the model to learn and over-index on corrupted or memorized examples. Further, restricting the evaluation to the test set instances with the lowest VoG improves the model's generalization performance. Finally, we show that VoG is a valuable and efficient ranking for out-of-distribution detection.
    Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development. (arXiv:2108.04308v2 [cs.CV] UPDATED)
    (3 min) Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision's propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.
    Is this Harmful? Learning to Predict Harmfulness Ratings from Video. (arXiv:2106.08323v2 [cs.CV] UPDATED)
    (2 min) Automatically identifying harmful content in video is an important task with a wide range of applications. However, due to the difficulty of collecting high-quality labels as well as demanding computational requirements, the task has not yet had a fully general approach. Typically, only small subsets of the problem are considered, such as identifying violent content. In cases where the general problem is tackled, approximations and simplifications are made to deal with the lack of labels and computational complexity. In this work, we identify and tackle some of the main obstacles. First, we create an open dataset of 3589 video clips from film trailers and annotated by professionals in the field. Second, we perform an analysis of our constructed dataset, investigating among other things the relation between clip and trailer level annotations. Lastly, we train audiovisual models on our dataset and conduct an in-depth study on our modeling choices. We find that results greatly improve by combining the visual and audio modality and that pre-training on large-scale video recognition datasets as well as class balanced sampling further improves performance. Further details of our dataset is available at this webpage: https://vidharm.github.io/.
    A review of deep learning methods for MRI reconstruction. (arXiv:2109.08618v1 [eess.IV])
    (2 min) Following the success of deep learning in a wide range of applications, neural network-based machine-learning techniques have received significant interest for accelerating magnetic resonance imaging (MRI) acquisition and reconstruction strategies. A number of ideas inspired by deep learning techniques for computer vision and image processing have been successfully applied to nonlinear image reconstruction in the spirit of compressed sensing for accelerated MRI. Given the rapidly growing nature of the field, it is imperative to consolidate and summarize the large number of deep learning methods that have been reported in the literature, to obtain a better understanding of the field in general. This article provides an overview of the recent developments in neural-network based approaches that have been proposed specifically for improving parallel imaging. A general background and introduction to parallel MRI is also given from a classical view of k-space based reconstruction methods. Image domain based techniques that introduce improved regularizers are covered along with k-space based methods which focus on better interpolation strategies using neural networks. While the field is rapidly evolving with thousands of papers published each year, in this review, we attempt to cover broad categories of methods that have shown good performance on publicly available data sets. Limitations and open problems are also discussed and recent efforts for producing open data sets and benchmarks for the community are examined.
    Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation. (arXiv:2108.08202v2 [eess.IV] UPDATED)
    (2 min) Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client. The client runs the inference of models to super-resolve the LR chunks. Consequently, a large number of models are streamed in order to deliver a video. In this paper, we first carefully study the relation between models of different chunks, then we tactfully design a joint training framework along with the Content-aware Feature Modulation (CaFM) layer to compress these models for neural video delivery. {\bf With our method, each video chunk only requires less than $1\% $ of original parameters to be streamed, achieving even better SR performance.} We conduct extensive experiments across various SR backbones, video time length, and scaling factors to demonstrate the advantages of our method. Besides, our method can be also viewed as a new approach of video coding. Our primary experiments achieve better video quality compared with the commercial H.264 and H.265 standard under the same storage cost, showing the great potential of the proposed method. Code is available at:\url{https://github.com/Neural-video-delivery/CaFM-Pytorch-ICCV2021}
    WebQA: Multihop and Multimodal QA. (arXiv:2109.00590v2 [cs.CL] UPDATED)
    (2 min) Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources -- akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
    Autonomous Vision-based UAV Landing with Collision Avoidance using Deep Learning. (arXiv:2109.08628v1 [cs.LG])
    (2 min) There is a risk of collision when multiple UAVs land simultaneously without communication on the same platform. This work accomplishes vision-based autonomous landing and uses a deep-learning-based method to realize collision avoidance during the landing process.
    Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms. (arXiv:2107.05637v2 [cs.CV] UPDATED)
    (2 min) Self-Attention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose it into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the self-attention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.
    Monitoring Indoor Activity of Daily Living Using Thermal Imaging: A Case Study. (arXiv:2109.08672v1 [cs.CV])
    (2 min) Monitoring indoor activities of daily living (ADLs) of a person is neither an easy nor an accurate process. It is subjected to dependency on sensor type, power supply stability, and connectivity stability without mentioning artifacts introduced by the person himself. Multiple challenges have to be overcome in this field, such as; monitoring the precise spatial location of the person, and estimating vital signs like an individuals average temperature. Privacy is another domain of the problem to be thought of with care. Identifying the persons posture without a camera is another challenge. Posture identification assists in the persons fall detection. Thermal imaging could be a proper solution for most of the mentioned challenges. It provides monitoring both the persons average temperature and spatial location while maintaining privacy. In this research, we propose an IoT system for monitoring an indoor ADL using thermal sensor array (TSA). Three classes of ADLs are introduced, which are daily activity, sleeping activity and no-activity respectively. Estimating person average temperature using TSAs is introduced as well in this paper. Results have shown that the three activity classes can be identified as well as the persons average temperature during day and night. The persons spatial location can be determined while his/her privacy is maintained as well.
    Semantic Snapping for Guided Multi-View Visualization Design. (arXiv:2109.08384v1 [cs.HC])
    (2 min) Visual information displays are typically composed of multiple visualizations that are used to facilitate an understanding of the underlying data. A common example are dashboards, which are frequently used in domains such as finance, process monitoring and business intelligence. However, users may not be aware of existing guidelines and lack expert design knowledge when composing such multi-view visualizations. In this paper, we present semantic snapping, an approach to help non-expert users design effective multi-view visualizations from sets of pre-existing views. When a particular view is placed on a canvas, it is "aligned" with the remaining views -- not with respect to its geometric layout, but based on aspects of the visual encoding itself, such as how data dimensions are mapped to channels. Our method uses an on-the-fly procedure to detect and suggest resolutions for conflicting, misleading, or ambiguous designs, as well as to provide suggestions for alternative presentations. With this approach, users can be guided to avoid common pitfalls encountered when composing visualizations. Our provided examples and case studies demonstrate the usefulness and validity of our approach.
    Post-OCR Paragraph Recognition by Graph Convolutional Networks. (arXiv:2101.12741v5 [cs.CV] UPDATED)
    (2 min) We propose a new approach for paragraph recognition in document images by spatial graph convolutional networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.
    GKNet: grasp keypoint network for grasp candidates detection. (arXiv:2106.08497v2 [cs.RO] UPDATED)
    (2 min) Contemporary grasp detection approaches employ deep learning to achieve robustness to sensor and object model uncertainty. The two dominant approaches design either grasp-quality scoring or anchor-based grasp recognition networks. This paper presents a different approach to grasp detection by treating it as keypoint detection. The deep network detects each grasp candidate as a pair of keypoints, convertible to the grasp representation g = {x, y, w, {\theta}}^T, rather than a triplet or quartet of corner points. Decreasing the detection difficulty by grouping keypoints into pairs boosts performance. To further promote dependencies between keypoints, the general non-local module is incorporated into the proposed learning framework. A final filtering strategy based on discrete and continuous orientation prediction removes false correspondences and further improves grasp detection performance. GKNet, the approach presented here, achieves the best balance of accuracy and speed on the Cornell and the abridged Jacquard dataset (96.9% and 98.39% at 41.67 and 23.26 fps). Follow-up experiments on a manipulator evaluate GKNet using 4 types of grasping experiments reflecting different nuisance sources: static grasping, dynamic grasping, grasping at varied camera angles, and bin picking. GKNet outperforms reference baselines in static and dynamic grasping experiments while showing robustness to varied camera viewpoints and bin picking experiments. The results confirm the hypothesis that grasp keypoints are an effective output representation for deep grasp networks that provide robustness to expected nuisance factors.
    Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions. (arXiv:2106.04484v2 [cs.CV] UPDATED)
    (2 min) Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.
    PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering. (arXiv:2109.08379v1 [cs.CV])
    (2 min) Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of other individuals. In this paper, a Portrait Image Neural Renderer (PIRenderer) is proposed to control the face motions with the parameters of three-dimensional morphable face models (3DMMs). The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications. Experiments on both direct and indirect editing tasks demonstrate the superiority of this model. Meanwhile, we further extend this model to tackle the audio-driven facial reenactment task by extracting sequential motions from audio inputs. We show that our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream. Our source code is available at https://github.com/RenYurui/PIRender.
    Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck. (arXiv:2109.08553v1 [cs.CV])
    (2 min) Semantic understanding of 3D point clouds is important for various robotics applications. Given that point-wise semantic annotation is expensive, in this paper, we address the challenge of learning models with extremely sparse labels. The core problem is how to leverage numerous unlabeled points. To this end, we propose a self-supervised 3D representation learning framework named viewpoint bottleneck. It optimizes a mutual-information based objective, which is applied on point clouds under different viewpoints. A principled analysis shows that viewpoint bottleneck leads to an elegant surrogate loss function that is suitable for large-scale point cloud data. Compared with former arts based upon contrastive learning, viewpoint bottleneck operates on the feature dimension instead of the sample dimension. This paradigm shift has several advantages: It is easy to implement and tune, does not need negative samples and performs better on our goal down-streaming task. We evaluate our method on the public benchmark ScanNet, under the pointly-supervised setting. We achieve the best quantitative results among comparable solutions. Meanwhile we provide an extensive qualitative inspection on various challenging scenes. They demonstrate that our models can produce fairly good scene parsing results for robotics applications. Our code, data and models will be made public.
    Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images. (arXiv:2106.00116v3 [cs.LG] UPDATED)
    (3 min) Transfer learning aims to exploit pre-trained models for more efficient follow-up training on wide range of downstream tasks and datasets, enabling successful training also on small data. Recently, strong improvement was shown for transfer learning and model generalization when increasing model, data and compute budget scale in the pre-training. To compare effect of scale both in intra- and inter-domain full and few-shot transfer, in this study we combine for the first time large openly available medical X-Ray chest imaging datasets to reach a dataset scale comparable to ImageNet-1k. We then conduct pre-training and transfer to different natural or medical targets while varying network size and source data scale and domain, being either large natural (ImageNet-1k/21k) or large medical chest X-Ray datasets. We observe strong improvement due to larger pre-training scale for intra-domain natural-natural and medical-medical transfer. For inter-domain natural-medical transfer, we find improvements due to larger pre-training scale on larger X-Ray targets in full shot regime, while for smaller targets and for few-shot regime the improvement is not visible. Remarkably, large networks pre-trained on very large natural ImageNet-21k are as good or better than networks pre-trained on largest available medical X-Ray data when performing transfer to large X-Ray targets. We conclude that high quality models for inter-domain transfer can be also obtained by substantially increasing scale of model and generic natural source data, removing necessity for large domain-specific medical source data in the pre-training. Code is available at: \url{https://github.com/SLAMPAI/large-scale-pretraining-transfer}}
    GraFormer: Graph Convolution Transformer for 3D Pose Estimation. (arXiv:2109.08364v1 [cs.CV])
    (2 min) Exploiting relations among 2D joints plays a crucial role yet remains semi-developed in 2D-to-3D pose estimation. To alleviate this issue, we propose GraFormer, a novel transformer architecture combined with graph convolution for 3D pose estimation. The proposed GraFormer comprises two repeatedly stacked core modules, GraAttention and ChebGConv block. GraAttention enables all 2D joints to interact in global receptive field without weakening the graph structure information of joints, which introduces vital features for later modules. Unlike vanilla graph convolutions that only model the apparent relationship of joints, ChebGConv block enables 2D joints to interact in the high-order sphere, which formulates their hidden implicit relations. We empirically show the superiority of GraFormer through conducting extensive experiments across popular benchmarks. Specifically, GraFormer outperforms state of the art on Human3.6M dataset while using 18$\%$ parameters. The code is available at https://github.com/Graformer/GraFormer .
    GoG: Relation-aware Graph-over-Graph Network for Visual Dialog. (arXiv:2109.08475v1 [cs.CL])
    (2 min) Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the representation of the image based on the fully represented question. Therefore, we propose a novel relation-aware graph-over-graph network (GoG) for visual dialog. Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Question-aware I-Graph, which aims to capture the relations between objects in an image based on fully question representation. As an additional feature representation module, we add GoG to the existing visual dialogue model. Experimental results show that our model outperforms the strong baseline in both generative and discriminative settings by a significant margin.
    Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos. (arXiv:2109.08275v1 [cs.MM])
    (2 min) Geo-tagged photo based tourist attraction recommendation can discover users' travel preferences from their taken photos, so as to recommend suitable tourist attractions to them. However, existing visual content based methods cannot fully exploit the user and tourist attraction information of photos to extract visual features, and do not differentiate the significances of different photos. In this paper, we propose multi-level visual similarity based personalized tourist attraction recommendation using geo-tagged photos (MEAL). MEAL utilizes the visual contents of photos and interaction behavior data to obtain the final embeddings of users and tourist attractions, which are then used to predict the visit probabilities. Specifically, by crossing the user and tourist attraction information of photos, we define four visual similarity levels and introduce a corresponding quintuplet loss to embed the visual contents of photos. In addition, to capture the significances of different photos, we exploit the self-attention mechanism to obtain the visual representations of users and tourist attractions. We conducted experiments on a dataset crawled from Flickr, and the experimental results proved the advantage of this method.
    ActionCLIP: A New Paradigm for Video Action Recognition. (arXiv:2109.08472v1 [cs.CV])
    (2 min) The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git
    Including Keyword Position in Image-based Models for Act Segmentation of Historical Registers. (arXiv:2109.08477v1 [cs.CV])
    (2 min) The segmentation of complex images into semantic regions has seen a growing interest these last years with the advent of Deep Learning. Until recently, most existing methods for Historical Document Analysis focused on the visual appearance of documents, ignoring the rich information that textual content can offer. However, the segmentation of complex documents into semantic regions is sometimes impossible relying only on visual features and recent models embed both visual and textual information. In this paper, we focus on the use of both visual and textual information for segmenting historical registers into structured and meaningful units such as acts. An act is a text recording containing valuable knowledge such as demographic information (baptism, marriage or death) or royal decisions (donation or pardon). We propose a simple pipeline to enrich document images with the position of text lines containing key-phrases and show that running a standard image-based layout analysis system on these images can lead to significant gains. Our experiments show that the detection of acts increases from 38 % of mAP to 74 % when adding textual information, in real use-case conditions where text lines positions and content are extracted with an automatic recognition system.
    LoGG3D-Net: Locally Guided Global Descriptor Learning for 3D Place Recognition. (arXiv:2109.08336v1 [cs.CV])
    (2 min) Retrieval-based place recognition is an efficient and effective solution for enabling re-localization within a pre-built map or global data association for Simultaneous Localization and Mapping (SLAM). The accuracy of such an approach is heavily dependent on the quality of the extracted scene-level representation. While end-to-end solutions, which learn a global descriptor from input point clouds, have demonstrated promising results, such approaches are limited in their ability to enforce desirable properties at the local feature level. In this paper, we demonstrate that the inclusion of an additional training signal (local consistency loss) can guide the network to learning local features which are consistent across revisits, hence leading to more repeatable global descriptors resulting in an overall improvement in place recognition performance. We formulate our approach in an end-to-end trainable architecture called LoGG3D-Net. Experiments on two large-scale public benchmarks (KITTI and MulRan) show that our method achieves mean $F1_{max}$ scores of $0.939$ and $0.968$ on KITTI and MulRan, respectively while operating in near real-time.
    TSS: Transformation-Specific Smoothing for Robustness Certification. (arXiv:2002.12398v4 [cs.LG] UPDATED)
    (3 min) As machine learning (ML) systems become pervasive, safeguarding their security is critical. However, recently it has been demonstrated that motivated adversaries are able to mislead ML systems by perturbing test data using semantic transformations. While there exists a rich body of research providing provable robustness guarantees for ML models against $\ell_p$ norm bounded adversarial perturbations, guarantees against semantic perturbations remain largely underexplored. In this paper, we provide TSS -- a unified framework for certifying ML robustness against general adversarial semantic transformations. First, depending on the properties of each transformation, we divide common transformations into two categories, namely resolvable (e.g., Gaussian blur) and differentially resolvable (e.g., rotation) transformations. For the former, we propose transformation-specific randomized smoothing strategies and obtain strong robustness certification. The latter category covers transformations that involve interpolation errors, and we propose a novel approach based on stratified sampling to certify the robustness. Our framework TSS leverages these certification strategies and combines with consistency-enhanced training to provide rigorous certification of robustness. We conduct extensive experiments on over ten types of challenging semantic transformations and show that TSS significantly outperforms the state of the art. Moreover, to the best of our knowledge, TSS is the first approach that achieves nontrivial certified robustness on the large-scale ImageNet dataset. For instance, our framework achieves 30.4% certified robust accuracy against rotation attack (within $\pm 30^\circ$) on ImageNet. Moreover, to consider a broader range of transformations, we show TSS is also robust against adaptive attacks and unforeseen image corruptions such as CIFAR-10-C and ImageNet-C.
    LOF: Structure-Aware Line Tracking based on Optical Flow. (arXiv:2109.08466v1 [cs.CV])
    (2 min) Lines provide the significantly richer geometric structural information about the environment than points, so lines are widely used in recent Visual Odometry (VO) works. Since VO with lines use line tracking results to locate and map, line tracking is a crucial component in VO. Although the state-of-the-art line tracking methods have made great progress, they are still heavily dependent on line detection or the predicted line segments. In order to relieve the dependencies described above to track line segments completely, accurately, and robustly at higher computational efficiency, we propose a structure-aware Line tracking algorithm based entirely on Optical Flow (LOF). Firstly, we propose a gradient-based strategy to sample pixels on lines that are suitable for line optical flow calculation. Then, in order to align the lines by fully using the structural relationship between the sampled points on it and effectively removing the influence of sampled points on it occluded by other objects, we propose a two-step structure-aware line segment alignment method. Furthermore, we propose a line refinement method to refine the orientation, position, and endpoints of the aligned line segments. Extensive experimental results demonstrate that the proposed LOF outperforms the state-of-the-art performance in line tracking accuracy, robustness, and efficiency, which also improves the location accuracy and robustness of VO system with lines.
    Messing Up 3D Virtual Environments: Transferable Adversarial 3D Objects. (arXiv:2109.08465v1 [cs.CV])
    (2 min) In the last few years, the scientific community showed a remarkable and increasing interest towards 3D Virtual Environments, training and testing Machine Learning-based models in realistic virtual worlds. On one hand, these environments could also become a mean to study the weaknesses of Machine Learning algorithms, or to simulate training settings that allow Machine Learning models to gain robustness to 3D adversarial attacks. On the other hand, their growing popularity might also attract those that aim at creating adversarial conditions to invalidate the benchmarking process, especially in the case of public environments that allow the contribution from a large community of people. Most of the existing Adversarial Machine Learning approaches are focused on static images, and little work has been done in studying how to deal with 3D environments and how a 3D object should be altered to fool a classifier that observes it. In this paper, we study how to craft adversarial 3D objects by altering their textures, using a tool chain composed of easily accessible elements. We show that it is possible, and indeed simple, to create adversarial objects using off-the-shelf limited surrogate renderers that can compute gradients with respect to the parameters of the rendering process, and, to a certain extent, to transfer the attacks to more advanced 3D engines. We propose a saliency-based attack that intersects the two classes of renderers in order to focus the alteration to those texture elements that are estimated to be effective in the target engine, evaluating its impact in popular neural classifiers.
    Bayesian Triplet Loss: Uncertainty Quantification in Image Retrieval. (arXiv:2011.12663v3 [cs.CV] UPDATED)
    (2 min) Uncertainty quantification in image retrieval is crucial for downstream decisions, yet it remains a challenging and largely unexplored problem. Current methods for estimating uncertainties are poorly calibrated, computationally expensive, or based on heuristics. We present a new method that views image embeddings as stochastic features rather than deterministic features. Our two main contributions are (1) a likelihood that matches the triplet constraint and that evaluates the probability of an anchor being closer to a positive than a negative; and (2) a prior over the feature space that justifies the conventional l2 normalization. To ensure computational efficiency, we derive a variational approximation of the posterior, called the Bayesian triplet loss, that produces state-of-the-art uncertainty estimates and matches the predictive performance of current state-of-the-art methods.
    Realistic PointGoal Navigation via Auxiliary Losses and Information Bottleneck. (arXiv:2109.08677v1 [cs.CV])
    (2 min) We propose a novel architecture and training paradigm for training realistic PointGoal Navigation -- navigating to a target coordinate in an unseen environment under actuation and sensor noise without access to ground-truth localization. Specifically, we find that the primary challenge under this setting is learning localization -- when stripped of idealized localization, agents fail to stop precisely at the goal despite reliably making progress towards it. To address this we introduce a set of auxiliary losses to help the agent learn localization. Further, we explore the idea of treating the precise location of the agent as privileged information -- it is unavailable during test time, however, it is available during training time in simulation. We grant the agent restricted access to ground-truth localization readings during training via an information bottleneck. Under this setting, the agent incurs a penalty for using this privileged information, encouraging the agent to only leverage this information when it is crucial to learning. This enables the agent to first learn navigation and then learn localization instead of conflating these two objectives in training. We evaluate our proposed method both in a semi-idealized (noiseless simulation without Compass+GPS) and realistic (addition of noisy simulation) settings. Specifically, our method outperforms existing baselines on the semi-idealized setting by 18\%/21\% SPL/Success and by 15\%/20\% SPL in the realistic setting. Our improved Success and SPL metrics indicate our agent's improved ability to accurately self-localize while maintaining a strong navigation policy. Our implementation can be found at https://github.com/NicoGrande/habitat-pointnav-via-ib.
    What we see and What we don't see: Imputing Occluded Crowd Structures from Robot Sensing. (arXiv:2109.08494v1 [cs.RO])
    (2 min) We consider the navigation of mobile robots in crowded environments, for which onboard sensing of the crowd is typically limited by occlusions. We address the problem of inferring the human occupancy in the space around the robot, in blind spots, beyond the range of its sensing capabilities. This problem is rather unexplored in spite of the important impact it has on the robot crowd navigation efficiency and safety, which requires the estimation and the prediction of the crowd state around it. In this work, we propose the first solution to sample predictions of possible human presence based on the state of a fewer set of sensed people around the robot as well as previous observations of the crowd activity.
    Towards agricultural autonomy: crop row detection under varying field conditions using deep learning. (arXiv:2109.08247v1 [cs.CV])
    (2 min) This paper presents a novel metric to evaluate the robustness of deep learning based semantic segmentation approaches for crop row detection under different field conditions encountered by a field robot. A dataset with ten main categories encountered under various field conditions was used for testing. The effect on these conditions on the angular accuracy of crop row detection was compared. A deep convolutional encoder decoder network is implemented to predict crop row masks using RGB input images. The predicted mask is then sent to a post processing algorithm to extract the crop rows. The deep learning model was found to be robust against shadows and growth stages of the crop while the performance was reduced under direct sunlight, increasing weed density, tramlines and discontinuities in crop rows when evaluated with the novel metric.
    CardiSort: a convolutional neural network for cross vendor automated sorting of cardiac MR images. (arXiv:2109.08479v1 [eess.IV])
    (2 min) Objectives: To develop an image-based automatic deep learning method to classify cardiac MR images by sequence type and imaging plane for improved clinical post-processing efficiency. Methods: Multi-vendor cardiac MRI studies were retrospectively collected from 4 centres and 3 vendors. A two-head convolutional neural network ('CardiSort') was trained to classify 35 sequences by imaging sequence (n=17) and plane (n=10). Single vendor training (SVT) on single centre images (n=234 patients) and multi-vendor training (MVT) with multicentre images (n = 479 patients, 3 centres) was performed. Model accuracy was compared to manual ground truth labels by an expert radiologist on a hold-out test set for both SVT and MVT. External validation of MVT (MVTexternal) was performed on data from 3 previously unseen magnet systems from 2 vendors (n=80 patients). Results: High sequence and plane accuracies were observed for SVT (85.2% and 93.2% respectively), and MVT (96.5% and 98.1% respectively) on the hold-out test set. MVTexternal yielded sequence accuracy of 92.7% and plane accuracy of 93.0%. There was high accuracy for common sequences and conventional cardiac planes. Poor accuracy was observed for underrepresented classes and sequences where there was greater variability in acquisition parameters across centres, such as perfusion imaging. Conclusions: A deep learning network was developed on multivendor data to classify MRI studies into component sequences and planes, with external validation. With refinement, it has potential to improve workflow by enabling automated sequence selection, an important first step in completely automated post-processing pipelines.
    Focus on Impact: Indoor Exploration with Intrinsic Motivation. (arXiv:2109.08521v1 [cs.RO])
    (2 min) Exploration of indoor environments has recently experienced a significant interest, also thanks to the introduction of deep neural agents built in a hierarchical fashion and trained with Deep Reinforcement Learning (DRL) on simulated environments. Current state-of-the-art methods employ a dense extrinsic reward that requires the complete a priori knowledge of the layout of the training environment to learn an effective exploration policy. However, such information is expensive to gather in terms of time and resources. In this work, we propose to train the model with a purely intrinsic reward signal to guide exploration, which is based on the impact of the robot's actions on the environment. So far, impact-based rewards have been employed for simple tasks and in procedurally generated synthetic environments with countable states. Since the number of states observable by the agent in realistic indoor environments is non-countable, we include a neural-based density model and replace the traditional count-based regularization with an estimated pseudo-count of previously visited states. The proposed exploration approach outperforms DRL-based competitors relying on intrinsic rewards and surpasses the agents trained with a dense extrinsic reward computed with the environment layouts. We also show that a robot equipped with the proposed approach seamlessly adapts to point-goal navigation and real-world deployment.
    Diverse Generation from a Single Video Made Possible. (arXiv:2109.08591v1 [cs.CV])
    (2 min) Most advanced video generation and manipulation methods train on a large collection of videos. As such, they are restricted to the types of video dynamics they train on. To overcome this limitation, GANs trained on a single video were recently proposed. While these provide more flexibility to a wide variety of video dynamics, they require days to train on a single tiny input video, rendering them impractical. In this paper we present a fast and practical method for video generation and manipulation from a single natural video, which generates diverse high-quality video outputs within seconds (for benchmark videos). Our method can be further applied to Full-HD video clips within minutes. Our approach is inspired by a recent advanced patch-nearest-neighbor based approach [Granot et al. 2021], which was shown to significantly outperform single-image GANs, both in run-time and in visual quality. Here we generalize this approach from images to videos, by casting classical space-time patch-based methods as a new generative video model. We adapt the generative image patch nearest neighbor approach to efficiently cope with the huge number of space-time patches in a single video. Our method generates more realistic and higher quality results than single-video GANs (confirmed by quantitative and qualitative evaluations). Moreover, it is disproportionally faster (runtime reduced from several days to seconds). Other than diverse video generation, we demonstrate several other challenging video applications, including spatio-temporal video retargeting, video structural analogies and conditional video-inpainting.
    VideoDG: Generalizing Temporal Relations in Videos to Novel Domains. (arXiv:1912.03716v2 [cs.CV] UPDATED)
    (2 min) This paper introduces video domain generalization where most video classification networks degenerate due to the lack of exposure to the target domains of divergent distributions. We observe that the global temporal features are less generalizable, due to the temporal domain shift that videos from other unseen domains may have an unexpected absence or misalignment of the temporal relations. This finding has motivated us to solve video domain generalization by effectively learning the local-relation features of different timescales that are more generalizable, and exploiting them along with the global-relation features to maintain the discriminability. This paper presents the VideoDG framework with two technical contributions. The first is a new deep architecture named the Adversarial Pyramid Network, which improves the generalizability of video features by capturing the local-relation, global-relation, and cross-relation features progressively. On the basis of pyramid features, the second contribution is a new and robust approach of adversarial data augmentation that can bridge different video domains by improving the diversity and quality of augmented data. We construct three video domain generalization benchmarks in which domains are divided according to different datasets, different consequences of actions, or different camera views, respectively. VideoDG consistently outperforms the combinations of previous video classification models and existing domain generalization methods on all benchmarks.
    Self-Supervised Neural Architecture Search for Imbalanced Datasets. (arXiv:2109.08580v1 [cs.CV])
    (2 min) Neural Architecture Search (NAS) provides state-of-the-art results when trained on well-curated datasets with annotated labels. However, annotating data or even having balanced number of samples can be a luxury for practitioners from different scientific fields, e.g., in the medical domain. To that end, we propose a NAS-based framework that bears the threefold contributions: (a) we focus on the self-supervised scenario, i.e., where no labels are required to determine the architecture, and (b) we assume the datasets are imbalanced, (c) we design each component to be able to run on a resource constrained setup, i.e., on a single GPU (e.g. Google Colab). Our components build on top of recent developments in self-supervised learning~\citep{zbontar2021barlow}, self-supervised NAS~\citep{kaplan2020self} and extend them for the case of imbalanced datasets. We conduct experiments on an (artificially) imbalanced version of CIFAR-10 and we demonstrate our proposed method outperforms standard neural networks, while using $27\times$ less parameters. To validate our assumption on a naturally imbalanced dataset, we also conduct experiments on ChestMNIST and COVID-19 X-ray. The results demonstrate how the proposed method can be used in imbalanced datasets, while it can be fully run on a single GPU. Code is available \href{https://github.com/TimofeevAlex/ssnas_imbalanced}{here}.
    Expression Snippet Transformer for Robust Video-based Facial Expression Recognition. (arXiv:2109.08409v1 [cs.CV])
    (2 min) The recent success of Transformer has provided a new direction to various visual understanding tasks, including video-based facial expression recognition (FER). By modeling visual relations effectively, Transformer has shown its power for describing complicated patterns. However, Transformer still performs unsatisfactorily to notice subtle facial expression movements, because the expression movements of many videos can be too small to extract meaningful spatial-temporal relations and achieve robust performance. To this end, we propose to decompose each video into a series of expression snippets, each of which contains a small number of facial movements, and attempt to augment the Transformer's ability for modeling intra-snippet and inter-snippet visual relations, respectively, obtaining the Expression snippet Transformer (EST). In particular, for intra-snippet modeling, we devise an attention-augmented snippet feature extractor (AA-SFE) to enhance the encoding of subtle facial movements of each snippet by gradually attending to more salient information. In addition, for inter-snippet modeling, we introduce a shuffled snippet order prediction (SSOP) head and a corresponding loss to improve the modeling of subtle motion changes across subsequent snippets by training the Transformer to identify shuffled snippet orders. Extensive experiments on four challenging datasets (i.e., BU-3DFE, MMI, AFEW, and DFEW) demonstrate that our EST is superior to other CNN-based methods, obtaining state-of-the-art performance.
    KATANA: Simple Post-Training Robustness Using Test Time Augmentations. (arXiv:2109.08191v1 [cs.CV])
    (2 min) Although Deep Neural Networks (DNNs) achieve excellent performance on many real-world tasks, they are highly vulnerable to adversarial attacks. A leading defense against such attacks is adversarial training, a technique in which a DNN is trained to be robust to adversarial attacks by introducing adversarial noise to its input. This procedure is effective but must be done during the training phase. In this work, we propose a new simple and easy-to-use technique, KATANA, for robustifying an existing pretrained DNN without modifying its weights. For every image, we generate N randomized Test Time Augmentations (TTAs) by applying diverse color, blur, noise, and geometric transforms. Next, we utilize the DNN's logits output to train a simple random forest classifier to predict the real class label. Our strategy achieves state-of-the-art adversarial robustness on diverse attacks with minimal compromise on the natural images' classification. We test KATANA also against two adaptive white-box attacks and it shows excellent results when combined with adversarial training. Code is available in https://github.com/giladcohen/KATANA.
    A computationally efficient framework for vector representation of persistence diagrams. (arXiv:2109.08239v1 [cs.CV])
    (2 min) In Topological Data Analysis, a common way of quantifying the shape of data is to use a persistence diagram (PD). PDs are multisets of points in $\mathbb{R}^2$ computed using tools of algebraic topology. However, this multi-set structure limits the utility of PDs in applications. Therefore, in recent years efforts have been directed towards extracting informative and efficient summaries from PDs to broaden the scope of their use for machine learning tasks. We propose a computationally efficient framework to convert a PD into a vector in $\mathbb{R}^n$, called a vectorized persistence block (VPB). We show that our representation possesses many of the desired properties of vector-based summaries such as stability with respect to input noise, low computational cost and flexibility. Through simulation studies, we demonstrate the effectiveness of VPBs in terms of performance and computational cost within various learning tasks, namely clustering, classification and change point detection.
    Are we ready for beyond-application high-volume data? The Reeds robot perception benchmark dataset. (arXiv:2109.08250v1 [cs.CV])
    (2 min) This paper presents a dataset, called Reeds, for research on robot perception algorithms. The dataset aims to provide demanding benchmark opportunities for algorithms, rather than providing an environment for testing application-specific solutions. A boat was selected as a logging platform in order to provide highly dynamic kinematics. The sensor package includes six high-performance vision sensors, two long-range lidars, radar, as well as GNSS and an IMU. The spatiotemporal resolution of sensors were maximized in order to provide large variations and flexibility in the data, offering evaluation at a large number of different resolution presets based on the resolution found in other datasets. Reeds also provides means of a fair and reproducible comparison of algorithms, by running all evaluations on a common server backend. As the dataset contains massive-scale data, the evaluation principle also serves as a way to avoid moving data unnecessarily. It was also found that naive evaluation of algorithms, where each evaluation is computed sequentially, was not practical as the fetch and decode task of each frame would not scale well. Instead, each frame is only decoded once and then fed to all algorithms in parallel, including for GPU-based algorithms.
    Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic Surgery. (arXiv:2109.08227v1 [eess.IV])
    (2 min) We introduce the task of stereo video reconstruction or, equivalently, 2D-to-3D video conversion for minimally invasive surgical video. We design and implement a series of end-to-end U-Net-based solutions for this task by varying the input (single frame vs. multiple consecutive frames), loss function (MSE, MAE, or perceptual losses), and network architecture. We evaluate these solutions by surveying ten experts - surgeons who routinely perform endoscopic surgery. We run two separate reader studies: one evaluating individual frames and the other evaluating fully reconstructed 3D video played on a VR headset. In the first reader study, a variant of the U-Net that takes as input multiple consecutive video frames and outputs the missing view performs best. We draw two conclusions from this outcome. First, motion information coming from multiple past frames is crucial in recreating stereo vision. Second, the proposed U-Net variant can indeed exploit such motion information for solving this task. The result from the second study further confirms the effectiveness of the proposed U-Net variant. The surgeons reported that they could successfully perceive depth from the reconstructed 3D video clips. They also expressed a clear preference for the reconstructed 3D video over the original 2D video. These two reader studies strongly support the usefulness of the proposed task of stereo reconstruction for minimally invasive surgical video and indicate that deep learning is a promising approach to this task. Finally, we identify two automatic metrics, LPIPS and DISTS, that are strongly correlated with expert judgement and that could serve as proxies for the latter in future studies.
    Bio-Inspired Audio-Visual Cues Integration for Visual Attention Prediction. (arXiv:2109.08371v1 [cs.CV])
    (2 min) Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene, which is significant and imperative in many vision tasks. Most existing methods only consider visual cues, while neglect the accompanied audio information, which can provide complementary information for the scene understanding. In fact, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by simultaneously sensing these cues. Motivated by this, a bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map by assisting vision modality. The proposed method consists of three parts: 1) audio-visual encoding, 2) audio-visual location, and 3) multi-cues aggregation parts. Firstly, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio-visual location part is devised to locate the sound source in the visual scene by learning the correspondence between audio-visual information. Thirdly, a multi-cues aggregation part is devised to adaptively aggregate audio-visual information and center-bias prior to generate the final visual attention map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art visual attention models.
    TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network. (arXiv:2109.08174v1 [cs.CV])
    (2 min) Recently, face super-resolution (FSR) methods either feed whole face image into convolutional neural networks (CNNs) or utilize extra facial priors (e.g., facial parsing maps, facial landmarks) to focus on facial structure, thereby maintaining the consistency of the facial structure while restoring facial details. However, the limited receptive fields of CNNs and inaccurate facial priors will reduce the naturalness and fidelity of the reconstructed face. In this paper, we propose a novel paradigm based on the self-attention mechanism (i.e., the core of Transformer) to fully explore the representation capacity of the facial structure feature. Specifically, we design a Transformer-CNN aggregation network (TANet) consisting of two paths, in which one path uses CNNs responsible for restoring fine-grained facial details while the other utilizes a resource-friendly Transformer to capture global information by exploiting the long-distance visual relation modeling. By aggregating the features from the above two paths, the consistency of global facial structure and fidelity of local facial detail restoration are strengthened simultaneously. Experimental results of face reconstruction and recognition verify that the proposed method can significantly outperform the state-of-the-art methods.
    Transformer-Unet: Raw Image Processing with Unet. (arXiv:2109.08417v1 [eess.IV])
    (2 min) Medical image segmentation have drawn massive attention as it is important in biomedical image analysis. Good segmentation results can assist doctors with their judgement and further improve patients' experience. Among many available pipelines in medical image analysis, Unet is one of the most popular neural networks as it keeps raw features by adding concatenation between encoder and decoder, which makes it still widely used in industrial field. In the mean time, as a popular model which dominates natural language process tasks, transformer is now introduced to computer vision tasks and have seen promising results in object detection, image classification and semantic segmentation tasks. Therefore, the combination of transformer and Unet is supposed to be more efficient than both methods working individually. In this article, we propose Transformer-Unet by adding transformer modules in raw images instead of feature maps in Unet and test our network in CT82 datasets for Pancreas segmentation accordingly. We form an end-to-end network and gain segmentation results better than many previous Unet based algorithms in our experiment. We demonstrate our network and show our experimental results in this paper accordingly.
    Adaptive Hierarchical Dual Consistency for Semi-Supervised Left Atrium Segmentation on Cross-Domain Data. (arXiv:2109.08311v1 [eess.IV])
    (2 min) Semi-supervised learning provides great significance in left atrium (LA) segmentation model learning with insufficient labelled data. Generalising semi-supervised learning to cross-domain data is of high importance to further improve model robustness. However, the widely existing distribution difference and sample mismatch between different data domains hinder the generalisation of semi-supervised learning. In this study, we alleviate these problems by proposing an Adaptive Hierarchical Dual Consistency (AHDC) for the semi-supervised LA segmentation on cross-domain data. The AHDC mainly consists of a Bidirectional Adversarial Inference module (BAI) and a Hierarchical Dual Consistency learning module (HDC). The BAI overcomes the difference of distributions and the sample mismatch between two different domains. It mainly learns two mapping networks adversarially to obtain two matched domains through mutual adaptation. The HDC investigates a hierarchical dual learning paradigm for cross-domain semi-supervised segmentation based on the obtained matched domains. It mainly builds two dual-modelling networks for mining the complementary information in both intra-domain and inter-domain. For the intra-domain learning, a consistency constraint is applied to the dual-modelling targets to exploit the complementary modelling information. For the inter-domain learning, a consistency constraint is applied to the LAs modelled by two dual-modelling networks to exploit the complementary knowledge among different data domains. We demonstrated the performance of our proposed AHDC on four 3D late gadolinium enhancement cardiac MR (LGE-CMR) datasets from different centres and a 3D CT dataset. Compared to other state-of-the-art methods, our proposed AHDC achieved higher segmentation accuracy, which indicated its capability in the cross-domain semi-supervised LA segmentation.
    Neural Network Based Lidar Gesture Recognition for Realtime Robot Teleoperation. (arXiv:2109.08263v1 [eess.IV])
    (2 min) We propose a novel low-complexity lidar gesture recognition system for mobile robot control robust to gesture variation. Our system uses a modular approach, consisting of a pose estimation module and a gesture classifier. Pose estimates are predicted from lidar scans using a Convolutional Neural Network trained using an existing stereo-based pose estimation system. Gesture classification is accomplished using a Long Short-Term Memory network and uses a sequence of estimated body poses as input to predict a gesture. Breaking down the pipeline into two modules reduces the dimensionality of the input, which could be lidar scans, stereo imagery, or any other modality from which body keypoints can be extracted, making our system lightweight and suitable for mobile robot control with limited computing power. The use of lidar contributes to the robustness of the system, allowing it to operate in most outdoor conditions, to be independent of lighting conditions, and for input to be detected 360 degrees around the robot. The lidar-based pose estimator and gesture classifier use data augmentation and automated labeling techniques, requiring a minimal amount of data collection and avoiding the need for manual labeling. We report experimental results for each module of our system and demonstrate its effectiveness by testing it in a real-world robot teleoperation setting.
    A Divide-and-Merge Point Cloud Clustering Algorithm for LiDAR Panoptic Segmentation. (arXiv:2109.08224v1 [cs.CV])
    (2 min) Clustering objects from the LiDAR point cloud is an important research problem with many applications such as autonomous driving. To meet the real-time requirement, existing research proposed to apply the connected-component-labeling (CCL) technique on LiDAR spherical range image with a heuristic condition to check if two neighbor points are connected. However, LiDAR range image is different from a binary image which has a deterministic condition to tell if two pixels belong to the same component. The heuristic condition used on the LiDAR range image only works empirically, which suggests the LiDAR clustering algorithm should be robust to potential failures of the empirical heuristic condition. To overcome this challenge, this paper proposes a divide-and-merge LiDAR clustering algorithm. This algorithm firstly conducts clustering in each evenly divided local region, then merges the local clustered small components by voting on edge point pairs. Assuming there are $N$ LiDAR points of objects in total with $m$ divided local regions, the time complexity of the proposed algorithm is $O(N)+O(m^2)$. A smaller $m$ means the voting will involve more neighbor points, but the time complexity will become larger. So the $m$ controls the trade-off between the time complexity and the clustering accuracy. A proper $m$ helps the proposed algorithm work in real-time as well as maintain good performance. We evaluate the divide-and-merge clustering algorithm on the SemanticKITTI panoptic segmentation benchmark by cascading it with a state-of-the-art semantic segmentation model. The final performance evaluated through the leaderboard achieves the best among all published methods. The proposed algorithm is implemented with C++ and wrapped as a python function. It can be easily used with the modern deep learning framework in python.
    Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. (arXiv:2109.08203v1 [cs.CV])
    (2 min) In this paper I investigate the effect of random seed selection on the accuracy when using popular deep learning architectures for computer vision. I scan a large amount of seeds (up to $10^4$) on CIFAR 10 and I also scan fewer seeds on Imagenet using pre-trained models to investigate large scale datasets. The conclusions are that even if the variance is not very large, it is surprisingly easy to find an outlier that performs much better or much worse than the average.
    Mass Segmentation in Automated 3-D Breast Ultrasound Using Dual-Path U-net. (arXiv:2109.08330v1 [eess.IV])
    (2 min) Automated 3-D breast ultrasound (ABUS) is a newfound system for breast screening that has been proposed as a supplementary modality to mammography for breast cancer detection. While ABUS has better performance in dense breasts, reading ABUS images is exhausting and time-consuming. So, a computer-aided detection system is necessary for interpretation of these images. Mass segmentation plays a vital role in the computer-aided detection systems and it affects the overall performance. Mass segmentation is a challenging task because of the large variety in size, shape, and texture of masses. Moreover, an imbalanced dataset makes segmentation harder. A novel mass segmentation approach based on deep learning is introduced in this paper. The deep network that is used in this study for image segmentation is inspired by U-net, which has been used broadly for dense segmentation in recent years. The system's performance was determined using a dataset of 50 masses including 38 malign and 12 benign lesions. The proposed segmentation method attained a mean Dice of 0.82 which outperformed a two-stage supervised edge-based method with a mean Dice of 0.74 and an adaptive region growing method with a mean Dice of 0.65.
    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI. (arXiv:2109.08238v1 [cs.CV])
    (2 min) We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 - 3.7x larger than other building-scale datasets such as MP3D and Gibson. When compared to existing photorealistic 3D datasets such as Replica, MP3D, Gibson, and ScanNet, images rendered from HM3D have 20 - 85% higher visual fidelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is `pareto optimal' in the following sense -- agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.
  • cs.IR updates on arXiv.org

    Integrating Flowsheet Data in OMOP Common Data Model for Clinical Research. (arXiv:2109.08235v1 [cs.IR])
    (2 min) Flowsheet data presents unique challenges and opportunities for integration into standardized Common Data Models (CDMs) such as the Observational Medical Outcomes Partnership (OMOP) CDM from the Observational Health Data Sciences and Informatics (OHDSI) program. These data are a potentially rich source of detailed curated health outcomes data such as pain scores, vital signs, lines drains and airways (LDA) and other measurements that can be invaluable in building a robust model of patient health journey during an inpatient stay. We present two approaches to integration of flowsheet measures into the OMOP CDM. One approach was computationally straightforward but of potentially limited research utility. The second approach was far more computationally and labor intensive and involved mapping to standardized terms in controlled clinical vocabularies such as Logical Observation Identifiers Names and Codes (LOINC), resulting in a research data set of higher utility to population health studies.
    Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature. (arXiv:2106.13375v2 [cs.IR] UPDATED)
    (3 min) Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.
    Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles. (arXiv:2109.00199v2 [cs.IR] UPDATED)
    (2 min) We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article's contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
    Semantic Models for the First-stage Retrieval: A Comprehensive Review. (arXiv:2103.04831v4 [cs.IR] UPDATED)
    (2 min) Multi-stage ranking pipelines have been a practical solution in modern search systems, where the first-stage retrieval is to return a subset of candidate documents, and latter stages attempt to re-rank those candidates. Unlike re-ranking stages going through quick technique shifts during past decades, the first-stage retrieval has long been dominated by classical term-based models. Unfortunately, these models suffer from the vocabulary mismatch problem, which may block re-ranking stages from relevant documents at the very beginning. Therefore, it has been a long-term desire to build semantic models for the first-stage retrieval that can achieve high recall efficiently. Recently, we have witnessed an explosive growth of research interests on the first-stage semantic retrieval models. We believe it is the right time to survey current status, learn from existing methods, and gain some insights for future development. In this paper, we describe the current landscape of the first-stage retrieval models under a unified framework to clarify the connection between classical term-based retrieval methods, early semantic retrieval methods and neural semantic retrieval methods. Moreover, we identify some open challenges and envision some future directions, with the hope of inspiring more researches on these important yet less investigated topics.
    Slot Filling for Biomedical Information Extraction. (arXiv:2109.08564v1 [cs.CL])
    (2 min) Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in the above sub-tasks.In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and pretrained data are available at https://github.com/healx/biomed-slot-filling.
    Dr. Top-k: Delegate-Centric Top-k on GPUs. (arXiv:2109.08219v1 [cs.IR])
    (2 min) Recent top-$k$ computation efforts explore the possibility of revising various sorting algorithms to answer top-$k$ queries on GPUs. These endeavors, unfortunately, perform significantly more work than needed. This paper introduces Dr. Top-k, a Delegate-centric top-$k$ system on GPUs that can reduce the top-$k$ workloads significantly. Particularly, it contains three major contributions: First, we introduce a comprehensive design of the delegate-centric concept, including maximum delegate, delegate-based filtering, and $\beta$ delegate mechanisms to help reduce the workload for top-$k$ up to more than 99%. Second, due to the difficulty and importance of deriving a proper subrange size, we perform a rigorous theoretical analysis, coupled with thorough experimental validations to identify the desirable subrange size. Third, we introduce four key system optimizations to enable fast multi-GPU top-$k$ computation. Taken together, this work constantly outperforms the state-of-the-art.
    Simple Entity-Centric Questions Challenge Dense Retrievers. (arXiv:2109.08535v1 [cs.CL])
    (2 min) Open-domain question answering has exploded in popularity recently due to the success of dense retrieval models, which have surpassed sparse models using only a few supervised training examples. However, in this paper, we demonstrate current dense models are not yet the holy grail of retrieval. We first construct EntityQuestions, a set of simple, entity-rich questions based on facts from Wikidata (e.g., "Where was Arve Furset born?"), and observe that dense retrievers drastically underperform sparse methods. We investigate this issue and uncover that dense retrievers can only generalize to common entities unless the question pattern is explicitly observed during training. We discuss two simple solutions towards addressing this critical problem. First, we demonstrate that data augmentation is unable to fix the generalization problem. Second, we argue a more robust passage encoder helps facilitate better question adaptation using specialized question encoders. We hope our work can shed light on the challenges in creating a robust, universal dense retriever that works well across different input distributions.
    Context-aware Retail Product Recommendation with Regularized Gradient Boosting. (arXiv:2109.08561v1 [cs.LG])
    (2 min) In the FARFETCH Fashion Recommendation challenge, the participants needed to predict the order in which various products would be shown to a user in a recommendation impression. The data was provided in two phases - a validation phase and a test phase. The validation phase had a labelled training set that contained a binary column indicating whether a product has been clicked or not. The dataset comprises over 5,000,000 recommendation events, 450,000 products and 230,000 unique users. It represents real, unbiased, but anonymised, interactions of actual users of the FARFETCH platform. The final evaluation was done according to the performance in the second phase. A total of 167 participants participated in the challenge, and we secured the 6th rank during the final evaluation with an MRR of 0.4658 on the test set. We have designed a unique context-aware system that takes the similarity of a product to the user context into account to rank products more effectively. Post evaluation, we have been able to fine-tune our approach with an MRR of 0.4784 on the test set, which would have placed us at the 3rd position.
    Event Flow -- How Events Shaped the Flow of the News, 1950-1995. (arXiv:2109.08589v1 [cs.CY])
    (2 min) This article relies on information-theoretic measures to examine how events impacted the news for the period 1950-1995. Moreover, we present a method for event characterization in (unstructured) textual sources, offering a taxonomy of events based on the different ways they impacted the flow of news information. The results give us a better understanding of the relationship between events and their impact on news sources with varying ideological backgrounds.
    Fast-Slow Transformer for Visually Grounding Speech. (arXiv:2109.08186v1 [eess.AS])
    (2 min) We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.
    Semantic Preserving Bijective Mappings of Mathematical Formulae between Document Preparation Systems and Computer Algebra Systems. (arXiv:2109.08655v1 [cs.IR])
    (2 min) Document preparation systems like LaTeX offer the ability to render mathematical expressions as one would write these on paper. Using LaTeX, LaTeXML, and tools generated for use in the National Institute of Standards (NIST) Digital Library of Mathematical Functions, semantically enhanced mathematical LaTeX markup (semantic LaTeX) is achieved by using a semantic macro set. Computer algebra systems (CAS) such as Maple and Mathematica use alternative markup to represent mathematical expressions. By taking advantage of Youssef's Part-of-Math tagger and CAS internal representations, we develop algorithms to translate mathematical expressions represented in semantic LaTeX to corresponding CAS representations and vice versa. We have also developed tools for translating the entire Wolfram Encoding Continued Fraction Knowledge and University of Antwerp Continued Fractions for Special Functions datasets, for use in the NIST Digital Repository of Mathematical Formulae. The overall goal of these efforts is to provide semantically enriched standard conforming MathML representations to the public for formulae in digital mathematics libraries. These representations include presentation MathML, content MathML, generic LaTeX, semantic LaTeX, and now CAS representations as well.
  • cs.LG updates on arXiv.org

    Discriminative Similarity for Data Clustering. (arXiv:2109.08675v1 [cs.LG])
    (2 min) Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose Clustering by Discriminative Similarity (CDS), a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.
    Accurate, Interpretable, and Fast Animation: AnIterative, Sparse, and Nonconvex Approach. (arXiv:2109.08356v1 [cs.LG])
    (2 min) Digital human animation relies on high-quality 3D models of the human face: rigs. A face rig must be accurate and, at the same time, fast to compute. One of the most common rigging models is the blendshape model. We propose a novel algorithm for solving the nonconvex inverse rig problem in facial animation. Our approach is model-based, but in contrast with previous model-based approaches, we use a quadratic instead of the linear approximation to the higher order rig model. This increases the accuracy of the solution by 8 percent on average and, confirmed by the empirical results, increases the sparsity of the resulting parameter vector -- an important feature for interpretability by animation artists. The proposed solution is based on a Levenberg-Marquardt (LM) algorithm, applied to a nonconvex constrained problem with sparsity regularization. In order to reduce the complexity of the iterates, a paradigm of Majorization Minimization (MM) is further invoked, which leads to an easy to solve problem that is separable in the parameters at each algorithm iteration. The algorithm is evaluated on a number of animation datasets, proprietary and open-source, and the results indicate the superiority of our method compared to the standard approach based on the linear rig approximation. Although our algorithm targets the specific problem, it might have additional signal processing applications.
    On In-network learning. A Comparative Study with Federated and Split Learning. (arXiv:2104.14929v2 [stat.ML] UPDATED)
    (2 min) In this paper, we consider a problem in which distributively extracted features are used for performing inference in wireless networks. We elaborate on our proposed architecture, which we herein refer to as "in-network learning", provide a suitable loss function and discuss its optimization using neural networks. We compare its performance with both Federated- and Split learning; and show that this architecture offers both better accuracy and bandwidth savings.
    Learning Sparse Graph with Minimax Concave Penalty under Gaussian Markov Random Fields. (arXiv:2109.08666v1 [eess.SP])
    (2 min) This paper presents a convex-analytic framework to learn sparse graphs from data. While our problem formulation is inspired by an extension of the graphical lasso using the so-called combinatorial graph Laplacian framework, a key difference is the use of a nonconvex alternative to the $\ell_1$ norm to attain graphs with better interpretability. Specifically, we use the weakly-convex minimax concave penalty (the difference between the $\ell_1$ norm and the Huber function) which is known to yield sparse solutions with lower estimation bias than $\ell_1$ for regression problems. In our framework, the graph Laplacian is replaced in the optimization by a linear transform of the vector corresponding to its upper triangular part. Via a reformulation relying on Moreau's decomposition, we show that overall convexity is guaranteed by introducing a quadratic function to our cost function. The problem can be solved efficiently by the primal-dual splitting method, of which the admissible conditions for provable convergence are presented. Numerical examples show that the proposed method significantly outperforms the existing graph learning methods with reasonable CPU time.
    ProCAN: Progressive Growing Channel Attentive Non-Local Network for Lung Nodule Classification. (arXiv:2010.15417v3 [eess.IV] UPDATED)
    (2 min) Lung cancer classification in screening computed tomography (CT) scans is one of the most crucial tasks for early detection of this disease. Many lives can be saved if we are able to accurately classify malignant/cancerous lung nodules. Consequently, several deep learning based models have been proposed recently to classify lung nodules as malignant or benign. Nevertheless, the large variation in the size and heterogeneous appearance of the nodules makes this task an extremely challenging one. We propose a new Progressive Growing Channel Attentive Non-Local (ProCAN) network for lung nodule classification. The proposed method addresses this challenge from three different aspects. First, we enrich the Non-Local network by adding channel-wise attention capability to it. Second, we apply Curriculum Learning principles, whereby we first train our model on easy examples before hard ones. Third, as the classification task gets harder during the Curriculum learning, our model is progressively grown to increase its capability of handling the task at hand. We examined our proposed method on two different public datasets and compared its performance with state-of-the-art methods in the literature. The results show that the ProCAN model outperforms state-of-the-art methods and achieves an AUC of 98.05% and an accuracy of 95.28% on the LIDC-IDRI dataset. Moreover, we conducted extensive ablation studies to analyze the contribution and effects of each new component of our proposed method.
    Distributionally Robust Policy Learning via Adversarial Environment Generation. (arXiv:2107.06353v2 [cs.RO] UPDATED)
    (2 min) Our goal is to train control policies that generalize well to unseen environments. Inspired by the Distributionally Robust Optimization (DRO) framework, we propose DRAGEN - Distributionally Robust policy learning via Adversarial Generation of ENvironments - for iteratively improving robustness of policies to realistic distribution shifts by generating adversarial environments. The key idea is to learn a generative model for environments whose latent variables capture cost-predictive and realistic variations in environments. We perform DRO with respect to a Wasserstein ball around the empirical distribution of environments by generating realistic adversarial environments via gradient ascent on the latent space. We demonstrate strong Out-of-Distribution (OoD) generalization in simulation for (i) swinging up a pendulum with onboard vision and (ii) grasping realistic 2D/3D objects. Grasping experiments on hardware demonstrate better sim2real performance compared to domain randomization.
    Non-parametric Semi-Supervised Learning in Many-body Hilbert Space with Rescaled Logarithmic Fidelity. (arXiv:2107.00195v2 [quant-ph] UPDATED)
    (2 min) In quantum and quantum-inspired machine learning, the very first step is to embed the data in quantum space known as Hilbert space. Developing quantum kernel function (QKF), which defines the distances among the samples in the Hilbert space, belongs to the fundamental topics for machine learning. In this work, we propose the rescaled logarithmic fidelity (RLF) and non-parametric semi-supervised learning in the quantum space, which we name as RLF-NSSL. The rescaling takes advantage of the non-linearity of the kernel to tune the mutual distances of samples in the Hilbert space, and meanwhile avoids the exponentially-small fidelities between quantum many-qubit states. Being non-parametric excludes the possible effects from the variational parameters, and evidently demonstrates the advantages from the space itself. We compare RLF-NSSL with several well-known non-parametric algorithms including naive Bayes classifiers, k-nearest neighbors, and spectral clustering. Our method exhibits better accuracy particularly for the unsupervised case with no labeled samples and the few-shot cases with small numbers of labeled samples. With the visualizations by t-stochastic neighbor embedding, our results imply that the machine learning in the Hilbert space complies with the principles of maximal coding rate reduction, where the low-dimensional data exhibit within-class compressibility, between-class discrimination, and overall diversity. Our proposals can be applied to other quantum and quantum-inspired machine learning, including the methods using the parametric models such as tensor networks, quantum circuits, and quantum neural networks.
    BDNNSurv: Bayesian deep neural networks for survival analysis using pseudo values. (arXiv:2101.03170v2 [stat.ME] UPDATED)
    (2 min) There has been increasing interest in modeling survival data using deep learning methods in medical research. In this paper, we proposed a Bayesian hierarchical deep neural networks model for modeling and prediction of survival data. Compared with previously studied methods, the new proposal can provide not only point estimate of survival probability but also quantification of the corresponding uncertainty, which can be of crucial importance in predictive modeling and subsequent decision making. The favorable statistical properties of point and uncertainty estimates were demonstrated by simulation studies and real data analysis. The Python code implementing the proposed approach was provided.
    Online certification of preference-based fairness for personalized recommender systems. (arXiv:2104.14527v2 [cs.LG] UPDATED)
    (2 min) Recommender systems are facing scrutiny because of their growing impact on the opportunities we have access to. Current audits for fairness are limited to coarse-grained parity assessments at the level of sensitive groups. We propose to audit for envy-freeness, a more granular criterion aligned with individual preferences: every user should prefer their recommendations to those of other users. Since auditing for envy requires to estimate the preferences of users beyond their existing recommendations, we cast the audit as a new pure exploration problem in multi-armed bandits. We propose a sample-efficient algorithm with theoretical guarantees that it does not deteriorate user experience. We also study the trade-offs achieved on real-world recommendation datasets.
    Supervised Contrastive Replay: Revisiting the Nearest Class Mean Classifier in Online Class-Incremental Continual Learning. (arXiv:2103.13885v3 [cs.LG] UPDATED)
    (2 min) Online class-incremental continual learning (CL) studies the problem of learning new classes continually from an online non-stationary data stream, intending to adapt to new data while mitigating catastrophic forgetting. While memory replay has shown promising results, the recency bias in online learning caused by the commonly used Softmax classifier remains an unsolved challenge. Although the Nearest-Class-Mean (NCM) classifier is significantly undervalued in the CL community, we demonstrate that it is a simple yet effective substitute for the Softmax classifier. It addresses the recency bias and avoids structural changes in the fully-connected layer for new classes. Moreover, we observe considerable and consistent performance gains when replacing the Softmax classifier with the NCM classifier for several state-of-the-art replay methods. To leverage the NCM classifier more effectively, data embeddings belonging to the same class should be clustered and well-separated from those with a different class label. To this end, we contribute Supervised Contrastive Replay (SCR), which explicitly encourages samples from the same class to cluster tightly in embedding space while pushing those of different classes further apart during replay-based training. Overall, we observe that our proposed SCR substantially reduces catastrophic forgetting and outperforms state-of-the-art CL methods by a significant margin on a variety of datasets.
    Deep Spiking Neural Networks with Resonate-and-Fire Neurons. (arXiv:2109.08234v1 [cs.NE])
    (2 min) In this work, we explore a new Spiking Neural Network (SNN) formulation with Resonate-and-Fire (RAF) neurons (Izhikevich, 2001) trained with gradient descent via back-propagation. The RAF-SNN, while more biologically plausible, achieves performance comparable to or higher than conventional models in the Machine Learning literature across different network configurations, using similar or fewer parameters. Strikingly, the RAF-SNN proves robust against noise induced at testing/training time, under both static and dynamic conditions. Against CNN on MNIST, we show 25% higher absolute accuracy with N(0, 0.2) induced noise at testing time. Against LSTM on N-MNIST, we show 70% higher absolute accuracy with 20% induced noise at training time.
    Classification of multivariate weakly-labelled time-series with attention. (arXiv:2102.08245v3 [cs.LG] UPDATED)
    (2 min) This research identifies a gap in weakly-labelled multivariate time-series classification (TSC), where state-of-the-art TSC models do not per-form well. Weakly labelled time-series are time-series containing noise and significant redundancies. In response to this gap, this paper proposes an approach of exploiting context relevance of subsequences from previous subsequences to improve classification accuracy. To achieve this, state-of-the-art Attention algorithms are experimented in combination with the top CNN models for TSC (FCN and ResNet), in an CNN-LSTM architecture. Attention is a popular strategy for context extraction with exceptional performance in modern sequence-to-sequence tasks. This paper shows how attention algorithms can be used for improved weakly labelledTSC by evaluating models on a multivariate EEG time-series dataset obtained using a commercial Emotiv headsets from participants performing various activities while driving. These time-series are segmented into sub-sequences and labelled to allow supervised TSC.
    WebQA: Multihop and Multimodal QA. (arXiv:2109.00590v2 [cs.CL] UPDATED)
    (2 min) Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources -- akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
    ILoSA: Interactive Learning of Stiffness and Attractors. (arXiv:2103.03099v2 [cs.RO] UPDATED)
    (2 min) Teaching robots how to apply forces according to our preferences is still an open challenge that has to be tackled from multiple engineering perspectives. This paper studies how to learn variable impedance policies where both the Cartesian stiffness and the attractor can be learned from human demonstrations and corrections with a user-friendly interface. The presented framework, named ILoSA, uses Gaussian Processes for policy learning, identifying regions of uncertainty and allowing interactive corrections, stiffness modulation and active disturbance rejection. The experimental evaluation of the framework is carried out on a Franka-Emika Panda in four separate cases with unique force interaction properties: 1) pulling a plug wherein a sudden force discontinuity occurs upon successful removal of the plug, 2) pushing a box where a sustained force is required to keep the robot in motion, 3) wiping a whiteboard in which the force is applied perpendicular to the direction of movement, and 4) inserting a plug to verify the usability for precision-critical tasks in an experimental validation performed with non-expert users.
    Autonomous Vision-based UAV Landing with Collision Avoidance using Deep Learning. (arXiv:2109.08628v1 [cs.LG])
    (2 min) There is a risk of collision when multiple UAVs land simultaneously without communication on the same platform. This work accomplishes vision-based autonomous landing and uses a deep-learning-based method to realize collision avoidance during the landing process.
    AI-assisted super-resolution cosmological simulations II: Halo substructures, velocities and higher order statistics. (arXiv:2105.01016v2 [astro-ph.CO] UPDATED)
    (2 min) In this work, we expand and test the capabilities of our recently developed super-resolution (SR) model to generate high-resolution (HR) realizations of the full phase-space matter distribution, including both displacement and velocity, from computationally cheap low-resolution (LR) cosmological N-body simulations. The SR model enhances the simulation resolution by generating 512 times more tracer particles, extending into the deeply non-linear regime where complex structure formation processes take place. We validate the SR model by deploying the model in 10 test simulations of box size 100 Mpc/h, and examine the matter power spectra, bispectra and 2D power spectra in redshift space. We find the generated SR field matches the true HR result at percent level down to scales of k ~ 10 h/Mpc. We also identify and inspect dark matter halos and their substructures. Our SR model generate visually authentic small-scale structures, that cannot be resolved by the LR input, and are in good statistical agreement with the real HR results. The SR model performs satisfactorily on the halo occupation distribution, halo correlations in both real and redshift space, and the pairwise velocity distribution, matching the HR results with comparable scatter, thus demonstrating its potential in making mock halo catalogs. The SR technique can be a powerful and promising tool for modelling small-scale galaxy formation physics in large cosmological volumes.
    Online learning of windmill time series using Long Short-term Cognitive Networks. (arXiv:2107.00425v2 [cs.LG] UPDATED)
    (2 min) Forecasting windmill time series is often the basis of other processes such as anomaly detection, health monitoring, or maintenance scheduling. The amount of data generated on windmill farms makes online learning the most viable strategy to follow. Such settings require retraining the model each time a new batch of data is available. However, update the model with the new information is often very expensive to perform using traditional Recurrent Neural Networks (RNNs). In this paper, we use Long Short-term Cognitive Networks (LSTCNs) to forecast windmill time series in online settings. These recently introduced neural systems consist of chained Short-term Cognitive Network blocks, each processing a temporal data chunk. The learning algorithm of these blocks is based on a very fast, deterministic learning rule that makes LSTCNs suitable for online learning tasks. The numerical simulations using a case study with four windmills showed that our approach reported the lowest forecasting errors with respect to a simple RNN, a Long Short-term Memory, a Gated Recurrent Unit, and a Hidden Markov Model. What is perhaps more important is that the LSTCN approach is significantly faster than these state-of-the-art models.
    Insta-RS: Instance-wise Randomized Smoothing for Improved Robustness and Accuracy. (arXiv:2103.04436v4 [cs.LG] UPDATED)
    (2 min) Randomized smoothing (RS) is an effective and scalable technique for constructing neural network classifiers that are certifiably robust to adversarial perturbations. Most RS works focus on training a good base model that boosts the certified robustness of the smoothed model. However, existing RS techniques treat every data point the same, i.e., the variance of the Gaussian noise used to form the smoothed model is preset and universal for all training and test data. This preset and universal Gaussian noise variance is suboptimal since different data points have different margins and the local properties of the base model vary across the input examples. In this paper, we examine the impact of customized handling of examples and propose Instance-wise Randomized Smoothing (Insta-RS) -- a multiple-start search algorithm that assigns customized Gaussian variances to test examples. We also design Insta-RS Train -- a novel two-stage training algorithm that adaptively adjusts and customizes the noise level of each training example for training a base model that boosts the certified robustness of the instance-wise Gaussian smoothed model. Through extensive experiments on CIFAR-10 and ImageNet, we show that our method significantly enhances the average certified radius (ACR) as well as the clean data accuracy compared to existing state-of-the-art provably robust classifiers.
    Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information. (arXiv:1912.05625v4 [q-bio.BM] UPDATED)
    (2 min) Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as PLUS-RNN. Our experiment results demonstrate that PLUS-RNN outperforms other models of similar size solely pre-trained with the language modeling in six out of seven widely used protein biology tasks. Furthermore, we present the results from our qualitative interpretation analyses to illustrate the strengths of PLUS-RNN. PLUS provides a novel way to exploit evolutionary relationships among unlabeled proteins and is broadly applicable across a variety of protein biology tasks. We expect that the gap between the numbers of unlabeled and labeled proteins will continue to grow exponentially, and the proposed pre-training method will play a larger role.
    SLAW: Scaled Loss Approximate Weighting for Efficient Multi-Task Learning. (arXiv:2109.08218v1 [cs.LG])
    (2 min) Multi-task learning (MTL) is a subfield of machine learning with important applications, but the multi-objective nature of optimization in MTL leads to difficulties in balancing training between tasks. The best MTL optimization methods require individually computing the gradient of each task's loss function, which impedes scalability to a large number of tasks. In this paper, we propose Scaled Loss Approximate Weighting (SLAW), a method for multi-task optimization that matches the performance of the best existing methods while being much more efficient. SLAW balances learning between tasks by estimating the magnitudes of each task's gradient without performing any extra backward passes. We provide theoretical and empirical justification for SLAW's estimation of gradient magnitudes. Experimental results on non-linear regression, multi-task computer vision, and virtual screening for drug discovery demonstrate that SLAW is significantly more efficient than strong baselines without sacrificing performance and applicable to a diverse range of domains.
    Distilling Linguistic Context for Language Model Compression. (arXiv:2109.08359v1 [cs.CL])
    (2 min) A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.
    CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research. (arXiv:2109.08267v1 [cs.PL])
    (2 min) Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can serve as a common benchmark for comparing techniques, and as a platform to accelerate progress in the field. We introduce CompilerGym, a set of environments for real world compiler optimization tasks, and a toolkit for exposing new optimization tasks to compiler researchers. CompilerGym enables anyone to experiment on production compiler optimization problems through an easy-to-use package, regardless of their experience with compilers. We build upon the popular OpenAI Gym interface enabling researchers to interact with compilers using Python and a familiar API. We describe the CompilerGym architecture and implementation, characterize the optimization spaces and computational efficiencies of three included compiler environments, and provide extensive empirical evaluations. Compared to prior works, CompilerGym offers larger datasets and optimization spaces, is 27x more computationally efficient, is fault-tolerant, and capable of detecting reproducibility bugs in the underlying compilers. In making it easy for anyone to experiment with compilers - irrespective of their background - we aim to accelerate progress in the AI and compiler research domains.
    Double Descent Optimization Pattern and Aliasing: Caveats of Noisy Labels. (arXiv:2106.02100v2 [cs.LG] UPDATED)
    (2 min) Optimization plays a key role in the training of deep neural networks. Deciding when to stop training can have a substantial impact on the performance of the network during inference. Under certain conditions, the generalization error can display a double descent pattern during training: the learning curve is non-monotonic and seemingly diverges before converging again after additional epochs. This optimization pattern can lead to early stopping procedures to stop training before the second convergence and consequently select a suboptimal set of parameters for the network, with worse performance during inference. In this work, in addition to confirming that double descent occurs with small datasets and noisy labels as evidenced by others, we show that noisy labels must be present both in the training and generalization sets to observe a double descent pattern. We also show that the learning rate has an influence on double descent, and study how different optimizers and optimizer parameters influence the apparition of double descent. Finally, we show that increasing the learning rate can create an aliasing effect that masks the double descent pattern without suppressing it. We study this phenomenon through extensive experiments on variants of CIFAR-10 and show that they translate to a real world application: the forecast of seizure events in epileptic patients from continuous electroencephalographic recordings.
    Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. (arXiv:2007.15779v6 [cs.CL] UPDATED)
    (2 min) Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.
    K-means for Evolving Data Streams. (arXiv:2012.03628v2 [stat.ML] UPDATED)
    (2 min) Currently the amount of data produced worldwide is increasing beyond measure, thus a high volume of unsupervised data must be processed continuously. One of the main unsupervised data analysis is clustering. In streaming data scenarios, the data is composed by an increasing sequence of batches of samples where the concept drift phenomenon may happen. In this paper, we formally define the Streaming $K$-means(S$K$M) problem, which implies a restart of the error function when a concept drift occurs. We propose a surrogate error function that does not rely on concept drift detection. We proof that the surrogate is a good approximation of the S$K$M error. Hence, we suggest an algorithm which minimizes this alternative error each time a new batch arrives. We present some initialization techniques for streaming data scenarios as well. Besides providing theoretical results, experiments demonstrate an improvement of the converged error for the non-trivial initialization methods.
    From Known to Unknown: Knowledge-guided Transformer for Time-Series Sales Forecasting in Alibaba. (arXiv:2109.08381v1 [cs.LG])
    (2 min) Time series forecasting (TSF) is fundamentally required in many real-world applications, such as electricity consumption planning and sales forecasting. In e-commerce, accurate time-series sales forecasting (TSSF) can significantly increase economic benefits. TSSF in e-commerce aims to predict future sales of millions of products. The trend and seasonality of products vary a lot, and the promotion activity heavily influences sales. Besides the above difficulties, we can know some future knowledge in advance except for the historical statistics. Such future knowledge may reflect the influence of the future promotion activity on current sales and help achieve better accuracy. However, most existing TSF methods only predict the future based on historical information. In this work, we make up for the omissions of future knowledge. Except for introducing future knowledge for prediction, we propose Aliformer based on the bidirectional Transformer, which can utilize the historical information, current factor, and future knowledge to predict future sales. Specifically, we design a knowledge-guided self-attention layer that uses known knowledge's consistency to guide the transmission of timing information. And the future-emphasized training strategy is proposed to make the model focus more on the utilization of future knowledge. Extensive experiments on four public benchmark datasets and one proposed large-scale industrial dataset from Tmall demonstrate that Aliformer can perform much better than state-of-the-art TSF methods. Aliformer has been deployed for goods selection on Tmall Industry Tablework, and the dataset will be released upon approval.
    Assessments of model-form uncertainty using Gaussian stochastic weight averaging for fluid-flow regression. (arXiv:2109.08248v1 [physics.flu-dyn])
    (2 min) We use Gaussian stochastic weight averaging (SWAG) to assess the model-form uncertainty associated with neural-network-based function approximation relevant to fluid flows. SWAG approximates a posterior Gaussian distribution of each weight, given training data, and a constant learning rate. Having access to this distribution, it is able to create multiple models with various combinations of sampled weights, which can be used to obtain ensemble predictions. The average of such an ensemble can be regarded as the `mean estimation', whereas its standard deviation can be used to construct `confidence intervals', which enable us to perform uncertainty quantification (UQ) with regard to the training process of neural networks. We utilize representative neural-network-based function approximation tasks for the following cases: (i) a two-dimensional circular-cylinder wake; (ii) the DayMET dataset (maximum daily temperature in North America); (iii) a three-dimensional square-cylinder wake; and (iv) urban flow, to assess the generalizability of the present idea for a wide range of complex datasets. SWAG-based UQ can be applied regardless of the network architecture, and therefore, we demonstrate the applicability of the method for two types of neural networks: (i) global field reconstruction from sparse sensors by combining convolutional neural network (CNN) and multi-layer perceptron (MLP); and (ii) far-field state estimation from sectional data with two-dimensional CNN. We find that SWAG can obtain physically-interpretable confidence-interval estimates from the perspective of model-form uncertainty. This capability supports its use for a wide range of problems in science and engineering.
    SaCoFa: Semantics-aware Control-flow Anonymization for Process Mining. (arXiv:2109.08501v1 [cs.DB])
    (2 min) Privacy-preserving process mining enables the analysis of business processes using event logs, while giving guarantees on the protection of sensitive information on process stakeholders. To this end, existing approaches add noise to the results of queries that extract properties of an event log, such as the frequency distribution of trace variants, for analysis.Noise insertion neglects the semantics of the process, though, and may generate traces not present in the original log. This is problematic. It lowers the utility of the published data and makes noise easily identifiable, as some traces will violate well-known semantic constraints.In this paper, we therefore argue for privacy preservation that incorporates a process semantics. For common trace-variant queries, we show how, based on the exponential mechanism, semantic constraints are incorporated to ensure differential privacy of the query result. Experiments demonstrate that our semantics-aware anonymization yields event logs of significantly higher utility than existing approaches.
    Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under Delayed Feedback. (arXiv:2109.08548v1 [cs.DC])
    (2 min) Scheduling decisions in parallel queuing systems arise as a fundamental problem, underlying the dimensioning and operation of many computing and communication systems, such as job routing in data center clusters, multipath communication, and Big Data systems. In essence, the scheduler maps each arriving job to one of the possibly heterogeneous servers while aiming at an optimization goal such as load balancing, low average delay or low loss rate. One main difficulty in finding optimal scheduling decisions here is that the scheduler only partially observes the impact of its decisions, e.g., through the delayed acknowledgements of the served jobs. In this paper, we provide a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements. We present a simulation model for this PO system to find a near-optimal scheduling policy in real-time using a scalable Monte Carlo tree search algorithm. We numerically show that the resulting policy outperforms other limited information scheduling strategies such as variants of Join-the-Most-Observations and has comparable performance to full information strategies like: Join-the-Shortest-Queue, Join-the- Shortest-Queue(d) and Shortest-Expected-Delay. Finally, we show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
    Measuring Fairness under Unawareness via Quantification. (arXiv:2109.08549v1 [cs.CY])
    (2 min) Models trained by means of supervised learning are increasingly deployed in high-stakes domains, and, when their predictions inform decisions about people, they inevitably impact (positively or negatively) on their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and ensure that sensitive demographic attributes, such as race or sex, do not result in unfair treatment for members of specific groups. For doing this, awareness of demographic attributes on the part of those evaluating model impacts is fundamental. Unfortunately, the collection of these attributes is often in conflict with industry practices and legislation on data minimization and privacy. For this reason, it may be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We identify five important factors that complicate the estimation of fairness under unawareness and formalize them into five different experimental protocols under which we assess the effectiveness of different estimators of group fairness. We also consider the problem of potential model misuse to infer sensitive attributes at an individual level, and demonstrate that quantification approaches are suitable for decoupling the (desirable) objective of measuring group fairness from the (undesirable) objective of inferring sensitive attributes of individuals.
    LEAN: graph-based pruning for convolutional neural networks by extracting longest chains. (arXiv:2011.06923v2 [cs.LG] UPDATED)
    (2 min) Convolutional neural networks (CNNs) have proven to be highly successful at a range of image-to-image tasks. CNNs can be prohibitively computationally expensive for real time use, which can limit their applicability in practice. Model pruning can improve computational efficiency by sparsifying trained networks. Common methods for pruning CNNs determine what convolutional filters to remove by ranking filters on an individual basis. However, filters are not independent, as CNNs consist of chains of convolutions, which can result in sub-optimal filter selection. We propose a novel pruning method, LongEst-chAiN (LEAN) pruning, which takes the interdependency between the convolution operations into account. We propose to prune CNNs by using graph-based algorithms to select relevant chains of convolutions. A CNN is interpreted as a graph, with the operator norm of each operator as distance metric for the edges. LEAN pruning iteratively extracts the highest value path from the graph to keep. In our experiments, we test LEAN pruning for several image-to-image tasks, including the well-known CamVid dataset, and a real-world dynamic X-ray CT dataset. When pruning CNNs with LEAN, we achieve a higher accuracy than pruning filters individually, and different pruned substructures emerge.
    Hard to Forget: Poisoning Attacks on Certified Machine Unlearning. (arXiv:2109.08266v1 [cs.LG])
    (2 min) The right to erasure requires removal of a user's information from data held by organizations, with rigorous interpretations extending to downstream products such as learned models. Retraining from scratch with the particular user's data omitted fully removes its influence on the resulting model, but comes with a high computational cost. Machine "unlearning" mitigates the cost incurred by full retraining: instead, models are updated incrementally, possibly only requiring retraining when approximation errors accumulate. Rapid progress has been made towards privacy guarantees on the indistinguishability of unlearned and retrained models, but current formalisms do not place practical bounds on computation. In this paper we demonstrate how an attacker can exploit this oversight, highlighting a novel attack surface introduced by machine unlearning. We consider an attacker aiming to increase the computational cost of data removal. We derive and empirically investigate a poisoning attack on certified machine unlearning where strategically designed training data triggers complete retraining when removed.
    Soft Actor-Critic With Integer Actions. (arXiv:2109.08512v1 [cs.LG])
    (2 min) Reinforcement learning is well-studied under discrete actions. Integer actions setting is popular in the industry yet still challenging due to its high dimensionality. To this end, we study reinforcement learning under integer actions by incorporating the Soft Actor-Critic (SAC) algorithm with an integer reparameterization. Our key observation for integer actions is that their discrete structure can be simplified using their comparability property. Hence, the proposed integer reparameterization does not need one-hot encoding and is of low dimensionality. Experiments show that the proposed SAC under integer actions is as good as the continuous action version on robot control tasks and outperforms Proximal Policy Optimization on power distribution systems control tasks.
    Dynamics-Aware Quality-Diversity for Efficient Learning of Skill Repertoires. (arXiv:2109.08522v1 [cs.LG])
    (2 min) Quality-Diversity (QD) algorithms are powerful exploration algorithms that allow robots to discover large repertoires of diverse and high-performing skills. However, QD algorithms are sample inefficient and require millions of evaluations. In this paper, we propose Dynamics-Aware Quality-Diversity (DA-QD), a framework to improve the sample efficiency of QD algorithms through the use of dynamics models. We also show how DA-QD can then be used for continual acquisition of new skill repertoires. To do so, we incrementally train a deep dynamics model from experience obtained when performing skill discovery using QD. We can then perform QD exploration in imagination with an imagined skill repertoire. We evaluate our approach on three robotic experiments. First, our experiments show DA-QD is 20 times more sample efficient than existing QD approaches for skill discovery. Second, we demonstrate learning an entirely new skill repertoire in imagination to perform zero-shot learning. Finally, we show how DA-QD is useful and effective for solving a long horizon navigation task and for damage adaptation in the real world. Videos and source code are available at: https://sites.google.com/view/da-qd.
    A Logic-based Multi-agent System for Ethical Monitoring and Evaluation of Dialogues. (arXiv:2109.08294v1 [cs.AI])
    (2 min) Dialogue Systems are tools designed for various practical purposes concerning human-machine interaction. These systems should be built on ethical foundations because their behavior may heavily influence a user (think especially about children). The primary objective of this paper is to present the architecture and prototype implementation of a Multi Agent System (MAS) designed for ethical monitoring and evaluation of a dialogue system. A prototype application, for monitoring and evaluation of chatting agents' (human/artificial) ethical behavior in an online customer service chat point w.r.t their institution/company's codes of ethics and conduct, is developed and presented. Future work and open issues with this research are discussed.
    Decision Tree Learning with Spatial Modal Logics. (arXiv:2109.08325v1 [cs.LG])
    (0 min) Symbolic learning represents the most straightforward approach to interpretable modeling, but its applications have been hampered by a single structural design choice: the adoption of propositional logic as the underlying language. Recently, more-than-propositional symbolic learning methods have started to appear, in particular for time-dependent data. These methods exploit the expressive power of modal temporal logics in powerful learning algorithms, such as temporal decision trees, whose classification capabilities are comparable with the best non-symbolic ones, while producing models with explicit knowledge representation. With the intent of following the same approach in the case of spatial data, in this paper we: i) present a theory of spatial decision tree learning; ii) describe a prototypical implementation of a spatial decision tree learning algorithm based, and strictly extending, the classical C4.5 algorithm; and iii) perform a series of experiments in which we compare the predicting power of spatial decision trees with that of classical propositional decision trees in several versions, for a multi-class image classification problem, on publicly available datasets. Our results are encouraging, showing clear improvements in the performances from the propositional to the spatial models, which in turn show higher levels of interpretability.
    An open GPS trajectory dataset and benchmark for travel mode detection. (arXiv:2109.08527v1 [cs.LG])
    (0 min) Travel mode detection has been a hot topic in the field of GPS trajectory-related processing. Former scholars have developed many mathematical methods to improve the accuracy of detection. Among these studies, almost all of the methods require ground truth dataset for training. A large amount of the studies choose to collect the GPS trajectory dataset for training by their customized ways. Currently, there is no open GPS dataset marked with travel mode. If there exists one, it will not only save a lot of efforts in model developing, but also help compare the performance of models. In this study, we propose and open GPS trajectory dataset marked with travel mode and benchmark for the travel mode detection. The dataset is collected by 7 independent volunteers in Japan and covers the time period of a complete month. The travel mode ranges from walking to railway. A part of routines are traveled repeatedly in different time slots to experience different road and travel conditions. We also provide a case study to distinguish the walking and bike trips in a massive GPS trajectory dataset.
    Structured Ensembles: an Approach to Reduce the Memory Footprint of Ensemble Methods. (arXiv:2105.02551v2 [cs.LG] UPDATED)
    (0 min) In this paper, we propose a novel ensembling technique for deep neural networks, which is able to drastically reduce the required memory compared to alternative approaches. In particular, we propose to extract multiple sub-networks from a single, untrained neural network by solving an end-to-end optimization task combining differentiable scaling over the original architecture, with multiple regularization terms favouring the diversity of the ensemble. Since our proposal aims to detect and extract sub-structures, we call it Structured Ensemble. On a large experimental evaluation, we show that our method can achieve higher or comparable accuracy to competing methods while requiring significantly less storage. In addition, we evaluate our ensembles in terms of predictive calibration and uncertainty, showing they compare favourably with the state-of-the-art. Finally, we draw a link with the continual learning literature, and we propose a modification of our framework to handle continuous streams of tasks with a sub-linear memory cost. We compare with a number of alternative strategies to mitigate catastrophic forgetting, highlighting advantages in terms of average accuracy and memory.
    Enforcing fairness in private federated learning via the modified method of differential multipliers. (arXiv:2109.08604v1 [cs.LG])
    (0 min) Federated learning with differential privacy, or private federated learning, provides a strategy to train machine learning models while respecting users' privacy. However, differential privacy can disproportionately degrade the performance of the models on under-represented groups, as these parts of the distribution are difficult to learn in the presence of noise. Existing approaches for enforcing fairness in machine learning models have considered the centralized setting, in which the algorithm has access to the users' data. This paper introduces an algorithm to enforce group fairness in private federated learning, where users' data does not leave their devices. First, the paper extends the modified method of differential multipliers to empirical risk minimization with fairness constraints, thus providing an algorithm to enforce fairness in the central setting. Then, this algorithm is extended to the private federated learning setting. The proposed algorithm, FPFL, is tested on a federated version of the Adult dataset and an "unfair" version of the FEMNIST dataset. The experiments on these datasets show how private federated learning accentuates unfairness in the trained models, and how FPFL is able to mitigate such unfairness.
    Rescaling Egocentric Vision. (arXiv:2006.13256v4 [cs.CV] UPDATED)
    (0 min) This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the "test of time" - i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics
    Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images. (arXiv:2106.00116v3 [cs.LG] UPDATED)
    (0 min) Transfer learning aims to exploit pre-trained models for more efficient follow-up training on wide range of downstream tasks and datasets, enabling successful training also on small data. Recently, strong improvement was shown for transfer learning and model generalization when increasing model, data and compute budget scale in the pre-training. To compare effect of scale both in intra- and inter-domain full and few-shot transfer, in this study we combine for the first time large openly available medical X-Ray chest imaging datasets to reach a dataset scale comparable to ImageNet-1k. We then conduct pre-training and transfer to different natural or medical targets while varying network size and source data scale and domain, being either large natural (ImageNet-1k/21k) or large medical chest X-Ray datasets. We observe strong improvement due to larger pre-training scale for intra-domain natural-natural and medical-medical transfer. For inter-domain natural-medical transfer, we find improvements due to larger pre-training scale on larger X-Ray targets in full shot regime, while for smaller targets and for few-shot regime the improvement is not visible. Remarkably, large networks pre-trained on very large natural ImageNet-21k are as good or better than networks pre-trained on largest available medical X-Ray data when performing transfer to large X-Ray targets. We conclude that high quality models for inter-domain transfer can be also obtained by substantially increasing scale of model and generic natural source data, removing necessity for large domain-specific medical source data in the pre-training. Code is available at: \url{https://github.com/SLAMPAI/large-scale-pretraining-transfer}}
    Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos. (arXiv:2109.08275v1 [cs.MM])
    (0 min) Geo-tagged photo based tourist attraction recommendation can discover users' travel preferences from their taken photos, so as to recommend suitable tourist attractions to them. However, existing visual content based methods cannot fully exploit the user and tourist attraction information of photos to extract visual features, and do not differentiate the significances of different photos. In this paper, we propose multi-level visual similarity based personalized tourist attraction recommendation using geo-tagged photos (MEAL). MEAL utilizes the visual contents of photos and interaction behavior data to obtain the final embeddings of users and tourist attractions, which are then used to predict the visit probabilities. Specifically, by crossing the user and tourist attraction information of photos, we define four visual similarity levels and introduce a corresponding quintuplet loss to embed the visual contents of photos. In addition, to capture the significances of different photos, we exploit the self-attention mechanism to obtain the visual representations of users and tourist attractions. We conducted experiments on a dataset crawled from Flickr, and the experimental results proved the advantage of this method.
    Primer: Searching for Efficient Transformers for Language Modeling. (arXiv:2109.08668v1 [cs.LG])
    (0 min) Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.
    Causality Learning: A New Perspective for Interpretable Machine Learning. (arXiv:2006.16789v2 [cs.LG] UPDATED)
    (2 min) Recent years have witnessed the rapid growth of machine learning in a wide range of fields such as image recognition, text classification, credit scoring prediction, recommendation system, etc. In spite of their great performance in different sectors, researchers still concern about the mechanism under any machine learning (ML) techniques that are inherently black-box and becoming more complex to achieve higher accuracy. Therefore, interpreting machine learning model is currently a mainstream topic in the research community. However, the traditional interpretable machine learning focuses on the association instead of the causality. This paper provides an overview of causal analysis with the fundamental background and key concepts, and then summarizes most recent causal approaches for interpretable machine learning. The evaluation techniques for assessing method quality, and open problems in causal interpretability are also discussed in this paper.
    Decentralized Global Connectivity Maintenance for Multi-Robot Navigation: A Reinforcement Learning Approach. (arXiv:2109.08536v1 [cs.RO])
    (2 min) The problem of multi-robot navigation of connectivity maintenance is challenging in multi-robot applications. This work investigates how to navigate a multi-robot team in unknown environments while maintaining connectivity. We propose a reinforcement learning (RL) approach to develop a decentralized policy, which is shared among multiple robots. Given range sensor measurements and the positions of other robots, the policy aims to generate control commands for navigation and preserve the global connectivity of the robot team. We incorporate connectivity concerns into the RL framework as constraints and introduce behavior cloning to reduce the exploration complexity of policy optimization. The policy is optimized with all transition data collected by multiple robots in random simulated scenarios. We validate the effectiveness of the proposed approach by comparing different combinations of connectivity constraints and behavior cloning. We also show that our policy can generalize to unseen scenarios in both simulation and holonomic robots experiments.
    Reinforcement Learning on Encrypted Data. (arXiv:2109.08236v1 [cs.LG])
    (0 min) The growing number of applications of Reinforcement Learning (RL) in real-world domains has led to the development of privacy-preserving techniques due to the inherently sensitive nature of data. Most existing works focus on differential privacy, in which information is revealed in the clear to an agent whose learned model should be robust against information leakage to malicious third parties. Motivated by use cases in which only encrypted data might be shared, such as information from sensitive sites, in this work we consider scenarios in which the inputs themselves are sensitive and cannot be revealed. We develop a simple extension to the MDP framework which provides for the encryption of states. We present a preliminary, experimental study of how a DQN agent trained on encrypted states performs in environments with discrete and continuous state spaces. Our results highlight that the agent is still capable of learning in small state spaces even in presence of non-deterministic encryption, but performance collapses in more complex environments.
    Dynamic Spatiotemporal Graph Convolutional Neural Networks for Traffic Data Imputation with Complex Missing Patterns. (arXiv:2109.08357v1 [cs.LG])
    (0 min) Missing data is an inevitable and ubiquitous problem for traffic data collection in intelligent transportation systems. Despite extensive research regarding traffic data imputation, there still exist two limitations to be addressed: first, existing approaches fail to capture the complex spatiotemporal dependencies in traffic data, especially the dynamic spatial dependencies evolving with time; second, prior studies mainly focus on randomly missing patterns while other more complex missing scenarios are less discussed. To fill these research gaps, we propose a novel deep learning framework called Dynamic Spatiotemporal Graph Convolutional Neural Networks (DSTGCN) to impute missing traffic data. The model combines the recurrent architecture with graph-based convolutions to model the spatiotemporal dependencies. Moreover, we introduce a graph structure estimation technique to model the dynamic spatial dependencies from real-time traffic information and road network structure. Extensive experiments based on two public traffic speed datasets are conducted to compare our proposed model with state-of-the-art deep learning approaches in four types of missing patterns. The results show that our proposed model outperforms existing deep learning models in all kinds of missing scenarios and the graph structure estimation technique contributes to the model performance. We further compare our proposed model with a tensor factorization model and find distinct behaviors across different model families under different training schemes and data availability.
    Online Learning of Network Bottlenecks via Minimax Paths. (arXiv:2109.08467v1 [cs.LG])
    (2 min) In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.
    Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic Surgery. (arXiv:2109.08227v1 [eess.IV])
    (2 min) We introduce the task of stereo video reconstruction or, equivalently, 2D-to-3D video conversion for minimally invasive surgical video. We design and implement a series of end-to-end U-Net-based solutions for this task by varying the input (single frame vs. multiple consecutive frames), loss function (MSE, MAE, or perceptual losses), and network architecture. We evaluate these solutions by surveying ten experts - surgeons who routinely perform endoscopic surgery. We run two separate reader studies: one evaluating individual frames and the other evaluating fully reconstructed 3D video played on a VR headset. In the first reader study, a variant of the U-Net that takes as input multiple consecutive video frames and outputs the missing view performs best. We draw two conclusions from this outcome. First, motion information coming from multiple past frames is crucial in recreating stereo vision. Second, the proposed U-Net variant can indeed exploit such motion information for solving this task. The result from the second study further confirms the effectiveness of the proposed U-Net variant. The surgeons reported that they could successfully perceive depth from the reconstructed 3D video clips. They also expressed a clear preference for the reconstructed 3D video over the original 2D video. These two reader studies strongly support the usefulness of the proposed task of stereo reconstruction for minimally invasive surgical video and indicate that deep learning is a promising approach to this task. Finally, we identify two automatic metrics, LPIPS and DISTS, that are strongly correlated with expert judgement and that could serve as proxies for the latter in future studies.
    TSS: Transformation-Specific Smoothing for Robustness Certification. (arXiv:2002.12398v4 [cs.LG] UPDATED)
    (3 min) As machine learning (ML) systems become pervasive, safeguarding their security is critical. However, recently it has been demonstrated that motivated adversaries are able to mislead ML systems by perturbing test data using semantic transformations. While there exists a rich body of research providing provable robustness guarantees for ML models against $\ell_p$ norm bounded adversarial perturbations, guarantees against semantic perturbations remain largely underexplored. In this paper, we provide TSS -- a unified framework for certifying ML robustness against general adversarial semantic transformations. First, depending on the properties of each transformation, we divide common transformations into two categories, namely resolvable (e.g., Gaussian blur) and differentially resolvable (e.g., rotation) transformations. For the former, we propose transformation-specific randomized smoothing strategies and obtain strong robustness certification. The latter category covers transformations that involve interpolation errors, and we propose a novel approach based on stratified sampling to certify the robustness. Our framework TSS leverages these certification strategies and combines with consistency-enhanced training to provide rigorous certification of robustness. We conduct extensive experiments on over ten types of challenging semantic transformations and show that TSS significantly outperforms the state of the art. Moreover, to the best of our knowledge, TSS is the first approach that achieves nontrivial certified robustness on the large-scale ImageNet dataset. For instance, our framework achieves 30.4% certified robust accuracy against rotation attack (within $\pm 30^\circ$) on ImageNet. Moreover, to consider a broader range of transformations, we show TSS is also robust against adaptive attacks and unforeseen image corruptions such as CIFAR-10-C and ImageNet-C.
    NetTraj: A Network-based Vehicle Trajectory Prediction Model with Directional Representation and Spatiotemporal Attention Mechanisms. (arXiv:2106.11175v2 [cs.LG] UPDATED)
    (0 min) Trajectory prediction of vehicles in city-scale road networks is of great importance to various location-based applications such as vehicle navigation, traffic management, and location-based recommendations. Existing methods typically represent a trajectory as a sequence of grid cells, road segments or intention sets. None of them is ideal, as the cell-based representation ignores the road network structures and the other two are less efficient in analyzing city-scale road networks. Moreover, previous models barely leverage spatial dependencies or only consider them at the grid cell level, ignoring the non-Euclidean spatial structure shaped by irregular road networks. To address these problems, we propose a network-based vehicle trajectory prediction model named NetTraj, which represents each trajectory as a sequence of intersections and associated movement directions, and then feeds them into a LSTM encoder-decoder network for future trajectory generation. Furthermore, we introduce a local graph attention mechanism to capture network-level spatial dependencies of trajectories, and a temporal attention mechanism with a sliding context window to capture both short- and long-term temporal dependencies in trajectory data. Extensive experiments based on two real-world large-scale taxi trajectory datasets show that NetTraj outperforms the existing state-of-the-art methods for vehicle trajectory prediction, validating the effectiveness of the proposed trajectory representation method and spatiotemporal attention mechanisms.
    Achieving Model Fairness in Vertical Federated Learning. (arXiv:2109.08344v1 [cs.LG])
    (0 min) Vertical federated learning (VFL), which enables multiple enterprises possessing non-overlapped features to strengthen their machine learning models without disclosing their private data and model parameters, has received increasing attention lately. Similar to other machine learning algorithms, VFL suffers from fairness issues, i.e., the learned model may be unfairly discriminatory over the group with sensitive attributes. To tackle this problem, we propose a fair VFL framework in this work. First, we systematically formulate the problem of training fair models in VFL, where the learning task is modeled as a constrained optimization problem. To solve it in a federated manner, we consider its equivalent dual form and develop an asynchronous gradient coordinate-descent ascent algorithm, where each data party performs multiple parallelized local updates per communication round to effectively reduce the number of communication rounds. We prove that the algorithm finds a $\delta$-stationary point of the dual objective in $\mathcal{O}(\delta^{-4})$ communication rounds under mild conditions. Finally, extensive experiments on three benchmark datasets demonstrate the superior performance of our method in training fair models.
    Knowledge is reward: Learning optimal exploration by predictive reward cashing. (arXiv:2109.08518v1 [stat.ML])
    (0 min) There is a strong link between the general concept of intelligence and the ability to collect and use information. The theory of Bayes-adaptive exploration offers an attractive optimality framework for training machines to perform complex information gathering tasks. However, the computational complexity of the resulting optimal control problem has limited the diffusion of the theory to mainstream deep AI research. In this paper we exploit the inherent mathematical structure of Bayes-adaptive problems in order to dramatically simplify the problem by making the reward structure denser while simultaneously decoupling the learning of exploitation and exploration policies. The key to this simplification comes from the novel concept of cross-value (i.e. the value of being in an environment while acting optimally according to another), which we use to quantify the value of currently available information. This results in a new denser reward structure that "cashes in" all future rewards that can be predicted from the current information state. In a set of experiments we show that the approach makes it possible to learn challenging information gathering tasks without the use of shaping and heuristic bonuses in situations where the standard RL algorithms fail.
    Post-OCR Paragraph Recognition by Graph Convolutional Networks. (arXiv:2101.12741v5 [cs.CV] UPDATED)
    (0 min) We propose a new approach for paragraph recognition in document images by spatial graph convolutional networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.
    A Linearly Convergent Algorithm for Distributed Principal Component Analysis. (arXiv:2101.01300v3 [cs.LG] UPDATED)
    (0 min) Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. This paper focuses on the dual objective of PCA, namely, dimensionality reduction and decorrelation of features, but in a distributed setting. This requires estimating the eigenvectors of the data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors, when data is distributed across a network of machines. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA) that estimates the eigenvectors of the data covariance matrix when data is distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also provided to demonstrate the efficacy of the proposed solution.
    Long Short-term Cognitive Networks. (arXiv:2106.16233v2 [cs.LG] UPDATED)
    (0 min) In this paper, we present a recurrent neural system named Long Short-term Cognitive Networks (LSTCNs) as a generalization of the Short-term Cognitive Network (STCN) model. Such a generalization is motivated by the difficulty of forecasting very long time series efficiently. The LSTCN model can be defined as a collection of STCN blocks, each processing a specific time patch of the (multivariate) time series being modeled. In this neural ensemble, each block passes information to the subsequent one in the form of weight matrices representing the prior knowledge. As a second contribution, we propose a deterministic learning algorithm to compute the learnable weights while preserving the prior knowledge resulting from previous learning processes. As a third contribution, we introduce a feature influence score as a proxy to explain the forecasting process in multivariate time series. The simulations using three case studies show that our neural system reports small forecasting errors while being significantly faster than state-of-the-art recurrent models.
    Correlation detection in trees for partial graph alignment. (arXiv:2107.07623v2 [cs.DS] UPDATED)
    (0 min) Motivated by alignment of correlated sparse random graphs, we study a hypothesis problem of deciding whether two random trees are correlated or not. Based on this tree detection problem, we propose BPAlign, a message-passing -- belief propagation -- algorithm for graph alignment, which we prove to succeed in polynomial time at partial alignment whenever tree detection is feasible. As a result our analysis of tree detection reveals new ranges of parameters for which partial alignment of sparse random graphs is feasible in polynomial time. We conjecture that the connection between partial graph alignment and tree detection runs deeper, and that the parameter range where tree detection is impossible, which we partially characterize, corresponds to a region where partial graph alignment is hard (not polytime feasible).
    Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification. (arXiv:2109.00582v2 [stat.ML] UPDATED)
    (0 min) Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide label combination by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion of outcome "information" conditional on outcome prediction, to guide practitioners on how to combine ambiguous outcome labels. ITCA indicates a balance in the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and prediction resolution (how many labels are predictable). To find the optimal label combination indicated by ITCA, we develop two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: to improve prediction accuracy and to identify ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification.
    One-Round Active Learning. (arXiv:2104.11843v2 [cs.LG] UPDATED)
    (0 min) In this work, we initiate the study of one-round active learning, which aims to select a subset of unlabeled data points that achieve the highest model performance after being labeled with only the information from initially labeled data points. The challenge of directly applying existing data selection criteria to the one-round setting is that they are not indicative of model performance when available labeled data is limited. We address the challenge by explicitly modeling the dependence of model performance on the dataset. Specifically, we propose DULO, a data-driven framework for one-round active learning, wherein we learn a model to predict the model performance for a given dataset and then leverage this model to guide the selection of unlabeled data. Our results demonstrate that DULO leads to the state-of-the-art performance on various active learning benchmarks in the one-round setting.
    In-Network Learning: Distributed Training and Inference in Networks. (arXiv:2107.03433v2 [cs.LG] UPDATED)
    (0 min) It is widely perceived that leveraging the success of modern machine learning techniques to mobile devices and wireless networks has the potential of enabling important new services. This, however, poses significant challenges, essentially due to that both data and processing power are highly distributed in a wireless network. In this paper, we develop a learning algorithm and an architecture that make use of multiple data streams and processing units, not only during the training phase but also during the inference phase. In particular, the analysis reveals how inference propagates and fuses across a network. We study the design criterion of our proposed method and its bandwidth requirements. Also, we discuss implementation aspects using neural networks in typical wireless radio access; and provide experiments that illustrate benefits over state-of-the-art techniques.
    Graph Learning for Cognitive Digital Twins in Manufacturing Systems. (arXiv:2109.08632v1 [cs.LG])
    (2 min) Future manufacturing requires complex systems that connect simulation platforms and virtualization with physical data from industrial processes. Digital twins incorporate a physical twin, a digital twin, and the connection between the two. Benefits of using digital twins, especially in manufacturing, are abundant as they can increase efficiency across an entire manufacturing life-cycle. The digital twin concept has become increasingly sophisticated and capable over time, enabled by rises in many technologies. In this paper, we detail the cognitive digital twin as the next stage of advancement of a digital twin that will help realize the vision of Industry 4.0. Cognitive digital twins will allow enterprises to creatively, effectively, and efficiently exploit implicit knowledge drawn from the experience of existing manufacturing systems. They also enable more autonomous decisions and control, while improving the performance across the enterprise (at scale). This paper presents graph learning as one potential pathway towards enabling cognitive functionalities in manufacturing digital twins. A novel approach to realize cognitive digital twins in the product design stage of manufacturing that utilizes graph learning is presented.
    A review of deep learning methods for MRI reconstruction. (arXiv:2109.08618v1 [eess.IV])
    (2 min) Following the success of deep learning in a wide range of applications, neural network-based machine-learning techniques have received significant interest for accelerating magnetic resonance imaging (MRI) acquisition and reconstruction strategies. A number of ideas inspired by deep learning techniques for computer vision and image processing have been successfully applied to nonlinear image reconstruction in the spirit of compressed sensing for accelerated MRI. Given the rapidly growing nature of the field, it is imperative to consolidate and summarize the large number of deep learning methods that have been reported in the literature, to obtain a better understanding of the field in general. This article provides an overview of the recent developments in neural-network based approaches that have been proposed specifically for improving parallel imaging. A general background and introduction to parallel MRI is also given from a classical view of k-space based reconstruction methods. Image domain based techniques that introduce improved regularizers are covered along with k-space based methods which focus on better interpolation strategies using neural networks. While the field is rapidly evolving with thousands of papers published each year, in this review, we attempt to cover broad categories of methods that have shown good performance on publicly available data sets. Limitations and open problems are also discussed and recent efforts for producing open data sets and benchmarks for the community are examined.
    New Students on Sesame Street: What Order-Aware Matrix Embeddings Can Learn from BERT. (arXiv:2109.08449v1 [cs.CL])
    (2 min) Large-scale pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive in low-resource or large-scale applications. While common approaches reduce the size of PreLMs via same-architecture distillation or pruning, we explore distilling PreLMs into more efficient order-aware embedding models. Our results on the GLUE benchmark show that embedding-centric students, which have learned from BERT, yield scores comparable to DistilBERT on QQP and RTE, often match or exceed the scores of ELMo, and only fall behind on detecting linguistic acceptability.
    Policy Choice and Best Arm Identification: Comments on "Adaptive Treatment Assignment in Experiments for Policy Choice". (arXiv:2109.08229v1 [econ.EM])
    (2 min) The purpose of this paper is to connect the "policy choice" problem, proposed in Kasy and Sautmann (2021), to the frontiers of the bandit literature in machine learning. We discuss how the policy choice problem can be framed in a way such that it is identical to what is called the "best arm identification" (BAI) problem. By connecting the literature, we identify that the asymptotic optimality of policy choice algorithms tackled in Kasy and Sautmann (2021) is a long-standing open question in the literature. Unfortunately, this connection highlights several major issues with the main theorem. In particular, we show that Theorem 1 in Kasy and Sautmann (2021) is false. We find that the proofs of statements (1) and (2) of Theorem 1 are incorrect, though the statements themselves may be true, though non-trivial to fix. Statement (3), and its proof, on the other hand, is false, which we show by utilizing existing theoretical results in the bandit literature. As this question is critically important, garnering much interest in the last decade within the bandit community, we provide a review of recent developments in the BAI literature. We hope this serves to highlight the relevance to economic problems and stimulate methodological and theoretical developments in the econometric community.
    RAPID-RL: A Reconfigurable Architecture with Preemptive-Exits for Efficient Deep-Reinforcement Learning. (arXiv:2109.08231v1 [cs.LG])
    (2 min) Present-day Deep Reinforcement Learning (RL) systems show great promise towards building intelligent agents surpassing human-level performance. However, the computational complexity associated with the underlying deep neural networks (DNNs) leads to power-hungry implementations. This makes deep RL systems unsuitable for deployment on resource-constrained edge devices. To address this challenge, we propose a reconfigurable architecture with preemptive exits for efficient deep RL (RAPID-RL). RAPID-RL enables conditional activation of DNN layers based on the difficulty level of inputs. This allows to dynamically adjust the compute effort during inference while maintaining competitive performance. We achieve this by augmenting a deep Q-network (DQN) with side-branches capable of generating intermediate predictions along with an associated confidence score. We also propose a novel training methodology for learning the actions and branch confidence scores in a dynamic RL setting. Our experiments evaluate the proposed framework for Atari 2600 gaming tasks and a realistic Drone navigation task on an open-source drone simulator (PEDRA). We show that RAPID-RL incurs 0.34x (0.25x) number of operations (OPS) while maintaining performance above 0.88x (0.91x) on Atari (Drone navigation) tasks, compared to a baseline-DQN without any side-branches. The reduction in OPS leads to fast and efficient inference, proving to be highly beneficial for the resource-constrained edge where making quick decisions with minimal compute is essential.
    Strategic Ranking. (arXiv:2109.08240v1 [cs.GT])
    (2 min) Strategic classification studies the design of a classifier robust to the manipulation of input by strategic individuals. However, the existing literature does not consider the effect of competition among individuals as induced by the algorithm design. Motivated by constrained allocation settings such as college admissions, we introduce strategic ranking, in which the (designed) individual reward depends on an applicant's post-effort rank in a measurement of interest. Our results illustrate how competition among applicants affects the resulting equilibria and model insights. We analyze how various ranking reward designs trade off applicant, school, and societal utility and in particular how ranking design can counter inequities arising from disparate access to resources to improve one's measured score: We find that randomization in the ranking reward design can mitigate two measures of disparate impact, welfare gap and access, whereas non-randomization may induce a high level of competition that systematically excludes a disadvantaged group.
    Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration. (arXiv:2109.08603v1 [cs.LG])
    (2 min) Curiosity-based reward schemes can present powerful exploration mechanisms which facilitate the discovery of solutions for complex, sparse or long-horizon tasks. However, as the agent learns to reach previously unexplored spaces and the objective adapts to reward new areas, many behaviours emerge only to disappear due to being overwritten by the constantly shifting objective. We argue that merely using curiosity for fast environment exploration or as a bonus reward for a specific task does not harness the full potential of this technique and misses useful skills. Instead, we propose to shift the focus towards retaining the behaviours which emerge during curiosity-based learning. We posit that these self-discovered behaviours serve as valuable skills in an agent's repertoire to solve related tasks. Our experiments demonstrate the continuous shift in behaviour throughout training and the benefits of a simple policy snapshot method to reuse discovered behaviour for transfer tasks.
    Recurrence-Aware Long-Term Cognitive Network for Explainable Pattern Classification. (arXiv:2107.03423v2 [cs.LG] UPDATED)
    (0 min) Machine learning solutions for pattern classification problems are nowadays widely deployed in society and industry. However, the lack of transparency and accountability of most accurate models often hinders their safe use. Thus, there is a clear need for developing explainable artificial intelligence mechanisms. There exist model-agnostic methods that summarize feature contributions, but their interpretability is limited to predictions made by black-box models. An open challenge is to develop models that have intrinsic interpretability and produce their own explanations, even for classes of models that are traditionally considered black boxes like (recurrent) neural networks. In this paper, we propose a Long-Term Cognitive Network for interpretable pattern classification of structured data. Our method brings its own mechanism for providing explanations by quantifying the relevance of each feature in the decision process. For supporting the interpretability without affecting the performance, the model incorporates more flexibility through a quasi-nonlinear reasoning rule that allows controlling nonlinearity. Besides, we propose a recurrence-aware decision model that evades the issues posed by unique fixed points while introducing a deterministic learning method to compute the tunable parameters. The simulations show that our interpretable model obtains competitive results when compared to the state-of-the-art white and black-box models.
    Estimating Example Difficulty Using Variance of Gradients. (arXiv:2008.11600v3 [cs.CV] UPDATED)
    (0 min) In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples ensures the safe deployment of models, isolates samples that require further human inspection and provides interpretability into model behavior. In this work, we propose Variance of Gradients (VoG) as a valuable and efficient metric to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. We show that data points with high VoG scores are far more difficult for the model to learn and over-index on corrupted or memorized examples. Further, restricting the evaluation to the test set instances with the lowest VoG improves the model's generalization performance. Finally, we show that VoG is a valuable and efficient ranking for out-of-distribution detection.
    Subtle Inverse Crimes: Na\"ively training machine learning algorithms could lead to overly-optimistic results. (arXiv:2109.08237v1 [cs.LG])
    (2 min) While open databases are an important resource in the Deep Learning (DL) era, they are sometimes used "off-label": data published for one task are used for training algorithms for a different one. This work aims to highlight that in some cases, this common practice may lead to biased, overly-optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data preprocessing pipelines. We describe two preprocessing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for Magnetic Resonance Imaging (MRI) reconstruction: Compressed Sensing (CS), Dictionary Learning (DictL), and DL. In this large-scale study we performed extensive computations. Our results demonstrate that the CS, DictL and DL algorithms yield systematically biased results when na\"ively trained on seemingly-appropriate data: the Normalized Root Mean Square Error (NRMSE) improves consistently with the preprocessing extent, showing an artificial increase of 25%-48% in some cases. Since this phenomenon is generally unknown, biased results are sometimes published as state-of-the-art; we refer to that as subtle inverse crimes. This work hence raises a red flag regarding na\"ive off-label usage of Big Data and reveals the vulnerability of modern inverse problem solvers to the resulting bias.
    Learning Enhanced Optimisation for Routing Problems. (arXiv:2109.08345v1 [cs.AI])
    (2 min) Deep learning approaches have shown promising results in solving routing problems. However, there is still a substantial gap in solution quality between machine learning and operations research algorithms. Recently, another line of research has been introduced that fuses the strengths of machine learning and operational research algorithms. In particular, search perturbation operators have been used to improve the solution. Nevertheless, using the perturbation may not guarantee a quality solution. This paper presents "Learning to Guide Local Search" (L2GLS), a learning-based approach for routing problems that uses a penalty term and reinforcement learning to adaptively adjust search efforts. L2GLS combines local search (LS) operators' strengths with penalty terms to escape local optimals. Routing problems have many practical applications, often presetting larger instances that are still challenging for many existing algorithms introduced in the learning to optimise field. We show that L2GLS achieves the new state-of-the-art results on larger TSP and CVRP over other machine learning methods.
    Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint. (arXiv:2101.02137v5 [cs.LG] UPDATED)
    (2 min) We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.
    Conversational Multi-Hop Reasoning with Neural Commonsense Knowledge and Symbolic Logic Rules. (arXiv:2109.08544v1 [cs.AI])
    (2 min) One of the challenges faced by conversational agents is their inability to identify unstated presumptions of their users' commands, a task trivial for humans due to their common sense. In this paper, we propose a zero-shot commonsense reasoning system for conversational agents in an attempt to achieve this. Our reasoner uncovers unstated presumptions from user commands satisfying a general template of if-(state), then-(action), because-(goal). Our reasoner uses a state-of-the-art transformer-based generative commonsense knowledge base (KB) as its source of background knowledge for reasoning. We propose a novel and iterative knowledge query mechanism to extract multi-hop reasoning chains from the neural KB which uses symbolic logic rules to significantly reduce the search space. Similar to any KBs gathered to date, our commonsense KB is prone to missing knowledge. Therefore, we propose to conversationally elicit the missing knowledge from human users with our novel dynamic question generation strategy, which generates and presents contextualized queries to human users. We evaluate the model with a user study with human users that achieves a 35% higher success rate compared to SOTA.
    Membership Leakage in Label-Only Exposures. (arXiv:2007.15528v3 [cs.LG] UPDATED)
    (2 min) Machine learning (ML) has been widely adopted in various privacy-critical applications, e.g., face recognition and medical image analysis. However, recent research has shown that ML models are vulnerable to attacks against their training data. Membership inference is one major attack in this domain: Given a data sample and model, an adversary aims to determine whether the sample is part of the model's training set. Existing membership inference attacks leverage the confidence scores returned by the model as their inputs (score-based attacks). However, these attacks can be easily mitigated if the model only exposes the predicted label, i.e., the final model decision. In this paper, we propose decision-based membership inference attacks and demonstrate that label-only exposures are also vulnerable to membership leakage. In particular, we develop two types of decision-based attacks, namely transfer attack, and boundary attack. Empirical evaluation shows that our decision-based attacks can achieve remarkable performance, and even outperform the previous score-based attacks in some cases. We further present new insights on the success of membership inference based on quantitative and qualitative analysis, i.e., member samples of a model are more distant to the model's decision boundary than non-member samples. Finally, we evaluate multiple defense mechanisms against our decision-based attacks and show that our two types of attacks can bypass most of these defenses.
    Slot Filling for Biomedical Information Extraction. (arXiv:2109.08564v1 [cs.CL])
    (2 min) Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in the above sub-tasks.In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and pretrained data are available at https://github.com/healx/biomed-slot-filling.
    Towards agricultural autonomy: crop row detection under varying field conditions using deep learning. (arXiv:2109.08247v1 [cs.CV])
    (2 min) This paper presents a novel metric to evaluate the robustness of deep learning based semantic segmentation approaches for crop row detection under different field conditions encountered by a field robot. A dataset with ten main categories encountered under various field conditions was used for testing. The effect on these conditions on the angular accuracy of crop row detection was compared. A deep convolutional encoder decoder network is implemented to predict crop row masks using RGB input images. The predicted mask is then sent to a post processing algorithm to extract the crop rows. The deep learning model was found to be robust against shadows and growth stages of the crop while the performance was reduced under direct sunlight, increasing weed density, tramlines and discontinuities in crop rows when evaluated with the novel metric.
    A Fairness Analysis on Private Aggregation of Teacher Ensembles. (arXiv:2109.08630v1 [cs.LG])
    (2 min) The Private Aggregation of Teacher Ensembles (PATE) is an important private machine learning framework. It combines multiple learning models used as teachers for a student model that learns to predict an output chosen by noisy voting among the teachers. The resulting model satisfies differential privacy and has been shown effective in learning high-quality private models in semisupervised settings or when one wishes to protect the data labels. This paper asks whether this privacy-preserving framework introduces or exacerbates bias and unfairness and shows that PATE can introduce accuracy disparity among individuals and groups of individuals. The paper analyzes which algorithmic and data properties are responsible for the disproportionate impacts, why these aspects are affecting different groups disproportionately, and proposes guidelines to mitigate these effects. The proposed approach is evaluated on several datasets and settings.
    Self-Supervised Neural Architecture Search for Imbalanced Datasets. (arXiv:2109.08580v1 [cs.CV])
    (2 min) Neural Architecture Search (NAS) provides state-of-the-art results when trained on well-curated datasets with annotated labels. However, annotating data or even having balanced number of samples can be a luxury for practitioners from different scientific fields, e.g., in the medical domain. To that end, we propose a NAS-based framework that bears the threefold contributions: (a) we focus on the self-supervised scenario, i.e., where no labels are required to determine the architecture, and (b) we assume the datasets are imbalanced, (c) we design each component to be able to run on a resource constrained setup, i.e., on a single GPU (e.g. Google Colab). Our components build on top of recent developments in self-supervised learning~\citep{zbontar2021barlow}, self-supervised NAS~\citep{kaplan2020self} and extend them for the case of imbalanced datasets. We conduct experiments on an (artificially) imbalanced version of CIFAR-10 and we demonstrate our proposed method outperforms standard neural networks, while using $27\times$ less parameters. To validate our assumption on a naturally imbalanced dataset, we also conduct experiments on ChestMNIST and COVID-19 X-ray. The results demonstrate how the proposed method can be used in imbalanced datasets, while it can be fully run on a single GPU. Code is available \href{https://github.com/TimofeevAlex/ssnas_imbalanced}{here}.
    Coordinated Random Access for Industrial IoT With Correlated Traffic By Reinforcement-Learning. (arXiv:2109.08389v1 [cs.NI])
    (2 min) We propose a coordinated random access scheme for industrial internet-of-things (IIoT) scenarios, with machine-type devices (MTDs) generating sporadic correlated traffic. This occurs, e.g., when external events trigger data generation at multiple MTDs simultaneously. Time is divided into frames, each split into slots and each MTD randomly selects one slot for (re)transmission, with probability density functions (PDFs) specific of both the MTD and the number of the current retransmission. PDFs are locally optimized to minimize the probability of packet collision. The optimization problem is modeled as a repeated Markov game with incomplete information, and the linear reward-inaction algorithm is used at each MTD, which provably converges to a deterministic (suboptimal) slot assignment. We compare our solution with both the slotted ALOHA and the min-max pairwise correlation random access schemes, showing that our approach achieves a higher network throughput with moderate traffic intensity.
    Context-aware Retail Product Recommendation with Regularized Gradient Boosting. (arXiv:2109.08561v1 [cs.LG])
    (2 min) In the FARFETCH Fashion Recommendation challenge, the participants needed to predict the order in which various products would be shown to a user in a recommendation impression. The data was provided in two phases - a validation phase and a test phase. The validation phase had a labelled training set that contained a binary column indicating whether a product has been clicked or not. The dataset comprises over 5,000,000 recommendation events, 450,000 products and 230,000 unique users. It represents real, unbiased, but anonymised, interactions of actual users of the FARFETCH platform. The final evaluation was done according to the performance in the second phase. A total of 167 participants participated in the challenge, and we secured the 6th rank during the final evaluation with an MRR of 0.4658 on the test set. We have designed a unique context-aware system that takes the similarity of a product to the user context into account to rank products more effectively. Post evaluation, we have been able to fine-tune our approach with an MRR of 0.4784 on the test set, which would have placed us at the 3rd position.
    Using hardware performance counters to speed up autotuning convergence on GPUs. (arXiv:2102.05297v2 [cs.DC] UPDATED)
    (0 min) Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.
    Planning and Learning Using Adaptive Entropy Tree Search. (arXiv:2102.06808v2 [cs.AI] UPDATED)
    (0 min) We present the Adaptive Entropy Tree Search (ANTS) algorithm, a planning method based on the Principle of Maximum Entropy. Importantly, we design ANTS so that it is a practical component of a planning-learning loop, outperforming state-of-the-art methods on the Atari benchmark. The key algorithmic novelty is entropy parameterization, which mitigates sensitivity to the temperature parameter - a bottleneck of the prior maximum entropy planning methods. To confirm our design choices, we perform a comprehensive suite of ablations in isolation from learning. Moreover, we theoretically show that ANTS enjoys exponential convergence in the softmax bandit setting.
    TS-MULE: Local Interpretable Model-Agnostic Explanations for Time Series Forecast Models. (arXiv:2109.08438v1 [cs.LG])
    (0 min) Time series forecasting is a demanding task ranging from weather to failure forecasting with black-box models achieving state-of-the-art performances. However, understanding and debugging are not guaranteed. We propose TS-MULE, a local surrogate model explanation method specialized for time series extending the LIME approach. Our extended LIME works with various ways to segment and perturb the time series data. In our extension, we present six sampling segmentation approaches for time series to improve the quality of surrogate attributions and demonstrate their performances on three deep learning model architectures and three common multivariate time series datasets.
    Integrating Deep Reinforcement and Supervised Learning to Expedite Indoor Mapping. (arXiv:2109.08490v1 [cs.LG])
    (0 min) The challenge of mapping indoor environments is addressed. Typical heuristic algorithms for solving the motion planning problem are frontier-based methods, that are especially effective when the environment is completely unknown. However, in cases where prior statistical data on the environment's architectonic features is available, such algorithms can be far from optimal. Furthermore, their calculation time may increase substantially as more areas are exposed. In this paper we propose two means by which to overcome these shortcomings. One is the use of deep reinforcement learning to train the motion planner. The second is the inclusion of a pre-trained generative deep neural network, acting as a map predictor. Each one helps to improve the decision making through use of the learned structural statistics of the environment, and both, being realized as neural networks, ensure a constant calculation time. We show that combining the two methods can shorten the mapping time, compared to frontier-based motion planning, by up to 75%.
    Boosting Transformers for Job Expression Extraction and Classification in a Low-Resource Setting. (arXiv:2109.08597v1 [cs.CL])
    (2 min) In this paper, we explore possible improvements of transformer models in a low-resource setting. In particular, we present our approaches to tackle the first two of three subtasks of the MEDDOPROF competition, i.e., the extraction and classification of job expressions in Spanish clinical texts. As neither language nor domain experts, we experiment with the multilingual XLM-R transformer model and tackle these low-resource information extraction tasks as sequence-labeling problems. We explore domain- and language-adaptive pretraining, transfer learning and strategic datasplits to boost the transformer model. Our results show strong improvements using these methods by up to 5.3 F1 points compared to a fine-tuned XLM-R model. Our best models achieve 83.2 and 79.3 F1 for the first two tasks, respectively.
    VideoDG: Generalizing Temporal Relations in Videos to Novel Domains. (arXiv:1912.03716v2 [cs.CV] UPDATED)
    (2 min) This paper introduces video domain generalization where most video classification networks degenerate due to the lack of exposure to the target domains of divergent distributions. We observe that the global temporal features are less generalizable, due to the temporal domain shift that videos from other unseen domains may have an unexpected absence or misalignment of the temporal relations. This finding has motivated us to solve video domain generalization by effectively learning the local-relation features of different timescales that are more generalizable, and exploiting them along with the global-relation features to maintain the discriminability. This paper presents the VideoDG framework with two technical contributions. The first is a new deep architecture named the Adversarial Pyramid Network, which improves the generalizability of video features by capturing the local-relation, global-relation, and cross-relation features progressively. On the basis of pyramid features, the second contribution is a new and robust approach of adversarial data augmentation that can bridge different video domains by improving the diversity and quality of augmented data. We construct three video domain generalization benchmarks in which domains are divided according to different datasets, different consequences of actions, or different camera views, respectively. VideoDG consistently outperforms the combinations of previous video classification models and existing domain generalization methods on all benchmarks.
    An Interpretable Framework for Drug-Target Interaction with Gated Cross Attention. (arXiv:2109.08360v1 [cs.LG])
    (0 min) In silico prediction of drug-target interactions (DTI) is significant for drug discovery because it can largely reduce timelines and costs in the drug development process. Specifically, deep learning-based DTI approaches have been shown promising results in terms of accuracy and low cost for the prediction. However, they pay little attention to the interpretability of their prediction results and feature-level interactions between a drug and a target. In this study, we propose a novel interpretable framework that can provide reasonable cues for the interaction sites. To this end, we elaborately design a gated cross-attention mechanism that crossly attends drug and target features by constructing explicit interactions between these features. The gating function in the method enables neural models to focus on salient regions over entire sequences of drugs and proteins, and the byproduct from the function, which is the attention map, could serve as interpretable factors. The experimental results show the efficacy of the proposed method in two DTI datasets. Additionally, we show that gated cross-attention can sensitively react to the mutation, and this result could provide insights into the identification of novel drugs targeting mutant proteins.
    Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions. (arXiv:2106.04484v2 [cs.CV] UPDATED)
    (0 min) Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.
    Realistic PointGoal Navigation via Auxiliary Losses and Information Bottleneck. (arXiv:2109.08677v1 [cs.CV])
    (2 min) We propose a novel architecture and training paradigm for training realistic PointGoal Navigation -- navigating to a target coordinate in an unseen environment under actuation and sensor noise without access to ground-truth localization. Specifically, we find that the primary challenge under this setting is learning localization -- when stripped of idealized localization, agents fail to stop precisely at the goal despite reliably making progress towards it. To address this we introduce a set of auxiliary losses to help the agent learn localization. Further, we explore the idea of treating the precise location of the agent as privileged information -- it is unavailable during test time, however, it is available during training time in simulation. We grant the agent restricted access to ground-truth localization readings during training via an information bottleneck. Under this setting, the agent incurs a penalty for using this privileged information, encouraging the agent to only leverage this information when it is crucial to learning. This enables the agent to first learn navigation and then learn localization instead of conflating these two objectives in training. We evaluate our proposed method both in a semi-idealized (noiseless simulation without Compass+GPS) and realistic (addition of noisy simulation) settings. Specifically, our method outperforms existing baselines on the semi-idealized setting by 18\%/21\% SPL/Success and by 15\%/20\% SPL in the realistic setting. Our improved Success and SPL metrics indicate our agent's improved ability to accurately self-localize while maintaining a strong navigation policy. Our implementation can be found at https://github.com/NicoGrande/habitat-pointnav-via-ib.
    Natlog: a Lightweight Logic Programming Language with a Neuro-symbolic Touch. (arXiv:2109.08291v1 [cs.PL])
    (2 min) We introduce Natlog, a lightweight Logic Programming language, sharing Prolog's unification-driven execution model, but with a simplified syntax and semantics. Our proof-of-concept Natlog implementation is tightly embedded in the Python-based deep-learning ecosystem with focus on content-driven indexing of ground term datasets. As an overriding of our symbolic indexing algorithm, the same function can be delegated to a neural network, serving ground facts to Natlog's resolution engine. Our open-source implementation is available as a Python package at https://pypi.org/project/natlog/ .
    Textual Echo Cancellation. (arXiv:2008.06006v4 [eess.AS] UPDATED)
    (0 min) In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device or ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC).
    Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation. (arXiv:2109.08331v1 [cs.LG])
    (0 min) In recommender systems (RecSys) and real-time bidding (RTB) for online advertisements, we often try to optimize sequential decision making using bandit and reinforcement learning (RL) techniques. In these applications, offline reinforcement learning (offline RL) and off-policy evaluation (OPE) are beneficial because they enable safe policy optimization using only logged data without any risky online interaction. In this position paper, we explore the potential of using simulation to accelerate practical research of offline RL and OPE, particularly in RecSys and RTB. Specifically, we discuss how simulation can help us conduct empirical research of offline RL and OPE. We take a position to argue that we should effectively use simulations in the empirical research of offline RL and OPE. To refute the counterclaim that experiments using only real-world data are preferable, we first point out the underlying risks and reproducibility issue in real-world experiments. Then, we describe how these issues can be addressed by using simulations. Moreover, we show how to incorporate the benefits of both real-world and simulation-based experiments to defend our position. Finally, we also present an open challenge to further facilitate practical research of offline RL and OPE in RecSys and RTB, with respect to public simulation platforms. As a possible solution for the issue, we show our ongoing open source project and its potential use case. We believe that building and utilizing simulation-based evaluation platforms for offline RL and OPE will be of great interest and relevance for the RecSys and RTB community.
    A Point-Cloud Deep Learning Framework for Prediction of Fluid Flow Fields on Irregular Geometries. (arXiv:2010.09469v2 [cs.LG] UPDATED)
    (0 min) We present a novel deep learning framework for flow field predictions in irregular domains when the solution is a function of the geometry of either the domain or objects inside the domain. Grid vertices in a computational fluid dynamics (CFD) domain are viewed as point clouds and used as inputs to a neural network based on the PointNet architecture, which learns an end-to-end mapping between spatial positions and CFD quantities. Using our approach, (i) the network inherits desirable features of unstructured meshes (e.g., fine and coarse point spacing near the object surface and in the far field, respectively), which minimizes network training cost; (ii) object geometry is accurately represented through vertices located on object boundaries, which maintains boundary smoothness and allows the network to detect small changes between geometries; and (iii) no data interpolation is utilized for creating training data; thus accuracy of the CFD data is preserved. None of these features are achievable by extant methods based on projecting scattered CFD data into Cartesian grids and then using regular convolutional neural networks. Incompressible laminar steady flow past a cylinder with various shapes for its cross section is considered. The mass and momentum of predicted fields are conserved. We test the generalizability of our network by predicting the flow around multiple objects as well as an airfoil, even though only single objects and no airfoils are observed during training. The network predicts the flow fields hundreds of times faster than our conventional CFD solver, while maintaining excellent to reasonable accuracy.
    Comfetch: Federated Learning of Large Networks on Memory-Constrained Clients via Sketching. (arXiv:2109.08346v1 [cs.LG])
    (0 min) A popular application of federated learning is using many clients to train a deep neural network, the parameters of which are maintained on a central server. While recent efforts have focused on reducing communication complexity, existing algorithms assume that each participating client is able to download the current and full set of parameters, which may not be a practical assumption depending on the memory constraints of clients such as mobile devices. In this work, we propose a novel algorithm Comfetch, which allows clients to train large networks using compressed versions of the global architecture via Count Sketch, thereby reducing communication and local memory costs. We provide a theoretical convergence guarantee and experimentally demonstrate that it is possible to learn large networks, such as a deep convolutional network and an LSTM, through federated agents training on their sketched counterparts. The resulting global models exhibit competitive test accuracy when compared against the state-of-the-art FetchSGD and the classical FedAvg, both of which require clients to download the full architecture.
    A Cautionary Tale of Decorrelating Theory Uncertainties. (arXiv:2109.08159v1 [hep-ph])
    (0 min) A variety of techniques have been proposed to train machine learning classifiers that are independent of a given feature. While this can be an essential technique for enabling background estimation, it may also be useful for reducing uncertainties. We carefully examine theory uncertainties, which typically do not have a statistical origin. We will provide explicit examples of two-point (fragmentation modeling) and continuous (higher-order corrections) uncertainties where decorrelating significantly reduces the apparent uncertainty while the actual uncertainty is much larger. These results suggest that caution should be taken when using decorrelation for these types of uncertainties as long as we do not have a complete decomposition into statistically meaningful components.
    Automatic prior selection for meta Bayesian optimization with a case study on tuning deep neural network optimizers. (arXiv:2109.08215v1 [cs.LG])
    (0 min) The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.
    Dropout's Dream Land: Generalization from Learned Simulators to Reality. (arXiv:2109.08342v1 [cs.LG])
    (2 min) A World Model is a generative model used to simulate an environment. World Models have proven capable of learning spatial and temporal representations of Reinforcement Learning environments. In some cases, a World Model offers an agent the opportunity to learn entirely inside of its own dream environment. In this work we explore improving the generalization capabilities from dream environments to real environments (Dream2Real). We present a general approach to improve a controller's ability to transfer from a neural network dream environment to reality at little additional cost. These improvements are gained by drawing on inspiration from Domain Randomization, where the basic idea is to randomize as much of a simulator as possible without fundamentally changing the task at hand. Generally, Domain Randomization assumes access to a pre-built simulator with configurable parameters but oftentimes this is not available. By training the World Model using dropout, the dream environment is capable of creating a nearly infinite number of different dream environments. Previous use cases of dropout either do not use dropout at inference time or averages the predictions generated by multiple sampled masks (Monte-Carlo Dropout). Dropout's Dream Land leverages each unique mask to create a diverse set of dream environments. Our experimental results show that Dropout's Dream Land is an effective technique to bridge the reality gap between dream environments and reality. Furthermore, we additionally perform an extensive set of ablation studies.
    Interpretable Local Tree Surrogate Policies. (arXiv:2109.08180v1 [cs.LG])
    (2 min) High-dimensional policies, such as those represented by neural networks, cannot be reasonably interpreted by humans. This lack of interpretability reduces the trust users have in policy behavior, limiting their use to low-impact tasks such as video games. Unfortunately, many methods rely on neural network representations for effective learning. In this work, we propose a method to build predictable policy trees as surrogates for policies such as neural networks. The policy trees are easily human interpretable and provide quantitative predictions of future behavior. We demonstrate the performance of this approach on several simulated tasks.
    Sparse Factorization of Large Square Matrices. (arXiv:2109.08184v1 [cs.LG])
    (2 min) Square matrices appear in many machine learning problems and models. Optimization over a large square matrix is expensive in memory and in time. Therefore an economic approximation is needed. Conventional approximation approaches factorize the square matrix into a number matrices of much lower ranks. However, the low-rank constraint is a performance bottleneck if the approximated matrix is intrinsically high-rank or close to full rank. In this paper, we propose to approximate a large square matrix with a product of sparse full-rank matrices. In the approximation, our method needs only $N(\log N)^2$ non-zero numbers for an $N\times N$ full matrix. We present both non-parametric and parametric ways to find the factorization. In the former, we learn the factorizing matrices directly, and in the latter, we train neural networks to map input data to the non-zero matrix entries. The sparse factorization method is tested for a variety of synthetic and real-world square matrices. The experimental results demonstrate that our method gives a better approximation when the approximated matrix is sparse and high-rank. Based on this finding, we use our parametric method as a scalable attention architecture that performs strongly in learning tasks for long sequential data and defeats Transformer and its several variants.
    Adaptive Hierarchical Dual Consistency for Semi-Supervised Left Atrium Segmentation on Cross-Domain Data. (arXiv:2109.08311v1 [eess.IV])
    (2 min) Semi-supervised learning provides great significance in left atrium (LA) segmentation model learning with insufficient labelled data. Generalising semi-supervised learning to cross-domain data is of high importance to further improve model robustness. However, the widely existing distribution difference and sample mismatch between different data domains hinder the generalisation of semi-supervised learning. In this study, we alleviate these problems by proposing an Adaptive Hierarchical Dual Consistency (AHDC) for the semi-supervised LA segmentation on cross-domain data. The AHDC mainly consists of a Bidirectional Adversarial Inference module (BAI) and a Hierarchical Dual Consistency learning module (HDC). The BAI overcomes the difference of distributions and the sample mismatch between two different domains. It mainly learns two mapping networks adversarially to obtain two matched domains through mutual adaptation. The HDC investigates a hierarchical dual learning paradigm for cross-domain semi-supervised segmentation based on the obtained matched domains. It mainly builds two dual-modelling networks for mining the complementary information in both intra-domain and inter-domain. For the intra-domain learning, a consistency constraint is applied to the dual-modelling targets to exploit the complementary modelling information. For the inter-domain learning, a consistency constraint is applied to the LAs modelled by two dual-modelling networks to exploit the complementary knowledge among different data domains. We demonstrated the performance of our proposed AHDC on four 3D late gadolinium enhancement cardiac MR (LGE-CMR) datasets from different centres and a 3D CT dataset. Compared to other state-of-the-art methods, our proposed AHDC achieved higher segmentation accuracy, which indicated its capability in the cross-domain semi-supervised LA segmentation.
    KATANA: Simple Post-Training Robustness Using Test Time Augmentations. (arXiv:2109.08191v1 [cs.CV])
    (2 min) Although Deep Neural Networks (DNNs) achieve excellent performance on many real-world tasks, they are highly vulnerable to adversarial attacks. A leading defense against such attacks is adversarial training, a technique in which a DNN is trained to be robust to adversarial attacks by introducing adversarial noise to its input. This procedure is effective but must be done during the training phase. In this work, we propose a new simple and easy-to-use technique, KATANA, for robustifying an existing pretrained DNN without modifying its weights. For every image, we generate N randomized Test Time Augmentations (TTAs) by applying diverse color, blur, noise, and geometric transforms. Next, we utilize the DNN's logits output to train a simple random forest classifier to predict the real class label. Our strategy achieves state-of-the-art adversarial robustness on diverse attacks with minimal compromise on the natural images' classification. We test KATANA also against two adaptive white-box attacks and it shows excellent results when combined with adversarial training. Code is available in https://github.com/giladcohen/KATANA.
    Beyond Average Performance -- exploring regions of deviating performance for black box classification models. (arXiv:2109.08216v1 [cs.LG])
    (2 min) Machine learning models are becoming increasingly popular in different types of settings. This is mainly caused by their ability to achieve a level of predictive performance that is hard to match by human experts in this new era of big data. With this usage growth comes an increase of the requirements for accountability and understanding of the models' predictions. However, the degree of sophistication of the most successful models (e.g. ensembles, deep learning) is becoming a large obstacle to this endeavour as these models are essentially black boxes. In this paper we describe two general approaches that can be used to provide interpretable descriptions of the expected performance of any black box classification model. These approaches are of high practical relevance as they provide means to uncover and describe in an interpretable way situations where the models are expected to have a performance that deviates significantly from their average behaviour. This may be of critical relevance for applications where costly decisions are driven by the predictions of the models, as it can be used to warn end users against the usage of the models in some specific cases.
    Improving Regression Uncertainty Estimation Under Statistical Change. (arXiv:2109.08213v1 [cs.LG])
    (2 min) While deep neural networks are highly performant and successful in a wide range of real-world problems, estimating their predictive uncertainty remains a challenging task. To address this challenge, we propose and implement a loss function for regression uncertainty estimation based on the Bayesian Validation Metric (BVM) framework while using ensemble learning. A series of experiments on in-distribution data show that the proposed method is competitive with existing state-of-the-art methods. In addition, experiments on out-of-distribution data show that the proposed method is robust to statistical change and exhibits superior predictive capability.
    AdaLoss: A computationally-efficient and provably convergent adaptive gradient method. (arXiv:2109.08282v1 [stat.ML])
    (2 min) We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we provide a linear convergence guarantee over the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width of the first-hidden layer in the two-layer networks is sufficiently large (polynomially), then AdaLoss converges robustly \emph{to the global minimum} in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.
    Micro-architectural Analysis of a Learned Index. (arXiv:2109.08495v1 [cs.DS])
    (2 min) Since the publication of The Case for Learned Index Structures in 2018, there has been a rise in research that focuses on learned indexes for different domains and with different functionalities. While the effectiveness of learned indexes as an alternative to traditional index structures such as B+Trees have already been demonstrated by several studies, previous work tend to focus on higher-level performance metrics such as throughput and index size. In this paper, our goal is to dig deeper and investigate how learned indexes behave at a micro-architectural level compared to traditional indexes. More specifically, we focus on previously proposed learned index structure ALEX, which is a tree-based in-memory index structure that consists of a hierarchy of machine learned models. Unlike the original proposal for learned indexes, ALEX is designed from the ground up to allow updates and inserts. Therefore, it enables more dynamic workloads using learned indexes. In this work, we perform a micro-architectural analysis of ALEX and compare its behavior to the tree-based index structures that are not based on learned models, i.e., ART and B+Tree. Our results show that ALEX is bound by memory stalls, mainly stalls due to data misses from the last-level cache. Compared to ART and B+Tree, ALEX exhibits fewer stalls and a lower cycles-per-instruction value across different workloads. On the other hand, the amount of instructions required to handle out-of-bound inserts in ALEX can increase the instructions needed per request significantly (10X) for write-heavy workloads. However, the micro-architectural behavior shows that this increase in the instruction footprint exhibit high instruction-level parallelism, and, therefore, does not negatively impact the overall execution time.

2021-09-17

  • cs.CL updates on arXiv.org

    Sister Help: Data Augmentation for Frame-Semantic Role Labeling. (arXiv:2109.07725v1 [cs.CL])
    (0 min) While FrameNet is widely regarded as a rich resource of semantics in natural language processing, a major criticism concerns its lack of coverage and the relative paucity of its labeled data compared to other commonly used lexical resources such as PropBank and VerbNet. This paper reports on a pilot study to address these gaps. We propose a data augmentation approach, which uses existing frame-specific annotation to automatically annotate other lexical units of the same frame which are unannotated. Our rule-based approach defines the notion of a sister lexical unit and generates frame-specific augmented data for training. We present experiments on frame-semantic role labeling which demonstrate the importance of this data augmentation: we obtain a large improvement to prior results on frame identification and argument identification for FrameNet, utilizing both full-text and lexicographic annotations under FrameNet. Our findings on data augmentation highlight the value of automatic resource creation for improved models in frame-semantic parsing.
    Alquist 4.0: Towards Social Intelligence Using Generative Models and Dialogue Personalization. (arXiv:2109.07968v1 [cs.CL])
    (0 min) The open domain-dialogue system Alquist has a goal to conduct a coherent and engaging conversation that can be considered as one of the benchmarks of social intelligence. The fourth version of the system, developed within the Alexa Prize Socialbot Grand Challenge 4, brings two main innovations. The first addresses coherence, and the second addresses the engagingness of the conversation. For innovations regarding coherence, we propose a novel hybrid approach combining hand-designed responses and a generative model. The proposed approach utilizes hand-designed dialogues, out-of-domain detection, and a neural response generator. Hand-designed dialogues walk the user through high-quality conversational flows. The out-of-domain detection recognizes that the user diverges from the predefined flow and prevents the system from producing a scripted response that might not make sense for unexpected user input. Finally, the neural response generator generates a response based on the context of the dialogue that correctly reacts to the unexpected user input and returns the dialogue to the boundaries of hand-designed dialogues. The innovations for engagement that we propose are mostly inspired by the famous exploration-exploitation dilemma. To conduct an engaging conversation with the dialogue partners, one has to learn their preferences and interests -- exploration. Moreover, to engage the partner, we have to utilize the knowledge we have already learned -- exploitation. In this work, we present the principles and inner workings of individual components of the open-domain dialogue system Alquist developed within the Alexa Prize Socialbot Grand Challenge 4 and the experiments we have conducted to evaluate them.
    An Ontology-Based Information Extraction System for Residential Land Use Suitability Analysis. (arXiv:2109.07672v1 [cs.AI])
    (0 min) We propose an Ontology-Based Information Extraction (OBIE) system to automate the extraction of the criteria and values applied in Land Use Suitability Analysis (LUSA) from bylaw and regulation documents related to the geographic area of interest. The results obtained by our proposed LUSA OBIE system (land use suitability criteria and their values) are presented as an ontology populated with instances of the extracted criteria and property values. This latter output ontology is incorporated into a Multi-Criteria Decision Making (MCDM) model applied for constructing suitability maps for different kinds of land uses. The resulting maps may be the final desired product or can be incorporated into the cellular automata urban modeling and simulation for predicting future urban growth. A case study has been conducted where the output from LUSA OBIE is applied to help produce a suitability map for the City of Regina, Saskatchewan, to assist in the identification of suitable areas for residential development. A set of Saskatchewan bylaw and regulation documents were downloaded and input to the LUSA OBIE system. We accessed the extracted information using both the populated LUSA ontology and the set of annotated documents. In this regard, the LUSA OBIE system was effective in producing a final suitability map.
    MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset. (arXiv:2010.04480v2 [cs.CL] UPDATED)
    (0 min) We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains eleven language pairs, with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.
    A Survey of Online Hate Speech through the Causal Lens. (arXiv:2109.08120v1 [cs.CL])
    (0 min) The societal issue of digital hostility has previously attracted a lot of attention. The topic counts an ample body of literature, yet remains prominent and challenging as ever due to its subjective nature. We posit that a better understanding of this problem will require the use of causal inference frameworks. This survey summarises the relevant research that revolves around estimations of causal effects related to online hate speech. Initially, we provide an argumentation as to why re-establishing the exploration of hate speech in causal terms is of the essence. Following that, we give an overview of the leading studies classified with respect to the direction of their outcomes, as well as an outline of all related research, and a summary of open research problems that can influence future work on the topic.
    Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool. (arXiv:2002.10416v2 [cs.CL] UPDATED)
    (2 min) In this paper, we introduce the resources that we developed for Turkish dependency parsing, which include a novel manually annotated treebank (BOUN Treebank), along with the guidelines we adopted, and a new annotation tool (BoAT). The manual annotation process we employed was shaped and implemented by a team of four linguists and five Natural Language Processing (NLP) specialists. Decisions regarding the annotation of the BOUN Treebank were made in line with the Universal Dependencies (UD) framework as well as our recent efforts for unifying the Turkish UD treebanks through manual re-annotation. To the best of our knowledge, BOUN Treebank is the largest Turkish treebank. It contains a total of 9,761 sentences from various topics including biographical texts, national newspapers, instructional texts, popular culture articles, and essays. In addition, we report the parsing results of a state-of-the-art dependency parser obtained over the BOUN Treebank as well as two other treebanks in Turkish. Our results demonstrate that the unification of the Turkish annotation scheme and the introduction of a more comprehensive treebank lead to improved performance with regard to dependency parsing.
    Enabling Zero-shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. (arXiv:2011.01097v2 [cs.CL] UPDATED)
    (2 min) Current end-to-end approaches to Spoken Language Translation (SLT) rely on limited training resources, especially for multilingual settings. On the other hand, Multilingual Neural Machine Translation (MultiNMT) approaches rely on higher-quality and more massive data sets. Our proposed method extends a MultiNMT architecture based on language-specific encoders-decoders to the task of Multilingual SLT (MultiSLT). Our method entirely eliminates the dependency from MultiSLT data and it is able to translate while training only on ASR and MultiNMT data. Our experiments on four different languages show that coupling the speech encoder to the MultiNMT architecture produces similar quality translations compared to a bilingual baseline ($\pm 0.2$ BLEU) while effectively allowing for zero-shot MultiSLT. Additionally, we propose using an Adapter module for coupling the speech inputs. This Adapter module produces consistent improvements up to +6 BLEU points on the proposed architecture and +1 BLEU point on the end-to-end baseline.
    Do Language Models Know the Way to Rome?. (arXiv:2109.07971v1 [cs.CL])
    (2 min) The global geometry of language models is important for a range of applications, but language model probes tend to evaluate rather local relations, for which ground truths are easily obtained. In this paper we exploit the fact that in geography, ground truths are available beyond local relations. In a series of experiments, we evaluate the extent to which language model representations of city and country names are isomorphic to real-world geography, e.g., if you tell a language model where Paris and Berlin are, does it know the way to Rome? We find that language models generally encode limited geographic information, but with larger models performing the best, suggesting that geographic knowledge can be induced from higher-order co-occurrence statistics.
    The NiuTrans System for the WMT21 Efficiency Task. (arXiv:2109.08003v1 [cs.CL])
    (2 min) This paper describes the NiuTrans system for the WMT21 translation efficiency task (this http URL). Following last year's work, we explore various techniques to improve efficiency while maintaining translation quality. We investigate the combinations of lightweight Transformer architectures and knowledge distillation strategies. Also, we improve the translation efficiency with graph optimization, low precision, dynamic batching, and parallel pre/post-processing. Our system can translate 247,000 words per second on an NVIDIA A100, being 3$\times$ faster than last year's system. Our system is the fastest and has the lowest memory consumption on the GPU-throughput track. The code, model, and pipeline will be available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).
    Self-Supervised Contrastive Learning with Adversarial Perturbations for Robust Pretrained Language Models. (arXiv:2107.07610v2 [cs.CL] UPDATED)
    (2 min) This paper improves the robustness of the pretrained language model, BERT, against word substitution-based adversarial attacks by leveraging self-supervised contrastive learning with adversarial perturbations. One advantage of our method compared to previous works is that it is capable of improving model robustness without using any labels. Additionally, we also create an adversarial attack for word-level adversarial training on BERT. The attack is efficient, allowing adversarial training for BERT on adversarial examples generated \textit{on the fly} during training. Experimental results show that our method improves the robustness of BERT against four different word substitution-based adversarial attacks. Additionally, combining our method with adversarial training gives higher robustness than adversarial training alone. Furthermore, to understand why our method can improve the model robustness against adversarial attacks, we study vector representations of clean examples and their corresponding adversarial examples before and after applying our method. As our method improves model robustness with unlabeled raw data, it opens up the possibility of using large text datasets to train robust language models.
    Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN. (arXiv:2107.01366v2 [cs.CL] UPDATED)
    (2 min) Despite their practical success, modern seq2seq architectures are unable to generalize systematically on several SCAN tasks. Hence, it is not clear if SCAN-style compositional generalization is useful in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.
    FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. (arXiv:2106.05707v2 [cs.CL] UPDATED)
    (2 min) Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this paper we introduce a novel dataset and benchmark, Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. Furthermore, we detail our efforts to track and minimize the biases present in the dataset and could be exploited by models, e.g. being able to predict the label without using evidence. Finally, we develop a baseline for verifying claims against text and tables which predicts both the correct evidence and verdict for 18% of the claims.
    Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition. (arXiv:2106.06922v5 [cs.CL] UPDATED)
    (2 min) How to effectively incorporate cross-utterance information cues into a neural language model (LM) has emerged as one of the intriguing issues for automatic speech recognition (ASR). Existing research efforts on improving contextualization of an LM typically regard previous utterances as a sequence of additional input and may fail to capture complex global structural dependencies among these utterances. In view of this, we in this paper seek to represent the historical context information of an utterance as graph-structured data so as to distill cross-utterances, global word interaction relationships. To this end, we apply a graph convolutional network (GCN) on the resulting graph to obtain the corresponding GCN embeddings of historical words. GCN has recently found its versatile applications on social-network analysis, text summarization, and among others due mainly to its ability of effectively capturing rich relational information among elements. However, GCN remains largely underexplored in the context of ASR, especially for dealing with conversational speech. In addition, we frame ASR N-best reranking as a prediction problem, leveraging bidirectional encoder representations from transformers (BERT) as the vehicle to not only seize the local intrinsic word regularity patterns inherent in a candidate hypothesis but also incorporate the cross-utterance, historical word interaction cues distilled by GCN for promoting performance. Extensive experiments conducted on the AMI benchmark dataset seem to confirm the pragmatic utility of our methods, in relation to some current top-of-the-line methods.
    Establishing Interlingua in Multilingual Language Models. (arXiv:2109.01207v2 [cs.CL] UPDATED)
    (2 min) Large multilingual language models show remarkable zero-shot cross-lingual transfer performance on a range of tasks. Follow-up works hypothesized that these models internally project representations of different languages into a shared interlingual space. However, they produced contradictory results. In this paper, we correct the famous prior work claiming that "BERT is not an Interlingua" and show that with the proper choice of sentence representation different languages actually do converge to a shared space in such language models. Furthermore, we demonstrate that this convergence pattern is robust across four measures of correlation similarity and six mBERT-like models. We then extend our analysis to 28 diverse languages and find that the interlingual space exhibits a particular structure similar to the linguistic relatedness of languages. We also highlight a few outlier languages that seem to fail to converge to the shared space. The code for replicating our results is available at the following URL: https://github.com/maksym-del/interlingua.
    Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. (arXiv:2109.00412v2 [cs.CL] UPDATED)
    (2 min) In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of synthesized embeddings. These embeddings are generated from the upstream process called multimodal fusion, which aims to extract and combine the input unimodal raw data to produce a richer multimodal representation. Previous work either back-propagates the task loss or manipulates the geometric property of feature spaces to produce favorable fusion results, which neglects the preservation of critical task-related information that flows from input to the fusion results. In this work, we propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and unimodal input in order to maintain task-related information through multimodal fusion. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task. To address the intractable issue of MI bounds, we further formulate a set of computationally simple parametric and non-parametric methods to approximate their truth value. Experimental results on the two widely used datasets demonstrate the efficacy of our approach. The implementation of this work is publicly available at https://github.com/declare-lab/Multimodal-Infomax.
    HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop. (arXiv:1907.11184v1 [cs.CL] CROSS LISTED)
    (2 min) While the role of humans is increasingly recognized in machine learning community, representation of and interaction with models in current human-in-the-loop machine learning (HITL-ML) approaches are too low-level and far-removed from human's conceptual models. We demonstrate HEIDL, a prototype HITL-ML system that exposes the machine-learned model through high-level, explainable linguistic expressions formed of predicates representing semantic structure of text. In HEIDL, human's role is elevated from simply evaluating model predictions to interpreting and even updating the model logic directly by enabling interaction with rule predicates themselves. Raising the currency of interaction to such semantic levels calls for new interaction paradigms between humans and machines that result in improved productivity for text analytics model development process. Moreover, by involving humans in the process, the human-machine co-created models generalize better to unseen data as domain experts are able to instill their expertise by extrapolating from what has been learned by automated algorithms from few labelled data.
    Locating Language-Specific Information in Contextualized Embeddings. (arXiv:2109.08040v1 [cs.CL])
    (2 min) Multilingual pretrained language models (MPLMs) exhibit multilinguality and are well suited for transfer across languages. Most MPLMs are trained in an unsupervised fashion and the relationship between their objective and multilinguality is unclear. More specifically, the question whether MPLM representations are language-agnostic or they simply interleave well with learned task prediction heads arises. In this work, we locate language-specific information in MPLMs and identify its dimensionality and the layers where this information occurs. We show that language-specific information is scattered across many dimensions, which can be projected into a linear subspace. Our study contributes to a better understanding of MPLM representations, going beyond treating them as unanalyzable blobs of information.
    Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification. (arXiv:2109.07604v1 [cs.CL])
    (2 min) Traditional hand-crafted linguistically-informed features have often been used for distinguishing between translated and original non-translated texts. By contrast, to date, neural architectures without manual feature engineering have been less explored for this task. In this work, we (i) compare the traditional feature-engineering-based approach to the feature-learning-based one and (ii) analyse the neural architectures in order to investigate how well the hand-crafted features explain the variance in the neural models' predictions. We use pre-trained neural word embeddings, as well as several end-to-end neural architectures in both monolingual and multilingual settings and compare them to feature-engineering-based SVM classifiers. We show that (i) neural architectures outperform other approaches by more than 20 accuracy points, with the BERT-based model performing the best in both the monolingual and multilingual settings; (ii) while many individual hand-crafted translationese features correlate with neural model predictions, feature importance analysis shows that the most important features for neural and classical architectures differ; and (iii) our multilingual experiments provide empirical evidence for translationese universals across languages.
    Automatic Error Type Annotation for Arabic. (arXiv:2109.08068v1 [cs.CL])
    (2 min) We present ARETA, an automatic error type annotation system for Modern Standard Arabic. We design ARETA to address Arabic's morphological richness and orthographic ambiguity. We base our error taxonomy on the Arabic Learner Corpus (ALC) Error Tagset with some modifications. ARETA achieves a performance of 85.8% (micro average F1 score) on a manually annotated blind test portion of ALC. We also demonstrate ARETA's usability by applying it to a number of submissions from the QALB 2014 shared task for Arabic grammatical error correction. The resulting analyses give helpful insights on the strengths and weaknesses of different submissions, which is more useful than the opaque M2 scoring metrics used in the shared task. ARETA employs a large Arabic morphological analyzer, but is completely unsupervised otherwise. We make ARETA publicly available.
    Regex Queries over Incomplete Knowledge Bases. (arXiv:2005.00480v2 [cs.CL] UPDATED)
    (2 min) We propose the novel task of answering regular expression queries (containing disjunction ($\vee$) and Kleene plus ($+$) operators) over incomplete KBs. The answer set of these queries potentially has a large number of entities, hence previous works for single-hop queries in KBC that model a query as a point in high-dimensional space are not as effective. In response, we develop RotatE-Box -- a novel combination of RotatE and box embeddings. It can model more relational inference patterns compared to existing embedding based models. Furthermore, we define baseline approaches for embedding based KBC models to handle regex operators. We demonstrate performance of RotatE-Box on two new regex-query datasets introduced in this paper, including one where the queries are harvested based on actual user query logs. We find that our final RotatE-Box model significantly outperforms models based on just RotatE and just box embeddings.
    Divided We Rule: Influencer Polarization on Twitter During Political Crises in India. (arXiv:2105.08361v2 [cs.SI] UPDATED)
    (2 min) Influencers are key to the nature and networks of information propagation on social media. Influencers are particularly important in political discourse through their engagement with issues, and may derive their legitimacy either solely or in large part through online operation, or have an offline sphere of expertise such as entertainers, journalists etc. To quantify influencers' political engagement and polarity, we use Google's Universal Sentence Encoder (USE) to encode the tweets of 6k influencers and 26k Indian politicians during political crises in India. We then obtain aggregate vector representations of the influencers based on their tweet embeddings, which alongside retweet graphs help compute their stance and polarity with respect to these political issues. We find that influencers engage with the topics in a partisan manner, with polarized influencers being rewarded with increased retweeting and following. Moreover, we observe that specific groups of influencers are consistently polarized across all events. We conclude by discussing how our study provides insights into the political schisms of present-day India, but also offers a means to study the role of influencers in exacerbating political polarization in other contexts.
    Incremental Learning for End-to-End Automatic Speech Recognition. (arXiv:2005.04288v3 [eess.AS] UPDATED)
    (2 min) In this paper, we propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) which enables an ASR system to perform well on new tasks while maintaining the performance on its originally learned ones. To mitigate catastrophic forgetting during incremental learning, we design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions. Our method works without access to the training data of original tasks, which addresses the cases where the previous data is no longer available or joint training is costly. Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting. Furthermore, in two practical scenarios, compared to the target-reference joint training method, the performance drop of our method is 0.02% Character Error Rate (CER), which is 97% smaller than the drops of the baseline methods.
    I Wish I Would Have Loved This One, But I Didn't -- A Multilingual Dataset for Counterfactual Detection in Product Reviews. (arXiv:2104.06893v2 [cs.CL] UPDATED)
    (2 min) Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.
    TWEAC: Transformer with Extendable QA Agent Classifiers. (arXiv:2104.07081v2 [cs.CL] UPDATED)
    (2 min) Question answering systems should help users to access knowledge on a broad range of topics and to answer a wide array of different questions. Most systems fall short of this expectation as they are only specialized in one particular setting, e.g., answering factual questions with Wikipedia data. To overcome this limitation, we propose composing multiple QA agents within a meta-QA system. We argue that there exist a wide range of specialized QA agents in literature. Thus, we address the central research question of how to effectively and efficiently identify suitable QA agents for any given question. We study both supervised and unsupervised approaches to address this challenge, showing that TWEAC -- Transformer with Extendable Agent Classifiers -- achieves the best performance overall with 94% accuracy. We provide extensive insights on the scalability of TWEAC, demonstrating that it scales robustly to over 100 QA agents with each providing just 1000 examples of questions they can answer. Our code and data is available: https://github.com/UKPLab/TWEAC-qa-agent-selection
    Zero-Shot Open Information Extraction using Question Generation and Reading Comprehension. (arXiv:2109.08079v1 [cs.IR])
    (2 min) Typically, Open Information Extraction (OpenIE) focuses on extracting triples, representing a subject, a relation, and the object of the relation. However, most of the existing techniques are based on a predefined set of relations in each domain which limits their applicability to newer domains where these relations may be unknown such as financial documents. This paper presents a zero-shot open information extraction technique that extracts the entities (value) and their descriptions (key) from a sentence, using off the shelf machine reading comprehension (MRC) Model. The input questions to this model are created using a novel noun phrase generation method. This method takes the context of the sentence into account and can create a wide variety of questions making our technique domain independent. Given the questions and the sentence, our technique uses the MRC model to extract entities (value). The noun phrase corresponding to the question, with the highest confidence, is taken as the description (key). This paper also introduces the EDGAR10-Q dataset which is based on publicly available financial documents from corporations listed in US securities and exchange commission (SEC). The dataset consists of paragraphs, tagged values (entities), and their keys (descriptions) and is one of the largest among entity extraction datasets. This dataset will be a valuable addition to the research community, especially in the financial domain. Finally, the paper demonstrates the efficacy of the proposed technique on the EDGAR10-Q and Ade corpus drug dosage datasets, where it obtained 86.84 % and 97% accuracy, respectively.
    Directed Acyclic Graph Network for Conversational Emotion Recognition. (arXiv:2105.12907v2 [cs.CL] UPDATED)
    (2 min) The modeling of conversational context plays a vital role in emotion recognition from conversation (ERC). In this paper, we put forward a novel idea of encoding the utterances with a directed acyclic graph (DAG) to better model the intrinsic structure within a conversation, and design a directed acyclic neural network, namely DAG-ERC, to implement this idea. In an attempt to combine the strengths of conventional graph-based neural models and recurrence-based neural models, DAG-ERC provides a more intuitive way to model the information flow between long-distance conversation background and nearby context. Extensive experiments are conducted on four ERC benchmarks with state-of-the-art models employed as baselines for comparison. The empirical results demonstrate the superiority of this new model and confirm the motivation of the directed acyclic graph architecture for ERC.
    Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. (arXiv:2104.08678v2 [cs.CL] UPDATED)
    (2 min) Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic adversarial data generation to make question answering models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.
    Discrete representations in neural models of spoken language. (arXiv:2105.05582v2 [cs.CL] UPDATED)
    (2 min) The distributed and continuous representations used by neural networks are at odds with representations employed in linguistics, which are typically symbolic. Vector quantization has been proposed as a way to induce discrete neural representations that are closer in nature to their linguistic counterparts. However, it is not clear which metrics are the best-suited to analyze such discrete representations. We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language. We compare the results they show when applied to two different models, while systematically studying the effect of the placement and size of the discretization layer. We find that different evaluation regimes can give inconsistent results. While we can attribute them to the properties of the different metrics in most cases, one point of concern remains: the use of minimal pairs of phoneme triples as stimuli disadvantages larger discrete unit inventories, unlike metrics applied to complete utterances. Furthermore, while in general vector quantization induces representations that correlate with units posited in linguistics, the strength of this correlation is only moderate.
    MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection. (arXiv:2109.08113v1 [cs.CL])
    (2 min) Much of natural language processing is focused on leveraging large capacity language models, typically trained over single messages with a task of predicting one or more tokens. However, modeling human language at higher-levels of context (i.e., sequences of messages) is under-explored. In stance detection and other social media tasks where the goal is to predict an attribute of a message, we have contextual data that is loosely semantically connected by authorship. Here, we introduce Message-Level Transformer (MeLT) -- a hierarchical message-encoder pre-trained over Twitter and applied to the task of stance prediction. We focus on stance prediction as a task benefiting from knowing the context of the message (i.e., the sequence of previous messages). The model is trained using a variant of masked-language modeling; where instead of predicting tokens, it seeks to generate an entire masked (aggregated) message vector via reconstruction loss. We find that applying this pre-trained masked message-level transformer to the downstream task of stance detection achieves F1 performance of 67%.
    Revisiting Tri-training of Dependency Parsers. (arXiv:2109.08122v1 [cs.CL])
    (2 min) We compare two orthogonal semi-supervised learning techniques, namely tri-training and pretrained word embeddings, in the task of dependency parsing. We explore language-specific FastText and ELMo embeddings and multilingual BERT embeddings. We focus on a low resource scenario as semi-supervised learning can be expected to have the most impact here. Based on treebank size and available ELMo models, we select Hungarian, Uyghur (a zero-shot language for mBERT) and Vietnamese. Furthermore, we include English in a simulated low-resource setting. We find that pretrained word embeddings make more effective use of unlabelled data than tri-training but that the two approaches can be successfully combined.
    MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education. (arXiv:2106.07340v3 [cs.CL] UPDATED)
    (2 min) Since the introduction of the original BERT (i.e., BASE BERT), researchers have developed various customized BERT models with improved performance for specific domains and tasks by exploiting the benefits of transfer learning. Due to the nature of mathematical texts, which often use domain specific vocabulary along with equations and math symbols, we posit that the development of a new BERT model for mathematics would be useful for many mathematical downstream tasks. In this resource paper, we introduce our multi-institutional effort (i.e., two learning platforms and three academic institutions in the US) toward this need: MathBERT, a model created by pre-training the BASE BERT model on a large mathematical corpus ranging from pre-kindergarten (pre-k), to high-school, to college graduate level mathematical content. In addition, we select three general NLP tasks that are often used in mathematics education: prediction of knowledge component, auto-grading open-ended Q&A, and knowledge tracing, to demonstrate the superiority of MathBERT over BASE BERT. Our experiments show that MathBERT outperforms prior best methods by 1.2-22% and BASE BERT by 2-8% on these tasks. In addition, we build a mathematics specific vocabulary 'mathVocab' to train with MathBERT. We discover that MathBERT pre-trained with 'mathVocab' outperforms MathBERT trained with the BASE BERT vocabulary (i.e., 'origVocab'). MathBERT is currently being adopted at the participated leaning platforms: Stride, Inc, a commercial educational resource provider, and ASSISTments.org, a free online educational platform. We release MathBERT for public usage at: https://github.com/tbs17/MathBERT.
    Does Summary Evaluation Survive Translation to Other Languages?. (arXiv:2109.08129v1 [cs.CL])
    (2 min) The creation of a large summarization quality dataset is a considerable, expensive, time-consuming effort, requiring careful planning and setup. It includes producing human-written and machine-generated summaries and evaluation of the summaries by humans, preferably by linguistic experts, and by automatic evaluation tools. If such effort is made in one language, it would be beneficial to be able to use it in other languages. To investigate how much we can trust the translation of such dataset without repeating human annotations in another language, we translated an existing English summarization dataset, SummEval dataset, to four different languages and analyzed the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language. Our results reveal that although translation changes the absolute value of automatic scores, the scores keep the same rank order and approximately the same correlations with human annotations.
    The Concept of Semantic Value in Social Network Analysis: an Application to Comparative Mythology. (arXiv:2109.08023v1 [cs.SI])
    (2 min) Human sciences have traditionally relied on human reasoning and intelligence to infer knowledge from a wide range of sources, such as oral and written narrations, reports, and traditions. Here we develop an extension of classical social network analysis approaches to incorporate the concept of meaning in each actor, as a mean to quantify and infer further knowledge from the original source of the network. This extension is based on a new affinity function, the semantic affinity, that establishes fuzzy-like relationships between the different actors in the network, using combinations of affinity functions. We also propose a new heuristic algorithm based on the shortest capacity problem to compute this affinity function. We use these concept of meaning and semantic affinity to analyze and compare the gods and heroes from three different classical mythologies: Greek, Celtic and Nordic. We study the relationships of each individual mythology and those of common structure that is formed when we fuse the three of them. We show a strong connection between the Celtic and Nordic gods and that Greeks put more emphasis on heroic characters rather than deities. Our approach provides a technique to highlight and quantify important relationships in the original domain of the network not deducible from its structural properties.
    Efficient Attribute Injection for Pretrained Language Models. (arXiv:2109.07953v1 [cs.CL])
    (2 min) Metadata attributes (e.g., user and product IDs from reviews) can be incorporated as additional inputs to neural-based NLP models, by modifying the architecture of the models, in order to improve their performance. Recent models however rely on pretrained language models (PLMs), where previously used techniques for attribute injection are either nontrivial or ineffective. In this paper, we propose a lightweight and memory-efficient method to inject attributes to PLMs. We extend adapters, i.e. tiny plug-in feed-forward modules, to include attributes both independently of or jointly with the text. To limit the increase of parameters especially when the attribute vocabulary is large, we use low-rank approximations and hypercomplex multiplications, significantly decreasing the total parameters. We also introduce training mechanisms to handle domains in which attributes can be multi-labeled or sparse. Extensive experiments and analyses on eight datasets from different domains show that our method outperforms previous attribute injection methods and achieves state-of-the-art performance on various datasets.
    TruthfulQA: Measuring How Models Mimic Human Falsehoods. (arXiv:2109.07958v1 [cs.CL])
    (2 min) We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. For example, the 6B-parameter GPT-J model was 17% less truthful than its 125M-parameter counterpart. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.
    KnowMAN: Weakly Supervised Multinomial Adversarial Networks. (arXiv:2109.07994v1 [cs.LG])
    (2 min) The absence of labeled data for training neural models is often addressed by leveraging knowledge about the specific task, resulting in heuristic but noisy labels. The knowledge is captured in labeling functions, which detect certain regularities or patterns in the training samples and annotate corresponding labels for training. This process of weakly supervised training may result in an over-reliance on the signals captured by the labeling functions and hinder models to exploit other signals or to generalize well. We propose KnowMAN, an adversarial scheme that enables to control influence of signals associated with specific labeling functions. KnowMAN forces the network to learn representations that are invariant to those signals and to pick up other signals that are more generally associated with an output label. KnowMAN strongly improves results compared to direct weakly supervised learning with a pre-trained transformer language model and a feature-based baseline.
    Detecting Propaganda Techniques in Memes. (arXiv:2109.08013v1 [cs.CV])
    (2 min) Propaganda can be defined as a form of communication that aims to influence the opinions or the actions of people towards a specific goal; this is achieved by means of well-defined rhetorical and psychological devices. Propaganda, in the form we know it today, can be dated back to the beginning of the 17th century. However, it is with the advent of the Internet and the social media that it has started to spread on a much larger scale than before, thus becoming major societal and political issue. Nowadays, a large fraction of propaganda in social media is multimodal, mixing textual with visual content. With this in mind, here we propose a new multi-label multimodal task: detecting the type of propaganda techniques used in memes. We further create and release a new corpus of 950 memes, carefully annotated with 22 propaganda techniques, which can appear in the text, in the image, or in both. Our analysis of the corpus shows that understanding both modalities together is essential for detecting these techniques. This is further confirmed in our experiments with several state-of-the-art multimodal models.
    Improving Unsupervised Question Answering via Summarization-Informed Question Generation. (arXiv:2109.07954v1 [cs.CL])
    (2 min) Question Generation (QG) is the task of generating a plausible question for a given pair. Template-based QG uses linguistically-informed heuristics to transform declarative sentences into interrogatives, whereas supervised QG uses existing Question Answering (QA) datasets to train a system to generate a question given a passage and an answer. A disadvantage of the heuristic approach is that the generated questions are heavily tied to their declarative counterparts. A disadvantage of the supervised approach is that they are heavily tied to the domain/language of the QA dataset used as training data. In order to overcome these shortcomings, we propose an unsupervised QG method which uses questions generated heuristically from summaries as a source of training data for a QG system. We make use of freely available news summary data, transforming declarative summary sentences into appropriate questions using heuristics informed by dependency parsing, named entity recognition and semantic role labeling. The resulting questions are then combined with the original news articles to train an end-to-end neural QG model. We extrinsically evaluate our approach using unsupervised QA: our QG model is used to generate synthetic QA pairs for training a QA model. Experimental results show that, trained with only 20k English Wikipedia-based synthetic QA pairs, the QA model substantially outperforms previous unsupervised models on three in-domain datasets (SQuAD1.1, Natural Questions, TriviaQA) and three out-of-domain datasets (NewsQA, BioASQ, DuoRC), demonstrating the transferability of the approach.
    Let the CAT out of the bag: Contrastive Attributed explanations for Text. (arXiv:2109.07983v1 [cs.CL])
    (2 min) Contrastive explanations for understanding the behavior of black box models has gained a lot of attention recently as they provide potential for recourse. In this paper, we propose a method Contrastive Attributed explanations for Text (CAT) which provides contrastive explanations for natural language text data with a novel twist as we build and exploit attribute classifiers leading to more semantically meaningful explanations. To ensure that our contrastive generated text has the fewest possible edits with respect to the original text, while also being fluent and close to a human generated contrastive, we resort to a minimal perturbation approach regularized using a BERT language model and attribute classifiers trained on available attributes. We show through qualitative examples and a user study that our method not only conveys more insight because of these attributes, but also leads to better quality (contrastive) text. Moreover, quantitatively we show that our method is more efficient than other state-of-the-art methods with it also scoring higher on benchmark metrics such as flip rate, (normalized) Levenstein distance, fluency and content preservation.
    Surveying the Research on Fake News in Social Media: a Tale of Networks and Language. (arXiv:2109.07909v1 [cs.CY])
    (3 min) The history of journalism and news diffusion is tightly coupled with the effort to dispel hoaxes, misinformation, propaganda, unverified rumours, poor reporting, and messages containing hate and divisions. With the explosive growth of online social media and billions of individuals engaged with consuming, creating, and sharing news, this ancient problem has surfaced with a renewed intensity threatening our democracies, public health, and news outlets credibility. This has triggered many researchers to develop new methods for studying, understanding, detecting, and preventing fake-news diffusion; as a consequence, thousands of scientific papers have been published in a relatively short period, making researchers of different disciplines to struggle in search of open problems and most relevant trends. The aim of this survey is threefold: first, we want to provide the researchers interested in this multidisciplinary and challenging area with a network-based analysis of the existing literature to assist them with a visual exploration of papers that can be of interest; second, we present a selection of the main results achieved so far adopting the network as an unifying framework to represent and make sense of data, to model diffusion processes, and to evaluate different debunking strategies. Finally, we present an outline of the most relevant research trends focusing on the moving target of fake-news, bots, and trolls identification by means of data mining and text technologies; despite scholars working on computational linguistics and networks traditionally belong to different scientific communities, we expect that forthcoming computational approaches to prevent fake news from polluting the social media must be developed using hybrid and up-to-date methodologies.
    The NiuTrans System for WNGT 2020 Efficiency Task. (arXiv:2109.08008v1 [cs.CL])
    (2 min) This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models \cite{wang-etal-2019-learning, li-etal-2019-niutrans} using NiuTensor (https://github.com/NiuTrans/NiuTensor), a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on \textit{newstest2018}. The code, models, and docker images are available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).
    Phrase Retrieval Learns Passage Retrieval, Too. (arXiv:2109.08133v1 [cs.CL])
    (2 min) Dense retrieval methods have shown great promise over sparse retrieval methods in a range of NLP problems. Among them, dense phrase retrieval-the most fine-grained retrieval unit-is appealing because phrases can be directly used as the output for question answering and slot filling tasks. In this work, we follow the intuition that retrieving phrases naturally entails retrieving larger text blocks and study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents. We first observe that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy (+3-5% in top-5 accuracy) compared to passage retrievers, which also helps achieve superior end-to-end QA performance with fewer passages. Then, we provide an interpretation for why phrase-level supervision helps learn better fine-grained entailment compared to passage-level supervision, and also show that phrase retrieval can be improved to achieve competitive performance in document-retrieval tasks such as entity linking and knowledge-grounded dialogue. Finally, we demonstrate how phrase filtering and vector quantization can reduce the size of our index by 4-10x, making dense phrase retrieval a practical and versatile solution in multi-granularity retrieval.
    The Language Model Understood the Prompt was Ambiguous: Probing Syntactic Uncertainty Through Generation. (arXiv:2109.07848v1 [cs.CL])
    (2 min) Temporary syntactic ambiguities arise when the beginning of a sentence is compatible with multiple syntactic analyses. We inspect to which extent neural language models (LMs) exhibit uncertainty over such analyses when processing temporarily ambiguous inputs, and how that uncertainty is modulated by disambiguating cues. We probe the LM's expectations by generating from it: we use stochastic decoding to derive a set of sentence completions, and estimate the probability that the LM assigns to each interpretation based on the distribution of parses across completions. Unlike scoring-based methods for targeted syntactic evaluation, this technique makes it possible to explore completions that are not hypothesized in advance by the researcher. We apply this method to study the behavior of two LMs (GPT2 and an LSTM) on three types of temporary ambiguity, using materials from human sentence processing experiments. We find that LMs can track multiple analyses simultaneously; the degree of uncertainty varies across constructions and contexts. As a response to disambiguating cues, the LMs often select the correct interpretation, but occasional errors point to potential areas of improvement.
    Improving Neural Machine Translation by Bidirectional Training. (arXiv:2109.07780v1 [cs.CL])
    (2 min) We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation. Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally. To achieve bidirectional updating, we simply reconstruct the training samples from "src$\rightarrow$tgt" to "src+tgt$\rightarrow$tgt+src" without any complicated model modifications. Notably, our approach does not increase any parameters or training steps, requiring the parallel data merely. Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs (data sizes range from 160K to 38M) significantly higher. Encouragingly, our proposed model can complement existing data manipulation strategies, i.e. back translation, data distillation, and data diversification. Extensive analyses show that our approach functions as a novel bilingual code-switcher, obtaining better bilingual alignment.
    Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models. (arXiv:2109.07765v1 [cs.CL])
    (2 min) We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license, both in Zenodo (\url{https://zenodo.org/record/4561971\#.YTI5SnVKiEA}).
    MOVER: Mask, Over-generate and Rank for Hyperbole Generation. (arXiv:2109.07726v1 [cs.CL])
    (2 min) Despite being a common figure of speech, hyperbole is under-researched with only a few studies addressing its identification task. In this paper, we introduce a new task of hyperbole generation to transfer a literal sentence into its hyperbolic paraphrase. To tackle the lack of available hyperbolic sentences, we construct HYPO-XL, the first large-scale hyperbole corpus containing 17,862 hyperbolic sentences in a non-trivial way. Based on our corpus, we propose an unsupervised method for hyperbole generation with no need for parallel literal-hyperbole pairs. During training, we fine-tune BART to infill masked hyperbolic spans of sentences from HYPO-XL. During inference, we mask part of an input literal sentence and over-generate multiple possible hyperbolic versions. Then a BERT-based ranker selects the best candidate by hyperbolicity and paraphrase quality. Human evaluation results show that our model is capable of generating hyperbolic paraphrase sentences and outperforms several baseline systems.
    Don't Search for a Search Method -- Simple Heuristics Suffice for Adversarial Text Attacks. (arXiv:2109.07926v1 [cs.CL])
    (2 min) Recently more attention has been given to adversarial attacks on neural networks for natural language processing (NLP). A central research topic has been the investigation of search algorithms and search constraints, accompanied by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth order optimization-based attacks and compare with the benchmark results in the TextAttack framework. Surprisingly, we find that optimization-based methods do not yield any improvement in a constrained setup and slightly benefit from approximate gradient information only in unconstrained setups where search spaces are larger. In contrast, simple heuristics exploiting nearest neighbors without querying the target function yield substantial success rates in constrained setups, and nearly full success rate in unconstrained setups, at an order of magnitude fewer queries. We conclude from these results that current TextAttack benchmark tasks are too easy and constraints are too strict, preventing meaningful research on black-box adversarial text attacks.
    RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization. (arXiv:2109.07943v1 [cs.CL])
    (2 min) Existing summarization systems mostly generate summaries purely relying on the content of the source document. However, even for humans, we usually need some references or exemplars to help us fully understand the source document and write summaries in a particular format. But how to find the high-quality exemplars and incorporate them into summarization systems is still challenging and worth exploring. In this paper, we propose RetrievalSum, a novel retrieval enhanced abstractive summarization framework consisting of a dense Retriever and a Summarizer. At first, several closely related exemplars are retrieved as supplementary input to help the generation model understand the text more comprehensively. Furthermore, retrieved exemplars can also play a role in guiding the model to capture the writing style of a specific corpus. We validate our method on a wide range of summarization datasets across multiple domains and two backbone models: BERT and BART. Results show that our framework obtains significant improvement by 1.38~4.66 in ROUGE-1 score when compared with the powerful pre-trained models, and achieve new state-of-the-art on BillSum. Human evaluation demonstrates that our retrieval enhanced model can better capture the domain-specific writing style.
    Humanly Certifying Superhuman Classifiers. (arXiv:2109.07867v1 [cs.LG])
    (2 min) Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. Today, this challenge is especially relevant given the emergence of systems which appear to increasingly outperform human beings. In some cases, this "superhuman" performance is readily demonstrated; for example by defeating legendary human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators can make mistakes and be subjective. Evaluating the performance with respect to a genuine oracle may be more objective and reliable, even when querying the oracle is expensive or impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is unobserved. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our assumptions a number of models from recent years are with high probability superhuman.
    Context-aware Entity Typing in Knowledge Graphs. (arXiv:2109.07990v1 [cs.CL])
    (2 min) Knowledge graph entity typing aims to infer entities' missing types in knowledge graphs which is an important but under-explored issue. This paper proposes a novel method for this task by utilizing entities' contextual information. Specifically, we design two inference mechanisms: i) N2T: independently use each neighbor of an entity to infer its type; ii) Agg2T: aggregate the neighbors of an entity to infer its type. Those mechanisms will produce multiple inference results, and an exponentially weighted pooling method is used to generate the final inference result. Furthermore, we propose a novel loss function to alleviate the false-negative problem during training. Experiments on two real-world KGs demonstrate the effectiveness of our method. The source code and data of this paper can be obtained from https://github.com/CCIIPLab/CET.
    Transferable Persona-Grounded Dialogues via Grounded Minimal Edits. (arXiv:2109.07713v1 [cs.CL])
    (2 min) Grounded dialogue models generate responses that are grounded on certain concepts. Limited by the distribution of grounded dialogue data, models trained on such data face the transferability challenges in terms of the data distribution and the type of grounded concepts. To address the challenges, we propose the grounded minimal editing framework, which minimally edits existing responses to be grounded on the given concept. Focusing on personas, we propose Grounded Minimal Editor (GME), which learns to edit by disentangling and recombining persona-related and persona-agnostic parts of the response. To evaluate persona-grounded minimal editing, we present the PersonaMinEdit dataset, and experimental results show that GME outperforms competitive baselines by a large margin. To evaluate the transferability, we experiment on the test set of BlendedSkillTalk and show that GME can edit dialogue models' responses to largely improve their persona consistency while preserving the use of knowledge and empathy.
    Scaling Laws for Neural Machine Translation. (arXiv:2109.07740v1 [cs.LG])
    (2 min) We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.
    Does External Knowledge Help Explainable Natural Language Inference? Automatic Evaluation vs. Human Ratings. (arXiv:2109.07833v1 [cs.CL])
    (2 min) Natural language inference (NLI) requires models to learn and apply commonsense knowledge. These reasoning abilities are particularly important for explainable NLI systems that generate a natural language explanation in addition to their label prediction. The integration of external knowledge has been shown to improve NLI systems, here we investigate whether it can also improve their explanation capabilities. For this, we investigate different sources of external knowledge and evaluate the performance of our models on in-domain data as well as on special transfer datasets that are designed to assess fine-grained reasoning capabilities. We find that different sources of knowledge have a different effect on reasoning abilities, for example, implicit knowledge stored in language models can hinder reasoning on numbers and negations. Finally, we conduct the largest and most fine-grained explainable NLI crowdsourcing study to date. It reveals that even large differences in automatic performance scores do neither reflect in human ratings of label, explanation, commonsense nor grammar correctness.
    Benchmarking Commonsense Knowledge Base Population with an Effective Evaluation Dataset. (arXiv:2109.07679v1 [cs.CL])
    (2 min) Reasoning over commonsense knowledge bases (CSKB) whose elements are in the form of free-text is an important yet hard task in NLP. While CSKB completion only fills the missing links within the domain of the CSKB, CSKB population is alternatively proposed with the goal of reasoning unseen assertions from external resources. In this task, CSKBs are grounded to a large-scale eventuality (activity, state, and event) graph to discriminate whether novel triples from the eventuality graph are plausible or not. However, existing evaluations on the population task are either not accurate (automatic evaluation with randomly sampled negative examples) or of small scale (human annotation). In this paper, we benchmark the CSKB population task with a new large-scale dataset by first aligning four popular CSKBs, and then presenting a high-quality human-annotated evaluation set to probe neural models' commonsense reasoning ability. We also propose a novel inductive commonsense reasoning model that reasons over graphs. Experimental results show that generalizing commonsense reasoning on unseen assertions is inherently a hard task. Models achieving high accuracy during training perform poorly on the evaluation set, with a large gap between human performance. We will make the data publicly available for future contributions. Codes and data are available at https://github.com/HKUST-KnowComp/CSKB-Population.
    Translation Transformers Rediscover Inherent Data Domains. (arXiv:2109.07864v1 [cs.CL])
    (2 min) Many works proposed methods to improve the performance of Neural Machine Translation (NMT) models in a domain/multi-domain adaptation scenario. However, an understanding of how NMT baselines represent text domain information internally is still lacking. Here we analyze the sentence representations learned by NMT Transformers and show that these explicitly include the information on text domains, even after only seeing the input sentences without domains labels. Furthermore, we show that this internal information is enough to cluster sentences by their underlying domains without supervision. We show that NMT models produce clusters better aligned to the actual domains compared to pre-trained language models (LMs). Notably, when computed on document-level, NMT cluster-to-domain correspondence nears 100%. We use these findings together with an approach to NMT domain adaptation using automatically extracted domains. Whereas previous work relied on external LMs for text clustering, we propose re-using the NMT model as a source of unsupervised clusters. We perform an extensive experimental study comparing two approaches across two data scenarios, three language pairs, and both sentence-level and document-level clustering, showing equal or significantly superior performance compared to LMs.
    CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. (arXiv:2109.07589v1 [cs.CL])
    (2 min) Named Entity Recognition (NER) in Few-Shot setting is imperative for entity tagging in low resource domains. Existing approaches only learn class-specific semantic features and intermediate representations from source domains. This affects generalizability to unseen target domains, resulting in suboptimal performances. To this end, we present CONTaiNER, a novel contrastive learning technique that optimizes the inter-token distribution distance for Few-Shot NER. Instead of optimizing class-specific attributes, CONTaiNER optimizes a generalized objective of differentiating between token categories based on their Gaussian-distributed embeddings. This effectively alleviates overfitting issues originating from training domains. Our experiments in several traditional test domains (OntoNotes, CoNLL'03, WNUT '17, GUM) and a new large scale Few-Shot NER dataset (Few-NERD) demonstrate that on average, CONTaiNER outperforms previous methods by 3%-13% absolute F1 points while showing consistent performance trends, even in challenging scenarios where previous approaches could not achieve appreciable performance.
    On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation. (arXiv:2109.07591v1 [cs.CL])
    (2 min) Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data. This work examines the benefit of data selection for language modeling and machine translation. Our experiments assess the complementarity of selection with fine tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but not so much as to erode the complementary effect of fine-tuning; (ii) there is a trade-off between selecting little data for fast but limited progress or much data for slow but long lasting progress; (iii) data selection can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data selection from domain classifiers is often more effective than the popular contrastive data selection method.
    Dialogue State Tracking with a Language Model using Schema-Driven Prompting. (arXiv:2109.07506v1 [cs.CL])
    (2 min) Task-oriented conversational systems often use dialogue state tracking to represent the user's intentions, which involves filling in values of pre-defined slots. Many approaches have been proposed, often using task-specific architectures with special-purpose classifiers. Recently, good results have been obtained using more general architectures based on pretrained language models. Here, we introduce a new variation of the language modeling approach that uses schema-driven prompting to provide task-aware history encoding that is used for both categorical and non-categorical slots. We further improve performance by augmenting the prompting with schema descriptions, a naturally occurring source of in-domain knowledge. Our purely generative system achieves state-of-the-art performance on MultiWOZ 2.2 and achieves competitive performance on two other benchmarks: MultiWOZ 2.1 and M2M. The data and code will be available at https://github.com/chiahsuan156/DST-as-Prompting.
    Reframing Instructional Prompts to GPTk's Language. (arXiv:2109.07830v1 [cs.CL])
    (2 min) How can model designers turn task instructions into effective prompts for language models? Backed by extensive empirical analysis on GPT3, we observe important features for successful instructional prompts, and propose several reframing techniques for model designers to create such prompts. For example, a complex task can be decomposed into multiple simpler tasks. We experiment over 12 NLP tasks across 6 diverse categories (question generation, classification, etc.). Our results show that reframing improves few-shot learning performance by 14\% while reducing sample complexity over existing few-shot baselines. The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible. Furthermore, we observe that such gains are not limited to GPT3; the reframed tasks remain superior over raw instructions across different model architectures, underscoring the cross-model generality of these guidelines. We hope these empirical-driven techniques will pave way for more effective ways to prompt LMs in future.
    Jointly Modeling Aspect and Polarity for Aspect-based Sentiment Analysis in Persian Reviews. (arXiv:2109.07680v1 [cs.CL])
    (2 min) Identification of user's opinions from natural language text has become an exciting field of research due to its growing applications in the real world. The research field is known as sentiment analysis and classification, where aspect category detection (ACD) and aspect category polarity (ACP) are two important sub-tasks of aspect-based sentiment analysis. The goal in ACD is to specify which aspect of the entity comes up in opinion while ACP aims to specify the polarity of each aspect category from the ACD task. The previous works mostly propose separate solutions for these two sub-tasks. This paper focuses on the ACD and ACP sub-tasks to solve both problems simultaneously. The proposed method carries out multi-label classification where four different deep models were employed and comparatively evaluated to examine their performance. A dataset of Persian reviews was collected from CinemaTicket website including 2200 samples from 14 categories. The developed models were evaluated using the collected dataset in terms of example-based and label-based metrics. The results indicate the high applicability and preference of the CNN and GRU models in comparison to LSTM and Bi-LSTM.
    Language Models are Few-shot Multilingual Learners. (arXiv:2109.07684v1 [cs.CL])
    (2 min) General-purpose language models have demonstrated impressive capabilities, performing on par with state-of-the-art approaches on a range of downstream natural language processing (NLP) tasks and benchmarks when inferring instructions from very few examples. Here, we evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages without any parameter updates. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones. Finally, we find the in-context few-shot cross-lingual prediction results of language models are significantly better than random prediction, and they are competitive compared to the existing state-of-the-art cross-lingual models.
    Constructing Emotion Consensus and Utilizing Unpaired Data for Empathetic Dialogue Generation. (arXiv:2109.07779v1 [cs.CL])
    (2 min) Researches on dialogue empathy aim to endow an agent with the capacity of accurate understanding and proper responding for emotions. Existing models for empathetic dialogue generation focus on the emotion flow in one direction, that is, from the context to response. We argue that conducting an empathetic conversation is a bidirectional process, where empathy occurs when the emotions of two interlocutors could converge on the same point, i.e., reaching an emotion consensus. Besides, we also find that the empathetic dialogue corpus is extremely limited, which further restricts the model performance. To address the above issues, we propose a dual-generative model, Dual-Emp, to simultaneously construct the emotion consensus and utilize some external unpaired data. Specifically, our model integrates a forward dialogue model, a backward dialogue model, and a discrete latent variable representing the emotion consensus into a unified architecture. Then, to alleviate the constraint of paired data, we extract unpaired emotional data from open-domain conversations and employ Dual-Emp to produce pseudo paired empathetic samples, which is more efficient and low-cost than the human annotation. Automatic and human evaluations demonstrate that our method outperforms competitive baselines in producing coherent and empathetic responses.
    Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets. (arXiv:2109.07494v1 [cs.CL])
    (2 min) For interpreting the behavior of a probabilistic model, it is useful to measure a model's calibration--the extent to which it produces reliable confidence scores. We address the open problem of calibration for tagging models with sparse tagsets, and recommend strategies to measure and reduce calibration error (CE) in such models. We show that several post-hoc recalibration techniques all reduce calibration error across the marginal distribution for two existing sequence taggers. Moreover, we propose tag frequency grouping (TFG) as a way to measure calibration error in different frequency bands. Further, recalibrating each group separately promotes a more equitable reduction of calibration error across the tag frequency spectrum.
    Cross-Register Projection for Headline Part of Speech Tagging. (arXiv:2109.07483v1 [cs.CL])
    (2 min) Part of speech (POS) tagging is a familiar NLP task. State of the art taggers routinely achieve token-level accuracies of over 97% on news body text, evidence that the problem is well understood. However, the register of English news headlines, "headlinese", is very different from the register of long-form text, causing POS tagging models to underperform on headlines. In this work, we automatically annotate news headlines with POS tags by projecting predicted tags from corresponding sentences in news bodies. We train a multi-domain POS tagger on both long-form and headline text and show that joint training on both registers improves over training on just one or naively concatenating training sets. We evaluate on a newly-annotated corpus of over 5,248 English news headlines from the Google sentence compression corpus, and show that our model yields a 23% relative error reduction per token and 19% per headline. In addition, we demonstrate that better headline POS tags can improve the performance of a syntax-based open information extraction system. We make POSH, the POS-tagged Headline corpus, available to encourage research in improved NLP models for news headlines.
    Text as Causal Mediators: Research Design for Causal Estimates of Differential Treatment of Social Groups via Language Aspects. (arXiv:2109.07542v1 [cs.CL])
    (2 min) Using observed language to understand interpersonal interactions is important in high-stakes decision making. We propose a causal research design for observational (non-experimental) data to estimate the natural direct and indirect effects of social group signals (e.g. race or gender) on speakers' responses with separate aspects of language as causal mediators. We illustrate the promises and challenges of this framework via a theoretical case study of the effect of an advocate's gender on interruptions from justices during U.S. Supreme Court oral arguments. We also discuss challenges conceptualizing and operationalizing causal variables such as gender and language that comprise of many components, and we articulate technical open challenges such as temporal dependence between language mediators in conversational settings.
    An influencer-based approach to understanding radical right viral tweets. (arXiv:2109.07588v1 [cs.SI])
    (2 min) Radical right influencers routinely use social media to spread highly divisive, disruptive and anti-democratic messages. Assessing and countering the challenge that such content poses is crucial for ensuring that online spaces remain open, safe and accessible. Previous work has paid little attention to understanding factors associated with radical right content that goes viral. We investigate this issue with a new dataset ROT which provides insight into the content, engagement and followership of a set of 35 radical right influencers. It includes over 50,000 original entries and over 40 million retweets, quotes, replies and mentions. We use a multilevel model to measure engagement with tweets, which are nested in each influencer. We show that it is crucial to account for the influencer-level structure, and find evidence of the importance of both influencer- and content-level factors, including the number of followers each influencer has, the type of content (original posts, quotes and replies), the length and toxicity of content, and whether influencers request retweets. We make ROT available for other researchers to use.
    Tied & Reduced RNN-T Decoder. (arXiv:2109.07513v1 [cs.CL])
    (2 min) Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003.07705 [eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications. In this work, we study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance. Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer (a.k.a. weight tying, commonly used in language modeling arXiv:1611.01462 [cs.LG]). This simple design, when used in conjunction with additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T Decoder from 23M parameters to just 2M, without affecting word-error rate (WER).
    "It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems. (arXiv:2109.07576v1 [cs.CL])
    (2 min) Conversations aimed at determining good recommendations are iterative in nature. People often express their preferences in terms of a critique of the current recommendation (e.g., "It doesn't look good for a date"), requiring some degree of common sense for a preference to be inferred. In this work, we present a method for transforming a user critique into a positive preference (e.g., "I prefer more romantic") in order to retrieve reviews pertaining to potentially better recommendations (e.g., "Perfect for a romantic dinner"). We leverage a large neural language model (LM) in a few-shot setting to perform critique-to-preference transformation, and we test two methods for retrieving recommendations: one that matches embeddings, and another that fine-tunes an LM for the task. We instantiate this approach in the restaurant domain and evaluate it using a new dataset of restaurant critiques. In an ablation study, we show that utilizing critique-to-preference transformation improves recommendations, and that there are at least three general cases that explain this improved performance.
    Comparing Euclidean and Hyperbolic Embeddings on the WordNet Nouns Hypernymy Graph. (arXiv:2109.07488v1 [cs.CL])
    (2 min) Nickel and Kiela (2017) present a new method for embedding tree nodes in the Poincare ball, and suggest that these hyperbolic embeddings are far more effective than Euclidean embeddings at embedding nodes in large, hierarchically structured graphs like the WordNet nouns hypernymy tree. This is especially true in low dimensions (Nickel and Kiela, 2017, Table 1). In this work, we seek to reproduce their experiments on embedding and reconstructing the WordNet nouns hypernymy graph. Counter to what they report, we find that Euclidean embeddings are able to represent this tree at least as well as Poincare embeddings, when allowed at least 50 dimensions. We note that this does not diminish the significance of their work given the impressive performance of hyperbolic embeddings in very low-dimensional settings. However, given the wide influence of their work, our aim here is to present an updated and more accurate comparison between the Euclidean and hyperbolic embeddings.
    Transductive Learning for Unsupervised Text Style Transfer. (arXiv:2109.07812v1 [cs.CL])
    (2 min) Unsupervised style transfer models are mainly based on an inductive learning approach, which represents the style as embeddings, decoder parameters, or discriminator parameters and directly applies these general rules to the test cases. However, the lacking of parallel corpus hinders the ability of these inductive learning methods on this task. As a result, it is likely to cause severe inconsistent style expressions, like `the salad is rude`. To tackle this problem, we propose a novel transductive learning approach in this paper, based on a retrieval-based context-aware style representation. Specifically, an attentional encoder-decoder with a retriever framework is utilized. It involves top-K relevant sentences in the target style in the transfer process. In this way, we can learn a context-aware style embedding to alleviate the above inconsistency problem. In this paper, both sparse (BM25) and dense retrieval functions (MIPS) are used, and two objective functions are designed to facilitate joint learning. Experimental results show that our method outperforms several strong baselines. The proposed transductive learning approach is general and effective to the task of unsupervised style transfer, and we will apply it to the other two typical methods in the future.
    MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition. (arXiv:2109.07877v1 [cs.CL])
    (2 min) Pre-trained language models lead Named Entity Recognition (NER) into a new era, while some more knowledge is needed to improve their performance in specific problems. In Chinese NER, character substitution is a complicated linguistic phenomenon. Some Chinese characters are quite similar for sharing the same components or having similar pronunciations. People replace characters in a named entity with similar characters to generate a new collocation but referring to the same object. It becomes even more common in the Internet age and is often used to avoid Internet censorship or just for fun. Such character substitution is not friendly to those pre-trained language models because the new collocations are occasional. As a result, it always leads to unrecognizable or recognition errors in the NER task. In this paper, we propose a new method, Multi-Feature Fusion Embedding for Chinese Named Entity Recognition (MFE-NER), to strengthen the language pattern of Chinese and handle the character substitution problem in Chinese Named Entity Recognition. MFE fuses semantic, glyph, and phonetic features together. In the glyph domain, we disassemble Chinese characters into components to denote structure features so that characters with similar structures can have close embedding space representation. Meanwhile, an improved phonetic system is also proposed in our work, making it reasonable to calculate phonetic similarity among Chinese characters. Experiments demonstrate that our method improves the overall performance of Chinese NER and especially performs well in informal language environments.
  • cs.CV updates on arXiv.org

    A Machine Learning Framework for Automatic Prediction of Human Semen Motility. (arXiv:2109.08049v1 [cs.LG])
    (2 min) In the field of reproductive health, a vital aspect for the detection of male fertility issues is the analysis of human semen quality. Two factors of importance are the morphology and motility of the sperm cells. While the former describes defects in different parts of a spermatozoon, the latter measures the efficient movement of cells. For many non-human species, so-called Computer-Aided Sperm Analysis systems work well for assessing these characteristics from microscopic video recordings but struggle with human sperm samples which generally show higher degrees of debris and dead spermatozoa, as well as lower overall sperm motility. Here, machine learning methods that harness large amounts of training data to extract salient features could support physicians with the detection of fertility issues or in vitro fertilisation procedures. In this work, the overall motility of given sperm samples is predicted with the help of a machine learning framework integrating unsupervised methods for feature extraction with downstream regression models. The models evaluated herein improve on the state-of-the-art for video-based sperm-motility prediction.
    SPIN Road Mapper: Extracting Roads from Aerial Images via Spatial and Interaction Space Graph Reasoning for Autonomous Driving. (arXiv:2109.07701v1 [cs.CV])
    (0 min) Road extraction is an essential step in building autonomous navigation systems. Detecting road segments is challenging as they are of varying widths, bifurcated throughout the image, and are often occluded by terrain, cloud, or other weather conditions. Using just convolution neural networks (ConvNets) for this problem is not effective as it is inefficient at capturing distant dependencies between road segments in the image which is essential to extract road connectivity. To this end, we propose a Spatial and Interaction Space Graph Reasoning (SPIN) module which when plugged into a ConvNet performs reasoning over graphs constructed on spatial and interaction spaces projected from the feature maps. Reasoning over spatial space extracts dependencies between different spatial regions and other contextual information. Reasoning over a projected interaction space helps in appropriate delineation of roads from other topographies present in the image. Thus, SPIN extracts long-range dependencies between road segments and effectively delineates roads from other semantics. We also introduce a SPIN pyramid which performs SPIN graph reasoning across multiple scales to extract multi-scale features. We propose a network based on stacked hourglass modules and SPIN pyramid for road segmentation which achieves better performance compared to existing methods. Moreover, our method is computationally efficient and significantly boosts the convergence speed during training, making it feasible for applying on large-scale high-resolution aerial images. Code available at: https://github.com/wgcban/SPIN_RoadMapper.git.
    Semi-supervised Meta-learning with Disentanglement for Domain-generalised Medical Image Segmentation. (arXiv:2106.13292v2 [cs.CV] UPDATED)
    (0 min) Generalising deep models to new data from new centres (termed here domains) remains a challenge. This is largely attributed to shifts in data statistics (domain shifts) between source and unseen domains. Recently, gradient-based meta-learning approaches where the training data are split into meta-train and meta-test sets to simulate and handle the domain shifts during training have shown improved generalisation performance. However, the current fully supervised meta-learning approaches are not scalable for medical image segmentation, where large effort is required to create pixel-wise annotations. Meanwhile, in a low data regime, the simulated domain shifts may not approximate the true domain shifts well across source and unseen domains. To address this problem, we propose a novel semi-supervised meta-learning framework with disentanglement. We explicitly model the representations related to domain shifts. Disentangling the representations and combining them to reconstruct the input image allows unlabeled data to be used to better approximate the true domain shifts for meta-learning. Hence, the model can achieve better generalisation performance, especially when there is a limited amount of labeled data. Experiments show that the proposed method is robust on different segmentation tasks and achieves state-of-the-art generalisation performance on two public benchmarks.
    Monocular, One-stage, Regression of Multiple 3D People. (arXiv:2008.12272v4 [cs.CV] UPDATED)
    (0 min) This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline that first detects people in bounding boxes and then independently regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP). The approach is conceptually simple, bounding box-free, and able to learn a per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image are easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person benchmarks, including 3DPW and CMU Panoptic. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. The released code is the first real-time implementation of monocular multi-person 3D mesh regression.
    Label Assignment Distillation for Object Detection. (arXiv:2109.07843v1 [cs.CV])
    (0 min) Knowledge distillation methods are proved to be promising in improving the performance of neural networks and no additional computational expenses are required during the inference time. For the sake of boosting the accuracy of object detection, a great number of knowledge distillation methods have been proposed particularly designed for object detection. However, most of these methods only focus on feature-level distillation and label-level distillation, leaving the label assignment step, a unique and paramount procedure for object detection, by the wayside. In this work, we come up with a simple but effective knowledge distillation approach focusing on label assignment in object detection, in which the positive and negative samples of student network are selected in accordance with the predictions of teacher network. Our method shows encouraging results on the MSCOCO2017 benchmark, and can not only be applied to both one-stage detectors and two-stage detectors but also be utilized orthogonally with other knowledge distillation methods.
    A Medical Pre-Diagnosis System for Histopathological Image of Breast Cancer. (arXiv:2109.07878v1 [cs.CV])
    (0 min) This paper constructs a novel intelligent medical diagnosis system, which can realize automatic communication and breast cancer pathological image recognition. This system contains two main parts, including a pre-training chatbot called M-Chatbot and an improved neural network model of EfficientNetV2-S named EfficientNetV2-SA, in which the activation function in top layers is replaced by ACON-C. Using information retrieval mechanism, M-Chatbot instructs patients to send breast pathological image to EfficientNetV2-SA network, and then the classifier trained by transfer learning will return the diagnosis results. We verify the performance of our chatbot and classification on the extrinsic metrics and BreaKHis dataset, respectively. The task completion rate of M-Chatbot reached 63.33\%. For the BreaKHis dataset, the highest accuracy of EfficientNetV2-SA network have achieved 84.71\%. All these experimental results illustrate that the proposed model can improve the accuracy performance of image recognition and our new intelligent medical diagnosis system is successful and efficient in providing automatic diagnosis of breast cancer.
    Semi-Supervised Visual Representation Learning for Fashion Compatibility. (arXiv:2109.08052v1 [cs.CV])
    (0 min) We consider the problem of complementary fashion prediction. Existing approaches focus on learning an embedding space where fashion items from different categories that are visually compatible are closer to each other. However, creating such labeled outfits is intensive and also not feasible to generate all possible outfit combinations, especially with large fashion catalogs. In this work, we propose a semi-supervised learning approach where we leverage large unlabeled fashion corpus to create pseudo-positive and pseudo-negative outfits on the fly during training. For each labeled outfit in a training batch, we obtain a pseudo-outfit by matching each item in the labeled outfit with unlabeled items. Additionally, we introduce consistency regularization to ensure that representation of the original images and their transformations are consistent to implicitly incorporate colour and other important attributes through self-supervision. We conduct extensive experiments on Polyvore, Polyvore-D and our newly created large-scale Fashion Outfits datasets, and show that our approach with only a fraction of labeled examples performs on-par with completely supervised methods.
    ROS-X-Habitat: Bridging the ROS Ecosystem with Embodied AI. (arXiv:2109.07703v1 [cs.RO])
    (0 min) We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied reinforcement learning agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physics-based simulation. With this interface, roboticists are able to train their own Habitat RL agents in another simulation environment or to develop their own robotic algorithms inside Habitat Sim. Through in silico experiments, we demonstrate that ROS-X-Habitat has minimal impact on the navigation performance and simulation speed of Habitat agents; that a standard set of ROS mapping, planning and navigation tools can run in the Habitat simulator, and that a Habitat agent can run in the standard ROS simulator Gazebo.
    Generating Dataset For Large-scale 3D Facial Emotion Recognition. (arXiv:2109.08043v1 [cs.CV])
    (0 min) The tremendous development in deep learning has led facial expression recognition (FER) to receive much attention in the past few years. Although 3D FER has an inherent edge over its 2D counterpart, work on 2D images has dominated the field. The main reason for the slow development of 3D FER is the unavailability of large training and large test datasets. Recognition accuracies have already saturated on existing 3D emotion recognition datasets due to their small gallery sizes. Unlike 2D photographs, 3D facial scans are not easy to collect, causing a bottleneck in the development of deep 3D FER networks and datasets. In this work, we propose a method for generating a large dataset of 3D faces with labeled emotions. We also develop a deep convolutional neural network(CNN) for 3D FER trained on 624,000 3D facial scans. The test data comprises 208,000 3D facial scans.
    Automated risk classification of colon biopsies based on semantic segmentation of histopathology images. (arXiv:2109.07892v1 [eess.IV])
    (0 min) Artificial Intelligence (AI) can potentially support histopathologists in the diagnosis of a broad spectrum of cancer types. In colorectal cancer (CRC), AI can alleviate the laborious task of characterization and reporting on resected biopsies, including polyps, the numbers of which are increasing as a result of CRC population screening programs, ongoing in many countries all around the globe. Here, we present an approach to address two major challenges in automated assessment of CRC histopathology whole-slide images. First, we present an AI-based method to segment multiple tissue compartments in the H\&E-stained whole-slide image, which provides a different, more perceptible picture of tissue morphology and composition. We test and compare a panel of state-of-the-art loss functions available for segmentation models, and provide indications about their use in histopathology image segmentation, based on the analysis of a) a multi-centric cohort of CRC cases from five medical centers in the Netherlands and Germany, and b) two publicly available datasets on segmentation in CRC. Second, we use the best performing AI model as the basis for a computer-aided diagnosis system (CAD) that classifies colon biopsies into four main categories that are relevant pathologically. We report the performance of this system on an independent cohort of more than 1,000 patients. The results show the potential of such an AI-based system to assist pathologists in diagnosis of CRC in the context of population screening. We have made the segmentation model available for research use on https://grand-challenge.org/algorithms/colon-tissue-segmentation/.
    Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer. (arXiv:2109.08024v1 [cs.CV])
    (0 min) We propose a semi-supervised network for wide-angle portraits correction. Wide-angle images often suffer from skew and distortion affected by perspective distortion, especially noticeable at the face regions. Previous deep learning based approaches require the ground-truth correction flow maps for the training guidance. However, such labels are expensive, which can only be obtained manually. In this work, we propose a semi-supervised scheme, which can consume unlabeled data in addition to the labeled data for improvements. Specifically, our semi-supervised scheme takes the advantages of the consistency mechanism, with several novel components such as direction and range consistency (DRC) and regression consistency (RC). Furthermore, our network, named as Multi-Scale Swin-Unet (MS-Unet), is built upon the multi-scale swin transformer block (MSTB), which can learn both local-scale and long-range semantic information effectively. In addition, we introduce a high-quality unlabeled dataset with rich scenarios for the training. Extensive experiments demonstrate that the proposed method is superior over the state-of-the-art methods and other representative baselines.
    R3LIVE: A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package. (arXiv:2109.07982v1 [cs.RO])
    (0 min) In this letter, we propose a novel LiDAR-Inertial-Visual sensor fusion framework termed R3LIVE, which takes advantage of measurement of LiDAR, inertial, and visual sensors to achieve robust and accurate state estimation. R3LIVE is contained of two subsystems, the LiDAR-inertial odometry (LIO) and visual-inertial odometry (VIO). The LIO subsystem (FAST-LIO) takes advantage of the measurement from LiDAR and inertial sensors and builds the geometry structure of (i.e. the position of 3D points) global maps. The VIO subsystem utilizes the data of visual-inertial sensors and renders the map's texture (i.e. the color of 3D points). More specifically, the VIO subsystem fuses the visual data directly and effectively by minimizing the frame-to-map photometric error. The developed system R3LIVE is developed based on our previous work R2LIVE, with careful architecture design and implementation. Experiment results show that the resultant system achieves more robustness and higher accuracy in state estimation than current counterparts (see our attached video). R3LIVE is a versatile and well-engineered system toward various possible applications, which can not only serve as a SLAM system for real-time robotic applications, but can also reconstruct the dense, precise, RGB-colored 3D maps for applications like surveying and mapping. Moreover, to make R3LIVE more extensible, we develop a series of offline utilities for reconstructing and texturing meshes, which further minimizes the gap between R3LIVE and various of 3D applications such as simulators, video games and etc (see our demos video). To share our findings and make contributions to the community, we open source R3LIVE on our Github, including all of our codes, software utilities, and the mechanical design of our device.
    The Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation. (arXiv:1904.03215v4 [cs.CV] UPDATED)
    (0 min) Deep learning has enabled impressive progress in the accuracy of semantic segmentation. Yet, the ability to estimate uncertainty and detect failure is key for safety-critical applications like autonomous driving. Existing uncertainty estimates have mostly been evaluated on simple tasks, and it is unclear whether these methods generalize to more complex scenarios. We present Fishyscapes, the first public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates pixel-wise uncertainty estimates towards the detection of anomalous objects in front of the vehicle. We~adapt state-of-the-art methods to recent semantic segmentation models and compare approaches based on softmax confidence, Bayesian learning, and embedding density. Our results show that anomaly detection is far from solved even for ordinary situations, while our benchmark allows measuring advancements beyond the state-of-the-art.
    Adversarial Semantic Hallucination for Domain Generalized Semantic Segmentation. (arXiv:2106.04144v3 [cs.CV] UPDATED)
    (0 min) Convolutional neural networks may perform poorly when the test and train data are from different domains. While this problem can be mitigated by using the target domain data to align the source and target domain feature representations, the target domain data may be unavailable due to privacy concerns. Consequently, there is a need for methods that generalize well without access to target domain data during training. In this work, we propose an adversarial hallucination approach, which combines a class-wise hallucination module and a semantic segmentation module. Since the segmentation performance varies across different classes, we design a semantic-conditioned style hallucination layer to adaptively stylize each class. The classwise stylization parameters are generated from the semantic knowledge in the segmentation probability maps of the source domain image. Both modules compete adversarially, with the hallucination module generating increasingly 'difficult' style images to challenge the segmentation module. In response, the segmentation module improves its performance as it is trained with generated samples at an appropriate class-wise difficulty level. Experiments on state of the art domain adaptation work demonstrate the efficacy of our proposed method when no target domain data are available for training.
    Disentangling Transfer and Interference in Multi-Domain Learning. (arXiv:2107.05445v3 [cs.CV] UPDATED)
    (0 min) Humans are incredibly good at transferring knowledge from one domain to another, enabling rapid learning of new tasks. Likewise, transfer learning has enabled enormous success in many computer vision problems using pretraining. However, the benefits of transfer in multi-domain learning, where a network learns multiple tasks defined by different datasets, has not been adequately studied. Learning multiple domains could be beneficial or these domains could interfere with each other given limited network capacity. In this work, we decipher the conditions where interference and knowledge transfer occur in multi-domain learning. We propose new metrics disentangling interference and transfer and set up experimental protocols. We further examine the roles of network capacity, task grouping, and dynamic loss weighting in reducing interference and facilitating transfer. We demonstrate our findings on the CIFAR-100, MiniPlaces, and Tiny-ImageNet datasets.
    Raising context awareness in motion forecasting. (arXiv:2109.08048v1 [cs.CV])
    (0 min) Learning-based trajectory prediction models have encountered great success, with the promise of leveraging contextual information in addition to motion history. Yet, we find that state-of-the-art forecasting methods tend to overly rely on the agent's dynamics, failing to exploit the semantic cues provided at its input. To alleviate this issue, we introduce CAB, a motion forecasting model equipped with a training procedure designed to promote the use of semantic contextual information. We also introduce two novel metrics -- dispersion and convergence-to-range -- to measure the temporal consistency of successive forecasts, which we found missing in standard metrics. Our method is evaluated on the widely adopted nuScenes Prediction benchmark.
    Quality-aware Cine Cardiac MRI Reconstruction and Analysis from Undersampled k-space Data. (arXiv:2109.07955v1 [eess.IV])
    (0 min) Cine cardiac MRI is routinely acquired for the assessment of cardiac health, but the imaging process is slow and typically requires several breath-holds to acquire sufficient k-space profiles to ensure good image quality. Several undersampling-based reconstruction techniques have been proposed during the last decades to speed up cine cardiac MRI acquisition. However, the undersampling factor is commonly fixed to conservative values before acquisition to ensure diagnostic image quality, potentially leading to unnecessarily long scan times. In this paper, we propose an end-to-end quality-aware cine short-axis cardiac MRI framework that combines image acquisition and reconstruction with downstream tasks such as segmentation, volume curve analysis and estimation of cardiac functional parameters. The goal is to reduce scan time by acquiring only a fraction of k-space data to enable the reconstruction of images that can pass quality control checks and produce reliable estimates of cardiac functional parameters. The framework consists of a deep learning model for the reconstruction of 2D+t cardiac cine MRI images from undersampled data, an image quality-control step to detect good quality reconstructions, followed by a deep learning model for bi-ventricular segmentation, a quality-control step to detect good quality segmentations and automated calculation of cardiac functional parameters. To demonstrate the feasibility of the proposed approach, we perform simulations using a cohort of selected participants from the UK Biobank (n=270), 200 healthy subjects and 70 patients with cardiomyopathies. Our results show that we can produce quality-controlled images in a scan time reduced from 12 to 4 seconds per slice, enabling reliable estimates of cardiac functional parameters such as ejection fraction within 5% mean absolute error.
    Humanly Certifying Superhuman Classifiers. (arXiv:2109.07867v1 [cs.LG])
    (0 min) Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. Today, this challenge is especially relevant given the emergence of systems which appear to increasingly outperform human beings. In some cases, this "superhuman" performance is readily demonstrated; for example by defeating legendary human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators can make mistakes and be subjective. Evaluating the performance with respect to a genuine oracle may be more objective and reliable, even when querying the oracle is expensive or impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is unobserved. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our assumptions a number of models from recent years are with high probability superhuman.
    Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views. (arXiv:2109.07945v1 [cs.CV])
    (0 min) We present a system for automatic converting of 2D mask object predictions and raw LiDAR point clouds into full 3D bounding boxes of objects. Because the LiDAR point clouds are partial, directly fitting bounding boxes to the point clouds is meaningless. Instead, we suggest that obtaining good results requires sharing information between \emph{all} objects in the dataset jointly, over multiple frames. We then make three improvements to the baseline. First, we address ambiguities in predicting the object rotations via direct optimization in this space while still backpropagating rotation prediction through the model. Second, we explicitly model outliers and task the network with learning their typical patterns, thus better discounting them. Third, we enforce temporal consistency when video data is available. With these contributions, our method significantly outperforms previous work despite the fact that those methods use significantly more complex pipelines, 3D models and additional human-annotated external sources of prior information.
    Harnessing Perceptual Adversarial Patches for Crowd Counting. (arXiv:2109.07986v1 [cs.CV])
    (0 min) Crowd counting, which is significantly important for estimating the number of people in safety-critical scenes, has been shown to be vulnerable to adversarial examples in the physical world (e.g., adversarial patches). Though harmful, adversarial examples are also valuable for assessing and better understanding model robustness. However, existing adversarial example generation methods in crowd counting scenarios lack strong transferability among different black-box models. Motivated by the fact that transferability is positively correlated to the model-invariant characteristics, this paper proposes the Perceptual Adversarial Patch (PAP) generation framework to learn the shared perceptual features between models by exploiting both the model scale perception and position perception. Specifically, PAP exploits differentiable interpolation and density attention to help learn the invariance between models during training, leading to better transferability. In addition, we surprisingly found that our adversarial patches could also be utilized to benefit the performance of vanilla models for alleviating several challenges including cross datasets and complex backgrounds. Extensive experiments under both digital and physical world scenarios demonstrate the effectiveness of our PAP.
    Mask-Guided Feature Extraction and Augmentation for Ultra-Fine-Grained Visual Categorization. (arXiv:2109.07755v1 [cs.CV])
    (0 min) While the fine-grained visual categorization (FGVC) problems have been greatly developed in the past years, the Ultra-fine-grained visual categorization (Ultra-FGVC) problems have been understudied. FGVC aims at classifying objects from the same species (very similar categories), while the Ultra-FGVC targets at more challenging problems of classifying images at an ultra-fine granularity where even human experts may fail to identify the visual difference. The challenges for Ultra-FGVC mainly comes from two aspects: one is that the Ultra-FGVC often arises overfitting problems due to the lack of training samples; and another lies in that the inter-class variance among images is much smaller than normal FGVC tasks, which makes it difficult to learn discriminative features for each class. To solve these challenges, a mask-guided feature extraction and feature augmentation method is proposed in this paper to extract discriminative and informative regions of images which are then used to augment the original feature map. The advantage of the proposed method is that the feature detection and extraction model only requires a small amount of target region samples with bounding boxes for training, then it can automatically locate the target area for a large number of images in the dataset at a high detection accuracy. Experimental results on two public datasets and ten state-of-the-art benchmark methods consistently demonstrate the effectiveness of the proposed method both visually and quantitatively.
    Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection. (arXiv:2109.07950v1 [cs.CV])
    (0 min) With the increased deployment of face recognition systems in our daily lives, face presentation attack detection (PAD) is attracting a lot of attention and playing a key role in securing face recognition systems. Despite the great performance achieved by the hand-crafted and deep learning based methods in intra-dataset evaluations, the performance drops when dealing with unseen scenarios. In this work, we propose a dual-stream convolution neural networks (CNNs) framework. One stream adapts four learnable frequency filters to learn features in the frequency domain, which are less influenced variations in sensors/illuminations. The other stream leverage the RGB images to complement the features of the frequency domain. Moreover, we propose a hierarchical attention module integration to join the information from the two streams at different stages by considering the nature of deep features in different layers of the CNN. The proposed method is evaluated in the intra-dataset and cross-dataset setups and the results demonstrates that our proposed approach enhances the generalizability in most experimental setups in comparison to state-of-the-art, including the methods designed explicitly for domain adaption/shift problem. We successfully prove the design of our proposed PAD solution in a step-wise ablation study that involves our proposed learnable frequency decomposition, our hierarchical attention module design, and the used loss function. Training codes and pre-trained models are publicly released.
    Compact Binary Fingerprint for Image Copy Re-Ranking. (arXiv:2109.07802v1 [cs.CV])
    (0 min) Image copy detection is challenging and appealing topic in computer vision and signal processing. Recent advancements in multimedia have made distribution of image across the global easy and fast: that leads to many other issues such as forgery and image copy retrieval. Local keypoint descriptors such as SIFT are used to represent the images, and based on those descriptors matching, images are matched and retrieved. Features are quantized so that searching/matching may be made feasible for large databases at the cost of accuracy loss. In this paper, we propose binary feature that is obtained by quantizing the SIFT into binary, and rank list is re-examined to remove the false positives. Experiments on challenging dataset shows the gain in accuracy and time.
    Towards Out-of-Distribution Detection with Divergence Guarantee in Deep Generative Models. (arXiv:2002.03328v4 [cs.LG] UPDATED)
    (0 min) Recent research has revealed that deep generative models including flow-based models and Variational autoencoders may assign higher likelihood to out-of-distribution (OOD) data than in-distribution (ID) data. However, we cannot sample out OOD data from the model. This counterintuitive phenomenon has not been satisfactorily explained. In this paper, we prove theorems to investigate the divergences in flow-based model and give two explanations to the above phenomenon from divergence and geometric perspectives, respectively. Based on our analysis, we propose two group anomaly detection methods. Furthermore, we decompose the KL divergence and propose a point-wise anomaly detection method. We have conducted extensive experiments on prevalent benchmarks to evaluate our methods. For group anomaly detection (GAD), our method can achieve near 100\% AUROC on all problems and has robustness against data manipulations. On the contrary, the state-of-the-art (SOTA) GAD method performs not better than random guessing for challenging problems and can be attacked by data manipulation in almost all cases. For point-wise anomaly detection (PAD), our method is comparable to the SOTA PAD method on one category of problems and outperforms the baseline significantly on another category of problems.
    Resolution based Feature Distillation for Cross Resolution Person Re-Identification. (arXiv:2109.07871v1 [cs.CV])
    (0 min) Person re-identification (re-id) aims to retrieve images of same identities across different camera views. Resolution mismatch occurs due to varying distances between person of interest and cameras, this significantly degrades the performance of re-id in real world scenarios. Most of the existing approaches resolve the re-id task as low resolution problem in which a low resolution query image is searched in a high resolution images gallery. Several approaches apply image super resolution techniques to produce high resolution images but ignore the multiple resolutions of gallery images which is a better realistic scenario. In this paper, we introduce channel correlations to improve the learning of features from the degraded data. In addition, to overcome the problem of multiple resolutions we propose a Resolution based Feature Distillation (RFD) approach. Such an approach learns resolution invariant features by filtering the resolution related features from the final feature vectors that are used to compute the distance matrix. We tested the proposed approach on two synthetically created datasets and on one original multi resolution dataset with real degradation. Our approach improves the performance when multiple resolutions occur in the gallery and have comparable results in case of single resolution (low resolution re-id).
    M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient Object Detection. (arXiv:2109.07922v1 [cs.CV])
    (0 min) Salient object detection is a fundamental topic in computer vision. Previous methods based on RGB-D often suffer from the incompatibility of multi-modal feature fusion and the insufficiency of multi-scale feature aggregation. To tackle these two dilemmas, we propose a novel multi-modal and multi-scale refined network (M2RNet). Three essential components are presented in this network. The nested dual attention module (NDAM) explicitly exploits the combined features of RGB and depth flows. The adjacent interactive aggregation module (AIAM) gradually integrates the neighbor features of high, middle and low levels. The joint hybrid optimization loss (JHOL) makes the predictions have a prominent outline. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches.
    Eformer: Edge Enhancement based Transformer for Medical Image Denoising. (arXiv:2109.08044v1 [eess.IV])
    (0 min) In this work, we present Eformer - Edge enhancement based transformer, a novel architecture that builds an encoder-decoder network using transformer blocks for medical image denoising. Non-overlapping window-based self-attention is used in the transformer block that reduces computational requirements. This work further incorporates learnable Sobel-Feldman operators to enhance edges in the image and propose an effective way to concatenate them in the intermediate layers of our architecture. The experimental analysis is conducted by comparing deterministic learning and residual learning for the task of medical image denoising. To defend the effectiveness of our approach, our model is evaluated on the AAPM-Mayo Clinic Low-Dose CT Grand Challenge Dataset and achieves state-of-the-art performance, $i.e.$, 43.487 PSNR, 0.0067 RMSE, and 0.9861 SSIM. We believe that our work will encourage more research in transformer-based architectures for medical image denoising using residual learning.
    Towards Non-Line-of-Sight Photography. (arXiv:2109.07783v1 [eess.IV])
    (0 min) Non-line-of-sight (NLOS) imaging is based on capturing the multi-bounce indirect reflections from the hidden objects. Active NLOS imaging systems rely on the capture of the time of flight of light through the scene, and have shown great promise for the accurate and robust reconstruction of hidden scenes without the need for specialized scene setups and prior assumptions. Despite that existing methods can reconstruct 3D geometries of the hidden scene with excellent depth resolution, accurately recovering object textures and appearance with high lateral resolution remains an challenging problem. In this work, we propose a new problem formulation, called NLOS photography, to specifically address this deficiency. Rather than performing an intermediate estimate of the 3D scene geometry, our method follows a data-driven approach and directly reconstructs 2D images of a NLOS scene that closely resemble the pictures taken with a conventional camera from the location of the relay wall. This formulation largely simplifies the challenging reconstruction problem by bypassing the explicit modeling of 3D geometry, and enables the learning of a deep model with a relatively small training dataset. The results are NLOS reconstructions of unprecedented lateral resolution and image quality.
    Overview of Tencent Multi-modal Ads Video Understanding Challenge. (arXiv:2109.07951v1 [cs.CV])
    (0 min) Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. Our challenge includes two tasks: video structuring in the temporal dimension and multi-modal video classification. It asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene based on a fine-grained and ads-related category hierarchy. Therefore, our task has four distinguishing features from previous ones: ads domain, multi-modal information, temporal segmentation, and multi-label classification. It will advance the foundation of ads video understanding and have a significant impact on many ads applications like video recommendation. This paper presents an overview of our challenge, including the background of ads videos, an elaborate description of task and dataset, evaluation protocol, and our proposed baseline. By ablating the key components of our baseline, we would like to reveal the main challenges of this task and provide useful guidance for future research of this area. In this paper, we give an extended version of our challenge overview. The dataset will be publicly available at https://algo.qq.com/.
    Neural \'{E}tendue Expander for Ultra-Wide-Angle High-Fidelity Holographic Display. (arXiv:2109.08123v1 [eess.IV])
    (0 min) Holographic displays can generate light fields by dynamically modulating the wavefront of a coherent beam of light using a spatial light modulator, promising rich virtual and augmented reality applications. However, the limited spatial resolution of existing dynamic spatial light modulators imposes a tight bound on the diffraction angle. As a result, today's holographic displays possess low \'{e}tendue, which is the product of the display area and the maximum solid angle of diffracted light. The low \'{e}tendue forces a sacrifice of either the field of view (FOV) or the display size. In this work, we lift this limitation by presenting neural \'{e}tendue expanders. This new breed of optical elements, which is learned from a natural image dataset, enables higher diffraction angles for ultra-wide FOV while maintaining both a compact form factor and the fidelity of displayed contents to human viewers. With neural \'{e}tendue expanders, we achieve 64$\times$ \'{e}tendue expansion of natural images with reconstruction quality (measured in PSNR) over 29dB on simulated retinal-resolution images. As a result, the proposed approach with expansion factor 64$\times$ enables high-fidelity ultra-wide-angle holographic projection of natural images using an 8K-pixel SLM, resulting in a 18.5 mm eyebox size and 2.18 steradians FOV, covering 85\% of the human stereo FOV.
    Aesthetics and neural network image representations. (arXiv:2109.08103v1 [cs.CV])
    (0 min) We analyze the spaces of images encoded by generative networks of the BigGAN architecture. We find that generic multiplicative perturbations away from the photo-realistic point often lead to images which appear as "artistic renditions" of the corresponding objects. This demonstrates an emergence of aesthetic properties directly from the structure of the photo-realistic environment coupled with its neural network parametrization. Moreover, modifying a deep semantic part of the neural network encoding leads to the appearance of symbolic visual representations.
    Field Convolutions for Surface CNNs. (arXiv:2104.03916v2 [cs.CV] UPDATED)
    (0 min) We present a novel surface convolution operator acting on vector fields that is based on a simple observation: instead of combining neighboring features with respect to a single coordinate parameterization defined at a given point, we have every neighbor describe the position of the point within its own coordinate frame. This formulation combines intrinsic spatial convolution with parallel transport in a scattering operation while placing no constraints on the filters themselves, providing a definition of convolution that commutes with the action of isometries, has increased descriptive potential, and is robust to noise and other nuisance factors. The result is a rich notion of convolution which we call field convolution, well-suited for CNNs on surfaces. Field convolutions are flexible, straight-forward to incorporate into surface learning frameworks, and their highly discriminating nature has cascading effects throughout the learning pipeline. Using simple networks constructed from residual field convolution blocks, we achieve state-of-the-art results on standard benchmarks in fundamental geometry processing tasks, such as shape classification, segmentation, correspondence, and sparse matching.
    Detection Accuracy for Evaluating Compositional Explanations of Units. (arXiv:2109.07804v1 [cs.LG])
    (0 min) The recent success of deep learning models in solving complex problems and in different domains has increased interest in understanding what they learn. Therefore, different approaches have been employed to explain these models, one of which uses human-understandable concepts as explanations. Two examples of methods that use this approach are Network Dissection and Compositional explanations. The former explains units using atomic concepts, while the latter makes explanations more expressive, replacing atomic concepts with logical forms. While intuitively, logical forms are more informative than atomic concepts, it is not clear how to quantify this improvement, and their evaluation is often based on the same metric that is optimized during the search-process and on the usage of hyper-parameters to be tuned. In this paper, we propose to use as evaluation metric the Detection Accuracy, which measures units' consistency of detection of their assigned explanations. We show that this metric (1) evaluates explanations of different lengths effectively, (2) can be used as a stopping criterion for the compositional explanation search, eliminating the explanation length hyper-parameter, and (3) exposes new specialized units whose length 1 explanations are the perceptual abstractions of their longer explanations.
    ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations. (arXiv:2109.07991v1 [cs.RO])
    (0 min) Multisensory object-centric perception, reasoning, and interaction have been a key research topic in recent years. However, the progress in these directions is limited by the small set of objects available -- synthetic objects are not realistic enough and are mostly centered around geometry, while real object datasets such as YCB are often practically challenging and unstable to acquire due to international shipping, inventory, and financial cost. We present ObjectFolder, a dataset of 100 virtualized objects that addresses both challenges with two key innovations. First, ObjectFolder encodes the visual, auditory, and tactile sensory data for all objects, enabling a number of multisensory object recognition tasks, beyond existing datasets that focus purely on object geometry. Second, ObjectFolder employs a uniform, object-centric, and implicit representation for each object's visual textures, acoustic simulations, and tactile readings, making the dataset flexible to use and easy to share. We demonstrate the usefulness of our dataset as a testbed for multisensory perception and control by evaluating it on a variety of benchmark tasks, including instance recognition, cross-sensory retrieval, 3D reconstruction, and robotic grasping.
    Context-aware Padding for Semantic Segmentation. (arXiv:2109.07854v1 [cs.CV])
    (2 min) Zero padding is widely used in convolutional neural networks to prevent the size of feature maps diminishing too fast. However, it has been claimed to disturb the statistics at the border. As an alternative, we propose a context-aware (CA) padding approach to extend the image. We reformulate the padding problem as an image extrapolation problem and illustrate the effects on the semantic segmentation task. Using context-aware padding, the ResNet-based segmentation model achieves higher mean Intersection-Over-Union than the traditional zero padding on the Cityscapes and the dataset of DeepGlobe satellite imaging challenge. Furthermore, our padding does not bring noticeable overhead during training and testing.
    Separating Boundary Points via Structural Regularization for Very Compact Clusters. (arXiv:2106.05430v3 [cs.CV] UPDATED)
    (2 min) Clustering algorithms have significantly improved along with Deep Neural Networks which provide effective representation of data. Existing methods are built upon deep autoencoder and self-training process that leverages the distribution of cluster assignments of samples. However, as the fundamental objective of the autoencoder is focused on efficient data reconstruction, the learnt space may be sub-optimal for clustering. Moreover, it requires highly effective codes (i.e., representation) of data, otherwise the initial cluster centers often cause stability issues during self-training. Many state-of-the-art clustering algorithms use convolution operation to extract efficient codes but their applications are limited to image data. In this regard, we propose an end-to-end deep clustering algorithm, i.e., Very Compact Clusters (VCC). VCC takes advantage of distributions of local relationships of samples near the boundary of clusters, so that they can be properly separated and pulled to cluster centers to form compact clusters. Experimental results on various datasets illustrate that our proposed approach achieves competitive clustering performance against most of the state-of-the-art clustering methods for both image and non-image data, and its results can be easily qualitatively seen in the learnt low-dimensional space.
    Real Time Monocular Vehicle Velocity Estimation using Synthetic Data. (arXiv:2109.07957v1 [cs.CV])
    (2 min) Vision is one of the primary sensing modalities in autonomous driving. In this paper we look at the problem of estimating the velocity of road vehicles from a camera mounted on a moving car. Contrary to prior methods that train end-to-end deep networks that estimate the vehicles' velocity from the video pixels, we propose a two-step approach where first an off-the-shelf tracker is used to extract vehicle bounding boxes and then a small neural network is used to regress the vehicle velocity from the tracked bounding boxes. Surprisingly, we find that this still achieves state-of-the-art estimation performance with the significant benefit of separating perception from dynamics estimation via a clean, interpretable and verifiable interface which allows us distill the statistics which are crucial for velocity estimation. We show that the latter can be used to easily generate synthetic training data in the space of bounding boxes and use this to improve the performance of our method further.
    An End-to-End Transformer Model for 3D Object Detection. (arXiv:2109.08141v1 [cs.CV])
    (2 min) We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D-specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.
    How Computer Science Can Aid Forest Restoration. (arXiv:2109.07898v1 [cs.CY])
    (2 min) The world faces two interlinked crises: climate change and loss of biodiversity. Forest restoration on degraded lands and surplus croplands can play a significant role both in sequestering carbon and re-establishing bio-diversity. There is a considerable body of research and practice that addresses forest restoration. However, there has been little work by computer scientists to bring powerful computational techniques to bear on this important area of work, perhaps due to a lack of awareness. In an attempt to bridge this gap, we present our vision of how techniques from computer science, broadly speaking, can aid current practice in forest restoration.
    The pitfalls of using open data to develop deep learning solutions for COVID-19 detection in chest X-rays. (arXiv:2109.08020v1 [cs.CV])
    (2 min) Since the emergence of COVID-19, deep learning models have been developed to identify COVID-19 from chest X-rays. With little to no direct access to hospital data, the AI community relies heavily on public data comprising numerous data sources. Model performance results have been exceptional when training and testing on open-source data, surpassing the reported capabilities of AI in pneumonia-detection prior to the COVID-19 outbreak. In this study impactful models are trained on a widely used open-source data and tested on an external test set and a hospital dataset, for the task of classifying chest X-rays into one of three classes: COVID-19, non-COVID pneumonia and no-pneumonia. Classification performance of the models investigated is evaluated through ROC curves, confusion matrices and standard classification metrics. Explainability modules are implemented to explore the image features most important to classification. Data analysis and model evaluations show that the popular open-source dataset COVIDx is not representative of the real clinical problem and that results from testing on this are inflated. Dependence on open-source data can leave models vulnerable to bias and confounding variables, requiring careful analysis to develop clinically useful/viable AI tools for COVID-19 detection in chest X-rays.
    Dual-Awareness Attention for Few-Shot Object Detection. (arXiv:2102.12152v3 [cs.CV] UPDATED)
    (2 min) While recent progress has significantly boosted few-shot classification (FSC) performance, few-shot object detection (FSOD) remains challenging for modern learning systems. Existing FSOD systems follow FSC approaches, ignoring critical issues such as spatial variability and uncertain representations, and consequently result in low performance. Observing this, we propose a novel \textbf{Dual-Awareness Attention (DAnA)} mechanism that enables networks to adaptively interpret the given support images. DAnA transforms support images into \textbf{query-position-aware} (QPA) features, guiding detection networks precisely by assigning customized support information to each local region of the query. In addition, the proposed DAnA component is flexible and adaptable to multiple existing object detection frameworks. By adopting DAnA, conventional object detection networks, Faster R-CNN and RetinaNet, which are not designed explicitly for few-shot learning, reach state-of-the-art performance in FSOD tasks. In comparison with previous methods, our model significantly increases the performance by 47\% (+6.9 AP), showing remarkable ability under various evaluation settings.
    Measuring the Biases and Effectiveness of Content-Style Disentanglement. (arXiv:2008.12378v4 [cs.CV] UPDATED)
    (2 min) A recent spate of state-of-the-art semi- and un-supervised solutions disentangle and encode image "content" into a spatial tensor and image appearance or "style" into a vector, to achieve good performance in spatially equivariant tasks (e.g. image-to-image translation). To achieve this, they employ different model design, learning objective, and data biases. While considerable effort has been made to measure disentanglement in vector representations, and assess its impact on task performance, such analysis for (spatial) content - style disentanglement is lacking. In this paper, we conduct an empirical study to investigate the role of different biases in content-style disentanglement settings and unveil the relationship between the degree of disentanglement and task performance. In particular, we consider the setting where we: (i) identify key design choices and learning constraints for three popular content-style disentanglement models; (ii) relax or remove such constraints in an ablation fashion; and (iii) use two metrics to measure the degree of disentanglement and assess its effect on each task performance. Our experiments reveal that there is a "sweet spot" between disentanglement, task performance and - surprisingly - content interpretability, suggesting that blindly forcing for higher disentanglement can hurt model performance and content factors semanticness. Our findings, as well as the used task-independent metrics, can be used to guide the design and selection of new models for tasks where content-style representations are useful.
    Multimodal Trajectory Prediction Conditioned on Lane-Graph Traversals. (arXiv:2106.15004v2 [cs.CV] UPDATED)
    (2 min) Accurately predicting the future motion of surrounding vehicles requires reasoning about the inherent uncertainty in driving behavior. This uncertainty can be loosely decoupled into lateral (e.g., keeping lane, turning) and longitudinal (e.g., accelerating, braking). We present a novel method that combines learned discrete policy rollouts with a focused decoder on subsets of the lane graph. The policy rollouts explore different goals given current observations, ensuring that the model captures lateral variability. Longitudinal variability is captured by our latent variable model decoder that is conditioned on various subsets of the lane graph. Our model achieves state-of-the-art performance on the nuScenes motion prediction dataset, and qualitatively demonstrates excellent scene compliance. Detailed ablations highlight the importance of the policy rollouts and the decoder architecture.
    Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering. (arXiv:2109.08029v1 [cs.CV])
    (2 min) Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that our unimodal model is complementary to multimodal models in both OK-VQA and VQA 2.0, and yield the best result to date in OK-VQA among systems not using external knowledge graphs, and comparable to systems that do use them. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
    DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows. (arXiv:2101.05796v2 [cs.CV] UPDATED)
    (2 min) The difficulty of obtaining paired data remains a major bottleneck for learning image restoration and enhancement models for real-world applications. Current strategies aim to synthesize realistic training data by modeling noise and degradations that appear in real-world settings. We propose DeFlow, a method for learning stochastic image degradations from unpaired data. Our approach is based on a novel unpaired learning formulation for conditional normalizing flows. We model the degradation process in the latent space of a shared flow encoder-decoder network. This allows us to learn the conditional distribution of a noisy image given the clean input by solely minimizing the negative log-likelihood of the marginal distributions. We validate our DeFlow formulation on the task of joint image restoration and super-resolution. The models trained with the synthetic data generated by DeFlow outperform previous learnable approaches on three recent datasets. Code and trained models are available at: https://github.com/volflow/DeFlow
    Pedestrian Trajectory Prediction with Convolutional Neural Networks. (arXiv:2010.05796v2 [cs.CV] UPDATED)
    (2 min) Predicting the future trajectories of pedestrians is a challenging problem that has a range of application, from crowd surveillance to autonomous driving. In literature, methods to approach pedestrian trajectory prediction have evolved, transitioning from physics-based models to data-driven models based on recurrent neural networks. In this work, we propose a new approach to pedestrian trajectory prediction, with the introduction of a novel 2D convolutional model. This new model outperforms recurrent models, and it achieves state-of-the-art results on the ETH and TrajNet datasets. We also present an effective system to represent pedestrian positions and powerful data augmentation techniques, such as the addition of Gaussian noise and the use of random rotations, which can be applied to any model. As an additional exploratory analysis, we present experimental results on the inclusion of occupancy methods to model social information, which empirically show that these methods are ineffective in capturing social interaction.
    SketchHairSalon: Deep Sketch-based Hair Image Synthesis. (arXiv:2109.07874v1 [cs.CV])
    (2 min) Recent deep generative models allow real-time generation of hair images from sketch inputs. Existing solutions often require a user-provided binary mask to specify a target hair shape. This not only costs users extra labor but also fails to capture complicated hair boundaries. Those solutions usually encode hair structures via orientation maps, which, however, are not very effective to encode complex structures. We observe that colored hair sketches already implicitly define target hair shapes as well as hair appearance and are more flexible to depict hair structures than orientation maps. Based on these observations, we present SketchHairSalon, a two-stage framework for generating realistic hair images directly from freehand sketches depicting desired hair structure and appearance. At the first stage, we train a network to predict a hair matte from an input hair sketch, with an optional set of non-hair strokes. At the second stage, another network is trained to synthesize the structure and appearance of hair images from the input sketch and the generated matte. To make the networks in the two stages aware of long-term dependency of strokes, we apply self-attention modules to them. To train these networks, we present a new dataset containing thousands of annotated hair sketch-image pairs and corresponding hair mattes. Two efficient methods for sketch completion are proposed to automatically complete repetitive braided parts and hair strokes, respectively, thus reducing the workload of users. Based on the trained networks and the two sketch completion strategies, we build an intuitive interface to allow even novice users to design visually pleasing hair images exhibiting various hair structures and appearance via freehand sketches. The qualitative and quantitative evaluations show the advantages of the proposed system over the existing or alternative solutions.
    Neural Architecture Search in operational context: a remote sensing case-study. (arXiv:2109.08028v1 [cs.CV])
    (2 min) Deep learning has become in recent years a cornerstone tool fueling key innovations in the industry, such as autonomous driving. To attain good performances, the neural network architecture used for a given application must be chosen with care. These architectures are often handcrafted and therefore prone to human biases and sub-optimal selection. Neural Architecture Search (NAS) is a framework introduced to mitigate such risks by jointly optimizing the network architectures and its weights. Albeit its novelty, it was applied on complex tasks with significant results - e.g. semantic image segmentation. In this technical paper, we aim to evaluate its ability to tackle a challenging operational task: semantic segmentation of objects of interest in satellite imagery. Designing a NAS framework is not trivial and has strong dependencies to hardware constraints. We therefore motivate our NAS approach selection and provide corresponding implementation details. We also present novel ideas to carry out other such use-case studies.
    The Pursuit of Knowledge: Discovering and Localizing Novel Categories using Dual Memory. (arXiv:2105.01652v3 [cs.CV] UPDATED)
    (0 min) We tackle object category discovery, which is the problem of discovering and localizing novel objects in a large unlabeled dataset. While existing methods show results on datasets with less cluttered scenes and fewer object instances per image, we present our results on the challenging COCO dataset. Moreover, we argue that, rather than discovering new categories from scratch, discovery algorithms can benefit from identifying what is already known and focusing their attention on the unknown. We propose a method that exploits prior knowledge about certain object types to discover new categories by leveraging two memory modules, namely Working and Semantic memory. We show the performance of our detector on the COCO minival dataset to demonstrate its in-the-wild capabilities.
    A Survey on Temporal Sentence Grounding in Videos. (arXiv:2109.08039v1 [cs.CV])
    (0 min) Temporal sentence grounding in videos~(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities~(i.e., text and video). In this survey, we give a comprehensive overview for TSGV, which i) summarizes the taxonomy of existing methods, ii) provides a detailed description of the evaluation protocols~(i.e., datasets and metrics) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. To the best of our knowledge, this is the first systematic survey on temporal sentence grounding. More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, end-to-end methods, reinforcement learning-based methods, and weakly supervised methods. Then we present the benchmark datasets and evaluation metrics to assess current research progress. Finally, we discuss some limitations in TSGV through pointing out potential problems improperly resolved in the current evaluation protocols, which may push forwards more cutting edge research in TSGV. Besides, we also share our insights on several promising directions, including three typical tasks with new and practical settings based on TSGV.
    Marginal MAP Estimation for Inverse RL under Occlusion with Observer Noise. (arXiv:2109.07788v1 [cs.RO])
    (2 min) We consider the problem of learning the behavioral preferences of an expert engaged in a task from noisy and partially-observable demonstrations. This is motivated by real-world applications such as a line robot learning from observing a human worker, where some observations are occluded by environmental objects that cannot be removed. Furthermore, robotic perception tends to be imperfect and noisy. Previous techniques for inverse reinforcement learning (IRL) take the approach of either omitting the missing portions or inferring it as part of expectation-maximization, which tends to be slow and prone to local optima. We present a new method that generalizes the well-known Bayesian maximum-a-posteriori (MAP) IRL method by marginalizing the occluded portions of the trajectory. This is additionally extended with an observation model to account for perception noise. We show that the marginal MAP (MMAP) approach significantly improves on the previous IRL technique under occlusion in both formative evaluations on a toy problem and in a summative evaluation on an onion sorting line task by a robot.
    Evaluating Continual Learning Algorithms by Generating 3D Virtual Environments. (arXiv:2109.07855v1 [cs.CV])
    (0 min) Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment. Trying to simulate this learning process in machines is a challenging task, also due to the inherent difficulty in creating conditions for designing continuously evolving dynamics that are typical of the real-world. Many existing research works usually involve training and testing of virtual agents on datasets of static images or short videos, considering sequences of distinct learning tasks. However, in order to devise continual learning algorithms that operate in more realistic conditions, it is fundamental to gain access to rich, fully customizable and controlled experimental playgrounds. Focussing on the specific case of vision, we thus propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance. Scenes are composed of objects that move along variable routes with different and fully customizable timings, and randomness can also be included in their evolution. A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives. These general principles are concretely implemented exploiting a recently published 3D virtual environment. The user can generate scenes without the need of having strong skills in computer graphics, since all the generation facilities are exposed through a simple high-level Python interface. We publicly share the proposed generator.
    Rethinking Coarse-to-Fine Approach in Single Image Deblurring. (arXiv:2108.05054v2 [cs.CV] UPDATED)
    (2 min) Coarse-to-fine strategies have been extensively used for the architecture design of single image deblurring networks. Conventional methods typically stack sub-networks with multi-scale input images and gradually improve sharpness of images from the bottom sub-network to the top sub-network, yielding inevitably high computational costs. Toward a fast and accurate deblurring network design, we revisit the coarse-to-fine strategy and present a multi-input multi-output U-net (MIMO-UNet). The MIMO-UNet has three distinct features. First, the single encoder of the MIMO-UNet takes multi-scale input images to ease the difficulty of training. Second, the single decoder of the MIMO-UNet outputs multiple deblurred images with different scales to mimic multi-cascaded U-nets using a single U-shaped network. Last, asymmetric feature fusion is introduced to merge multi-scale features in an efficient manner. Extensive experiments on the GoPro and RealBlur datasets demonstrate that the proposed network outperforms the state-of-the-art methods in terms of both accuracy and computational complexity. Source code is available for research purposes at https://github.com/chosj95/MIMO-UNet.
    Rotation Averaging in a Split Second: A Primal-Dual Method and a Closed-Form for Cycle Graphs. (arXiv:2109.08046v1 [cs.CV])
    (0 min) A cornerstone of geometric reconstruction, rotation averaging seeks the set of absolute rotations that optimally explains a set of measured relative orientations between them. In spite of being an integral part of bundle adjustment and structure-from-motion, averaging rotations is both a non-convex and high-dimensional optimization problem. In this paper, we address it from a maximum likelihood estimation standpoint and make a twofold contribution. Firstly, we set forth a novel initialization-free primal-dual method which we show empirically to converge to the global optimum. Further, we derive what is to our knowledge, the first optimal closed-form solution for rotation averaging in cycle graphs and contextualize this result within spectral graph theory. Our proposed methods achieve a significant gain both in precision and performance.
    Point Cloud Pre-training by Mixing and Disentangling. (arXiv:2109.00452v2 [cs.CV] UPDATED)
    (0 min) The annotation for large-scale point clouds is still time-consuming and unavailable for many real-world tasks. Point cloud pre-training is one potential solution for obtaining a scalable model for fast adaptation. Therefore, in this paper, we investigate a new self-supervised learning approach, called Mixing and Disentangling (MD), for point cloud pre-training. As the name implies, we explore how to separate the original point cloud from the mixed point cloud, and leverage this challenging task as a pretext optimization objective for model training. Considering the limited training data in the original dataset, which is much less than prevailing ImageNet, the mixing process can efficiently generate more high-quality samples. We build one baseline network to verify our intuition, which simply contains two modules, encoder and decoder. Given a mixed point cloud, the encoder is first pre-trained to extract the semantic embedding. Then an instance-adaptive decoder is harnessed to disentangle the point clouds according to the embedding. Albeit simple, the encoder is inherently able to capture the point cloud keypoints after training and can be fast adapted to downstream tasks including classification and segmentation by the pre-training and fine-tuning paradigm. Extensive experiments on two datasets show that the encoder + ours (MD) significantly surpasses that of the encoder trained from scratch and converges quickly. In ablation studies, we further study the effect of each component and discuss the advantages of the proposed self-supervised learning strategy. We hope this self-supervised learning attempt on point clouds can pave the way for reducing the deeply-learned model dependence on large-scale labeled data and saving a lot of annotation costs in the future.
    DisUnknown: Distilling Unknown Factors for Disentanglement Learning. (arXiv:2109.08090v1 [cs.LG])
    (2 min) Disentangling data into interpretable and independent factors is critical for controllable generation tasks. With the availability of labeled data, supervision can help enforce the separation of specific factors as expected. However, it is often expensive or even impossible to label every single factor to achieve fully-supervised disentanglement. In this paper, we adopt a general setting where all factors that are hard to label or identify are encapsulated as a single unknown factor. Under this setting, we propose a flexible weakly-supervised multi-factor disentanglement framework DisUnknown, which Distills Unknown factors for enabling multi-conditional generation regarding both labeled and unknown factors. Specifically, a two-stage training approach is adopted to first disentangle the unknown factor with an effective and robust training method, and then train the final generator with the proper disentanglement of all labeled factors utilizing the unknown distillation. To demonstrate the generalization capacity and scalability of our method, we evaluate it on multiple benchmark datasets qualitatively and quantitatively and further apply it to various real-world applications on complicated datasets.
    Invertable Frowns: Video-to-Video Facial Emotion Translation. (arXiv:2109.08061v1 [cs.CV])
    (2 min) We present Wav2Lip-Emotion, a video-to-video translation architecture that modifies facial expressions of emotion in videos of speakers. Previous work modifies emotion in images, uses a single image to produce a video with animated emotion, or puppets facial expressions in videos with landmarks from a reference video. However, many use cases such as modifying an actor's performance in post-production, coaching individuals to be more animated speakers, or touching up emotion in a teleconference require a video-to-video translation approach. We explore a method to maintain speakers' lip movements, identity, and pose while translating their expressed emotion. Our approach extends an existing multi-modal lip synchronization architecture to modify the speaker's emotion using L1 reconstruction and pre-trained emotion objectives. We also propose a novel automated emotion evaluation approach and corroborate it with a user study. These find that we succeed in modifying emotion while maintaining lip synchronization. Visual quality is somewhat diminished, with a trade off between greater emotion modification and visual quality between model variants. Nevertheless, we demonstrate (1) that facial expressions of emotion can be modified with nothing other than L1 reconstruction and pre-trained emotion objectives and (2) that our automated emotion evaluation approach aligns with human judgements.
    UCP-Net: Unstructured Contour Points for Instance Segmentation. (arXiv:2109.07592v1 [cs.CV])
    (2 min) The goal of interactive segmentation is to assist users in producing segmentation masks as fast and as accurately as possible. Interactions have to be simple and intuitive and the number of interactions required to produce a satisfactory segmentation mask should be as low as possible. In this paper, we propose a novel approach to interactive segmentation based on unconstrained contour clicks for initial segmentation and segmentation refinement. Our method is class-agnostic and produces accurate segmentation masks (IoU > 85%) for a lower number of user interactions than state-of-the-art methods on popular segmentation datasets (COCO MVal, SBD and Berkeley).
    A Framework for Multisensory Foresight for Embodied Agents. (arXiv:2109.07561v1 [cs.RO])
    (2 min) Predicting future sensory states is crucial for learning agents such as robots, drones, and autonomous vehicles. In this paper, we couple multiple sensory modalities with exploratory actions and propose a predictive neural network architecture to address this problem. Most existing approaches rely on large, manually annotated datasets, or only use visual data as a single modality. In contrast, the unsupervised method presented here uses multi-modal perceptions for predicting future visual frames. As a result, the proposed model is more comprehensive and can better capture the spatio-temporal dynamics of the environment, leading to more accurate visual frame prediction. The other novelty of our framework is the use of sub-networks dedicated to anticipating future haptic, audio, and tactile signals. The framework was tested and validated with a dataset containing 4 sensory modalities (vision, haptic, audio, and tactile) on a humanoid robot performing 9 behaviors multiple times on a large set of objects. While the visual information is the dominant modality, utilizing the additional non-visual modalities improves the accuracy of predictions.
    Dense Pruning of Pointwise Convolutions in the Frequency Domain. (arXiv:2109.07707v1 [cs.CV])
    (2 min) Depthwise separable convolutions and frequency-domain convolutions are two recent ideas for building efficient convolutional neural networks. They are seemingly incompatible: the vast majority of operations in depthwise separable CNNs are in pointwise convolutional layers, but pointwise layers use 1x1 kernels, which do not benefit from frequency transformation. This paper unifies these two ideas by transforming the activations, not the kernels. Our key insights are that 1) pointwise convolutions commute with frequency transformation and thus can be computed in the frequency domain without modification, 2) each channel within a given layer has a different level of sensitivity to frequency domain pruning, and 3) each channel's sensitivity to frequency pruning is approximately monotonic with respect to frequency. We leverage this knowledge by proposing a new technique which wraps each pointwise layer in a discrete cosine transform (DCT) which is truncated to selectively prune coefficients above a given threshold as per the needs of each channel. To learn which frequencies should be pruned from which channels, we introduce a novel learned parameter which specifies each channel's pruning threshold. We add a new regularization term which incentivizes the model to decrease the number of retained frequencies while still maintaining task accuracy. Unlike weight pruning techniques which rely on sparse operators, our contiguous frequency band pruning results in fully dense computation. We apply our technique to MobileNetV2 and in the process reduce computation time by 22% and incur <1% accuracy degradation.
    Voxel-wise Cross-Volume Representation Learning for 3D Neuron Reconstruction. (arXiv:2108.06522v2 [eess.IV] UPDATED)
    (2 min) Automatic 3D neuron reconstruction is critical for analysing the morphology and functionality of neurons in brain circuit activities. However, the performance of existing tracing algorithms is hinged by the low image quality. Recently, a series of deep learning based segmentation methods have been proposed to improve the quality of raw 3D optical image stacks by removing noises and restoring neuronal structures from low-contrast background. Due to the variety of neuron morphology and the lack of large neuron datasets, most of current neuron segmentation models rely on introducing complex and specially-designed submodules to a base architecture with the aim of encoding better feature representations. Though successful, extra burden would be put on computation during inference. Therefore, rather than modifying the base network, we shift our focus to the dataset itself. The encoder-decoder backbone used in most neuron segmentation models attends only intra-volume voxel points to learn structural features of neurons but neglect the shared intrinsic semantic features of voxels belonging to the same category among different volumes, which is also important for expressive representation learning. Hence, to better utilise the scarce dataset, we propose to explicitly exploit such intrinsic features of voxels through a novel voxel-level cross-volume representation learning paradigm on the basis of an encoder-decoder segmentation model. Our method introduces no extra cost during inference. Evaluated on 42 3D neuron images from BigNeuron project, our proposed method is demonstrated to improve the learning ability of the original segmentation model and further enhancing the reconstruction performance.
    EchoCP: An Echocardiography Dataset in Contrast Transthoracic Echocardiography for Patent Foramen Ovale Diagnosis. (arXiv:2105.08267v2 [eess.IV] UPDATED)
    (2 min) Patent foramen ovale (PFO) is a potential separation between the septum, primum and septum secundum located in the anterosuperior portion of the atrial septum. PFO is one of the main factors causing cryptogenic stroke which is the fifth leading cause of death in the United States. For PFO diagnosis, contrast transthoracic echocardiography (cTTE) is preferred as being a more robust method compared with others. However, the current PFO diagnosis through cTTE is extremely slow as it is proceeded manually by sonographers on echocardiography videos. Currently there is no publicly available dataset for this important topic in the community. In this paper, we present EchoCP, as the first echocardiography dataset in cTTE targeting PFO diagnosis. EchoCP consists of 30 patients with both rest and Valsalva maneuver videos which covers various PFO grades. We further establish an automated baseline method for PFO diagnosis based on the state-of-the-art cardiac chamber segmentation technique, which achieves 0.89 average mean Dice score, but only 0.60/0.67 mean accuracies for PFO diagnosis, leaving large room for improvement. We hope that the challenging EchoCP dataset can stimulate further research and lead to innovative and generic solutions that would have an impact in multiple domains. Our dataset is released.
    FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition. (arXiv:2109.07916v1 [eess.AS])
    (0 min) Using mel-spectrograms over conventional MFCCs features, we assess the abilities of convolutional neural networks to accurately recognize and classify emotions from speech data. We introduce FSER, a speech emotion recognition model trained on four valid speech databases, achieving a high-classification accuracy of 95,05\%, over 8 different emotion classes: anger, anxiety, calm, disgust, happiness, neutral, sadness, surprise. On each benchmark dataset, FSER outperforms the best models introduced so far, achieving a state-of-the-art performance. We show that FSER stays reliable, independently of the language, sex identity, and any other external factor. Additionally, we describe how FSER could potentially be used to improve mental and emotional health care and how our analysis and findings serve as guidelines and benchmarks for further works in the same direction.
    Explainability Requires Interactivity. (arXiv:2109.07869v1 [cs.LG])
    (2 min) When explaining the decisions of deep neural networks, simple stories are tempting but dangerous. Especially in computer vision, the most popular explanation approaches give a false sense of comprehension to its users and provide an overly simplistic picture. We introduce an interactive framework to understand the highly complex decision boundaries of modern vision models. It allows the user to exhaustively inspect, probe, and test a network's decisions. Across a range of case studies, we compare the power of our interactive approach to static explanation methods, showing how these can lead a user astray, with potentially severe consequences.
    Urdu text in natural scene images: a new dataset and preliminary text detection. (arXiv:2109.08060v1 [cs.CV])
    (2 min) Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.
    Detecting Propaganda Techniques in Memes. (arXiv:2109.08013v1 [cs.CV])
    (2 min) Propaganda can be defined as a form of communication that aims to influence the opinions or the actions of people towards a specific goal; this is achieved by means of well-defined rhetorical and psychological devices. Propaganda, in the form we know it today, can be dated back to the beginning of the 17th century. However, it is with the advent of the Internet and the social media that it has started to spread on a much larger scale than before, thus becoming major societal and political issue. Nowadays, a large fraction of propaganda in social media is multimodal, mixing textual with visual content. With this in mind, here we propose a new multi-label multimodal task: detecting the type of propaganda techniques used in memes. We further create and release a new corpus of 950 memes, carefully annotated with 22 propaganda techniques, which can appear in the text, in the image, or in both. Our analysis of the corpus shows that understanding both modalities together is essential for detecting these techniques. This is further confirmed in our experiments with several state-of-the-art multimodal models.
    Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning. (arXiv:2109.07799v1 [cs.CV])
    (2 min) Automatic transcription of scene understanding in images and videos is a step towards artificial general intelligence. Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques. Automated image captioning techniques utilize encoder and decoder architecture, where the encoder extracts features from an image and the decoder generates a transcript. In this work, we investigate two unexplored ideas for image captioning using transformers: First, we demonstrate the enforcement of using objects' relevance in the surrounding environment. Second, learning an explicit association between labels and language constructs. We propose label-attention Transformer with geometrically coherent objects (LATGeO). The proposed technique acquires a proposal of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using a label-attention module. Object coherence is defined using the localized ratio of the geometrical properties of the proposals. The label-attention module associates the extracted objects classes to the available dictionary using self-attention layers. The experimentation results show that objects' relevance in surroundings and binding of their visual feature with their geometrically localized ratios combined with its associated labels help in defining meaningful captions. The proposed framework is tested on the MSCOCO dataset, and a thorough evaluation resulting in overall better quantitative scores pronounces its superiority.
    Heterogeneous Relational Complement for Vehicle Re-identification. (arXiv:2109.07894v1 [cs.CV])
    (2 min) The crucial problem in vehicle re-identification is to find the same vehicle identity when reviewing this object from cross-view cameras, which sets a higher demand for learning viewpoint-invariant representations. In this paper, we propose to solve this problem from two aspects: constructing robust feature representations and proposing camera-sensitive evaluations. We first propose a novel Heterogeneous Relational Complement Network (HRCN) by incorporating region-specific features and cross-level features as complements for the original high-level output. Considering the distributional differences and semantic misalignment, we propose graph-based relation modules to embed these heterogeneous features into one unified high-dimensional space. On the other hand, considering the deficiencies of cross-camera evaluations in existing measures (i.e., CMC and AP), we then propose a Cross-camera Generalization Measure (CGM) to improve the evaluations by introducing position-sensitivity and cross-camera generalization penalties. We further construct a new benchmark of existing models with our proposed CGM and experimental results reveal that our proposed HRCN model achieves new state-of-the-art in VeRi-776, VehicleID, and VERI-Wild.
    MHFC: Multi-Head Feature Collaboration for Few-Shot Learning. (arXiv:2109.07785v1 [cs.CV])
    (2 min) Few-shot learning (FSL) aims to address the data-scarce problem. A standard FSL framework is composed of two components: (1) Pre-train. Employ the base data to generate a CNN-based feature extraction model (FEM). (2) Meta-test. Apply the trained FEM to acquire the novel data's features and recognize them. FSL relies heavily on the design of the FEM. However, various FEMs have distinct emphases. For example, several may focus more attention on the contour information, whereas others may lay particular emphasis on the texture information. The single-head feature is only a one-sided representation of the sample. Besides the negative influence of cross-domain (e.g., the trained FEM can not adapt to the novel class flawlessly), the distribution of novel data may have a certain degree of deviation compared with the ground truth distribution, which is dubbed as distribution-shift-problem (DSP). To address the DSP, we propose Multi-Head Feature Collaboration (MHFC) algorithm, which attempts to project the multi-head features (e.g., multiple features extracted from a variety of FEMs) to a unified space and fuse them to capture more discriminative information. Typically, first, we introduce a subspace learning method to transform the multi-head features to aligned low-dimensional representations. It corrects the DSP via learning the feature with more powerful discrimination and overcomes the problem of inconsistent measurement scales from different head features. Then, we design an attention block to update combination weights for each head feature automatically. It comprehensively considers the contribution of various perspectives and further improves the discrimination of features. We evaluate the proposed method on five benchmark datasets (including cross-domain experiments) and achieve significant improvements of 2.1%-7.8% compared with state-of-the-arts.
    A Multi-Task Cross-Task Learning Architecture for Ad-hoc Uncertainty Estimation in 3D Cardiac MRI Image Segmentation. (arXiv:2109.07702v1 [eess.IV])
    (2 min) Medical image segmentation has significantly benefitted thanks to deep learning architectures. Furthermore, semi-supervised learning (SSL) has recently been a growing trend for improving a model's overall performance by leveraging abundant unlabeled data. Moreover, learning multiple tasks within the same model further improves model generalizability. To generate smoother and accurate segmentation masks from 3D cardiac MR images, we present a Multi-task Cross-task learning consistency approach to enforce the correlation between the pixel-level (segmentation) and the geometric-level (distance map) tasks. Our extensive experimentation with varied quantities of labeled data in the training sets justifies the effectiveness of our model for the segmentation of the left atrial cavity from Gadolinium-enhanced magnetic resonance (GE-MR) images. With the incorporation of uncertainty estimates to detect failures in the segmentation masks generated by CNNs, our study further showcases the potential of our model to flag low-quality segmentation from a given model.
    A Comparative Study of Machine Learning Methods for Predicting the Evolution of Brain Connectivity from a Baseline Timepoint. (arXiv:2109.07739v1 [cs.LG])
    (2 min) Predicting the evolution of the brain network, also called connectome, by foreseeing changes in the connectivity weights linking pairs of anatomical regions makes it possible to spot connectivity-related neurological disorders in earlier stages and detect the development of potential connectomic anomalies. Remarkably, such a challenging prediction problem remains least explored in the predictive connectomics literature. It is a known fact that machine learning (ML) methods have proven their predictive abilities in a wide variety of computer vision problems. However, ML techniques specifically tailored for the prediction of brain connectivity evolution trajectory from a single timepoint are almost absent. To fill this gap, we organized a Kaggle competition where 20 competing teams designed advanced machine learning pipelines for predicting the brain connectivity evolution from a single timepoint. The competing teams developed their ML pipelines with a combination of data pre-processing, dimensionality reduction, and learning methods. Utilizing an inclusive evaluation approach, we ranked the methods based on two complementary evaluation metrics (mean absolute error (MAE) and Pearson Correlation Coefficient (PCC)) and their performances using different training and testing data perturbation strategies (single random split and cross-validation). The final rank was calculated using the rank product for each competing team across all evaluation measures and validation strategies. In support of open science, the developed 20 ML pipelines along with the connectomic dataset are made available on GitHub. The outcomes of this competition are anticipated to lead to the further development of predictive models that can foresee the evolution of brain connectivity over time, as well as other types of networks (e.g., genetic networks).
    METEOR: A Massive Dense & Heterogeneous Behavior Dataset for Autonomous Driving. (arXiv:2109.07648v1 [cs.CV])
    (2 min) We present a new and complex traffic dataset, METEOR, which captures traffic patterns in unstructured scenarios in India. METEOR consists of more than 1000 one-minute video clips, over 2 million annotated frames with ego-vehicle trajectories, and more than 13 million bounding boxes for surrounding vehicles or traffic agents. METEOR is a unique dataset in terms of capturing the heterogeneity of microscopic and macroscopic traffic characteristics. Furthermore, we provide annotations for rare and interesting driving behaviors such as cut-ins, yielding, overtaking, overspeeding, zigzagging, sudden lane changing, running traffic signals, driving in the wrong lanes, taking wrong turns, lack of right-of-way rules at intersections, etc. We also present diverse traffic scenarios corresponding to rainy weather, nighttime driving, driving in rural areas with unmarked roads, and high-density traffic scenarios. We use our novel dataset to evaluate the performance of object detection and behavior prediction algorithms. We show that state-of-the-art object detectors fail in these challenging conditions and also propose a new benchmark test: action-behavior prediction with a baseline mAP score of 70.74.
    Predicting 3D shapes, masks, and properties of materials, liquids, and objects inside transparent containers, using the TransProteus CGI dataset. (arXiv:2109.07577v1 [cs.CV])
    (2 min) We present TransProteus, a dataset, and methods for predicting the 3D structure, masks, and properties of materials, liquids, and objects inside transparent vessels from a single image without prior knowledge of the image source and camera parameters. Manipulating materials in transparent containers is essential in many fields and depends heavily on vision. This work supplies a new procedurally generated dataset consisting of 50k images of liquids and solid objects inside transparent containers. The image annotations include 3D models, material properties (color/transparency/roughness...), and segmentation masks for the vessel and its content. The synthetic (CGI) part of the dataset was procedurally generated using 13k different objects, 500 different environments (HDRI), and 1450 material textures (PBR) combined with simulated liquids and procedurally generated vessels. In addition, we supply 104 real-world images of objects inside transparent vessels with depth maps of both the vessel and its content. We propose a camera agnostic method that predicts 3D models from an image as an XYZ map. This allows the trained net to predict the 3D model as a map with XYZ coordinates per pixel without prior knowledge of the image source. To calculate the training loss, we use the distance between pairs of points inside the 3D model instead of the absolute XYZ coordinates. This makes the loss function translation invariant. We use this to predict 3D models of vessels and their content from a single image. Finally, we demonstrate a net that uses a single image to predict the material properties of the vessel content and surface.
    Learning the Regularization in DCE-MR Image Reconstruction for Functional Imaging of Kidneys. (arXiv:2109.07548v1 [cs.LG])
    (2 min) Kidney DCE-MRI aims at both qualitative assessment of kidney anatomy and quantitative assessment of kidney function by estimating the tracer kinetic (TK) model parameters. Accurate estimation of TK model parameters requires an accurate measurement of the arterial input function (AIF) with high temporal resolution. Accelerated imaging is used to achieve high temporal resolution, which yields under-sampling artifacts in the reconstructed images. Compressed sensing (CS) methods offer a variety of reconstruction options. Most commonly, sparsity of temporal differences is encouraged for regularization to reduce artifacts. Increasing regularization in CS methods removes the ambient artifacts but also over-smooths the signal temporally which reduces the parameter estimation accuracy. In this work, we propose a single image trained deep neural network to reduce MRI under-sampling artifacts without reducing the accuracy of functional imaging markers. Instead of regularizing with a penalty term in optimization, we promote regularization by generating images from a lower dimensional representation. In this manuscript we motivate and explain the lower dimensional input design. We compare our approach to CS reconstructions with multiple regularization weights. Proposed approach results in kidney biomarkers that are highly correlated with the ground truth markers estimated using the CS reconstruction which was optimized for functional analysis. At the same time, the proposed approach reduces the artifacts in the reconstructed images.
    Few-Shot Object Detection by Attending to Per-Sample-Prototype. (arXiv:2109.07734v1 [cs.CV])
    (2 min) Few-shot object detection aims to detect instances of specific categories in a query image with only a handful of support samples. Although this takes less effort than obtaining enough annotated images for supervised object detection, it results in a far inferior performance compared to the conventional object detection methods. In this paper, we propose a meta-learning-based approach that considers the unique characteristics of each support sample. Rather than simply averaging the information of the support samples to generate a single prototype per category, our method can better utilize the information of each support sample by treating each support sample as an individual prototype. Specifically, we introduce two types of attention mechanisms for aggregating the query and support feature maps. The first is to refine the information of few-shot samples by extracting shared information between the support samples through attention. Second, each support sample is used as a class code to leverage the information by comparing similarities between each support feature and query features. Our proposed method is complementary to the previous methods, making it easy to plug and play for further improvement. We have evaluated our method on PASCAL VOC and COCO benchmarks, and the results verify the effectiveness of our method. In particular, the advantages of our method are maximized when there is more diversity among support data.
    DeepMTS: Deep Multi-task Learning for Survival Prediction in Patients with Advanced Nasopharyngeal Carcinoma using Pretreatment PET/CT. (arXiv:2109.07711v1 [eess.IV])
    (2 min) Nasopharyngeal Carcinoma (NPC) is a worldwide malignant epithelial cancer. Survival prediction is a major concern for NPC patients, as it provides early prognostic information that is needed to guide treatments. Recently, deep learning, which leverages Deep Neural Networks (DNNs) to learn deep representations of image patterns, has been introduced to the survival prediction in various cancers including NPC. It has been reported that image-derived end-to-end deep survival models have the potential to outperform clinical prognostic indicators and traditional radiomics-based survival models in prognostic performance. However, deep survival models, especially 3D models, require large image training data to avoid overfitting. Unfortunately, medical image data is usually scarce, especially for Positron Emission Tomography/Computed Tomography (PET/CT) due to the high cost of PET/CT scanning. Compared to Magnetic Resonance Imaging (MRI) or Computed Tomography (CT) providing only anatomical information of tumors, PET/CT that provides both anatomical (from CT) and metabolic (from PET) information is promising to achieve more accurate survival prediction. However, we have not identified any 3D end-to-end deep survival model that applies to small PET/CT data of NPC patients. In this study, we introduced the concept of multi-task leaning into deep survival models to address the overfitting problem resulted from small data. Tumor segmentation was incorporated as an auxiliary task to enhance the model's efficiency of learning from scarce PET/CT data. Based on this idea, we proposed a 3D end-to-end Deep Multi-Task Survival model (DeepMTS) for joint survival prediction and tumor segmentation. Our DeepMTS can jointly learn survival prediction and tumor segmentation using PET/CT data of only 170 patients with advanced NPC.
    End-to-End Partially Observable Visual Navigation in a Diverse Environment. (arXiv:2109.07752v1 [cs.RO])
    (2 min) How can a robot navigate successfully in a rich and diverse environment, indoors or outdoors, along an office corridor or a trail in the park, on the flat ground, the staircase, or the elevator, etc.? To this end, this work aims at three challenges: (i) complex visual observations, (ii) partial observability of local sensing, and (iii) multimodal navigation behaviors that depend on both the local environment and the high-level goal. We propose a novel neural network (NN) architecture to represent a local controller and leverage the flexibility of the end-to-end approach to learn a powerful policy. To tackle complex visual observations, we extract multiscale spatial information through convolution layers. To deal with partial observability, we encode rich history information in LSTM-like modules. Importantly, we integrate the two into a single unified architecture that exploits convolutional memory cells to track the observation history at multiple spatial scales, which can capture the complex spatiotemporal dependencies between observations and controls. We additionally condition the network on the high-level goal in order to generate different navigation behavior modes. Specifically, we propose to use independent memory cells for different modes to prevent mode collapse in the learned policy. We implemented the NN controller on the SPOT robot and evaluate it on three challenging tasks with partial observations: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. Our model significantly outperforms CNNs, conventional LSTMs, or the ablated versions of our model. A demo video will be publicly available, showing our SPOT robot traversing many different locations on our university campus.
    Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers. (arXiv:2109.07531v1 [cs.CV])
    (2 min) We propose to leverage Transformer architectures for non-autoregressive human motion prediction. Our approach decodes elements in parallel from a query sequence, instead of conditioning on previous predictions such as instate-of-the-art RNN-based approaches. In such a way our approach is less computational intensive and potentially avoids error accumulation to long term elements in the sequence. In that context, our contributions are fourfold: (i) we frame human motion prediction as a sequence-to-sequence problem and propose a non-autoregressive Transformer to infer the sequences of poses in parallel; (ii) we propose to decode sequences of 3D poses from a query sequence generated in advance with elements from the input sequence;(iii) we propose to perform skeleton-based activity classification from the encoder memory, in the hope that identifying the activity can improve predictions;(iv) we show that despite its simplicity, our approach achieves competitive results in two public datasets, although surprisingly more for short term predictions rather than for long term ones.
    Dense Semantic Contrast for Self-Supervised Visual Representation Learning. (arXiv:2109.07756v1 [cs.CV])
    (2 min) Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.
    Dynamic Fusion Network for RGBT Tracking. (arXiv:2109.07662v1 [cs.CV])
    (2 min) For both visible and infrared images have their own advantages and disadvantages, RGBT tracking has attracted more and more attention. The key points of RGBT tracking lie in feature extraction and feature fusion of visible and infrared images. Current RGBT tracking methods mostly pay attention to both individual features (features extracted from images of a single camera) and common features (features extracted and fused from an RGB camera and a thermal camera), while pay less attention to the different and dynamic contributions of individual features and common features for different sequences of registered image pairs. This paper proposes a novel RGBT tracking method, called Dynamic Fusion Network (DFNet), which adopts a two-stream structure, in which two non-shared convolution kernels are employed in each layer to extract individual features. Besides, DFNet has shared convolution kernels for each layer to extract common features. Non-shared convolution kernels and shared convolution kernels are adaptively weighted and summed according to different image pairs, so that DFNet can deal with different contributions for different sequences. DFNet has a fast speed, which is 28.658 FPS. The experimental results show that when DFNet only increases the Mult-Adds of 0.02% than the non-shared-convolution-kernel-based fusion method, Precision Rate (PR) and Success Rate (SR) reach 88.1% and 71.9% respectively.
    Partner-Assisted Learning for Few-Shot Image Classification. (arXiv:2109.07607v1 [cs.CV])
    (2 min) Few-shot Learning has been studied to mimic human visual capabilities and learn effective models without the need of exhaustive human annotation. Even though the idea of meta-learning for adaptation has dominated the few-shot learning methods, how to train a feature extractor is still a challenge. In this paper, we focus on the design of training strategy to obtain an elemental representation such that the prototype of each novel class can be estimated from a few labeled samples. We propose a two-stage training scheme, Partner-Assisted Learning (PAL), which first trains a partner encoder to model pair-wise similarities and extract features serving as soft-anchors, and then trains a main encoder by aligning its outputs with soft-anchors while attempting to maximize classification performance. Two alignment constraints from logit-level and feature-level are designed individually. For each few-shot task, we perform prototype classification. Our method consistently outperforms the state-of-the-art method on four benchmarks. Detailed ablation studies of PAL are provided to justify the selection of each component involved in training.
    OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication. (arXiv:2109.07644v1 [cs.CV])
    (2 min) Employing Vehicle-to-Vehicle communication to enhance perception performance in self-driving technology has attracted considerable attention recently; however, the absence of a suitable open dataset for benchmarking algorithms has made it difficult to develop and assess cooperative perception technologies. To this end, we present the first large-scale open simulated dataset for Vehicle-to-Vehicle perception. It contains over 70 interesting scenes, 111,464 frames, and 232,913 annotated 3D vehicle bounding boxes, collected from 8 towns in CARLA and a digital town of Culver City, Los Angeles. We then construct a comprehensive benchmark with a total of 16 implemented models to evaluate several information fusion strategies~(i.e. early, late, and intermediate fusion) with state-of-the-art LiDAR detection algorithms. Moreover, we propose a new Attentive Intermediate Fusion pipeline to aggregate information from multiple connected vehicles. Our experiments show that the proposed pipeline can be easily integrated with existing 3D LiDAR detectors and achieve outstanding performance even with large compression rates. To encourage more researchers to investigate Vehicle-to-Vehicle perception, we will release the dataset, benchmark methods, and all related codes in https://mobility-lab.seas.ucla.edu/opv2v/.
    Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs. (arXiv:2109.07710v1 [cs.LG])
    (2 min) Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies, achieving high accuracy on computer vision workloads such as image classification and object detection. However, training these models involving large parameters is both time-consuming and energy-hogging. In this regard, several prior works have advocated for sparsity to speed up the of DL training and more so, the inference phase. This work begins with the observation that during training, sparsity in the forward and backward passes are correlated. In that context, we investigate two types of sparsity (input and output type) inherent in gradient descent-based optimization algorithms and propose a hardware micro-architecture to leverage the same. Our experimental results use five state-of-the-art CNN models on the Imagenet dataset, and show back propagation speedups in the range of 1.69$\times$ to 5.43$\times$, compared to the dense baseline execution. By exploiting sparsity in both the forward and backward passes, speedup improvements range from 1.68$\times$ to 3.30$\times$ over the sparsity-agnostic baseline execution. Our work also achieves significant reduction in training iteration time over several previously proposed dense as well as sparse accelerator based platforms, in addition to achieving order of magnitude energy efficiency improvements over GPU based execution.
    A Pathology Deep Learning System Capable of Triage of Melanoma Specimens Utilizing Dermatopathologist Consensus as Ground Truth. (arXiv:2109.07554v1 [eess.IV])
    (2 min) Although melanoma occurs more rarely than several other skin cancers, patients' long term survival rate is extremely low if the diagnosis is missed. Diagnosis is complicated by a high discordance rate among pathologists when distinguishing between melanoma and benign melanocytic lesions. A tool that allows pathology labs to sort and prioritize melanoma cases in their workflow could improve turnaround time by prioritizing challenging cases and routing them directly to the appropriate subspecialist. We present a pathology deep learning system (PDLS) that performs hierarchical classification of digitized whole slide image (WSI) specimens into six classes defined by their morphological characteristics, including classification of "Melanocytic Suspect" specimens likely representing melanoma or severe dysplastic nevi. We trained the system on 7,685 images from a single lab (the reference lab), including the the largest set of triple-concordant melanocytic specimens compiled to date, and tested the system on 5,099 images from two distinct validation labs. We achieved Area Underneath the ROC Curve (AUC) values of 0.93 classifying Melanocytic Suspect specimens on the reference lab, 0.95 on the first validation lab, and 0.82 on the second validation lab. We demonstrate that the PDLS is capable of automatically sorting and triaging skin specimens with high sensitivity to Melanocytic Suspect cases and that a pathologist would only need between 30% and 60% of the caseload to address all melanoma specimens.
    RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. (arXiv:2109.07547v1 [cs.CV])
    (2 min) We introduce RAFT-Stereo, a new deep architecture for rectified stereo based on the optical flow network RAFT. We introduce multi-level convolutional GRUs, which more efficiently propagate information across the image. A modified version of RAFT-Stereo can perform accurate real-time inference. RAFT-stereo ranks first on the Middlebury leaderboard, outperforming the next best method on 1px error by 29% and outperforms all published work on the ETH3D two-view stereo benchmark. Code is available at https://github.com/princeton-vl/RAFT-Stereo.
    Federated Contrastive Learning for Decentralized Unlabeled Medical Images. (arXiv:2109.07504v1 [cs.LG])
    (2 min) A label-efficient paradigm in computer vision is based on self-supervised contrastive pre-training on unlabeled data followed by fine-tuning with a small number of labels. Making practical use of a federated computing environment in the clinical domain and learning on medical images poses specific challenges. In this work, we propose FedMoCo, a robust federated contrastive learning (FCL) framework, which makes efficient use of decentralized unlabeled medical data. FedMoCo has two novel modules: metadata transfer, an inter-node statistical data augmentation module, and self-adaptive aggregation, an aggregation module based on representational similarity analysis. To the best of our knowledge, this is the first FCL work on medical images. Our experiments show that FedMoCo can consistently outperform FedAvg, a seminal federated learning framework, in extracting meaningful representations for downstream tasks. We further show that FedMoCo can substantially reduce the amount of labeled data required in a downstream task, such as COVID-19 detection, to achieve a reasonable performance.
    Hybrid ICP. (arXiv:2109.07559v1 [cs.CV])
    (2 min) ICP algorithms typically involve a fixed choice of data association method and a fixed choice of error metric. In this paper, we propose Hybrid ICP, a novel and flexible ICP variant which dynamically optimises both the data association method and error metric based on the live image of an object and the current ICP estimate. We show that when used for object pose estimation, Hybrid ICP is more accurate and more robust to noise than other commonly used ICP variants. We also consider the setting where ICP is applied sequentially with a moving camera, and we study the trade-off between the accuracy of each ICP estimate and the number of ICP estimates available within a fixed amount of time.
    Learning to Aggregate and Refine Noisy Labels for Visual Sentiment Analysis. (arXiv:2109.07509v1 [cs.CV])
    (2 min) Visual sentiment analysis has received increasing attention in recent years. However, the quality of the dataset is a concern because the sentiment labels are crowd-sourcing, subjective, and prone to mistakes. This poses a severe threat to the data-driven models including the deep neural networks which would generalize poorly on the testing cases if they are trained to over-fit the samples with noisy sentiment labels. Inspired by the recent progress on learning with noisy labels, we propose a robust learning method to perform robust visual sentiment analysis. Our method relies on an external memory to aggregate and filter noisy labels during training and thus can prevent the model from overfitting the noisy cases. The memory is composed of the prototypes with corresponding labels, both of which can be updated online. We establish a benchmark for visual sentiment analysis with label noise using publicly available datasets. The experiment results of the proposed benchmark settings comprehensively show the effectiveness of our method.
  • cs.IR updates on arXiv.org

    A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles. (arXiv:2109.07791v1 [cs.IR])
    (2 min) Literature recommendation systems (LRS) assist readers in the discovery of relevant content from the overwhelming amount of literature available. Despite the widespread adoption of LRS, there is a lack of research on the user-perceived recommendation characteristics for fundamentally different approaches to content-based literature recommendation. To complement existing quantitative studies on literature recommendation, we present qualitative study results that report on users' perceptions for two contrasting recommendation classes: (1) link-based recommendation represented by the Co-Citation Proximity (CPA) approach, and (2) text-based recommendation represented by Lucene's MoreLikeThis (MLT) algorithm. The empirical data analyzed in our study with twenty users and a diverse set of 40 Wikipedia articles indicate a noticeable difference between text- and link-based recommendation generation approaches along several key dimensions. The text-based MLT method receives higher satisfaction ratings in terms of user-perceived similarity of recommended articles. In contrast, the CPA approach receives higher satisfaction scores in terms of diversity and serendipity of recommendations. We conclude that users of literature recommendation systems can benefit most from hybrid approaches that combine both link- and text-based approaches, where the user's information needs and preferences should control the weighting for the approaches used. The optimal weighting of multiple approaches used in a hybrid recommendation system is highly dependent on a user's shifting needs.
    Zero-Shot Open Information Extraction using Question Generation and Reading Comprehension. (arXiv:2109.08079v1 [cs.IR])
    (2 min) Typically, Open Information Extraction (OpenIE) focuses on extracting triples, representing a subject, a relation, and the object of the relation. However, most of the existing techniques are based on a predefined set of relations in each domain which limits their applicability to newer domains where these relations may be unknown such as financial documents. This paper presents a zero-shot open information extraction technique that extracts the entities (value) and their descriptions (key) from a sentence, using off the shelf machine reading comprehension (MRC) Model. The input questions to this model are created using a novel noun phrase generation method. This method takes the context of the sentence into account and can create a wide variety of questions making our technique domain independent. Given the questions and the sentence, our technique uses the MRC model to extract entities (value). The noun phrase corresponding to the question, with the highest confidence, is taken as the description (key). This paper also introduces the EDGAR10-Q dataset which is based on publicly available financial documents from corporations listed in US securities and exchange commission (SEC). The dataset consists of paragraphs, tagged values (entities), and their keys (descriptions) and is one of the largest among entity extraction datasets. This dataset will be a valuable addition to the research community, especially in the financial domain. Finally, the paper demonstrates the efficacy of the proposed technique on the EDGAR10-Q and Ade corpus drug dosage datasets, where it obtained 86.84 % and 97% accuracy, respectively.
    Evaluating Music Recommendations with Binary Feedback for Multiple Stakeholders. (arXiv:2109.07692v1 [cs.IR])
    (2 min) High quality user feedback data is essential to training and evaluating a successful music recommendation system, particularly one that has to balance the needs of multiple stakeholders. Most existing music datasets suffer from noisy feedback and self-selection biases inherent in the data collected by music platforms. Using the Piki Music dataset of 500k ratings collected over a two-year time period, we evaluate the performance of classic recommendation algorithms on three important stakeholders: consumers, well-known artists and lesser-known artists. We show that a matrix factorization algorithm trained on both likes and dislikes performs significantly better compared to one trained only on likes for all three stakeholders.
    Phrase Retrieval Learns Passage Retrieval, Too. (arXiv:2109.08133v1 [cs.CL])
    (2 min) Dense retrieval methods have shown great promise over sparse retrieval methods in a range of NLP problems. Among them, dense phrase retrieval-the most fine-grained retrieval unit-is appealing because phrases can be directly used as the output for question answering and slot filling tasks. In this work, we follow the intuition that retrieving phrases naturally entails retrieving larger text blocks and study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents. We first observe that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy (+3-5% in top-5 accuracy) compared to passage retrievers, which also helps achieve superior end-to-end QA performance with fewer passages. Then, we provide an interpretation for why phrase-level supervision helps learn better fine-grained entailment compared to passage-level supervision, and also show that phrase retrieval can be improved to achieve competitive performance in document-retrieval tasks such as entity linking and knowledge-grounded dialogue. Finally, we demonstrate how phrase filtering and vector quantization can reduce the size of our index by 4-10x, making dense phrase retrieval a practical and versatile solution in multi-granularity retrieval.
    ST-PIL: Spatial-Temporal Periodic Interest Learning for Next Point-of-Interest Recommendation. (arXiv:2104.02262v2 [cs.IR] UPDATED)
    (2 min) Point-of-Interest (POI) recommendation is an important task in location-based social networks. It facilitates the relation modeling between users and locations. Recently, researchers recommend POIs by long- and short-term interests and achieve success. However, they fail to well capture the periodic interest. People tend to visit similar places at similar times or in similar areas. Existing models try to acquire such kind of periodicity by user's mobility status or time slot, which limits the performance of periodic interest. To this end, we propose to learn spatial-temporal periodic interest. Specifically, in the long-term module, we learn the temporal periodic interest of daily granularity, then utilize intra-level attention to form long-term interest. In the short-term module, we construct various short-term sequences to acquire the spatial-temporal periodic interest of hourly, areal, and hourly-areal granularities, respectively. Finally, we apply inter-level attention to automatically integrate multiple interests. Experiments on two real-world datasets demonstrate the state-of-the-art performance of our method.
    On-the-Fly Ensemble Pruning in Evolving Data Streams. (arXiv:2109.07611v1 [cs.LG])
    (2 min) Ensemble pruning is the process of selecting a subset of componentclassifiers from an ensemble which performs at least as well as theoriginal ensemble while reducing storage and computational costs.Ensemble pruning in data streams is a largely unexplored area ofresearch. It requires analysis of ensemble components as they arerunning on the stream, and differentiation of useful classifiers fromredundant ones. We present CCRP, an on-the-fly ensemble prun-ing method for multi-class data stream classification empoweredby an imbalance-aware fusion of class-wise component rankings.CCRP aims that the resulting pruned ensemble contains the bestperforming classifier for each target class and hence, reduces the ef-fects of class imbalance. The conducted experiments on real-worldand synthetic data streams demonstrate that different types of en-sembles that integrate CCRP as their pruning scheme consistentlyyield on par or superior performance with 20% to 90% less averagememory consumption. Lastly, we validate the proposed pruningscheme by comparing our approach against pruning schemes basedon ensemble weights and basic rank fusion methods.
    Quoka Atlas of Scholarly Knowledge Production: An Interactive Sensemaking Tool for Exploring the Outputs of Research Institutions. (arXiv:2109.07689v1 [cs.DL])
    (2 min) The vast amount of research produced at institutions world-wide is extremely diverse, and coarse-grained quantitative measures of impact often obscure the individual contributions of these institutions to specific research fields and topics. We show that by applying an information retrieval model to index research articles which are faceted by institution and time, we can develop tools to rank institutions given a keyword query. We present an interactive atlas, Quoka, designed to enable a user to explore these rankings contextually by geography and over time. Through a set of use cases we demonstrate that the atlas can be used to perform sensemaking tasks to learn and collect information about the relationships between institutions and scholarly knowledge production.
    FOMO: Topics versus documents in legal eDiscovery. (arXiv:2109.08059v1 [cs.IR])
    (2 min) In the United States, the parties to a lawsuit are required to search through their electronically stored information to find documents that are relevant to the specific case and produce them to their opposing party. Negotiations over the scope of these searches often reflect a fear that something will be missed (Fear of Missing Out: FOMO). A Recall level of 80%, for example, means that 20% of the relevant documents will be left unproduced. This paper makes the argument that eDiscovery is the process of identifying responsive information, not identifying documents. Documents are the carriers of the information; they are not the direct targets of the process. A given document may contain one or more topics or factoids and a factoid may appear in more than one document. The coupon collector's problem, Heaps law, and other analyses provide ways to model the problem of finding information from among documents. In eDiscovery, however, the parties do not know how many factoids there might be in a collection or their probabilities. This paper describes a simple model that estimates the confidence that a fact will be omitted from the produced set (the identified set), while being contained in the missed set. Two data sets are then analyzed, a small set involving microaggressions and larger set involving classification of web pages. Both show that it is possible to discover at least one example of each available topic within a relatively small number of documents, meaning the further effort will not return additional novel information. The smaller data set is also used to investigate whether the non-random order of searching for responsive documents commonly used in eDiscovery (called continuous active learning) affects the distribution of topics-it does not.
    Popularity Bias Is Not Always Evil: Disentangling Benign and Harmful Bias for Recommendation. (arXiv:2109.07946v1 [cs.IR])
    (2 min) Recommender system usually suffers from severe popularity bias -- the collected interaction data usually exhibits quite imbalanced or even long-tailed distribution over items. Such skewed distribution may result from the users' conformity to the group, which deviates from reflecting users' true preference. Existing efforts for tackling this issue mainly focus on completely eliminating popularity bias. However, we argue that not all popularity bias is evil. Popularity bias not only results from conformity but also item quality, which is usually ignored by existing methods. Some items exhibit higher popularity as they have intrinsic better property. Blindly removing the popularity bias would lose such important signal, and further deteriorate model performance. To sufficiently exploit such important information for recommendation, it is essential to disentangle the benign popularity bias caused by item quality from the harmful popularity bias caused by conformity. Although important, it is quite challenging as we lack an explicit signal to differentiate the two factors of popularity bias. In this paper, we propose to leverage temporal information as the two factors exhibit quite different patterns along the time: item quality revealing item inherent property is stable and static while conformity that depends on items' recent clicks is highly time-sensitive. Correspondingly, we further propose a novel Time-aware DisEntangled framework (TIDE), where a click is generated from three components namely the static item quality, the dynamic conformity effect, as well as the user-item matching score returned by any recommendation model. Lastly, we conduct interventional inference such that the recommendation can benefit from the benign popularity bias while circumvent the harmful one. Extensive experiments on three real-world datasets demonstrated the effectiveness of TIDE.
  • cs.LG updates on arXiv.org

    Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. (arXiv:2104.08678v2 [cs.CL] UPDATED)
    (2 min) Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic adversarial data generation to make question answering models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.
    Verifying Quantized Neural Networks using SMT-Based Model Checking. (arXiv:2106.05997v2 [cs.LG] UPDATED)
    (3 min) Artificial Neural Networks (ANNs) are being deployed for an increasing number of safety-critical applications, including autonomous cars and medical diagnosis. However, concerns about their reliability have been raised due to their black-box nature and apparent fragility to adversarial attacks. These concerns are amplified when ANNs are deployed on restricted system, which limit the precision of mathematical operations and thus introduce additional quantization errors. Here, we develop and evaluate a novel symbolic verification framework using software model checking (SMC) and satisfiability modulo theories (SMT) to check for vulnerabilities in ANNs. More specifically, we propose several ANN-related optimizations for SMC, including invariant inference via interval analysis, slicing, expression simplifications, and discretization of non-linear activation functions. With this verification framework, we can provide formal guarantees on the safe behavior of ANNs implemented both in floating- and fixed-point arithmetic. In this regard, our verification approach was able to verify and produce adversarial examples for $52$ test cases spanning image classification and general machine learning applications. Furthermore, for small- to medium-sized ANN, our approach completes most of its verification runs in minutes. Moreover, in contrast to most state-of-the-art methods, our approach is not restricted to specific choices regarding activation functions and non-quantized representations. Our experiments show that our approach can analyze larger ANN implementations and substantially reduce the verification time compared to state-of-the-art techniques that use SMT solving.
    OMPQ: Orthogonal Mixed Precision Quantization. (arXiv:2109.07865v1 [cs.LG])
    (2 min) To bridge the ever increasing gap between deep neural networks' complexity and hardware capability, network quantization has attracted more and more research attention. The latest trend of mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization. However, this also results in a difficult integer programming formulation, and forces most existing approaches to use an extremely time-consuming search process even with various relaxations. Instead of solving a problem of the original integer programming, we propose to optimize a proxy metric, the concept of network orthogonality, which is highly correlated with the loss of the integer programming but also easy to optimize with linear programming. This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy. Specifically, on post-training quantization, we achieve 71.27% Top-1 accuracy on MobileNetV2, which only takes 9 seconds for searching and 1.4 GPU hours for finetuning on ImageNet. Our codes are avaliable at https://github.com/MAC-AutoML/OMPQ.
    Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks. (arXiv:2003.08414v3 [cs.LG] UPDATED)
    (2 min) Graph convolutional networks (GCNs) have shown promising results in processing graph data by extracting structure-aware features. This gave rise to extensive work in geometric deep learning, focusing on designing network architectures that ensure neuron activations conform to regularity patterns within the input graph. However, in most cases the graph structure is only accounted for by considering the similarity of activations between adjacent nodes, which limits the capabilities of such methods to discriminate between nodes in a graph. Here, we propose to augment conventional GCNs with geometric scattering transforms and residual convolutions. The former enables band-pass filtering of graph signals, thus alleviating the so-called oversmoothing often encountered in GCNs, while the latter is introduced to clear the resulting features of high-frequency noise. We establish the advantages of the presented Scattering GCN with both theoretical results establishing the complementary benefits of scattering and GCN features, as well as experimental results showing the benefits of our method compared to leading graph neural networks for semi-supervised node classification, including the recently proposed GAT network that typically alleviates oversmoothing using graph attention mechanisms.
    SAFRAN: An interpretable, rule-based link prediction method outperforming embedding models. (arXiv:2109.08002v1 [cs.AI])
    (2 min) Neural embedding-based machine learning models have shown promise for predicting novel links in knowledge graphs. Unfortunately, their practical utility is diminished by their lack of interpretability. Recently, the fully interpretable, rule-based algorithm AnyBURL yielded highly competitive results on many general-purpose link prediction benchmarks. However, current approaches for aggregating predictions made by multiple rules are affected by redundancies. We improve upon AnyBURL by introducing the SAFRAN rule application framework, which uses a novel aggregation approach called Non-redundant Noisy-OR that detects and clusters redundant rules prior to aggregation. SAFRAN yields new state-of-the-art results for fully interpretable link prediction on the established general-purpose benchmarks FB15K-237, WN18RR and YAGO3-10. Furthermore, it exceeds the results of multiple established embedding-based algorithms on FB15K-237 and WN18RR and narrows the gap between rule-based and embedding-based algorithms on YAGO3-10.
    Learning Security Classifiers with Verified Global Robustness Properties. (arXiv:2105.11363v2 [cs.CR] UPDATED)
    (2 min) Many recent works have proposed methods to train classifiers with local robustness properties, which can provably eliminate classes of evasion attacks for most inputs, but not all inputs. Since data distribution shift is very common in security applications, e.g., often observed for malware detection, local robustness cannot guarantee that the property holds for unseen inputs at the time of deploying the classifier. Therefore, it is more desirable to enforce global robustness properties that hold for all inputs, which is strictly stronger than local robustness. In this paper, we present a framework and tools for training classifiers that satisfy global robustness properties. We define new notions of global robustness that are more suitable for security classifiers. We design a novel booster-fixer training framework to enforce global robustness properties. We structure our classifier as an ensemble of logic rules and design a new verifier to verify the properties. In our training algorithm, the booster increases the classifier's capacity, and the fixer enforces verified global robustness properties following counterexample guided inductive synthesis. We show that we can train classifiers to satisfy different global robustness properties for three security datasets, and even multiple properties at the same time, with modest impact on the classifier's performance. For example, we train a Twitter spam account classifier to satisfy five global robustness properties, with 5.4% decrease in true positive rate, and 0.1% increase in false positive rate, compared to a baseline XGBoost model that doesn't satisfy any property.
    NuSPAN: A Proximal Average Network for Nonuniform Sparse Model -- Application to Seismic Reflectivity Inversion. (arXiv:2105.00003v2 [physics.geo-ph] UPDATED)
    (2 min) We solve the problem of sparse signal deconvolution in the context of seismic reflectivity inversion, which pertains to high-resolution recovery of the subsurface reflection coefficients. Our formulation employs a nonuniform, non-convex synthesis sparse model comprising a combination of convex and non-convex regularizers, which results in accurate approximations of the l0 pseudo-norm. The resulting iterative algorithm requires the proximal average strategy. When unfolded, the iterations give rise to a learnable proximal average network architecture that can be optimized in a data-driven fashion. We demonstrate the efficacy of the proposed approach through numerical experiments on synthetic 1-D seismic traces and 2-D wedge models in comparison with the benchmark techniques. We also present validations considering the simulated Marmousi2 model as well as real 3-D seismic volume data acquired from the Penobscot 3D survey off the coast of Nova Scotia, Canada.
    Disentangling Transfer and Interference in Multi-Domain Learning. (arXiv:2107.05445v3 [cs.CV] UPDATED)
    (2 min) Humans are incredibly good at transferring knowledge from one domain to another, enabling rapid learning of new tasks. Likewise, transfer learning has enabled enormous success in many computer vision problems using pretraining. However, the benefits of transfer in multi-domain learning, where a network learns multiple tasks defined by different datasets, has not been adequately studied. Learning multiple domains could be beneficial or these domains could interfere with each other given limited network capacity. In this work, we decipher the conditions where interference and knowledge transfer occur in multi-domain learning. We propose new metrics disentangling interference and transfer and set up experimental protocols. We further examine the roles of network capacity, task grouping, and dynamic loss weighting in reducing interference and facilitating transfer. We demonstrate our findings on the CIFAR-100, MiniPlaces, and Tiny-ImageNet datasets.
    Urdu text in natural scene images: a new dataset and preliminary text detection. (arXiv:2109.08060v1 [cs.CV])
    (2 min) Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.
    Zero-Shot Open Information Extraction using Question Generation and Reading Comprehension. (arXiv:2109.08079v1 [cs.IR])
    (2 min) Typically, Open Information Extraction (OpenIE) focuses on extracting triples, representing a subject, a relation, and the object of the relation. However, most of the existing techniques are based on a predefined set of relations in each domain which limits their applicability to newer domains where these relations may be unknown such as financial documents. This paper presents a zero-shot open information extraction technique that extracts the entities (value) and their descriptions (key) from a sentence, using off the shelf machine reading comprehension (MRC) Model. The input questions to this model are created using a novel noun phrase generation method. This method takes the context of the sentence into account and can create a wide variety of questions making our technique domain independent. Given the questions and the sentence, our technique uses the MRC model to extract entities (value). The noun phrase corresponding to the question, with the highest confidence, is taken as the description (key). This paper also introduces the EDGAR10-Q dataset which is based on publicly available financial documents from corporations listed in US securities and exchange commission (SEC). The dataset consists of paragraphs, tagged values (entities), and their keys (descriptions) and is one of the largest among entity extraction datasets. This dataset will be a valuable addition to the research community, especially in the financial domain. Finally, the paper demonstrates the efficacy of the proposed technique on the EDGAR10-Q and Ade corpus drug dosage datasets, where it obtained 86.84 % and 97% accuracy, respectively.
    Discrete representations in neural models of spoken language. (arXiv:2105.05582v2 [cs.CL] UPDATED)
    (2 min) The distributed and continuous representations used by neural networks are at odds with representations employed in linguistics, which are typically symbolic. Vector quantization has been proposed as a way to induce discrete neural representations that are closer in nature to their linguistic counterparts. However, it is not clear which metrics are the best-suited to analyze such discrete representations. We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language. We compare the results they show when applied to two different models, while systematically studying the effect of the placement and size of the discretization layer. We find that different evaluation regimes can give inconsistent results. While we can attribute them to the properties of the different metrics in most cases, one point of concern remains: the use of minimal pairs of phoneme triples as stimuli disadvantages larger discrete unit inventories, unlike metrics applied to complete utterances. Furthermore, while in general vector quantization induces representations that correlate with units posited in linguistics, the strength of this correlation is only moderate.
    A Machine Learning Framework for Automatic Prediction of Human Semen Motility. (arXiv:2109.08049v1 [cs.LG])
    (2 min) In the field of reproductive health, a vital aspect for the detection of male fertility issues is the analysis of human semen quality. Two factors of importance are the morphology and motility of the sperm cells. While the former describes defects in different parts of a spermatozoon, the latter measures the efficient movement of cells. For many non-human species, so-called Computer-Aided Sperm Analysis systems work well for assessing these characteristics from microscopic video recordings but struggle with human sperm samples which generally show higher degrees of debris and dead spermatozoa, as well as lower overall sperm motility. Here, machine learning methods that harness large amounts of training data to extract salient features could support physicians with the detection of fertility issues or in vitro fertilisation procedures. In this work, the overall motility of given sperm samples is predicted with the help of a machine learning framework integrating unsupervised methods for feature extraction with downstream regression models. The models evaluated herein improve on the state-of-the-art for video-based sperm-motility prediction.
    AMI-FML: A Privacy-Preserving Federated Machine Learning Framework for AMI. (arXiv:2109.05666v1 [cs.LG] CROSS LISTED)
    (2 min) Machine learning (ML) based smart meter data analytics is very promising for energy management and demand-response applications in the advanced metering infrastructure(AMI). A key challenge in developing distributed ML applications for AMI is to preserve user privacy while allowing active end-users participation. This paper addresses this challenge and proposes a privacy-preserving federated learning framework for ML applications in the AMI. We consider each smart meter as a federated edge device hosting an ML application that exchanges information with a central aggregator or a data concentrator, periodically. Instead of transferring the raw data sensed by the smart meters, the ML model weights are transferred to the aggregator to preserve privacy. The aggregator processes these parameters to devise a robust ML model that can be substituted at each edge device. We also discuss strategies to enhance privacy and improve communication efficiency while sharing the ML model parameters, suited for relatively slow network connections in the AMI. We demonstrate the proposed framework on a use case federated ML (FML) application that improves short-term load forecasting (STLF). We use a long short-term memory(LSTM) recurrent neural network (RNN) model for STLF. In our architecture, we assume that there is an aggregator connected to a group of smart meters. The aggregator uses the learned model gradients received from the federated smart meters to generate an aggregate, robust RNN model which improves the forecasting accuracy for individual and aggregated STLF. Our results indicate that with FML, forecasting accuracy is increased while preserving the data privacy of the end-users.
    Associative Memories via Predictive Coding. (arXiv:2109.08063v1 [cs.LG])
    (2 min) Associative memories in the brain receive and store patterns of activity registered by the sensory neurons, and are able to retrieve them when necessary. Due to their importance in human intelligence, computational models of associative memories have been developed for several decades now. They include autoassociative memories, which allow for storing data points and retrieving a stored data point $s$ when provided with a noisy or partial variant of $s$, and heteroassociative memories, able to store and recall multi-modal data. In this paper, we present a novel neural model for realizing associative memories, based on a hierarchical generative network that receives external stimuli via sensory neurons. This model is trained using predictive coding, an error-based learning algorithm inspired by information processing in the cortex. To test the capabilities of this model, we perform multiple retrieval experiments from both corrupted and incomplete data points. In an extensive comparison, we show that this new model outperforms in retrieval accuracy and robustness popular associative memory models, such as autoencoders trained via backpropagation, and modern Hopfield networks. In particular, in completing partial data points, our model achieves remarkable results on natural image datasets, such as ImageNet, with a surprisingly high accuracy, even when only a tiny fraction of pixels of the original images is presented. Furthermore, we show that this method is able to handle multi-modal data, retrieving images from descriptions, and vice versa. We conclude by discussing the possible impact of this work in the neuroscience community, by showing that our model provides a plausible framework to study learning and retrieval of memories in the brain, as it closely mimics the behavior of the hippocampus as a memory index and generative model.
    Reverse Differentiation via Predictive Coding. (arXiv:2103.04689v2 [cs.LG] UPDATED)
    (2 min) Deep learning has redefined the field of artificial intelligence (AI) thanks to the rise of artificial neural networks, which are architectures inspired by their neurological counterpart in the brain. Through the years, this dualism between AI and neuroscience has brought immense benefits to both fields, allowing neural networks to be used in dozens of applications. These networks use an efficient implementation of reverse differentiation, called backpropagation (BP). This algorithm, however, is often criticized for its biological implausibility (e.g., lack of local update rules for the parameters). Therefore, biologically plausible learning methods that rely on predictive coding (PC), a framework for describing information processing in the brain, are increasingly studied. Recent works prove that these methods can approximate BP up to a certain margin on multilayer perceptrons (MLPs), and asymptotically on any other complex model, and that zero-divergence inference learning (Z-IL), a variant of PC, is able to exactly implement BP on MLPs. However, the recent literature shows also that there is no biologically plausible method yet that can exactly replicate the weight update of BP on complex models. To fill this gap, in this paper, we generalize (PC and) Z-IL by directly defining them on computational graphs, and show that it can perform exact reverse differentiation. What results is the first biologically plausible algorithm that is equivalent to BP in the way of updating parameters on any neural network, providing a bridge between the interdisciplinary research of neuroscience and deep learning.
    Efficient Scaling of Dynamic Graph Neural Networks. (arXiv:2109.07893v1 [cs.DC])
    (2 min) We present distributed algorithms for training dynamic Graph Neural Networks (GNN) on large scale graphs spanning multi-node, multi-GPU systems. To the best of our knowledge, this is the first scaling study on dynamic GNN. We devise mechanisms for reducing the GPU memory usage and identify two execution time bottlenecks: CPU-GPU data transfer; and communication volume. Exploiting properties of dynamic graphs, we design a graph difference-based strategy to significantly reduce the transfer time. We develop a simple, but effective data distribution technique under which the communication volume remains fixed and linear in the input size, for any number of GPUs. Our experiments using billion-size graphs on a system of 128 GPUs shows that: (i) the distribution scheme achieves up to 30x speedup on 128 GPUs; (ii) the graph-difference technique reduces the transfer time by a factor of up to 4.1x and the overall execution time by up to 40%
    Enabling Visual Action Planning for Object Manipulation through Latent Space Roadmap. (arXiv:2103.02554v2 [cs.RO] UPDATED)
    (2 min) We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces, focusing on manipulation of deformable objects. We propose a Latent Space Roadmap (LSR) for task planning which is a graph-based structure globally capturing the system dynamics in a low-dimensional latent space. Our framework consists of three parts: (1) a Mapping Module (MM) that maps observations given in the form of images into a structured latent space extracting the respective states as well as generates observations from the latent states, (2) the LSR which builds and connects clusters containing similar states in order to find the latent plans between start and goal states extracted by MM, and (3) the Action Proposal Module that complements the latent plan found by the LSR with the corresponding actions. We present a thorough investigation of our framework on simulated box stacking and rope/box manipulation tasks, and a folding task executed on a real robot.
    Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN. (arXiv:2107.01366v2 [cs.CL] UPDATED)
    (2 min) Despite their practical success, modern seq2seq architectures are unable to generalize systematically on several SCAN tasks. Hence, it is not clear if SCAN-style compositional generalization is useful in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.
    Disaggregated Interventions to Reduce Inequality. (arXiv:2107.00593v2 [cs.LG] UPDATED)
    (2 min) A significant body of research in the data sciences considers unfair discrimination against social categories such as race or gender that could occur or be amplified as a result of algorithmic decisions. Simultaneously, real-world disparities continue to exist, even before algorithmic decisions are made. In this work, we draw on insights from the social sciences brought into the realm of causal modeling and constrained optimization, and develop a novel algorithmic framework for tackling pre-existing real-world disparities. The purpose of our framework, which we call the "impact remediation framework," is to measure real-world disparities and discover the optimal intervention policies that could help improve equity or access to opportunity for those who are underserved with respect to an outcome of interest. We develop a disaggregated approach to tackling pre-existing disparities that relaxes the typical set of assumptions required for the use of social categories in structural causal models. Our approach flexibly incorporates counterfactuals and is compatible with various ontological assumptions about the nature of social categories. We demonstrate impact remediation with a hypothetical case study and compare our disaggregated approach to an existing state-of-the-art approach, comparing its structure and resulting policy recommendations. In contrast to most work on optimal policy learning, we explore disparity reduction itself as an objective, explicitly focusing the power of algorithms on reducing inequality.
    On the inductive biases of deep domain adaptation. (arXiv:2109.07920v1 [cs.LG])
    (2 min) Domain alignment is currently the most prevalent solution to unsupervised domain-adaptation tasks and are often being presented as minimizers of some theoretical upper-bounds on risk in the target domain. However, further works revealed severe inadequacies between theory and practice: we consolidate this analysis and confirm that imposing domain invariance on features is neither necessary nor sufficient to obtain low target risk. We instead argue that successful deep domain adaptation rely largely on hidden inductive biases found in the common practice, such as model pre-training or design of encoder architecture. We perform various ablation experiments on popular benchmarks and our own synthetic transfers to illustrate their role in prototypical situations. To conclude our analysis, we propose to meta-learn parametric inductive biases to solve specific transfers and show their superior performance over handcrafted heuristics.
    SplitFed: When Federated Learning Meets Split Learning. (arXiv:2004.12088v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) and split learning (SL) are two popular distributed machine learning approaches. Both follow a model-to-data scenario; clients train and test machine learning models without sharing raw data. SL provides better model privacy than FL due to the machine learning model architecture split between clients and the server. Moreover, the split model makes SL a better option for resource-constrained environments. However, SL performs slower than FL due to the relay-based training across multiple clients. In this regard, this paper presents a novel approach, named splitfed learning (SFL), that amalgamates the two approaches eliminating their inherent drawbacks, along with a refined architectural configuration incorporating differential privacy and PixelDP to enhance data privacy and model robustness. Our analysis and empirical results demonstrate that (pure) SFL provides similar test accuracy and communication efficiency as SL while significantly decreasing its computation time per global epoch than in SL for multiple clients. Furthermore, as in SL, its communication efficiency over FL improves with the number of clients. Besides, the performance of SFL with privacy and robustness measures is further evaluated under extended experimental settings.
    DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows. (arXiv:2101.05796v2 [cs.CV] UPDATED)
    (2 min) The difficulty of obtaining paired data remains a major bottleneck for learning image restoration and enhancement models for real-world applications. Current strategies aim to synthesize realistic training data by modeling noise and degradations that appear in real-world settings. We propose DeFlow, a method for learning stochastic image degradations from unpaired data. Our approach is based on a novel unpaired learning formulation for conditional normalizing flows. We model the degradation process in the latent space of a shared flow encoder-decoder network. This allows us to learn the conditional distribution of a noisy image given the clean input by solely minimizing the negative log-likelihood of the marginal distributions. We validate our DeFlow formulation on the task of joint image restoration and super-resolution. The models trained with the synthetic data generated by DeFlow outperform previous learnable approaches on three recent datasets. Code and trained models are available at: https://github.com/volflow/DeFlow
    Field Study in Deploying Restless Multi-Armed Bandits: Assisting Non-Profits in Improving Maternal and Child Health. (arXiv:2109.08075v1 [cs.LG])
    (2 min) The widespread availability of cell phones has enabled non-profits to deliver critical health information to their beneficiaries in a timely manner. This paper describes our work to assist non-profits that employ automated messaging programs to deliver timely preventive care information to beneficiaries (new and expecting mothers) during pregnancy and after delivery. Unfortunately, a key challenge in such information delivery programs is that a significant fraction of beneficiaries drop out of the program. Yet, non-profits often have limited health-worker resources (time) to place crucial service calls for live interaction with beneficiaries to prevent such engagement drops. To assist non-profits in optimizing this limited resource, we developed a Restless Multi-Armed Bandits (RMABs) system. One key technical contribution in this system is a novel clustering method of offline historical data to infer unknown RMAB parameters. Our second major contribution is evaluation of our RMAB system in collaboration with an NGO, via a real-world service quality improvement study. The study compared strategies for optimizing service calls to 23003 participants over a period of 7 weeks to reduce engagement drops. We show that the RMAB group provides statistically significant improvement over other comparison groups, reducing ~ 30% engagement drops. To the best of our knowledge, this is the first study demonstrating the utility of RMABs in real world public health settings. We are transitioning our RMAB system to the NGO for real-world use.
    Detecting Propaganda Techniques in Memes. (arXiv:2109.08013v1 [cs.CV])
    (2 min) Propaganda can be defined as a form of communication that aims to influence the opinions or the actions of people towards a specific goal; this is achieved by means of well-defined rhetorical and psychological devices. Propaganda, in the form we know it today, can be dated back to the beginning of the 17th century. However, it is with the advent of the Internet and the social media that it has started to spread on a much larger scale than before, thus becoming major societal and political issue. Nowadays, a large fraction of propaganda in social media is multimodal, mixing textual with visual content. With this in mind, here we propose a new multi-label multimodal task: detecting the type of propaganda techniques used in memes. We further create and release a new corpus of 950 memes, carefully annotated with 22 propaganda techniques, which can appear in the text, in the image, or in both. Our analysis of the corpus shows that understanding both modalities together is essential for detecting these techniques. This is further confirmed in our experiments with several state-of-the-art multimodal models.
    Learning Representations by Humans, for Humans. (arXiv:1905.12686v4 [cs.LG] UPDATED)
    (2 min) When machine predictors can achieve higher performance than the human decision-makers they support, improving the performance of human decision-makers is often conflated with improving machine accuracy. Here we propose a framework to directly support human decision-making, in which the role of machines is to reframe problems rather than to prescribe actions through prediction. Inspired by the success of representation learning in improving performance of machine predictors, our framework learns human-facing representations optimized for human performance. This "Mind Composed with Machine" framework incorporates a human decision-making model directly into the representation learning paradigm and is trained with a novel human-in-the-loop training procedure. We empirically demonstrate the successful application of the framework to various tasks and representational forms.
    Building an Ensemble of Classifiers via Randomized Models of Ensemble Members. (arXiv:2109.07861v1 [cs.LG])
    (2 min) Many dynamic ensemble selection (DES) methods are known in the literature. A previously-developed by the authors, method consists in building a randomized classifier which is treated as a model of the base classifier. The model is equivalent to the base classifier in a certain probabilistic sense. Next, the probability of correct classification of randomized classifier is taken as the competence of the evaluated classifier. In this paper, a novel randomized model of base classifier is developed. In the proposed method, the random operation of the model results from a random selection of the learning set from the family of learning sets of a fixed size. The paper presents the mathematical foundations of this approach and shows how, for a practical application when learning and validation sets are given, one can determine the measure of competence and build a MC system with the DES scheme. The DES scheme with the proposed model of competence was experimentally evaluated on the collection of 67 benchmark datasets and compared in terms of eight quality criteria with two ensemble classifiers which use the previously-proposed concepts of randomized model. The proposed approach achieved the lowest ranks for almost all investigated quality criteria.
    Sample-Efficient L0-L2 Constrained Structure Learning of Sparse Ising Models. (arXiv:2012.01744v5 [stat.ML] UPDATED)
    (0 min) We consider the problem of learning the underlying graph of a sparse Ising model with $p$ nodes from $n$ i.i.d. samples. The most recent and best performing approaches combine an empirical loss (the logistic regression loss or the interaction screening loss) with a regularizer (an L1 penalty or an L1 constraint). This results in a convex problem that can be solved separately for each node of the graph. In this work, we leverage the cardinality constraint L0 norm, which is known to properly induce sparsity, and further combine it with an L2 norm to better model the non-zero coefficients. We show that our proposed estimators achieve an improved sample complexity, both (a) theoretically, by reaching new state-of-the-art upper bounds for recovery guarantees, and (b) empirically, by showing sharper phase transitions between poor and full recovery for graph topologies studied in the literature, when compared to their L1-based state-of-the-art methods.
    On the Cryptographic Hardness of Learning Single Periodic Neurons. (arXiv:2106.10744v2 [cs.LG] UPDATED)
    (0 min) We show a simple reduction which demonstrates the cryptographic hardness of learning a single periodic neuron over isotropic Gaussian distributions in the presence of noise. More precisely, our reduction shows that any polynomial-time algorithm (not necessarily gradient-based) for learning such functions under small noise implies a polynomial-time quantum algorithm for solving worst-case lattice problems, whose hardness form the foundation of lattice-based cryptography. Our core hard family of functions, which are well-approximated by one-layer neural networks, take the general form of a univariate periodic function applied to an affine projection of the data. These functions have appeared in previous seminal works which demonstrate their hardness against gradient-based (Shamir'18), and Statistical Query (SQ) algorithms (Song et al.'17). We show that if (polynomially) small noise is added to the labels, the intractability of learning these functions applies to all polynomial-time algorithms, beyond gradient-based and SQ algorithms, under the aforementioned cryptographic assumptions. Moreover, we demonstrate the necessity of noise in the hardness result by designing a polynomial-time algorithm for learning certain families of such functions under exponentially small adversarial noise. Our proposed algorithm is not a gradient-based or an SQ algorithm, but is rather based on the celebrated Lenstra-Lenstra-Lov\'asz (LLL) lattice basis reduction algorithm. Furthermore, in the absence of noise, this algorithm can be directly applied to solve CLWE detection (Bruna et al.'21) and phase retrieval with an optimal sample complexity of $d+1$ samples. In the former case, this improves upon the quadratic-in-$d$ sample complexity required in (Bruna et al.'21).
    Adversarial Attacks Against Deep Reinforcement Learning Framework in Internet of Vehicles. (arXiv:2108.00833v2 [cs.LG] UPDATED)
    (0 min) Machine learning (ML) has made incredible impacts and transformations in a wide range of vehicular applications. As the use of ML in Internet of Vehicles (IoV) continues to advance, adversarial threats and their impact have become an important subject of research worth exploring. In this paper, we focus on Sybil-based adversarial threats against a deep reinforcement learning (DRL)-assisted IoV framework and more specifically, DRL-based dynamic service placement in IoV. We carry out an experimental study with real vehicle trajectories to analyze the impact on service delay and resource congestion under different attack scenarios for the DRL-based dynamic service placement application. We further investigate the impact of the proportion of Sybil-attacked vehicles in the network. The results demonstrate that the performance is significantly affected by Sybil-based data poisoning attacks when compared to adversary-free healthy network scenario.
    TruthfulQA: Measuring How Models Mimic Human Falsehoods. (arXiv:2109.07958v1 [cs.CL])
    (0 min) We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. For example, the 6B-parameter GPT-J model was 17% less truthful than its 125M-parameter counterpart. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.
    Semi-Supervised Visual Representation Learning for Fashion Compatibility. (arXiv:2109.08052v1 [cs.CV])
    (0 min) We consider the problem of complementary fashion prediction. Existing approaches focus on learning an embedding space where fashion items from different categories that are visually compatible are closer to each other. However, creating such labeled outfits is intensive and also not feasible to generate all possible outfit combinations, especially with large fashion catalogs. In this work, we propose a semi-supervised learning approach where we leverage large unlabeled fashion corpus to create pseudo-positive and pseudo-negative outfits on the fly during training. For each labeled outfit in a training batch, we obtain a pseudo-outfit by matching each item in the labeled outfit with unlabeled items. Additionally, we introduce consistency regularization to ensure that representation of the original images and their transformations are consistent to implicitly incorporate colour and other important attributes through self-supervision. We conduct extensive experiments on Polyvore, Polyvore-D and our newly created large-scale Fashion Outfits datasets, and show that our approach with only a fraction of labeled examples performs on-par with completely supervised methods.
    MOFSimplify: Machine Learning Models with Extracted Stability Data of Three Thousand Metal-Organic Frameworks. (arXiv:2109.08098v1 [cond-mat.mtrl-sci])
    (0 min) We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal-organic framework (MOF) literature describing structurally characterized MOFs and their solvent removal and thermal stabilities. We obtain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models.
    Scalable Deep Reinforcement Learning for Routing and Spectrum Access in Physical Layer. (arXiv:2012.11783v2 [eess.SP] UPDATED)
    (0 min) This paper proposes a novel scalable reinforcement learning approach for simultaneous routing and spectrum access in wireless ad-hoc networks. In most previous works on reinforcement learning for network optimization, the network topology is assumed to be fixed, and a different agent is trained for each transmission node -- this limits scalability and generalizability. Further, routing and spectrum access are typically treated as separate tasks. Moreover, the optimization objective is usually a cumulative metric along the route, e.g., number of hops or delay. In this paper, we account for the physical-layer signal-to-interference-plus-noise ratio (SINR) in a wireless network and further show that bottleneck objective such as the minimum SINR along the route can also be optimized effectively using reinforcement learning. Specifically, we propose a scalable approach in which a single agent is associated with each flow and makes routing and spectrum access decisions as it moves along the frontier nodes. The agent is trained according to the physical-layer characteristics of the environment using a novel rewarding scheme based on the Monte Carlo estimation of the future bottleneck SINR. It learns to avoid interference by intelligently making joint routing and spectrum allocation decisions based on the geographical location information of the neighbouring nodes.
    Planning as Inference in Epidemiological Models. (arXiv:2003.13221v3 [q-bio.PE] UPDATED)
    (0 min) In this work we demonstrate how to automate parts of the infectious disease-control policy-making process via performing inference in existing epidemiological models. The kind of inference tasks undertaken include computing the posterior distribution over controllable, via direct policy-making choices, simulation model parameters that give rise to acceptable disease progression outcomes. Among other things, we illustrate the use of a probabilistic programming language that automates inference in existing simulators. Neither the full capabilities of this tool for automating inference nor its utility for planning is widely disseminated at the current time. Timely gains in understanding about how such simulation-based models and inference automation tools applied in support of policymaking could lead to less economically damaging policy prescriptions, particularly during the current COVID-19 pandemic.
    Fermion Sampling Made More Efficient. (arXiv:2109.07358v1 [quant-ph] CROSS LISTED)
    (0 min) Fermion sampling is to generate probability distribution of a many-body Slater-determinant wavefunction, which is termed "determinantal point process" in statistical analysis. For its inherently-embedded Pauli exclusion principle, its application reaches beyond simulating fermionic quantum many-body physics to constructing machine learning models for diversified datasets. Here we propose a fermion sampling algorithm, which has a polynomial time-complexity -- quadratic in the fermion number and linear in the system size. This algorithm is about 100% more efficient in computation time than the best known algorithms. In sampling the corresponding marginal distribution, our algorithm has a more drastic improvement, achieving a scaling advantage. We demonstrate its power on several test applications, including sampling fermions in a many-body system and a machine learning task of text summarization, and confirm its improved computation efficiency over other methods by counting floating-point operations.
    ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations. (arXiv:2109.07991v1 [cs.RO])
    (0 min) Multisensory object-centric perception, reasoning, and interaction have been a key research topic in recent years. However, the progress in these directions is limited by the small set of objects available -- synthetic objects are not realistic enough and are mostly centered around geometry, while real object datasets such as YCB are often practically challenging and unstable to acquire due to international shipping, inventory, and financial cost. We present ObjectFolder, a dataset of 100 virtualized objects that addresses both challenges with two key innovations. First, ObjectFolder encodes the visual, auditory, and tactile sensory data for all objects, enabling a number of multisensory object recognition tasks, beyond existing datasets that focus purely on object geometry. Second, ObjectFolder employs a uniform, object-centric, and implicit representation for each object's visual textures, acoustic simulations, and tactile readings, making the dataset flexible to use and easy to share. We demonstrate the usefulness of our dataset as a testbed for multisensory perception and control by evaluating it on a variety of benchmark tasks, including instance recognition, cross-sensory retrieval, 3D reconstruction, and robotic grasping.
    Probability-driven scoring functions in combining linear classifiers. (arXiv:2109.07815v1 [cs.LG])
    (0 min) Although linear classifiers are one of the oldest methods in machine learning, they are still very popular in the machine learning community. This is due to their low computational complexity and robustness to overfitting. Consequently, linear classifiers are often used as base classifiers of multiple ensemble classification systems. This research is aimed at building a new fusion method dedicated to the ensemble of linear classifiers. The fusion scheme uses both measurement space and geometrical space. Namely, we proposed a probability-driven scoring function which shape depends on the orientation of the decision hyperplanes generated by the base classifiers. The proposed fusion method is compared with the reference method using multiple benchmark datasets taken from the KEEL repository. The comparison is done using multiple quality criteria. The statistical analysis of the obtained results is also performed. The experimental study shows that, under certain conditions, some improvement may be obtained.
    DisUnknown: Distilling Unknown Factors for Disentanglement Learning. (arXiv:2109.08090v1 [cs.LG])
    (0 min) Disentangling data into interpretable and independent factors is critical for controllable generation tasks. With the availability of labeled data, supervision can help enforce the separation of specific factors as expected. However, it is often expensive or even impossible to label every single factor to achieve fully-supervised disentanglement. In this paper, we adopt a general setting where all factors that are hard to label or identify are encapsulated as a single unknown factor. Under this setting, we propose a flexible weakly-supervised multi-factor disentanglement framework DisUnknown, which Distills Unknown factors for enabling multi-conditional generation regarding both labeled and unknown factors. Specifically, a two-stage training approach is adopted to first disentangle the unknown factor with an effective and robust training method, and then train the final generator with the proper disentanglement of all labeled factors utilizing the unknown distillation. To demonstrate the generalization capacity and scalability of our method, we evaluate it on multiple benchmark datasets qualitatively and quantitatively and further apply it to various real-world applications on complicated datasets.
    HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop. (arXiv:1907.11184v1 [cs.CL] CROSS LISTED)
    (0 min) While the role of humans is increasingly recognized in machine learning community, representation of and interaction with models in current human-in-the-loop machine learning (HITL-ML) approaches are too low-level and far-removed from human's conceptual models. We demonstrate HEIDL, a prototype HITL-ML system that exposes the machine-learned model through high-level, explainable linguistic expressions formed of predicates representing semantic structure of text. In HEIDL, human's role is elevated from simply evaluating model predictions to interpreting and even updating the model logic directly by enabling interaction with rule predicates themselves. Raising the currency of interaction to such semantic levels calls for new interaction paradigms between humans and machines that result in improved productivity for text analytics model development process. Moreover, by involving humans in the process, the human-machine co-created models generalize better to unseen data as domain experts are able to instill their expertise by extrapolating from what has been learned by automated algorithms from few labelled data.
    Data driven algorithms for limited labeled data learning. (arXiv:2103.10547v3 [cs.LG] UPDATED)
    (0 min) We consider a novel data driven approach for designing learning algorithms that can effectively learn with only a small number of labeled examples. This is crucial for modern machine learning applications where labels are scarce or expensive to obtain. We focus on graph-based techniques, where the unlabeled examples are connected in a graph under the implicit assumption that similar nodes likely have similar labels. Over the past decades, several elegant graph-based semi-supervised and active learning algorithms for how to infer the labels of the unlabeled examples given the graph and a few labeled examples have been proposed. However, the problem of how to create the graph (which impacts the practical usefulness of these methods significantly) has been relegated to domain-specific art and heuristics and no general principles have been proposed. In this work we present a novel data driven approach for learning the graph and provide strong formal guarantees in both the distributional and online learning formalizations. We show how to leverage problem instances coming from an underlying problem domain to learn the graph hyperparameters from commonly used parametric families of graphs that perform well on new instances coming from the same domain. We obtain low regret and efficient algorithms in the online setting, and generalization guarantees in the distributional setting. We also show how to combine several very different similarity metrics and learn multiple hyperparameters, providing general techniques to apply to large classes of problems. We expect some of the tools and techniques we develop along the way to be of interest beyond semi-supervised and active learning, for data driven algorithms for combinatorial problems more generally.
    Humanly Certifying Superhuman Classifiers. (arXiv:2109.07867v1 [cs.LG])
    (0 min) Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. Today, this challenge is especially relevant given the emergence of systems which appear to increasingly outperform human beings. In some cases, this "superhuman" performance is readily demonstrated; for example by defeating legendary human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators can make mistakes and be subjective. Evaluating the performance with respect to a genuine oracle may be more objective and reliable, even when querying the oracle is expensive or impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is unobserved. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our assumptions a number of models from recent years are with high probability superhuman.
    The pitfalls of using open data to develop deep learning solutions for COVID-19 detection in chest X-rays. (arXiv:2109.08020v1 [cs.CV])
    (0 min) Since the emergence of COVID-19, deep learning models have been developed to identify COVID-19 from chest X-rays. With little to no direct access to hospital data, the AI community relies heavily on public data comprising numerous data sources. Model performance results have been exceptional when training and testing on open-source data, surpassing the reported capabilities of AI in pneumonia-detection prior to the COVID-19 outbreak. In this study impactful models are trained on a widely used open-source data and tested on an external test set and a hospital dataset, for the task of classifying chest X-rays into one of three classes: COVID-19, non-COVID pneumonia and no-pneumonia. Classification performance of the models investigated is evaluated through ROC curves, confusion matrices and standard classification metrics. Explainability modules are implemented to explore the image features most important to classification. Data analysis and model evaluations show that the popular open-source dataset COVIDx is not representative of the real clinical problem and that results from testing on this are inflated. Dependence on open-source data can leave models vulnerable to bias and confounding variables, requiring careful analysis to develop clinically useful/viable AI tools for COVID-19 detection in chest X-rays.
    Data-Driven Theory-guided Learning of Partial Differential Equations using SimultaNeous Basis Function Approximation and Parameter Estimation (SNAPE). (arXiv:2109.07471v1 [cs.LG])
    (0 min) The measured spatiotemporal response of various physical processes is utilized to infer the governing partial differential equations (PDEs). We propose SimultaNeous Basis Function Approximation and Parameter Estimation (SNAPE), a technique of parameter estimation of PDEs that is robust against high levels of noise nearly 100 %, by simultaneously fitting basis functions to the measured response and estimating the parameters of both ordinary and partial differential equations. The domain knowledge of the general multidimensional process is used as a constraint in the formulation of the optimization framework. SNAPE not only demonstrates its applicability on various complex dynamic systems that encompass wide scientific domains including Schr\"odinger equation, chaotic duffing oscillator, and Navier-Stokes equation but also estimates an analytical approximation to the process response. The method systematically combines the knowledge of well-established scientific theories and the concepts of data science to infer the properties of the process from the observed data.
    Telehealthcare and Covid-19: A Noninvasive & Low Cost Invasive, Scalable and Multimodal Real-Time Smartphone Application for Early Diagnosis of SARS-CoV-2 Infection. (arXiv:2109.07846v1 [cs.LG])
    (0 min) The global coronavirus pandemic overwhelmed many health care systems, enforcing lockdown and encouraged work from home to control the spread of the virus and prevent overrunning of hospitalized patients. This prompted a sharp widespread use of telehealth to provide low-risk care for patients. Nevertheless, a continuous mutation into new variants and widespread unavailability of test kits, especially in developing countries, possess the challenge to control future potential waves of infection. In this paper, we propose a novel Smartphone application-based platform for early diagnosis of possible Covid-19 infected patients. The application provides three modes of diagnosis from possible symptoms, cough sound, and specific blood biomarkers. When a user chooses a particular setting and provides the necessary information, it sends the data to a trained machine learning (ML) model deployed in a remote server using the internet. The ML algorithm then predicts the possibility of contracting Covid-19 and sends the feedback to the user. The entire procedure takes place in real-time. Our machine learning models can identify Covid-19 patients with an accuracy of 100%, 95.65%, and 77.59% from blood parameters, cough sound, and symptoms respectively. Moreover, the ML sensitivity for blood and sound is 100%, which indicates correct identification of Covid positive patients. This is significant in limiting the spread of the virus. The multimodality offers multiplex diagnostic methods to better classify possible infectees and together with the instantaneous nature of our technique, demonstrates the power of telehealthcare as an easy and widespread low-cost scalable diagnostic solution for future pandemics.
    Transferable Persona-Grounded Dialogues via Grounded Minimal Edits. (arXiv:2109.07713v1 [cs.CL])
    (0 min) Grounded dialogue models generate responses that are grounded on certain concepts. Limited by the distribution of grounded dialogue data, models trained on such data face the transferability challenges in terms of the data distribution and the type of grounded concepts. To address the challenges, we propose the grounded minimal editing framework, which minimally edits existing responses to be grounded on the given concept. Focusing on personas, we propose Grounded Minimal Editor (GME), which learns to edit by disentangling and recombining persona-related and persona-agnostic parts of the response. To evaluate persona-grounded minimal editing, we present the PersonaMinEdit dataset, and experimental results show that GME outperforms competitive baselines by a large margin. To evaluate the transferability, we experiment on the test set of BlendedSkillTalk and show that GME can edit dialogue models' responses to largely improve their persona consistency while preserving the use of knowledge and empathy.
    An Online Algorithm for Maximum-Likelihood Quantum State Tomography. (arXiv:2012.15498v2 [cs.LG] UPDATED)
    (0 min) We propose, to the best of our knowledge, the first online algorithm to compute the maximum-likelihood estimate in quantum state tomography. Suppose the quantum state to be estimated corresponds to a $D$-by-$D$ density matrix. The per-iteration computational complexity of the algorithm is $O ( D ^ 3 )$, independent of the data size. The expected optimization error of the algorithm is $O(\sqrt{ ( 1 / T ) D \log D })$, where $T$ denotes the number of iterations. The algorithm can be viewed as a quantum extension of Soft-Bayes, a recent algorithm for online portfolio selection (Orseau et al. Soft-Bayes: Prod for mixtures of experts with log-loss. Int. Conf. Algorithmic Learning Theory. 2017).
    DuRIN: A Deep-unfolded Sparse Seismic Reflectivity Inversion Network. (arXiv:2104.04704v2 [physics.geo-ph] UPDATED)
    (0 min) We consider the reflection seismology problem of recovering the locations of interfaces and the amplitudes of reflection coefficients from seismic data, which are vital for estimating the subsurface structure. The reflectivity inversion problem is typically solved using greedy algorithms and iterative techniques. Sparse Bayesian learning framework, and more recently, deep learning techniques have shown the potential of data-driven approaches to solve the problem. In this paper, we propose a weighted minimax-concave penalty-regularized reflectivity inversion formulation and solve it through a model-based neural network. The network is referred to as deep-unfolded reflectivity inversion network (DuRIN). We demonstrate the efficacy of the proposed approach over the benchmark techniques by testing on synthetic 1-D seismic traces and 2-D wedge models and validation with the simulated 2-D Marmousi2 model and real data from the Penobscot 3D survey off the coast of Nova Scotia, Canada.
    Lifelong Graph Learning. (arXiv:2009.00647v2 [cs.LG] UPDATED)
    (0 min) Graph neural networks (GNNs) are powerful models for many graph-structured tasks. Existing models often assume that a complete structure of a graph is available during training, however, in practice, graph-structured data is usually formed in a streaming fashion, so that learning a graph continuously is often necessary. In this paper, we aim to bridge GNN to lifelong learning by converting a graph problem to a regular learning problem, so that GNN is able to inherit the lifelong learning techniques developed for convolutional neural networks (CNNs). To this end, we propose a new graph topology based on feature cross-correlation, called the feature graph. It takes features as new nodes and turns nodes into independent graphs. This successfully converts the original problem of node classification to graph classification, in which the increasing nodes are turned into independent training samples. In the experiments, we demonstrate the efficiency and effectiveness of feature graph networks (FGN) by continuously learning a sequence of classical graph datasets. We also show that FGN achieves superior performance in human action recognition with distributed streaming signals for wearable devices.
    Adversarial Attacks against Deep Learning Based Power Control in Wireless Communications. (arXiv:2109.08139v1 [eess.SP])
    (0 min) We consider adversarial machine learning based attacks on power allocation where the base station (BS) allocates its transmit power to multiple orthogonal subcarriers by using a deep neural network (DNN) to serve multiple user equipments (UEs). The DNN that corresponds to a regression model is trained with channel gains as the input and allocated transmit powers as the output. While the BS allocates the transmit power to the UEs to maximize rates for all UEs, there is an adversary that aims to minimize these rates. The adversary may be an external transmitter that aims to manipulate the inputs to the DNN by interfering with the pilot signals that are transmitted to measure the channel gain. Alternatively, the adversary may be a rogue UE that transmits fabricated channel estimates to the BS. In both cases, the adversary carefully crafts adversarial perturbations to manipulate the inputs to the DNN of the BS subject to an upper bound on the strengths of these perturbations. We consider the attacks targeted on a single UE or all UEs. We compare these attacks with a benchmark, where the adversary scales down the input to the DNN. We show that adversarial attacks are much more effective than the benchmark attack in terms of reducing the rate of communications. We also show that adversarial attacks are robust to the uncertainty at the adversary including the erroneous knowledge of channel gains and the potential errors in exercising the attacks exactly as specified.
    I Wish I Would Have Loved This One, But I Didn't -- A Multilingual Dataset for Counterfactual Detection in Product Reviews. (arXiv:2104.06893v2 [cs.CL] UPDATED)
    (0 min) Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.
    Incentives in Two-sided Matching Markets with Prediction-enhanced Preference-formation. (arXiv:2109.07835v1 [cs.LG])
    (0 min) Two-sided matching markets have long existed to pair agents in the absence of regulated exchanges. A common example is school choice, where a matching mechanism uses student and school preferences to assign students to schools. In such settings, forming preferences is both difficult and critical. Prior work has suggested various prediction mechanisms that help agents make decisions about their preferences. Although often deployed together, these matching and prediction mechanisms are almost always analyzed separately. The present work shows that at the intersection of the two lies a previously unexplored type of strategic behavior: agents returning to the market (e.g., schools) can attack future predictions by interacting short-term non-optimally with their matches. Here, we first introduce this type of strategic behavior, which we call an `adversarial interaction attack'. Next, we construct a formal economic model that captures the feedback loop between prediction mechanisms designed to assist agents and the matching mechanism used to pair them. This economic model allows us to analyze adversarial interaction attacks. Finally, using school choice as an example, we build a simulation to show that, as the trust in and accuracy of predictions increases, schools gain progressively more by initiating an adversarial interaction attack. We also show that this attack increases inequality in the student population.
    Incremental Learning for End-to-End Automatic Speech Recognition. (arXiv:2005.04288v3 [eess.AS] UPDATED)
    (0 min) In this paper, we propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) which enables an ASR system to perform well on new tasks while maintaining the performance on its originally learned ones. To mitigate catastrophic forgetting during incremental learning, we design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions. Our method works without access to the training data of original tasks, which addresses the cases where the previous data is no longer available or joint training is costly. Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting. Furthermore, in two practical scenarios, compared to the target-reference joint training method, the performance drop of our method is 0.02% Character Error Rate (CER), which is 97% smaller than the drops of the baseline methods.
    Membership Inference Attacks Against Recommender Systems. (arXiv:2109.08045v1 [cs.CR])
    (0 min) Recently, recommender systems have achieved promising performances and become one of the most widely used web applications. However, recommender systems are often trained on highly sensitive user data, thus potential data leakage from recommender systems may lead to severe privacy problems. In this paper, we make the first attempt on quantifying the privacy leakage of recommender systems through the lens of membership inference. In contrast with traditional membership inference against machine learning classifiers, our attack faces two main differences. First, our attack is on the user-level but not on the data sample-level. Second, the adversary can only observe the ordered recommended items from a recommender system instead of prediction results in the form of posterior probabilities. To address the above challenges, we propose a novel method by representing users from relevant items. Moreover, a shadow recommender is established to derive the labeled training data for training the attack model. Extensive experimental results show that our attack framework achieves a strong performance. In addition, we design a defense mechanism to effectively mitigate the membership inference threat of recommender systems.
    OpenFed: An Open-Source Security and Privacy Guaranteed Federated Learning Framework. (arXiv:2109.07852v1 [cs.CR])
    (0 min) The broad application of artificial intelligence techniques ranging from self-driving vehicles to advanced medical diagnostics afford many benefits. Federated learning is a new breed of artificial intelligence, offering techniques to help bridge the gap between personal data protection and utilization for research and commercial deployment, especially in the use-cases where security and privacy are the key concerns. Here, we present OpenFed, an open-source software framework to simultaneously address the demands for data protection and utilization. In practice, OpenFed enables state-of-the-art model development in low-trust environments despite limited local data availability, which lays the groundwork for sustainable collaborative model development and commercial deployment by alleviating concerns of asset protection. In addition, OpenFed also provides an end-to-end toolkit to facilitate federated learning algorithm development as well as several benchmarks to fair performance comparison under diverse computing paradigms and configurations.
    Comparison and Unification of Three Regularization Methods in Batch Reinforcement Learning. (arXiv:2109.08134v1 [cs.LG])
    (0 min) In batch reinforcement learning, there can be poorly explored state-action pairs resulting in poorly learned, inaccurate models and poorly performing associated policies. Various regularization methods can mitigate the problem of learning overly-complex models in Markov decision processes (MDPs), however they operate in technically and intuitively distinct ways and lack a common form in which to compare them. This paper unifies three regularization methods in a common framework -- a weighted average transition matrix. Considering regularization methods in this common form illuminates how the MDP structure and the state-action pair distribution of the batch data set influence the relative performance of regularization methods. We confirm intuitions generated from the common framework by empirical evaluation across a range of MDPs and data collection policies.
    KnowMAN: Weakly Supervised Multinomial Adversarial Networks. (arXiv:2109.07994v1 [cs.LG])
    (0 min) The absence of labeled data for training neural models is often addressed by leveraging knowledge about the specific task, resulting in heuristic but noisy labels. The knowledge is captured in labeling functions, which detect certain regularities or patterns in the training samples and annotate corresponding labels for training. This process of weakly supervised training may result in an over-reliance on the signals captured by the labeling functions and hinder models to exploit other signals or to generalize well. We propose KnowMAN, an adversarial scheme that enables to control influence of signals associated with specific labeling functions. KnowMAN forces the network to learn representations that are invariant to those signals and to pick up other signals that are more generally associated with an output label. KnowMAN strongly improves results compared to direct weakly supervised learning with a pre-trained transformer language model and a feature-based baseline.
    Auditing Fairness and Imputation Impact in Predictive Analytics for Higher Education. (arXiv:2109.07908v1 [cs.CY])
    (0 min) Nowadays, colleges and universities use predictive analytics in a variety of ways to increase student success rates. Despite the potentials for predictive analytics, there exist two major barriers to their adoption in higher education: (a) the lack of democratization in deployment, and (b) the potential to exacerbate inequalities. Education researchers and policymakers encounter numerous challenges in deploying predictive modeling in practice. These challenges present in different steps of modeling including data preparation, model development, and evaluation. Nevertheless, each of these steps can introduce additional bias to the system if not appropriately performed. Most large-scale and nationally representative education data sets suffer from a significant number of incomplete responses from the research participants. Missing Values are the frequent latent causes behind many data analysis challenges. While many education-related studies addressed the challenges of missing data, little is known about the impact of handling missing values on the fairness of predictive outcomes in practice. In this paper, we set out to first assess the disparities in predictive modeling outcome for college-student success, then investigate the impact of imputation techniques on the model performance and fairness using a comprehensive set of common metrics. The comprehensive analysis of a real large-scale education dataset reveals key insights on the modeling disparity and how different imputation techniques fundamentally compare to one another in terms of their impact on the fairness of the student-success predictive outcome.
    Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?. (arXiv:2109.08131v1 [cs.HC])
    (0 min) Research in machine learning (ML) has primarily argued that models trained on incomplete or biased datasets can lead to discriminatory outputs. In this commentary, we propose moving the research focus beyond bias-oriented framings by adopting a power-aware perspective to "study up" ML datasets. This means accounting for historical inequities, labor conditions, and epistemological standpoints inscribed in data. We draw on HCI and CSCW work to support our argument, critically analyze previous research, and point at two co-existing lines of work within our community -- one bias-oriented, the other power-aware. This way, we highlight the need for dialogue and cooperation in three areas: data quality, data work, and data documentation. In the first area, we argue that reducing societal problems to "bias" misses the context-based nature of data. In the second one, we highlight the corporate forces and market imperatives involved in the labor of data workers that subsequently shape ML datasets. Finally, we propose expanding current transparency-oriented efforts in dataset documentation to reflect the social contexts of data design and production.
    Adaptive Control of Quadratic Costs in Linear Stochastic Differential Equations. (arXiv:2109.07630v1 [eess.SY])
    (0 min) We study a canonical problem in adaptive control; design and analysis of policies for minimizing quadratic costs in unknown continuous-time linear dynamical systems. We address important challenges including accuracy of learning the unknown parameters of the underlying stochastic differential equation, as well as full analyses of performance degradation due to sub-optimal actions (i.e., regret). Then, an easy-to-implement algorithm for balancing exploration versus exploitation is proposed, followed by theoretical guarantees showing a square-root of time regret bound. Further, we present tight results for assuring system stability and for specifying fundamental limits for regret. To establish the presented results, multiple novel technical frameworks are developed, which can be of independent interests.
    A Quadratic Time Locally Optimal Algorithm for NP-hard Equal Cardinality Partition Optimization. (arXiv:2109.07882v1 [cs.DS])
    (0 min) We study the optimization version of the equal cardinality set partition problem (where the absolute difference between the equal sized partitions' sums are minimized). While this problem is NP-hard and requires exponential complexity to solve in general, we have formulated a weaker version of this NP-hard problem, where the goal is to find a locally optimal solution. The local optimality considered in our work is under any swap between the opposing partitions' element pairs. To this end, we designed an algorithm which can produce such a locally optimal solution in $O(N^2)$ time and $O(N)$ space. Our approach does not require positive or integer inputs and works equally well under arbitrary input precisions. Thus, it is widely applicable in different problem scenarios.
    DeepMTS: Deep Multi-task Learning for Survival Prediction in Patients with Advanced Nasopharyngeal Carcinoma using Pretreatment PET/CT. (arXiv:2109.07711v1 [eess.IV])
    (0 min) Nasopharyngeal Carcinoma (NPC) is a worldwide malignant epithelial cancer. Survival prediction is a major concern for NPC patients, as it provides early prognostic information that is needed to guide treatments. Recently, deep learning, which leverages Deep Neural Networks (DNNs) to learn deep representations of image patterns, has been introduced to the survival prediction in various cancers including NPC. It has been reported that image-derived end-to-end deep survival models have the potential to outperform clinical prognostic indicators and traditional radiomics-based survival models in prognostic performance. However, deep survival models, especially 3D models, require large image training data to avoid overfitting. Unfortunately, medical image data is usually scarce, especially for Positron Emission Tomography/Computed Tomography (PET/CT) due to the high cost of PET/CT scanning. Compared to Magnetic Resonance Imaging (MRI) or Computed Tomography (CT) providing only anatomical information of tumors, PET/CT that provides both anatomical (from CT) and metabolic (from PET) information is promising to achieve more accurate survival prediction. However, we have not identified any 3D end-to-end deep survival model that applies to small PET/CT data of NPC patients. In this study, we introduced the concept of multi-task leaning into deep survival models to address the overfitting problem resulted from small data. Tumor segmentation was incorporated as an auxiliary task to enhance the model's efficiency of learning from scarce PET/CT data. Based on this idea, we proposed a 3D end-to-end Deep Multi-Task Survival model (DeepMTS) for joint survival prediction and tumor segmentation. Our DeepMTS can jointly learn survival prediction and tumor segmentation using PET/CT data of only 170 patients with advanced NPC.
    Conservative Data Sharing for Multi-Task Offline Reinforcement Learning. (arXiv:2109.08128v1 [cs.LG])
    (0 min) Offline reinforcement learning (RL) algorithms have shown promising results in domains where abundant pre-collected data is available. However, prior methods focus on solving individual problems from scratch with an offline dataset without considering how an offline RL agent can acquire multiple skills. We argue that a natural use case of offline RL is in settings where we can pool large amounts of data collected in various scenarios for solving different tasks, and utilize all of this data to learn behaviors for all the tasks more effectively rather than training each one in isolation. However, sharing data across all tasks in multi-task offline RL performs surprisingly poorly in practice. Thorough empirical analysis, we find that sharing data can actually exacerbate the distributional shift between the learned policy and the dataset, which in turn can lead to divergence of the learned policy and poor performance. To address this challenge, we develop a simple technique for data-sharing in multi-task offline RL that routes data based on the improvement over the task-specific data. We call this approach conservative data sharing (CDS), and it can be applied with multiple single-task offline RL methods. On a range of challenging multi-task locomotion, navigation, and vision-based robotic manipulation problems, CDS achieves the best or comparable performance compared to prior offline multi-task RL methods and previous data sharing approaches.
    Efficient Differentiable Simulation of Articulated Bodies. (arXiv:2109.07719v1 [cs.LG])
    (0 min) We present a method for efficient differentiable simulation of articulated bodies. This enables integration of articulated body dynamics into deep learning frameworks, and gradient-based optimization of neural networks that operate on articulated bodies. We derive the gradients of the forward dynamics using spatial algebra and the adjoint method. Our approach is an order of magnitude faster than autodiff tools. By only saving the initial states throughout the simulation process, our method reduces memory requirements by two orders of magnitude. We demonstrate the utility of efficient differentiable dynamics for articulated bodies in a variety of applications. We show that reinforcement learning with articulated systems can be accelerated using gradients provided by our method. In applications to control and inverse problems, gradient-based optimization enabled by our work accelerates convergence by more than an order of magnitude.
    Frame by frame completion probability of an NFL pass. (arXiv:2109.08051v1 [stat.ML])
    (0 min) American football is an increasingly popular sport, with a growing audience in many countries in the world. The most watched American football league in the world is the United States' National Football League (NFL), where every offensive play can be either a run or a pass, and in this work we focus on passes. Many factors can affect the probability of pass completion, such as receiver separation from the nearest defender, distance from receiver to passer, offense formation, among many others. When predicting the completion probability of a pass, it is essential to know who the target of the pass is. By using distance measures between players and the ball, it is possible to calculate empirical probabilities and predict very accurately who the target will be. The big question is: how likely is it for a pass to be completed in an NFL match while the ball is in the air? We developed a machine learning algorithm to answer this based on several predictors. Using data from the 2018 NFL season, we obtained conditional and marginal predictions for pass completion probability based on a random forest model. This is based on a two-stage procedure: first, we calculate the probability of each offensive player being the pass target, then, conditional on the target, we predict completion probability based on the random forest model. Finally, the general completion probability can be calculated using the law of total probability. We present animations for selected plays and show the pass completion probability evolution.
    Weighted Graph-Based Signal Temporal Logic Inference Using Neural Networks. (arXiv:2109.08078v1 [cs.AI])
    (0 min) Extracting spatial-temporal knowledge from data is useful in many applications. It is important that the obtained knowledge is human-interpretable and amenable to formal analysis. In this paper, we propose a method that trains neural networks to learn spatial-temporal properties in the form of weighted graph-based signal temporal logic (wGSTL) formulas. For learning wGSTL formulas, we introduce a flexible wGSTL formula structure in which the user's preference can be applied in the inferred wGSTL formulas. In the proposed framework, each neuron of the neural networks corresponds to a subformula in a flexible wGSTL formula structure. We initially train a neural network to learn the wGSTL operators and then train a second neural network to learn the parameters in a flexible wGSTL formula structure. We use a COVID-19 dataset and a rain prediction dataset to evaluate the performance of the proposed framework and algorithms. We compare the performance of the proposed framework with three baseline classification methods including K-nearest neighbors, decision trees, and artificial neural networks. The classification accuracy obtained by the proposed framework is comparable with the baseline classification methods.
    Predicting students' performance in online courses using multiple data sources. (arXiv:2109.07903v1 [cs.CY])
    (0 min) Data-driven decision making is serving and transforming education. We approached the problem of predicting students' performance by using multiple data sources which came from online courses, including one we created. Experimental results show preliminary conclusions towards which data are to be considered for the task.
    Bayesian graph convolutional neural networks via tempered MCMC. (arXiv:2104.08438v2 [cs.LG] UPDATED)
    (0 min) Deep learning models, such as convolutional neural networks, have long been applied to image and multi-media tasks, particularly those with structured data. More recently, there has been more attention to unstructured data that can be represented via graphs. These types of data are often found in health and medicine, social networks, and research data repositories. Graph convolutional neural networks have recently gained attention in the field of deep learning that takes advantage of graph-based data representation with automatic feature extraction via convolutions. Given the popularity of these methods in a wide range of applications, robust uncertainty quantification is vital. This remains a challenge for large models and unstructured datasets. Bayesian inference provides a principled approach to uncertainty quantification of model parameters for deep learning models. Although Bayesian inference has been used extensively elsewhere, its application to deep learning remains limited due to the computational requirements of the Markov Chain Monte Carlo (MCMC) methods. Recent advances in parallel computing and advanced proposal schemes in MCMC sampling methods has opened the path for Bayesian deep learning. In this paper, we present Bayesian graph convolutional neural networks that employ tempered MCMC sampling with Langevin-gradient proposal distribution implemented via parallel computing. Our results show that the proposed method can provide accuracy similar to advanced optimisers while providing uncertainty quantification for key benchmark problems.
    Beyond 5G RIS mmWave Systems: Where Communication and Localization Meet. (arXiv:2109.07729v1 [eess.SP])
    (0 min) Upcoming beyond fifth generation (5G) communications systems aim at further enhancing key performance indicators and fully supporting brand new use cases by embracing emerging techniques, e.g., reconfigurable intelligent surface (RIS), integrated communication, localization, and sensing, and mmWave/THz communications. The wireless intelligence empowered by state-of-the-art artificial intelligence techniques has been widely considered at the transceivers, and now the paradigm is deemed to be shifted to the smart control of radio propagation environment by virtue of RISs. In this article, we argue that to harness the full potential of RISs, localization and communication must be tightly coupled. This is in sharp contrast to 5G and earlier generations, where localization was a minor additional service. To support this, we first introduce the fundamentals of RIS mmWave channel modeling, followed by RIS channel state information acquisition and link establishment. Then, we deal with the connection between localization and communications, from a separate and joint perspective.
    A Column Streaming-Based Convolution Engine and Mapping Algorithm for CNN-based Edge AI accelerators. (arXiv:2109.07601v1 [cs.LG])
    (0 min) Edge AI accelerators have been emerging as a solution for near customers' applications in areas such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, robotics, and remote sensing satellites. These applications not only require meeting performance targets but also meeting strict area and power constraints due to their portable mobility feature and limited power sources. As a result, a column streaming-based convolution engine has been proposed in this paper that includes column sets of processing elements design for flexibility in terms of the applicability for different CNN algorithms in edge AI accelerators. Comparing to a commercialized CNN accelerator, the key results reveal that the column streaming-based convolution engine requires similar execution cycles for processing a 227 x 227 feature map with avoiding zero-padding penalties.
    PDBench: Evaluating Computational Methods for Protein Sequence Design. (arXiv:2109.07925v1 [q-bio.BM])
    (0 min) Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical challenges facing humanity. This is the purpose of protein design. Sequence design is an important aspect of protein design, and many successful methods to do this have been developed. Recently, deep-learning methods that frame it as a classification problem have emerged as a powerful approach. Beyond their reported improvement in performance, their primary advantage over physics-based methods is that the computational burden is shifted from the user to the developers, thereby increasing accessibility to the design method. Despite this trend, the tools for assessment and comparison of such models remain quite generic. The goal of this paper is to both address the timely problem of evaluation and to shine a spotlight, within the Machine Learning community, on specific assessment criteria that will accelerate impact. We present a carefully curated benchmark set of proteins and propose a number of standard tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility. We compare five existing models with two novel models for sequence prediction. Finally, we test the designs produced by these models with AlphaFold2, a state-of-the-art structure-prediction algorithm, to determine if they are likely to fold into the intended 3D shapes.
    A Multi-Task Cross-Task Learning Architecture for Ad-hoc Uncertainty Estimation in 3D Cardiac MRI Image Segmentation. (arXiv:2109.07702v1 [eess.IV])
    (0 min) Medical image segmentation has significantly benefitted thanks to deep learning architectures. Furthermore, semi-supervised learning (SSL) has recently been a growing trend for improving a model's overall performance by leveraging abundant unlabeled data. Moreover, learning multiple tasks within the same model further improves model generalizability. To generate smoother and accurate segmentation masks from 3D cardiac MR images, we present a Multi-task Cross-task learning consistency approach to enforce the correlation between the pixel-level (segmentation) and the geometric-level (distance map) tasks. Our extensive experimentation with varied quantities of labeled data in the training sets justifies the effectiveness of our model for the segmentation of the left atrial cavity from Gadolinium-enhanced magnetic resonance (GE-MR) images. With the incorporation of uncertainty estimates to detect failures in the segmentation masks generated by CNNs, our study further showcases the potential of our model to flag low-quality segmentation from a given model.
    Cross-domain Activity Recognition via Substructural Optimal Transport. (arXiv:2102.03353v3 [eess.SP] UPDATED)
    (0 min) It is expensive and time-consuming to collect sufficient labeled data for human activity recognition (HAR). Domain adaptation is a promising approach for cross-domain activity recognition. Existing methods mainly focus on adapting cross-domain representations via domain-level, class-level, or sample-level distribution matching. However, they might fail to capture the fine-grained locality information in activity data. The domain- and class-level matching are too coarse that may result in under-adaptation, while sample-level matching may be affected by the noise seriously and eventually cause over-adaptation. In this paper, we propose substructure-level matching for domain adaptation (SSDA) to better utilize the locality information of activity data for accurate and efficient knowledge transfer. Based on SSDA, we propose an optimal transport-based implementation, Substructural Optimal Transport (SOT), for cross-domain HAR. We obtain the substructures of activities via clustering methods and seeks the coupling of the weighted substructures between different domains. We conduct comprehensive experiments on four public activity recognition datasets (i.e. UCI-DSADS, UCI-HAR, USC-HAD, PAMAP2), which demonstrates that SOT significantly outperforms other state-of-the-art methods w.r.t classification accuracy (9%+ improvement). In addition, our mehtod is 5x faster than traditional OT-based DA methods with the same hyper-parameters.
    Accurately Modeling Biased Random Walks on Weighted Graphs Using $\textit{Node2vec+}$. (arXiv:2109.08031v1 [cs.SI])
    (0 min) Node embedding is a powerful approach for representing the structural role of each node in a graph. $\textit{Node2vec}$ is a widely used method for node embedding that works by exploring the local neighborhoods via biased random walks on the graph. However, $\textit{node2vec}$ does not consider edge weights when computing walk biases. This intrinsic limitation prevents $\textit{node2vec}$ from leveraging all the information in weighted graphs and, in turn, limits its application to many real-world networks that are weighted and dense. Here, we naturally extend $\textit{node2vec}$ to $\textit{node2vec+}$ in a way that accounts for edge weights when calculating walk biases, but which reduces to $\textit{node2vec}$ in the cases of unweighted graphs or unbiased walks. We empirically show that $\textit{node2vec+}$ is more robust to additive noise than $\textit{node2vec}$ in weighted graphs using two synthetic datasets. We also demonstrate that $\textit{node2vec+}$ significantly outperforms $\textit{node2vec}$ on a commonly benchmarked multi-label dataset (Wikipedia). Furthermore, we test $\textit{node2vec+}$ against GCN and GraphSAGE using various challenging gene classification tasks on two protein-protein interaction networks. Despite some clear advantages of GCN and GraphSAGE, they show comparable performance with $\textit{node2vec+}$. Finally, $\textit{node2vec+}$ can be used as a general approach for generating biased random walks, benefiting all existing methods built on top of $\textit{node2vec}$. $\textit{Node2vec+}$ is implemented as part of $\texttt{PecanPy}$, which is available at https://github.com/krishnanlab/PecanPy .
    Learning to Aggregate and Refine Noisy Labels for Visual Sentiment Analysis. (arXiv:2109.07509v1 [cs.CV])
    (0 min) Visual sentiment analysis has received increasing attention in recent years. However, the quality of the dataset is a concern because the sentiment labels are crowd-sourcing, subjective, and prone to mistakes. This poses a severe threat to the data-driven models including the deep neural networks which would generalize poorly on the testing cases if they are trained to over-fit the samples with noisy sentiment labels. Inspired by the recent progress on learning with noisy labels, we propose a robust learning method to perform robust visual sentiment analysis. Our method relies on an external memory to aggregate and filter noisy labels during training and thus can prevent the model from overfitting the noisy cases. The memory is composed of the prototypes with corresponding labels, both of which can be updated online. We establish a benchmark for visual sentiment analysis with label noise using publicly available datasets. The experiment results of the proposed benchmark settings comprehensively show the effectiveness of our method.
    WildWood: a new Random Forest algorithm. (arXiv:2109.08010v1 [cs.LG])
    (0 min) We introduce WildWood (WW), a new ensemble algorithm for supervised learning of Random Forest (RF) type. While standard RF algorithms use bootstrap out-of-bag samples to compute out-of-bag scores, WW uses these samples to produce improved predictions given by an aggregation of the predictions of all possible subtrees of each fully grown tree in the forest. This is achieved by aggregation with exponential weights computed over out-of-bag samples, that are computed exactly and very efficiently thanks to an algorithm called context tree weighting. This improvement, combined with a histogram strategy to accelerate split finding, makes WW fast and competitive compared with other well-established ensemble methods, such as standard RF and extreme gradient boosting algorithms.
    Capacity and quantum geometry of parametrized quantum circuits. (arXiv:2102.01659v2 [quant-ph] UPDATED)
    (0 min) To harness the potential of noisy intermediate-scale quantum devices, it is paramount to find the best type of circuits to run hybrid quantum-classical algorithms. Key candidates are parametrized quantum circuits that can be effectively implemented on current devices. Here, we evaluate the capacity and trainability of these circuits using the geometric structure of the parameter space via the effective quantum dimension, which reveals the expressive power of circuits in general as well as of particular initialization strategies. We assess the expressive power of various popular circuit types and find striking differences depending on the type of entangling gates used. Particular circuits are characterized by scaling laws in their expressiveness. We identify a transition in the quantum geometry of the parameter space, which leads to a decay of the quantum natural gradient for deep circuits. For shallow circuits, the quantum natural gradient can be orders of magnitude larger in value compared to the regular gradient; however, both of them can suffer from vanishing gradients. By tuning a fixed set of circuit parameters to randomized ones, we find a region where the circuit is expressive, but does not suffer from barren plateaus, hinting at a good way to initialize circuits. We show an algorithm that prunes redundant parameters of a circuit without affecting its effective dimension. Our results enhance the understanding of parametrized quantum circuits and can be immediately applied to improve variational quantum algorithms.
    Sign-MAML: Efficient Model-Agnostic Meta-Learning by SignSGD. (arXiv:2109.07497v1 [cs.LG])
    (0 min) We propose a new computationally-efficient first-order algorithm for Model-Agnostic Meta-Learning (MAML). The key enabling technique is to interpret MAML as a bilevel optimization (BLO) problem and leverage the sign-based SGD(signSGD) as a lower-level optimizer of BLO. We show that MAML, through the lens of signSGD-oriented BLO, naturally yields an alternating optimization scheme that just requires first-order gradients of a learned meta-model. We term the resulting MAML algorithm Sign-MAML. Compared to the conventional first-order MAML (FO-MAML) algorithm, Sign-MAML is theoretically-grounded as it does not impose any assumption on the absence of second-order derivatives during meta training. In practice, we show that Sign-MAML outperforms FO-MAML in various few-shot image classification tasks, and compared to MAML, it achieves a much more graceful tradeoff between classification accuracy and computation efficiency.
    Towards Out-of-Distribution Detection with Divergence Guarantee in Deep Generative Models. (arXiv:2002.03328v4 [cs.LG] UPDATED)
    (0 min) Recent research has revealed that deep generative models including flow-based models and Variational autoencoders may assign higher likelihood to out-of-distribution (OOD) data than in-distribution (ID) data. However, we cannot sample out OOD data from the model. This counterintuitive phenomenon has not been satisfactorily explained. In this paper, we prove theorems to investigate the divergences in flow-based model and give two explanations to the above phenomenon from divergence and geometric perspectives, respectively. Based on our analysis, we propose two group anomaly detection methods. Furthermore, we decompose the KL divergence and propose a point-wise anomaly detection method. We have conducted extensive experiments on prevalent benchmarks to evaluate our methods. For group anomaly detection (GAD), our method can achieve near 100\% AUROC on all problems and has robustness against data manipulations. On the contrary, the state-of-the-art (SOTA) GAD method performs not better than random guessing for challenging problems and can be attacked by data manipulation in almost all cases. For point-wise anomaly detection (PAD), our method is comparable to the SOTA PAD method on one category of problems and outperforms the baseline significantly on another category of problems.
    Behavior of Keyword Spotting Networks Under Noisy Conditions. (arXiv:2109.07930v1 [eess.AS])
    (0 min) Keyword spotting (KWS) is becoming a ubiquitous need with the advancement in artificial intelligence and smart devices. Recent work in this field have focused on several different architectures to achieve good results on datasets with low to moderate noise. However, the performance of these models deteriorates under high noise conditions as shown by our experiments. In our paper, we present an extensive comparison between state-of-the-art KWS networks under various noisy conditions. We also suggest adaptive batch normalization as a technique to improve the performance of the networks when the noise files are unknown during the training phase. The results of such high noise characterization enable future work in developing models that perform better in the aforementioned conditions.
    SPIN Road Mapper: Extracting Roads from Aerial Images via Spatial and Interaction Space Graph Reasoning for Autonomous Driving. (arXiv:2109.07701v1 [cs.CV])
    (0 min) Road extraction is an essential step in building autonomous navigation systems. Detecting road segments is challenging as they are of varying widths, bifurcated throughout the image, and are often occluded by terrain, cloud, or other weather conditions. Using just convolution neural networks (ConvNets) for this problem is not effective as it is inefficient at capturing distant dependencies between road segments in the image which is essential to extract road connectivity. To this end, we propose a Spatial and Interaction Space Graph Reasoning (SPIN) module which when plugged into a ConvNet performs reasoning over graphs constructed on spatial and interaction spaces projected from the feature maps. Reasoning over spatial space extracts dependencies between different spatial regions and other contextual information. Reasoning over a projected interaction space helps in appropriate delineation of roads from other topographies present in the image. Thus, SPIN extracts long-range dependencies between road segments and effectively delineates roads from other semantics. We also introduce a SPIN pyramid which performs SPIN graph reasoning across multiple scales to extract multi-scale features. We propose a network based on stacked hourglass modules and SPIN pyramid for road segmentation which achieves better performance compared to existing methods. Moreover, our method is computationally efficient and significantly boosts the convergence speed during training, making it feasible for applying on large-scale high-resolution aerial images. Code available at: https://github.com/wgcban/SPIN_RoadMapper.git.
    Interpretable Additive Recurrent Neural Networks For Multivariate Clinical Time Series. (arXiv:2109.07602v1 [cs.LG])
    (0 min) Time series models with recurrent neural networks (RNNs) can have high accuracy but are unfortunately difficult to interpret as a result of feature-interactions, temporal-interactions, and non-linear transformations. Interpretability is important in domains like healthcare where constructing models that provide insight into the relationships they have learned are required to validate and trust model predictions. We want accurate time series models where users can understand the contribution of individual input features. We present the Interpretable-RNN (I-RNN) that balances model complexity and accuracy by forcing the relationship between variables in the model to be additive. Interactions are restricted between hidden states of the RNN and additively combined at the final step. I-RNN specifically captures the unique characteristics of clinical time series, which are unevenly sampled in time, asynchronously acquired, and have missing data. Importantly, the hidden state activations represent feature coefficients that correlate with the prediction target and can be visualized as risk curves that capture the global relationship between individual input features and the outcome. We evaluate the I-RNN model on the Physionet 2012 Challenge dataset to predict in-hospital mortality, and on a real-world clinical decision support task: predicting hemodynamic interventions in the intensive care unit. I-RNN provides explanations in the form of global and local feature importances comparable to highly intelligible models like decision trees trained on hand-engineered features while significantly outperforming them. I-RNN remains intelligible while providing accuracy comparable to state-of-the-art decay-based and interpolation-based recurrent time series models. The experimental results on real-world clinical datasets refute the myth that there is a tradeoff between accuracy and interpretability.
    Neural-network acceleration of projection-based model-order-reduction for finite plasticity: Application to RVEs. (arXiv:2109.07747v1 [cs.LG])
    (0 min) Compared to conventional projection-based model-order-reduction, its neural-network acceleration has the advantage that the online simulations are equation-free, meaning that no system of equations needs to be solved iteratively. Consequently, no stiffness matrix needs to be constructed and the stress update needs to be computed only once per increment. In this contribution, a recurrent neural network is developed to accelerate a projection-based model-order-reduction of the elastoplastic mechanical behaviour of an RVE. In contrast to a neural network that merely emulates the relation between the macroscopic deformation (path) and the macroscopic stress, the neural network acceleration of projection-based model-order-reduction preserves all microstructural information, at the price of computing this information once per increment.
    BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales. (arXiv:2109.07623v1 [cs.SD])
    (0 min) Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods.
    Super-resolution data assimilation. (arXiv:2109.08017v1 [physics.ao-ph])
    (0 min) Increasing the resolution of a model can improve the performance of a data assimilation system: first because model field are in better agreement with high resolution observations, then the corrections are better sustained and, with ensemble data assimilation, the forecast error covariances are improved. However, resolution increase is associated with a cubical increase of the computational costs. Here we are testing an approach inspired from images super-resolution techniques and called "Super-resolution data assimilation" (SRDA). Starting from a low-resolution forecast, a neural network (NN) emulates a high-resolution field that is then used to assimilate high-resolution observations. We apply the SRDA to a quasi-geostrophic model representing simplified surface ocean dynamics, with a model resolution up to four times lower than the reference high-resolution and we use the Ensemble Kalman Filter data assimilation method. We show that SRDA outperforms the low-resolution data assimilation approach and a SRDA version with cubic spline interpolation instead of NN. The NN's ability to anticipate the systematic differences between low and high resolution model dynamics explains the enhanced performance, for example by correcting the difference of propagation speed of eddies. Increasing the computational cost by 55\% above the LR data assimilation system (using a 25-members ensemble), the SRDA reduces the errors by 40\% making the performance very close to the HR system (16\% larger, compared to 92\% larger for the LR EnKF). The reliability of the ensemble system is not degraded by SRDA.
    Machine-Learned HASDM Model with Uncertainty Quantification. (arXiv:2109.07651v1 [cs.LG])
    (0 min) The first thermospheric neutral mass density model with robust and reliable uncertainty estimates is developed based on the SET HASDM density database. This database, created by Space Environment Technologies (SET), contains 20 years of outputs from the U.S. Space Force's High Accuracy Satellite Drag Model (HASDM), which represents the state-of-the-art for density and drag modeling. We utilize principal component analysis (PCA) for dimensionality reduction, creating the coefficients upon which nonlinear machine-learned (ML) regression models are trained. These models use three unique loss functions: mean square error (MSE), negative logarithm of predictive density (NLPD), and continuous ranked probability score (CRPS). Three input sets are also tested, showing improved performance when introducing time histories for geomagnetic indices. These models leverage Monte Carlo (MC) dropout to provide uncertainty estimates, and the use of the NLPD loss function results in well-calibrated uncertainty estimates without sacrificing model accuracy (<10% mean absolute error). By comparing the best HASDM-ML model to the HASDM database along satellite orbits, we found that the model provides robust and reliable uncertainties in the density space over all space weather conditions. A storm-time comparison shows that HASDM-ML also supplies meaningful uncertainty measurements during extreme events.
    Short Quantum Circuits in Reinforcement Learning Policies for the Vehicle Routing Problem. (arXiv:2109.07498v1 [quant-ph])
    (0 min) Quantum computing and machine learning have potential for symbiosis. However, in addition to the hardware limitations from current devices, there are still basic issues that must be addressed before quantum circuits can usefully incorporate with current machine learning tasks. We report a new strategy for such an integration in the context of attention models used for reinforcement learning. Agents that implement attention mechanisms have successfully been applied to certain cases of combinatorial routing problems by first encoding nodes on a graph and then sequentially decoding nodes until a route is selected. We demonstrate that simple quantum circuits can used in place of classical attention head layers while maintaining performance. Our method modifies the networks used in [1] by replacing key and query vectors for every node with quantum states that are entangled before being measured. The resulting hybrid classical-quantum agent is tested in the context of vehicle routing problems where its performance is competitive with the original classical approach. We regard our model as a prototype that can be scaled up and as an avenue for further study on the role of quantum computing in reinforcement learning.
    Predicting the outcome of team movements -- Player time series analysis using fuzzy and deep methods for representation learning. (arXiv:2109.07570v1 [cs.LG])
    (0 min) We extract and use player position time-series data, tagged along with the action types, to build a competent model for representing team tactics behavioral patterns and use this representation to predict the outcome of arbitrary movements. We provide a framework for the useful encoding of short tactics and space occupations in a more extended sequence of movements or tactical plans. We investigate game segments during a match in which the team in possession of the ball regularly attempts to reach a position where they can take a shot at goal for a single game. A carefully designed and efficient kernel is employed using a triangular fuzzy membership function to create multiple time series for players' potential of presence at different court regions. Unsupervised learning is then used for time series using triplet loss and deep neural networks with exponentially dilated causal convolutions for the derived multivariate time series. This works key contribution lies in its approach to model how short scenes contribute to other longer ones and how players occupies and creates new spaces in-game court. We discuss the effectiveness of the proposed approach for prediction and recognition tasks on the professional basketball SportVU dataset for the 2015-16 half-season. The proposed system demonstrates descent functionality even with relatively small data.
    An End-to-End Transformer Model for 3D Object Detection. (arXiv:2109.08141v1 [cs.CV])
    (0 min) We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D-specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.
    Quality-aware Cine Cardiac MRI Reconstruction and Analysis from Undersampled k-space Data. (arXiv:2109.07955v1 [eess.IV])
    (0 min) Cine cardiac MRI is routinely acquired for the assessment of cardiac health, but the imaging process is slow and typically requires several breath-holds to acquire sufficient k-space profiles to ensure good image quality. Several undersampling-based reconstruction techniques have been proposed during the last decades to speed up cine cardiac MRI acquisition. However, the undersampling factor is commonly fixed to conservative values before acquisition to ensure diagnostic image quality, potentially leading to unnecessarily long scan times. In this paper, we propose an end-to-end quality-aware cine short-axis cardiac MRI framework that combines image acquisition and reconstruction with downstream tasks such as segmentation, volume curve analysis and estimation of cardiac functional parameters. The goal is to reduce scan time by acquiring only a fraction of k-space data to enable the reconstruction of images that can pass quality control checks and produce reliable estimates of cardiac functional parameters. The framework consists of a deep learning model for the reconstruction of 2D+t cardiac cine MRI images from undersampled data, an image quality-control step to detect good quality reconstructions, followed by a deep learning model for bi-ventricular segmentation, a quality-control step to detect good quality segmentations and automated calculation of cardiac functional parameters. To demonstrate the feasibility of the proposed approach, we perform simulations using a cohort of selected participants from the UK Biobank (n=270), 200 healthy subjects and 70 patients with cardiomyopathies. Our results show that we can produce quality-controlled images in a scan time reduced from 12 to 4 seconds per slice, enabling reliable estimates of cardiac functional parameters such as ejection fraction within 5% mean absolute error.
    Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns. (arXiv:2109.07723v1 [cs.LG])
    (0 min) Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks, raising concerns about the application of such models to risk-sensitive tasks such as autonomous driving. Threat models for these demonstrations are limited to (1) targeted attacks through real-time manipulation of the agent's observation, and (2) untargeted attacks through manipulation of the physical environment. The former assumes full access to the agent's states/observations at all times, while the latter has no control over attack outcomes. This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical object in the environment, a threat model that combines the practicality and effectiveness of the existing ones. Through analysis, we demonstrate that a pre-trained policy can be hijacked within a time window, e.g., performing an unintended self-parking, when an adversarial object is present. To enable the attack, we adopt an assumption that the dynamics of both the environment and the agent can be learned by the attacker. Lastly, we empirically show the effectiveness of the proposed attack on different driving scenarios, perform a location robustness test, and study the tradeoff between the attack strength and its effectiveness.
    Class-Similarity Based Label Smoothing for Confidence Calibration. (arXiv:2006.14028v2 [cs.LG] UPDATED)
    (0 min) Generating confidence calibrated outputs is of utmost importance for the applications of deep neural networks in safety-critical decision-making systems. The output of a neural network is a probability distribution where the scores are estimated confidences of the input belonging to the corresponding classes, and hence they represent a complete estimate of the output likelihood relative to all classes. In this paper, we propose a novel form of label smoothing to improve confidence calibration. Since different classes are of different intrinsic similarities, more similar classes should result in closer probability values in the final output. This motivates the development of a new smooth label where the label values are based on similarities with the reference class. We adopt different similarity measurements, including those that capture feature-based similarities or semantic similarity. We demonstrate through extensive experiments, on various datasets and network architectures, that our approach consistently outperforms state-of-the-art calibration techniques including uniform label smoothing.
    Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via Generative Models. (arXiv:2108.12445v2 [cs.LG] UPDATED)
    (0 min) The commonly used latent space embedding techniques, such as Principal Component Analysis, Factor Analysis, and manifold learning techniques, are typically used for learning effective representations of homogeneous data. However, they do not readily extend to heterogeneous data that are a combination of numerical and categorical variables, e.g., arising from linked GPS and text data. In this paper, we are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion. The learned generative model provides latent unified representations that capture the factors common to the multiple dimensions of the data, and thus enable fusing multimodal data for various machine learning tasks. Following a Bayesian approach, we propose a general framework that combines disparate data types through the natural parameterization of the exponential family of distributions. To scale the model inference to millions of instances with thousands of features, we use the Laplace-Bernstein approximation for posterior computations involving nonlinear link functions. The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features. Experiments on two high-dimensional and heterogeneous datasets (NYC Taxi and MovieLens-10M) demonstrate the scalability and competitive performance of the proposed algorithm on different machine learning tasks such as anomaly detection, data imputation, and recommender systems.
    Regex Queries over Incomplete Knowledge Bases. (arXiv:2005.00480v2 [cs.CL] UPDATED)
    (0 min) We propose the novel task of answering regular expression queries (containing disjunction ($\vee$) and Kleene plus ($+$) operators) over incomplete KBs. The answer set of these queries potentially has a large number of entities, hence previous works for single-hop queries in KBC that model a query as a point in high-dimensional space are not as effective. In response, we develop RotatE-Box -- a novel combination of RotatE and box embeddings. It can model more relational inference patterns compared to existing embedding based models. Furthermore, we define baseline approaches for embedding based KBC models to handle regex operators. We demonstrate performance of RotatE-Box on two new regex-query datasets introduced in this paper, including one where the queries are harvested based on actual user query logs. We find that our final RotatE-Box model significantly outperforms models based on just RotatE and just box embeddings.
    Explainability Requires Interactivity. (arXiv:2109.07869v1 [cs.LG])
    (0 min) When explaining the decisions of deep neural networks, simple stories are tempting but dangerous. Especially in computer vision, the most popular explanation approaches give a false sense of comprehension to its users and provide an overly simplistic picture. We introduce an interactive framework to understand the highly complex decision boundaries of modern vision models. It allows the user to exhaustively inspect, probe, and test a network's decisions. Across a range of case studies, we compare the power of our interactive approach to static explanation methods, showing how these can lead a user astray, with potentially severe consequences.
    Computationally-Efficient Climate Predictions using Multi-Fidelity Surrogate Modelling. (arXiv:2109.07468v1 [physics.ao-ph])
    (0 min) Accurately modelling the Earth's climate has widespread applications ranging from forecasting local weather to understanding global climate change. Low-fidelity simulations of climate phenomena are readily available, but high-fidelity simulations are expensive to obtain. We therefore investigate the potential of Gaussian process-based multi-fidelity surrogate modelling as a way to produce high-fidelity climate predictions at low cost. Specifically, our model combines the predictions of a low-fidelity Global Climate Model (GCM) and those of a high-fidelity Regional Climate Model (RCM) to produce high-fidelity temperature predictions for a mountainous region on the coastline of Peru. We are able to produce high-fidelity temperature predictions at significantly lower computational cost compared to the high-fidelity model alone: our predictions have an average error of $15.62^\circ\text{C}^2$ yet our approach only evaluates the high-fidelity model on 6% of the region of interest.
    Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs. (arXiv:2109.07710v1 [cs.LG])
    (0 min) Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies, achieving high accuracy on computer vision workloads such as image classification and object detection. However, training these models involving large parameters is both time-consuming and energy-hogging. In this regard, several prior works have advocated for sparsity to speed up the of DL training and more so, the inference phase. This work begins with the observation that during training, sparsity in the forward and backward passes are correlated. In that context, we investigate two types of sparsity (input and output type) inherent in gradient descent-based optimization algorithms and propose a hardware micro-architecture to leverage the same. Our experimental results use five state-of-the-art CNN models on the Imagenet dataset, and show back propagation speedups in the range of 1.69$\times$ to 5.43$\times$, compared to the dense baseline execution. By exploiting sparsity in both the forward and backward passes, speedup improvements range from 1.68$\times$ to 3.30$\times$ over the sparsity-agnostic baseline execution. Our work also achieves significant reduction in training iteration time over several previously proposed dense as well as sparse accelerator based platforms, in addition to achieving order of magnitude energy efficiency improvements over GPU based execution.
    Scaling Laws for Neural Machine Translation. (arXiv:2109.07740v1 [cs.LG])
    (0 min) We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.
    An Overview and Case Study of the Clinical AI Model Development Life Cycle for Healthcare Systems. (arXiv:2003.07678v3 [cs.CY] CROSS LISTED)
    (0 min) Healthcare is one of the most promising areas for machine learning models to make a positive impact. However, successful adoption of AI-based systems in healthcare depends on engaging and educating stakeholders from diverse backgrounds about the development process of AI models. We present a broadly accessible overview of the development life cycle of clinical AI models that is general enough to be adapted to most machine learning projects, and then give an in-depth case study of the development process of a deep learning based system to detect aortic aneurysms in Computed Tomography (CT) exams. We hope other healthcare institutions and clinical practitioners find the insights we share about the development process useful in informing their own model development efforts and to increase the likelihood of successful deployment and integration of AI in healthcare.
    Spectral estimation from simulations via sketching. (arXiv:2007.11026v2 [stat.ML] UPDATED)
    (0 min) Sketching is a stochastic dimension reduction method that preserves geometric structures of data and has applications in high-dimensional regression, low rank approximation and graph sparsification. In this work, we show that sketching can be used to compress simulation data and still accurately estimate time autocorrelation and power spectral density. For a given compression ratio, the accuracy is much higher than using previously known methods. In addition to providing theoretical guarantees, we apply sketching to a molecular dynamics simulation of methanol and find that the estimate of spectral density is 90% accurate using only 10% of the data.
    On-the-Fly Ensemble Pruning in Evolving Data Streams. (arXiv:2109.07611v1 [cs.LG])
    (0 min) Ensemble pruning is the process of selecting a subset of componentclassifiers from an ensemble which performs at least as well as theoriginal ensemble while reducing storage and computational costs.Ensemble pruning in data streams is a largely unexplored area ofresearch. It requires analysis of ensemble components as they arerunning on the stream, and differentiation of useful classifiers fromredundant ones. We present CCRP, an on-the-fly ensemble prun-ing method for multi-class data stream classification empoweredby an imbalance-aware fusion of class-wise component rankings.CCRP aims that the resulting pruned ensemble contains the bestperforming classifier for each target class and hence, reduces the ef-fects of class imbalance. The conducted experiments on real-worldand synthetic data streams demonstrate that different types of en-sembles that integrate CCRP as their pruning scheme consistentlyyield on par or superior performance with 20% to 90% less averagememory consumption. Lastly, we validate the proposed pruningscheme by comparing our approach against pruning schemes basedon ensemble weights and basic rank fusion methods.
    A literature survey on student feedback assessment tools and their usage in sentiment analysis. (arXiv:2109.07904v1 [cs.CY])
    (0 min) Online learning is becoming increasingly popular, whether for convenience, to accommodate work hours, or simply to have the freedom to study from anywhere. Especially, during the Covid-19 pandemic, it has become the only viable option for learning. The effectiveness of teaching various hard-core programming courses with a mix of theoretical content is determined by the student interaction and responses. In contrast to a digital lecture through Zoom or Teams, a lecturer may rapidly acquire such responses from students' facial expressions, behavior, and attitude in a physical session, even if the listener is largely idle and non-interactive. However, student assessment in virtual learning is a challenging task. Despite the challenges, different technologies are progressively being integrated into teaching environments to boost student engagement and motivation. In this paper, we evaluate the effectiveness of various in-class feedback assessment methods such as Kahoot!, Mentimeter, Padlet, and polling to assist a lecturer in obtaining real-time feedback from students throughout a session and adapting the teaching style accordingly. Furthermore, some of the topics covered by student suggestions include tutor suggestions, enhancing teaching style, course content, and other subjects. Any input gives the instructor valuable insight into how to improve the student's learning experience, however, manually going through all of the qualitative comments and extracting the ideas is tedious. Thus, in this paper, we propose a sentiment analysis model for extracting the explicit suggestions from the students' qualitative feedback comments.
    Optimal Probing with Statistical Guarantees for Network Monitoring at Scale. (arXiv:2109.07743v1 [cs.LG])
    (0 min) Cloud networks are difficult to monitor because they grow rapidly and the budgets for monitoring them are limited. We propose a framework for estimating network metrics, such as latency and packet loss, with guarantees on estimation errors for a fixed monitoring budget. Our proposed algorithms produce a distribution of probes across network paths, which we then monitor; and are based on A- and E-optimal experimental designs in statistics. Unfortunately, these designs are too computationally costly to use at production scale. We propose their scalable and near-optimal approximations based on the Frank-Wolfe algorithm. We validate our approaches in simulation on real network topologies, and also using a production probing system in a real cloud network. We show major gains in reducing the probing budget compared to both production and academic baselines, while maintaining low estimation errors, even with very low probing budgets.
    Personalized Federated Learning for Heterogeneous Clients with Clustered Knowledge Transfer. (arXiv:2109.08119v1 [cs.LG])
    (0 min) Personalized federated learning (FL) aims to train model(s) that can perform well for individual clients that are highly data and system heterogeneous. Most work in personalized FL, however, assumes using the same model architecture at all clients and increases the communication cost by sending/receiving models. This may not be feasible for realistic scenarios of FL. In practice, clients have highly heterogeneous system-capabilities and limited communication resources. In our work, we propose a personalized FL framework, PerFed-CKT, where clients can use heterogeneous model architectures and do not directly communicate their model parameters. PerFed-CKT uses clustered co-distillation, where clients use logits to transfer their knowledge to other clients that have similar data-distributions. We theoretically show the convergence and generalization properties of PerFed-CKT and empirically show that PerFed-CKT achieves high test accuracy with several orders of magnitude lower communication cost compared to the state-of-the-art personalized FL schemes.
    Predicting Users' Value Changes by the Friends' Influence from Social Media Usage. (arXiv:2109.08021v1 [cs.SI])
    (0 min) Basic human values represent a set of values such as security, independence, success, kindness, and pleasure, which we deem important to our lives. Each of us holds different values with different degrees of significance. Existing studies show that values of a person can be identified from their social network usage. However, the value priority of a person may change over time due to different factors such as life experiences, influence, social structure and technology. Existing studies do not conduct any analysis regarding the change of users' value from the social influence, i.e., group persuasion, form the social media usage. In our research, first, we predict users' value score by the influence of friends from their social media usage. We propose a Bounded Confidence Model (BCM) based value dynamics model from 275 different ego networks in Facebook that predicts how social influence may persuade a person to change their value over time. Then, to predict better, we use particle swarm optimization based hyperparameter tuning technique. We observe that these optimized hyperparameters produce accurate future value score. We also run our approach with different machine learning based methods and find support vector regression (SVR) outperforms other regressor models. By using SVR with the best hyperparameters of BCM model, we find the lowest Mean Squared Error (MSE) score 0.00347.
    End-to-End Partially Observable Visual Navigation in a Diverse Environment. (arXiv:2109.07752v1 [cs.RO])
    (0 min) How can a robot navigate successfully in a rich and diverse environment, indoors or outdoors, along an office corridor or a trail in the park, on the flat ground, the staircase, or the elevator, etc.? To this end, this work aims at three challenges: (i) complex visual observations, (ii) partial observability of local sensing, and (iii) multimodal navigation behaviors that depend on both the local environment and the high-level goal. We propose a novel neural network (NN) architecture to represent a local controller and leverage the flexibility of the end-to-end approach to learn a powerful policy. To tackle complex visual observations, we extract multiscale spatial information through convolution layers. To deal with partial observability, we encode rich history information in LSTM-like modules. Importantly, we integrate the two into a single unified architecture that exploits convolutional memory cells to track the observation history at multiple spatial scales, which can capture the complex spatiotemporal dependencies between observations and controls. We additionally condition the network on the high-level goal in order to generate different navigation behavior modes. Specifically, we propose to use independent memory cells for different modes to prevent mode collapse in the learned policy. We implemented the NN controller on the SPOT robot and evaluate it on three challenging tasks with partial observations: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. Our model significantly outperforms CNNs, conventional LSTMs, or the ablated versions of our model. A demo video will be publicly available, showing our SPOT robot traversing many different locations on our university campus.
    Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across Objects and Views. (arXiv:2109.07945v1 [cs.CV])
    (0 min) We present a system for automatic converting of 2D mask object predictions and raw LiDAR point clouds into full 3D bounding boxes of objects. Because the LiDAR point clouds are partial, directly fitting bounding boxes to the point clouds is meaningless. Instead, we suggest that obtaining good results requires sharing information between \emph{all} objects in the dataset jointly, over multiple frames. We then make three improvements to the baseline. First, we address ambiguities in predicting the object rotations via direct optimization in this space while still backpropagating rotation prediction through the model. Second, we explicitly model outliers and task the network with learning their typical patterns, thus better discounting them. Third, we enforce temporal consistency when video data is available. With these contributions, our method significantly outperforms previous work despite the fact that those methods use significantly more complex pipelines, 3D models and additional human-annotated external sources of prior information.
    Differentiable Physics: A Position Piece. (arXiv:2109.07573v1 [cs.LG])
    (0 min) Differentiable physics provides a new approach for modeling and understanding the physical systems by pairing the new technology of differentiable programming with classical numerical methods for physical simulation. We survey the rapidly growing literature of differentiable physics techniques and highlight methods for parameter estimation, learning representations, solving differential equations, and developing what we call scientific foundation models using data and inductive priors. We argue that differentiable physics offers a new paradigm for modeling physical phenomena by combining classical analytic solutions with numerical methodology using the bridge of differentiable programming.
    Detection Accuracy for Evaluating Compositional Explanations of Units. (arXiv:2109.07804v1 [cs.LG])
    (0 min) The recent success of deep learning models in solving complex problems and in different domains has increased interest in understanding what they learn. Therefore, different approaches have been employed to explain these models, one of which uses human-understandable concepts as explanations. Two examples of methods that use this approach are Network Dissection and Compositional explanations. The former explains units using atomic concepts, while the latter makes explanations more expressive, replacing atomic concepts with logical forms. While intuitively, logical forms are more informative than atomic concepts, it is not clear how to quantify this improvement, and their evaluation is often based on the same metric that is optimized during the search-process and on the usage of hyper-parameters to be tuned. In this paper, we propose to use as evaluation metric the Detection Accuracy, which measures units' consistency of detection of their assigned explanations. We show that this metric (1) evaluates explanations of different lengths effectively, (2) can be used as a stopping criterion for the compositional explanation search, eliminating the explanation length hyper-parameter, and (3) exposes new specialized units whose length 1 explanations are the perceptual abstractions of their longer explanations.
    Subspace Learning for Personalized Federated Optimization. (arXiv:2109.07628v1 [cs.LG])
    (0 min) As data is generated and stored almost everywhere, learning a model from a data-decentralized setting is a task of interest for many AI-driven service providers. Although federated learning is settled down as the main solution in such situations, there still exists room for improvement in terms of personalization. Training federated learning systems usually focuses on optimizing a global model that is identically deployed to all client devices. However, a single global model is not sufficient for each client to be personalized on their performance as local data assumes to be not identically distributed across clients. We propose a method to address this situation through the lens of ensemble learning based on the construction of a low-loss subspace continuum that generates a high-accuracy ensemble of two endpoints (i.e. global model and local model). We demonstrate that our method achieves consistent gains both in personalized and unseen client evaluation settings through extensive experiments on several standard benchmark datasets.
    Learning the Regularization in DCE-MR Image Reconstruction for Functional Imaging of Kidneys. (arXiv:2109.07548v1 [cs.LG])
    (0 min) Kidney DCE-MRI aims at both qualitative assessment of kidney anatomy and quantitative assessment of kidney function by estimating the tracer kinetic (TK) model parameters. Accurate estimation of TK model parameters requires an accurate measurement of the arterial input function (AIF) with high temporal resolution. Accelerated imaging is used to achieve high temporal resolution, which yields under-sampling artifacts in the reconstructed images. Compressed sensing (CS) methods offer a variety of reconstruction options. Most commonly, sparsity of temporal differences is encouraged for regularization to reduce artifacts. Increasing regularization in CS methods removes the ambient artifacts but also over-smooths the signal temporally which reduces the parameter estimation accuracy. In this work, we propose a single image trained deep neural network to reduce MRI under-sampling artifacts without reducing the accuracy of functional imaging markers. Instead of regularizing with a penalty term in optimization, we promote regularization by generating images from a lower dimensional representation. In this manuscript we motivate and explain the lower dimensional input design. We compare our approach to CS reconstructions with multiple regularization weights. Proposed approach results in kidney biomarkers that are highly correlated with the ground truth markers estimated using the CS reconstruction which was optimized for functional analysis. At the same time, the proposed approach reduces the artifacts in the reconstructed images.
    Multi-Task Learning with Sequence-Conditioned Transporter Networks. (arXiv:2109.07578v1 [cs.LG])
    (0 min) Enabling robots to solve multiple manipulation tasks has a wide range of industrial applications. While learning-based approaches enjoy flexibility and generalizability, scaling these approaches to solve such compositional tasks remains a challenge. In this work, we aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling. First, we propose a new suite of benchmark specifically aimed at compositional tasks, MultiRavens, which allows defining custom task combinations through task modules that are inspired by industrial tasks and exemplify the difficulties in vision-based learning and planning methods. Second, we propose a vision-based end-to-end system architecture, Sequence-Conditioned Transporter Networks, which augments Goal-Conditioned Transporter Networks with sequence-conditioning and weighted sampling and can efficiently learn to solve multi-task long horizon problems. Our analysis suggests that not only the new framework significantly improves pick-and-place performance on novel 10 multi-task benchmark problems, but also the multi-task learning with weighted sampling can vastly improve learning and agent performances on individual tasks.
    RaWaNet: Enriching Graph Neural Network Input via Random Walks on Graphs. (arXiv:2109.07555v1 [stat.ML])
    (0 min) In recent years, graph neural networks (GNNs) have gained increasing popularity and have shown very promising results for data that are represented by graphs. The majority of GNN architectures are designed based on developing new convolutional and/or pooling layers that better extract the hidden and deeper representations of the graphs to be used for different prediction tasks. The inputs to these layers are mainly the three default descriptors of a graph, node features $(X)$, adjacency matrix $(A)$, and edge features $(W)$ (if available). To provide a more enriched input to the network, we propose a random walk data processing of the graphs based on three selected lengths. Namely, (regular) walks of length 1 and 2, and a fractional walk of length $\gamma \in (0,1)$, in order to capture the different local and global dynamics on the graphs. We also calculate the stationary distribution of each random walk, which is then used as a scaling factor for the initial node features ($X$). This way, for each graph, the network receives multiple adjacency matrices along with their individual weighting for the node features. We test our method on various molecular datasets by passing the processed node features to the network in order to perform several classification and regression tasks. Interestingly, our method, not using edge features which are heavily exploited in molecular graph learning, let a shallow network outperform well known deep GNNs.
    Modern Cybersecurity Solution using Supervised Machine Learning. (arXiv:2109.07593v1 [cs.CR])
    (0 min) Cybersecurity is essential, and attacks are rapidly growing and getting more challenging to detect. The traditional Firewall and Intrusion Detection system, even though it is widely used and recommended but it fails to detect new attacks, zero-day attacks, and traffic patterns that do not match with any configured rules. Therefore, Machine Learning (ML) can be an efficient and cost-reduced solution in cybersecurity. We used Netflow datasets to extract features after applying data analysis. Then, a selection process has been applied to compare these features with one another. Our experiments focus on how efficient machine learning algorithms can detect Bot traffic, Malware traffic, and background traffic. We managed to get 0.903 precision value from a dataset that has 6.5% Bot flows, 1.57% Normal flows, 0.18% Command&Control (C&C) flows, and 91.7% background flows, from 2,753,884 total flows. The results show low false-negative with few false-positive detections.
    Directed degree corrected mixed membership model and estimating community memberships in directed networks. (arXiv:2109.07826v1 [stat.ML])
    (0 min) This paper considers the problem of modeling and estimating community memberships of nodes in a directed network where every row (column) node is associated with a vector determining its membership in each row (column) community. To model such directed network, we propose directed degree corrected mixed membership (DiDCMM) model by considering degree heterogeneity. DiDCMM is identifiable under popular conditions for mixed membership network when considering degree heterogeneity. Based on the cone structure inherent in the normalized version of the left singular vectors and the simplex structure inherent in the right singular vectors of the population adjacency matrix, we build an efficient algorithm called DiMSC to infer the community membership vectors for both row nodes and column nodes. By taking the advantage of DiMSC's equivalence algorithm which returns same estimations as DiMSC and the recent development on row-wise singular vector deviation, we show that the proposed algorithm is asymptotically consistent under mild conditions by providing error bounds for the inferred membership vectors of each row node and each column node under DiDCMM. The theory is supplemented by a simulation study.
    Pedestrian Trajectory Prediction with Convolutional Neural Networks. (arXiv:2010.05796v2 [cs.CV] UPDATED)
    (0 min) Predicting the future trajectories of pedestrians is a challenging problem that has a range of application, from crowd surveillance to autonomous driving. In literature, methods to approach pedestrian trajectory prediction have evolved, transitioning from physics-based models to data-driven models based on recurrent neural networks. In this work, we propose a new approach to pedestrian trajectory prediction, with the introduction of a novel 2D convolutional model. This new model outperforms recurrent models, and it achieves state-of-the-art results on the ETH and TrajNet datasets. We also present an effective system to represent pedestrian positions and powerful data augmentation techniques, such as the addition of Gaussian noise and the use of random rotations, which can be applied to any model. As an additional exploratory analysis, we present experimental results on the inclusion of occupancy methods to model social information, which empirically show that these methods are ineffective in capturing social interaction.
    Field Convolutions for Surface CNNs. (arXiv:2104.03916v2 [cs.CV] UPDATED)
    (0 min) We present a novel surface convolution operator acting on vector fields that is based on a simple observation: instead of combining neighboring features with respect to a single coordinate parameterization defined at a given point, we have every neighbor describe the position of the point within its own coordinate frame. This formulation combines intrinsic spatial convolution with parallel transport in a scattering operation while placing no constraints on the filters themselves, providing a definition of convolution that commutes with the action of isometries, has increased descriptive potential, and is robust to noise and other nuisance factors. The result is a rich notion of convolution which we call field convolution, well-suited for CNNs on surfaces. Field convolutions are flexible, straight-forward to incorporate into surface learning frameworks, and their highly discriminating nature has cascading effects throughout the learning pipeline. Using simple networks constructed from residual field convolution blocks, we achieve state-of-the-art results on standard benchmarks in fundamental geometry processing tasks, such as shape classification, segmentation, correspondence, and sparse matching.
    Adversarially Regularized Policy Learning Guided by Trajectory Optimization. (arXiv:2109.07627v1 [cs.RO])
    (0 min) Recent advancement in combining trajectory optimization with function approximation (especially neural networks) shows promise in learning complex control policies for diverse tasks in robot systems. Despite their great flexibility, the large neural networks for parameterizing control policies impose significant challenges. The learned neural control policies are often overcomplex and non-smooth, which can easily cause unexpected or diverging robot motions. Therefore, they often yield poor generalization performance in practice. To address this issue, we propose adVErsarially Regularized pOlicy learNIng guided by trajeCtory optimizAtion (VERONICA) for learning smooth control policies. Specifically, our proposed approach controls the smoothness (local Lipschitz continuity) of the neural control policies by stabilizing the output control with respect to the worst-case perturbation to the input state. Our experiments on robot manipulation show that our proposed approach not only improves the sample efficiency of neural policy learning but also enhances the robustness of the policy against various types of disturbances, including sensor noise, environmental uncertainty, and model mismatch.
    Machine learning with quantum field theories. (arXiv:2109.07730v1 [cs.LG])
    (0 min) The precise equivalence between discretized Euclidean field theories and a certain class of probabilistic graphical models, namely the mathematical framework of Markov random fields, opens up the opportunity to investigate machine learning from the perspective of quantum field theory. In this contribution we will demonstrate, through the Hammersley-Clifford theorem, that the $\phi^{4}$ scalar field theory on a square lattice satisfies the local Markov property and can therefore be recast as a Markov random field. We will then derive from the $\phi^{4}$ theory machine learning algorithms and neural networks which can be viewed as generalizations of conventional neural network architectures. Finally, we will conclude by presenting applications based on the minimization of an asymmetric distance between the probability distribution of the $\phi^{4}$ machine learning algorithms and target probability distributions.
    Soft Confusion Matrix Classifier for Stream Classification. (arXiv:2109.07857v1 [cs.LG])
    (0 min) In this paper, the issue of tailoring the soft confusion matrix (SCM) based classifier to deal with stream learning task is addressed. The main goal of the work is to develop a wrapping-classifier that allows incremental learning to classifiers that are unable to learn incrementally. The goal is achieved by making two improvements in the previously developed SCM classifier. The first one is aimed at reducing the computational cost of the SCM classifier. To do so, the definition of the fuzzy neighborhood of an object is changed. The second one is aimed at effective dealing with the concept drift. This is done by employing the ADWIN-driven concept drift detector that is not only used to detect the drift but also to control the size of the neighbourhood. The obtained experimental results show that the proposed approach significantly outperforms the reference methods.
    A Comparative Study of Machine Learning Methods for Predicting the Evolution of Brain Connectivity from a Baseline Timepoint. (arXiv:2109.07739v1 [cs.LG])
    (0 min) Predicting the evolution of the brain network, also called connectome, by foreseeing changes in the connectivity weights linking pairs of anatomical regions makes it possible to spot connectivity-related neurological disorders in earlier stages and detect the development of potential connectomic anomalies. Remarkably, such a challenging prediction problem remains least explored in the predictive connectomics literature. It is a known fact that machine learning (ML) methods have proven their predictive abilities in a wide variety of computer vision problems. However, ML techniques specifically tailored for the prediction of brain connectivity evolution trajectory from a single timepoint are almost absent. To fill this gap, we organized a Kaggle competition where 20 competing teams designed advanced machine learning pipelines for predicting the brain connectivity evolution from a single timepoint. The competing teams developed their ML pipelines with a combination of data pre-processing, dimensionality reduction, and learning methods. Utilizing an inclusive evaluation approach, we ranked the methods based on two complementary evaluation metrics (mean absolute error (MAE) and Pearson Correlation Coefficient (PCC)) and their performances using different training and testing data perturbation strategies (single random split and cross-validation). The final rank was calculated using the rank product for each competing team across all evaluation measures and validation strategies. In support of open science, the developed 20 ML pipelines along with the connectomic dataset are made available on GitHub. The outcomes of this competition are anticipated to lead to the further development of predictive models that can foresee the evolution of brain connectivity over time, as well as other types of networks (e.g., genetic networks).
    Estimation of Warfarin Dosage with Reinforcement Learning. (arXiv:2109.07564v1 [cs.LG])
    (0 min) In this paper, it has attempted to use Reinforcement learning to model the proper dosage of Warfarin for patients.The paper first examines two baselines: a fixed model of 35 mg/week dosages and a linear model that relies on patient data. We implemented a LinUCB bandit that improved performance measured on regret and percent incorrect. On top of the LinUCB bandit, we experimented with online supervised learning and reward reshaping to boost performance. Our results clearly beat the baselines and show the promise of using multi-armed bandits and artificial intelligence to aid physicians in deciding proper dosages.
    The Neural Metric Factorization for Computational Drug Repositioning. (arXiv:2109.07690v1 [cs.LG])
    (2 min) Computational drug repositioning aims to discover new therapeutic diseases for marketed drugs and has the advantages of low cost, short development cycle, and high controllability compared to traditional drug development. The matrix factorization model has become a mainstream cornerstone technique for computational drug repositioning due to its ease of implementation and excellent scalability. However, the matrix factorization model uses the inner product operation to represent the association between drugs and diseases, which is lacking in expressive ability. Moreover, the degree of similarity of drugs or diseases could not be implied on their respective latent factor vectors, which is not satisfy the common sense of drug discovery. Therefore, a neural metric factorization model for computational drug repositioning is proposed in this work. We novelly consider the latent factor vector of drugs and diseases as a point in a high-dimensional coordinate system and propose a generalized Euclidean distance to represent the association between drugs and diseases to compensate for the shortcomings of the inner product operation. Furthermore, by embedding multiple drug and disease metrics information into the encoding space of the latent factor vector, the latent factor vectors of similar drugs or diseases are made closer. Finally, we conduct wide analysis experiments on two real datasets to demonstrate the effectiveness of the above improvement points and the superiority of the NMF model.
    How to Simplify Search: Classification-wise Pareto Evolution for One-shot Neural Architecture Search. (arXiv:2109.07582v1 [cs.LG])
    (2 min) In the deployment of deep neural models, how to effectively and automatically find feasible deep models under diverse design objectives is fundamental. Most existing neural architecture search (NAS) methods utilize surrogates to predict the detailed performance (e.g., accuracy and model size) of a candidate architecture during the search, which however is complicated and inefficient. In contrast, we aim to learn an efficient Pareto classifier to simplify the search process of NAS by transforming the complex multi-objective NAS task into a simple Pareto-dominance classification task. To this end, we propose a classification-wise Pareto evolution approach for one-shot NAS, where an online classifier is trained to predict the dominance relationship between the candidate and constructed reference architectures, instead of using surrogates to fit the objective functions. The main contribution of this study is to change supernet adaption into a Pareto classifier. Besides, we design two adaptive schemes to select the reference set of architectures for constructing classification boundary and regulate the rate of positive samples over negative ones, respectively. We compare the proposed evolution approach with state-of-the-art approaches on widely-used benchmark datasets, and experimental results indicate that the proposed approach outperforms other approaches and have found a number of neural architectures with different model sizes ranging from 2M to 6M under diverse objectives and constraints.
    Reframing Instructional Prompts to GPTk's Language. (arXiv:2109.07830v1 [cs.CL])
    (2 min) How can model designers turn task instructions into effective prompts for language models? Backed by extensive empirical analysis on GPT3, we observe important features for successful instructional prompts, and propose several reframing techniques for model designers to create such prompts. For example, a complex task can be decomposed into multiple simpler tasks. We experiment over 12 NLP tasks across 6 diverse categories (question generation, classification, etc.). Our results show that reframing improves few-shot learning performance by 14\% while reducing sample complexity over existing few-shot baselines. The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible. Furthermore, we observe that such gains are not limited to GPT3; the reframed tasks remain superior over raw instructions across different model architectures, underscoring the cross-model generality of these guidelines. We hope these empirical-driven techniques will pave way for more effective ways to prompt LMs in future.
    Federated Submodel Averaging. (arXiv:2109.07704v1 [cs.LG])
    (2 min) We study practical data characteristics underlying federated learning, where non-i.i.d. data from clients have sparse features, and a certain client's local data normally involves only a small part of the full model, called a submodel. Due to data sparsity, the classical federated averaging (FedAvg) algorithm or its variants will be severely slowed down, because when updating the global model, each client's zero update of the full model excluding its submodel is inaccurately aggregated. Therefore, we propose federated submodel averaging (FedSubAvg), ensuring that the expectation of the global update of each model parameter is equal to the average of the local updates of the clients who involve it. We theoretically proved the convergence rate of FedSubAvg by deriving an upper bound under a new metric called the element-wise gradient norm. In particular, this new metric can characterize the convergence of federated optimization over sparse data, while the conventional metric of squared gradient norm used in FedAvg and its variants cannot. We extensively evaluated FedSubAvg over both public and industrial datasets. The evaluation results demonstrate that FedSubAvg significantly outperforms FedAvg and its variants.
    Secure Your Ride: Real-time Matching Success Rate Prediction for Passenger-Driver Pairs. (arXiv:2109.07571v1 [cs.LG])
    (2 min) In recent years, online ride-hailing platforms have become an indispensable part of urban transportation. After a passenger is matched up with a driver by the platform, both the passenger and the driver have the freedom to simply accept or cancel a ride with one click. Hence, accurately predicting whether a passenger-driver pair is a good match turns out to be crucial for ride-hailing platforms to devise instant order assignments. However, since the users of ride-hailing platforms consist of two parties, decision-making needs to simultaneously account for the dynamics from both the driver and the passenger sides. This makes it more challenging than traditional online advertising tasks. Moreover, the amount of available data is severely imbalanced across different cities, creating difficulties for training an accurate model for smaller cities with scarce data. Though a sophisticated neural network architecture can help improve the prediction accuracy under data scarcity, the overly complex design will impede the model's capacity of delivering timely predictions in a production environment. In the paper, to accurately predict the MSR of passenger-driver, we propose the Multi-View model (MV) which comprehensively learns the interactions among the dynamic features of the passenger, driver, trip order, as well as context. Regarding the data imbalance problem, we further design the Knowledge Distillation framework (KD) to supplement the model's predictive power for smaller cities using the knowledge from cities with denser data and also generate a simple model to support efficient deployment. Finally, we conduct extensive experiments on real-world datasets from several different cities, which demonstrates the superiority of our solution.
    Non-smooth Bayesian Optimization in Tuning Problems. (arXiv:2109.07563v1 [cs.LG])
    (2 min) Building surrogate models is one common approach when we attempt to learn unknown black-box functions. Bayesian optimization provides a framework which allows us to build surrogate models based on sequential samples drawn from the function and find the optimum. Tuning algorithmic parameters to optimize the performance of large, complicated "black-box" application codes is a specific important application, which aims at finding the optima of black-box functions. Within the Bayesian optimization framework, the Gaussian process model produces smooth or continuous sample paths. However, the black-box function in the tuning problem is often non-smooth. This difficult tuning problem is worsened by the fact that we usually have limited sequential samples from the black-box function. Motivated by these issues encountered in tuning, we propose a novel additive Gaussian process model called clustered Gaussian process (cGP), where the additive components are induced by clustering. In the examples we studied, the performance can be improved by as much as 90% among repetitive experiments. By using this surrogate model, we want to capture the non-smoothness of the black-box function. In addition to an algorithm for constructing this model, we also apply the model to several artificial and real applications to evaluate it.
    Learning logic programs through divide, constrain, and conquer. (arXiv:2109.07818v1 [cs.AI])
    (2 min) We introduce an inductive logic programming approach that combines classical divide-and-conquer search with modern constraint-driven search. Our anytime approach can learn optimal, recursive, and large programs and supports predicate invention. Our experiments on three domains (classification, inductive general game playing, and program synthesis) show that our approach can increase predictive accuracies and reduce learning times.
    Generalized XGBoost Method. (arXiv:2109.07473v1 [cs.LG])
    (2 min) The XGBoost method has many advantages and is especially suitable for statistical analysis of big data, but its loss function is limited to convex functions. In many specific applications, a nonconvex loss function would be preferable. In this paper, we propose a generalized XGBoost method, which requires weaker loss function condition and involves more general loss functions, including convex loss functions and some non-convex loss functions. Furthermore, this generalized XGBoost method is extended to multivariate loss function to form a more generalized XGBoost method. This method is a multivariate regularized tree boosting method, which can model multiple parameters in most of the frequently-used parametric probability distributions to be fitted by predictor variables. Meanwhile, the related algorithms and some examples in non-life insurance pricing are given.
    Network representation learning systematic review: ancestors and current development state. (arXiv:2109.07583v1 [cs.LG])
    (2 min) Real-world information networks are increasingly occurring across various disciplines including online social networks and citation networks. These network data are generally characterized by sparseness, nonlinearity and heterogeneity bringing different challenges to the network analytics task to capture inherent properties from network data. Artificial intelligence and machine learning have been recently leveraged as powerful systems to learn insights from network data and deal with presented challenges. As part of machine learning techniques, graph embedding approaches are originally conceived for graphs constructed from feature represented datasets, like image dataset, in which links between nodes are explicitly defined. These traditional approaches cannot cope with network data challenges. As a new learning paradigm, network representation learning has been proposed to map a real-world information network into a low-dimensional space while preserving inherent properties of the network. In this paper, we present a systematic comprehensive survey of network representation learning, known also as network embedding, from birth to the current development state. Through the undertaken survey, we provide a comprehensive view of reasons behind the emergence of network embedding and, types of settings and models used in the network embedding pipeline. Thus, we introduce a brief history of representation learning and word representation learning ancestor of network embedding. We provide also formal definitions of basic concepts required to understand network representation learning followed by a description of network embedding pipeline. Most commonly used downstream tasks to evaluate embeddings, their evaluation metrics and popular datasets are highlighted. Finally, we present the open-source libraries for network embedding.
    Self-supervised Contrastive Learning for EEG-based Sleep Staging. (arXiv:2109.07839v1 [cs.LG])
    (2 min) EEG signals are usually simple to obtain but expensive to label. Although supervised learning has been widely used in the field of EEG signal analysis, its generalization performance is limited by the amount of annotated data. Self-supervised learning (SSL), as a popular learning paradigm in computer vision (CV) and natural language processing (NLP), can employ unlabeled data to make up for the data shortage of supervised learning. In this paper, we propose a self-supervised contrastive learning method of EEG signals for sleep stage classification. During the training process, we set up a pretext task for the network in order to match the right transformation pairs generated from EEG signals. In this way, the network improves the representation ability by learning the general features of EEG signals. The robustness of the network also gets improved in dealing with diverse data, that is, extracting constant features from changing data. In detail, the network's performance depends on the choice of transformations and the amount of unlabeled data used in the training process of self-supervised learning. Empirical evaluations on the Sleep-edf dataset demonstrate the competitive performance of our method on sleep staging (88.16% accuracy and 81.96% F1 score) and verify the effectiveness of SSL strategy for EEG signal analysis in limited labeled data regimes. All codes are provided publicly online.
    CounterNet: End-to-End Training of Counterfactual Aware Predictions. (arXiv:2109.07557v1 [cs.LG])
    (2 min) This work presents CounterNet, a novel end-to-end learning framework which integrates the predictive model training and counterfactual (CF) explanation generation into a single end-to-end pipeline. Counterfactual explanations attempt to find the smallest modification to the feature values of an instance that changes the prediction of the ML model to a predefined output. Prior CF explanation techniques rely on solving separate time-intensive optimization problems for every single input instance to find CF examples, and also suffer from the misalignment of objectives between model predictions and explanations, which leads to significant shortcomings in the quality of CF explanations. CounterNet, on the other hand, integrates both prediction and explanation in the same framework, which enables the optimization of the CF example generation only once together with the predictive model. We propose a novel variant of back-propagation which can help in effectively training CounterNet's network. Finally, we conduct extensive experiments on multiple real-world datasets. Our results show that CounterNet generates high-quality predictions, and corresponding CF examples (with high validity) for any new input instance significantly faster than existing state-of-the-art baselines.
    Towards Zero-shot Cross-lingual Image Retrieval and Tagging. (arXiv:2109.07622v1 [cs.LG])
    (2 min) There has been a recent spike in interest in multi-modal Language and Vision problems. On the language side, most of these models primarily focus on English since most multi-modal datasets are monolingual. We try to bridge this gap with a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side. We present a simple yet practical approach for building a cross-lingual image retrieval model which trains on a monolingual training dataset but can be used in a zero-shot cross-lingual fashion during inference. We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts away from each other. For evaluation, we introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform. We use this as the test set for zero-shot model performance across languages. We also demonstrate how a cross-lingual model can be used for downstream tasks like multi-lingual image tagging in a zero shot manner. XTD10 dataset is made publicly available here: https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10.
    ROS-X-Habitat: Bridging the ROS Ecosystem with Embodied AI. (arXiv:2109.07703v1 [cs.RO])
    (2 min) We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied reinforcement learning agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physics-based simulation. With this interface, roboticists are able to train their own Habitat RL agents in another simulation environment or to develop their own robotic algorithms inside Habitat Sim. Through in silico experiments, we demonstrate that ROS-X-Habitat has minimal impact on the navigation performance and simulation speed of Habitat agents; that a standard set of ROS mapping, planning and navigation tools can run in the Habitat simulator, and that a Habitat agent can run in the standard ROS simulator Gazebo.
    Discovering Useful Compact Sets of Sequential Rules in a Long Sequence. (arXiv:2109.07519v1 [cs.LG])
    (2 min) We are interested in understanding the underlying generation process for long sequences of symbolic events. To do so, we propose COSSU, an algorithm to mine small and meaningful sets of sequential rules. The rules are selected using an MDL-inspired criterion that favors compactness and relies on a novel rule-based encoding scheme for sequences. Our evaluation shows that COSSU can successfully retrieve relevant sets of closed sequential rules from a long sequence. Such rules constitute an interpretable model that exhibits competitive accuracy for the tasks of next-element prediction and classification.
    Federated Contrastive Learning for Decentralized Unlabeled Medical Images. (arXiv:2109.07504v1 [cs.LG])
    (2 min) A label-efficient paradigm in computer vision is based on self-supervised contrastive pre-training on unlabeled data followed by fine-tuning with a small number of labels. Making practical use of a federated computing environment in the clinical domain and learning on medical images poses specific challenges. In this work, we propose FedMoCo, a robust federated contrastive learning (FCL) framework, which makes efficient use of decentralized unlabeled medical data. FedMoCo has two novel modules: metadata transfer, an inter-node statistical data augmentation module, and self-adaptive aggregation, an aggregation module based on representational similarity analysis. To the best of our knowledge, this is the first FCL work on medical images. Our experiments show that FedMoCo can consistently outperform FedAvg, a seminal federated learning framework, in extracting meaningful representations for downstream tasks. We further show that FedMoCo can substantially reduce the amount of labeled data required in a downstream task, such as COVID-19 detection, to achieve a reasonable performance.
    On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation. (arXiv:2109.07591v1 [cs.CL])
    (2 min) Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data. This work examines the benefit of data selection for language modeling and machine translation. Our experiments assess the complementarity of selection with fine tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but not so much as to erode the complementary effect of fine-tuning; (ii) there is a trade-off between selecting little data for fast but limited progress or much data for slow but long lasting progress; (iii) data selection can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data selection from domain classifiers is often more effective than the popular contrastive data selection method.
    Comparing Euclidean and Hyperbolic Embeddings on the WordNet Nouns Hypernymy Graph. (arXiv:2109.07488v1 [cs.CL])
    (2 min) Nickel and Kiela (2017) present a new method for embedding tree nodes in the Poincare ball, and suggest that these hyperbolic embeddings are far more effective than Euclidean embeddings at embedding nodes in large, hierarchically structured graphs like the WordNet nouns hypernymy tree. This is especially true in low dimensions (Nickel and Kiela, 2017, Table 1). In this work, we seek to reproduce their experiments on embedding and reconstructing the WordNet nouns hypernymy graph. Counter to what they report, we find that Euclidean embeddings are able to represent this tree at least as well as Poincare embeddings, when allowed at least 50 dimensions. We note that this does not diminish the significance of their work given the impressive performance of hyperbolic embeddings in very low-dimensional settings. However, given the wide influence of their work, our aim here is to present an updated and more accurate comparison between the Euclidean and hyperbolic embeddings.
    Tied & Reduced RNN-T Decoder. (arXiv:2109.07513v1 [cs.CL])
    (2 min) Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003.07705 [eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications. In this work, we study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance. Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer (a.k.a. weight tying, commonly used in language modeling arXiv:1611.01462 [cs.LG]). This simple design, when used in conjunction with additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T Decoder from 23M parameters to just 2M, without affecting word-error rate (WER).
    Score Matching Model for Unbounded Data Score. (arXiv:2106.05527v3 [cs.LG] UPDATED)
    (2 min) Recent advance in diffusion models incorporates the Stochastic Differential Equation (SDE), which brings the state-of-the art performance on image generation tasks. This paper improves such diffusion models by analyzing the model at the zero diffusion time. In real datasets, the score function diverges as the diffusion time ($t$) decreases to zero, and this observation leads an argument that the score estimation fails at $t=0$ with any neural network structure. Subsequently, we introduce Unbounded Diffusion Model (UDM) that resolves the score diverging problem with an easily applicable modification to any diffusion models. Additionally, we introduce a new SDE that overcomes the theoretic and practical limitations of Variance Exploding SDE. On top of that, the introduced Soft Truncation method improves the sample quality by mitigating the loss scale issue that happens at $t=0$. We further provide a theoretic result of the proposed method to uncover the behind mechanism of the diffusion models.
    DNN-Based Decentralized Hybrid Beamforming for Cell-Free Massive MIMO. (arXiv:2106.16194v2 [eess.SP] UPDATED)
    (2 min) Cell-free massive MIMO (CF-mMIMO) systems represent a promising approach to increase the spectral efficiency of wireless communication systems. However, near-optimal hybrid beamforming solutions require a large amount of signaling exchange between access points (APs) and the network controller (NC). In this letter, we propose two unsupervised deep neural networks (DNN) architectures, fully and partially distributed, that can perform decentralized coordinated hybrid beamforming with zero or limited communication overhead between APs and NC, while achieving near-optimal sum-rate with a reduced computational complexity compared to conventional near-optimal solutions.

2021-09-16

  • cs.CL updates on arXiv.org

    Does language help generalization in vision models?. (arXiv:2104.08313v3 [cs.AI] UPDATED)
    (2 min) Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and adversarial robustness. In each setting, multimodal training produced no additional generalization capability compared to standard supervised visual training. We conclude that work is still required for semantic grounding to help improve vision models.
    Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids' Representations. (arXiv:2109.05958v2 [cs.CL] UPDATED)
    (2 min) Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy -- which is widely used in the context of layer-wise probing -- can be misleading given the norm disparity of the representations across different layers. Instead, we adopt an alternative information-theoretic probing with minimum description length, which has recently been proven to provide more reliable and informative results.
    Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU. (arXiv:2109.07364v1 [cs.CL])
    (2 min) Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole, abstracting away the notion of time. Recent work attempts to apply Transformers incrementally via restart-incrementality by repeatedly feeding, to an unchanged model, increasingly longer input prefixes to produce partial outputs. However, this approach is computationally costly and does not scale efficiently for long sequences. In parallel, we witness efforts to make Transformers more efficient, e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, we examine the feasibility of LT for incremental NLU in English. Our results show that the recurrent LT model has better incremental performance and faster inference speed compared to the standard Transformer and LT with restart-incrementality, at the cost of part of the non-incremental (full sequence) quality. We show that the performance drop can be mitigated by training the model to wait for right context before committing to an output and that training with input prefixes is beneficial for delivering correct partial outputs.
    Constraint based Knowledge Base Distillation in End-to-End Task Oriented Dialogs. (arXiv:2109.07396v1 [cs.CL])
    (2 min) End-to-End task-oriented dialogue systems generate responses based on dialog history and an accompanying knowledge base (KB). Inferring those KB entities that are most relevant for an utterance is crucial for response generation. Existing state of the art scales to large KBs by softly filtering over irrelevant KB information. In this paper, we propose a novel filtering technique that consists of (1) a pairwise similarity based filter that identifies relevant information by respecting the n-ary structure in a KB record. and, (2) an auxiliary loss that helps in separating contextually unrelated KB information. We also propose a new metric -- multiset entity F1 which fixes a correctness issue in the existing entity F1 metric. Experimental results on three publicly available task-oriented dialog datasets show that our proposed approach outperforms existing state-of-the-art models.
    Transformer-based Lexically Constrained Headline Generation. (arXiv:2109.07080v1 [cs.CL])
    (2 min) This paper explores a variant of automatic headline generation methods, where a generated headline is required to include a given phrase such as a company or a product name. Previous methods using Transformer-based models generate a headline including a given phrase by providing the encoder with additional information corresponding to the given phrase. However, these methods cannot always include the phrase in the generated headline. Inspired by previous RNN-based methods generating token sequences in backward and forward directions from the given phrase, we propose a simple Transformer-based method that guarantees to include the given phrase in the high-quality generated headline. We also consider a new headline generation strategy that takes advantage of the controllable generation order of Transformer. Our experiments with the Japanese News Corpus demonstrate that our methods, which are guaranteed to include the phrase in the generated headline, achieve ROUGE scores comparable to previous Transformer-based methods. We also show that our generation strategy performs better than previous strategies.
    Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training. (arXiv:2109.07306v1 [cs.CL])
    (2 min) Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.
    Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction. (arXiv:2105.06965v3 [cs.CL] UPDATED)
    (2 min) When language models process syntactically complex sentences, do they use their representations of syntax in a manner that is consistent with the grammar of the language? We propose AlterRep, an intervention-based method to address this question. For any linguistic feature of a given sentence, AlterRep generates counterfactual representations by altering how the feature is encoded, while leaving intact all other aspects of the original representation. By measuring the change in a model's word prediction behavior when these counterfactual representations are substituted for the original ones, we can draw conclusions about the causal effect of the linguistic feature in question on the model's behavior. We apply this method to study how BERT models of different sizes process relative clauses (RCs). We find that BERT variants use RC boundary information during word prediction in a manner that is consistent with the rules of English grammar; this RC boundary information generalizes to a considerable extent across different RC types, suggesting that BERT represents RCs as an abstract linguistic category.
    Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion. (arXiv:2106.03821v2 [cs.SD] UPDATED)
    (2 min) It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.
    Language Grounding with 3D Objects. (arXiv:2107.12514v2 [cs.CL] UPDATED)
    (2 min) Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" Flat images of candidate mice may not provide the discriminative information needed for "wireless." The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE) requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these image-based models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation. We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform, but note that a large gap remains between these models and human performance.
    From Alignment to Assignment: Frustratingly Simple Unsupervised Entity Alignment. (arXiv:2109.02363v2 [cs.CL] UPDATED)
    (2 min) Cross-lingual entity alignment (EA) aims to find the equivalent entities between crosslingual KGs, which is a crucial step for integrating KGs. Recently, many GNN-based EA methods are proposed and show decent performance improvements on several public datasets. Meanwhile, existing GNN-based EA methods inevitably inherit poor interpretability and low efficiency from neural networks. Motivated by the isomorphic assumption of GNNbased methods, we successfully transform the cross-lingual EA problem into the assignment problem. Based on this finding, we propose a frustratingly Simple but Effective Unsupervised entity alignment method (SEU) without neural networks. Extensive experiments show that our proposed unsupervised method even beats advanced supervised methods across all public datasets and has high efficiency, interpretability, and stability.
    Towards Continual Entity Learning in Language Models for Conversational Agents. (arXiv:2108.00082v2 [cs.CL] UPDATED)
    (2 min) Neural language models (LM) trained on diverse corpora are known to work well on previously seen entities, however, updating these models with dynamically changing entities such as place names, song titles and shopping items requires re-training from scratch and collecting full sentences containing these entities. We aim to address this issue, by introducing entity-aware language models (EALM), where we integrate entity models trained on catalogues of entities into the pre-trained LMs. Our combined language model adaptively adds information from the entity models into the pre-trained LM depending on the sentence context. Our entity models can be updated independently of the pre-trained LM, enabling us to influence the distribution of entities output by the final LM, without any further training of the pre-trained LM. We show significant perplexity improvements on task-oriented dialogue datasets, especially on long-tailed utterances, with an ability to continually adapt to new entities (to an extent).
    Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing. (arXiv:2108.04990v2 [cs.CL] UPDATED)
    (2 min) Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantically and spatially similar to its seed input (therefore sharing similar interpretations). Simultaneously, the generated examples achieve the same prediction label as the seed yet are given a substantially different explanation by the interpretation methods. Our experiments generate fragile interpretations to attack two SOTA interpretation methods, across three popular Transformer models and on two different NLP datasets. We observe that the rank order correlation drops by over 20% when less than 10% of words are perturbed on average. Further, rank-order correlation keeps decreasing as more words get perturbed. Furthermore, we demonstrate that candidates generated from our method have good quality metrics.
    Topic Transferable Table Question Answering. (arXiv:2109.07377v1 [cs.CL])
    (2 min) Weakly-supervised table question-answering(TableQA) models have achieved state-of-art performance by using pre-trained BERT transformer to jointly encoding a question and a table to produce structured query for the question. However, in practical settings TableQA systems are deployed over table corpora having topic and word distributions quite distinct from BERT's pretraining corpus. In this work we simulate the practical topic shift scenario by designing novel challenge benchmarks WikiSQL-TS and WikiTQ-TS, consisting of train-dev-test splits in five distinct topic groups, based on the popular WikiSQL and WikiTableQuestions datasets. We empirically show that, despite pre-training on large open-domain text, performance of models degrades significantly when they are evaluated on unseen topics. In response, we propose T3QA (Topic Transferable Table Question Answering) a pragmatic adaptation framework for TableQA comprising of: (1) topic-specific vocabulary injection into BERT, (2) a novel text-to-text transformer generator (such as T5, GPT2) based natural language question generation pipeline focused on generating topic specific training data, and (3) a logical form reranker. We show that T3QA provides a reasonably good baseline for our topic shift benchmarks. We believe our topic split benchmarks will lead to robust TableQA solutions that are better suited for practical deployment.
    RankNAS: Efficient Neural Architecture Search by Pairwise Ranking. (arXiv:2109.07383v1 [cs.CL])
    (2 min) This paper addresses the efficiency challenge of Neural Architecture Search (NAS) by formulating the task as a ranking problem. Previous methods require numerous training examples to estimate the accurate performance of architectures, although the actual goal is to find the distinction between "good" and "bad" candidates. Here we do not resort to performance predictors. Instead, we propose a performance ranking method (RankNAS) via pairwise ranking. It enables efficient architecture search using much fewer training examples. Moreover, we develop an architecture selection method to prune the search space and concentrate on more promising candidates. Extensive experiments on machine translation and language modeling tasks show that RankNAS can design high-performance architectures while being orders of magnitude faster than state-of-the-art NAS systems.
    Introducing an Abusive Language Classification Framework for Telegram to Investigate the German Hater Community. (arXiv:2109.07346v1 [cs.CL])
    (2 min) Since traditional social media platforms ban more and more actors that distribute hate speech or other forms of abusive language (deplatforming), these actors migrate to alternative platforms that do not moderate the users' content. One known platform that is relevant for the German hater community is Telegram, for which there have only been made limited research efforts so far. The goal of this study is to develop a broad framework that consists of (i) an abusive language classification model for German Telegram messages and (ii) a classification model for the hatefulness of Telegram channels. For the first part, we employ existing abusive language datasets containing posts from other platforms to build our classification models. For the channel classification model, we develop a method that combines channel specific content information coming from a topic model with a social graph to predict the hatefulness of channels. Furthermore, we complement these two approaches for hate speech detection with insightful results on the evolution of the hater community on Telegram in Germany. Moreover, we propose methods to the hate speech research community for scalable network analyses for social media platforms. As an additional output of the study, we release an annotated abusive language dataset containing 1,149 annotated Telegram messages.
    See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization. (arXiv:2105.09601v2 [cs.LG] UPDATED)
    (2 min) In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths. In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground-truth. We then propose FLORAL, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task. FLORAL utilizes an increasing number of self-attentions to capture multimodality and performs significantly better than traditional encoder-decoder based networks. Extensive experiments illustrate that FLORAL achieves significant improvement over the baselines in both qualitative and quantitative evaluations on the existing How2 dataset for short videos and newly introduced AVIATE dataset for videos with diverse duration, beating the best baseline on the two datasets by $1.39$ and $2.74$ ROUGE-L points respectively.
    Span Fine-tuning for Pre-trained Language Models. (arXiv:2108.12848v2 [cs.CL] UPDATED)
    (2 min) Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have shown that incorporating span-level information over consecutive words in pre-training could further improve the performance of PrLMs. However, given that span-level clues are introduced and fixed in pre-training, previous methods are time-consuming and lack of flexibility. To alleviate the inconvenience, this paper presents a novel span fine-tuning method for PrLMs, which facilitates the span setting to be adaptively determined by specific downstream tasks during the fine-tuning phase. In detail, any sentences processed by the PrLM will be segmented into multiple spans according to a pre-sampled dictionary. Then the segmentation information will be sent through a hierarchical CNN module together with the representation outputs of the PrLM and ultimately generate a span-enhanced representation. Experiments on GLUE benchmark show that the proposed span fine-tuning method significantly enhances the PrLM, and at the same time, offer more flexibility in an efficient way.
    Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters. (arXiv:2105.06232v2 [cs.CL] UPDATED)
    (2 min) To diversify and enrich generated dialogue responses, knowledge-grounded dialogue has been investigated in recent years. The existing methods tackle the knowledge grounding challenge by retrieving the relevant sentences over a large corpus and augmenting the dialogues with explicit extra information. Despite their success, however, the existing works have drawbacks on the inference efficiency. This paper proposes KnowExpert, an end-to-end framework to bypass the explicit retrieval process and inject knowledge into the pre-trained language models with lightweight adapters and adapt to the knowledge-grounded dialogue task. To the best of our knowledge, this is the first attempt to tackle this challenge without retrieval in this task under an open-domain chit-chat scenario. The experimental results show that KknowExpert performs comparably with some retrieval-based baselines while being time-efficient in inference, demonstrating the potential of our proposed direction.
    $\infty$-former: Infinite Memory Transformer. (arXiv:2109.00301v2 [cs.CL] UPDATED)
    (2 min) Transformers struggle when attending to long contexts, since the amount of computation grows with the context length, and therefore they cannot model long-term memories effectively. Several variations have been proposed to alleviate this problem, but they all have a finite memory capacity, being forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the $\infty$-former's attention complexity becomes independent of the context length. Thus, it is able to model arbitrarily long contexts and maintain "sticky memories" while keeping a fixed computation budget. Experiments on a synthetic sorting task demonstrate the ability of the $\infty$-former to retain information from long sequences. We also perform experiments on language modeling, by training a model from scratch and by fine-tuning a pre-trained language model, which show benefits of unbounded long-term memories.
    Enhancing Clinical Information Extraction with Transferred Contextual Embeddings. (arXiv:2109.07243v1 [cs.CL])
    (2 min) The Bidirectional Encoder Representations from Transformers (BERT) model has achieved the state-of-the-art performance for many natural language processing (NLP) tasks. Yet, limited research has been contributed to studying its effectiveness when the target domain is shifted from the pre-training corpora, for example, for biomedical or clinical NLP applications. In this paper, we applied it to a widely studied a hospital information extraction (IE) task and analyzed its performance under the transfer learning setting. Our application became the new state-of-the-art result by a clear margin, compared with a range of existing IE models. Specifically, on this nursing handover data set, the macro-average F1 score from our model was 0.438, whilst the previous best deep learning models had 0.416. In conclusion, we showed that BERT based pre-training models can be transferred to health-related documents under mild conditions and with a proper fine-tuning process.
    Matching with Transformers in MELT. (arXiv:2109.07401v1 [cs.CL])
    (2 min) One of the strongest signals for automated matching of ontologies and knowledge graphs are the textual descriptions of the concepts. The methods that are typically applied (such as character- or token-based comparisons) are relatively simple, and therefore do not capture the actual meaning of the texts. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is possible. In this paper, we model the ontology matching task as classification problem and present approaches based on transformer models. We further provide an easy to use implementation in the MELT framework which is suited for ontology and knowledge graph matching. We show that a transformer-based filter helps to choose the correct correspondences given a high-recall alignment and already achieves a good result with simple alignment post-processing methods.
    Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context. (arXiv:2109.07293v1 [cs.CL])
    (2 min) Embedding based methods are widely used for unsupervised keyphrase extraction (UKE) tasks. Generally, these methods simply calculate similarities between phrase embeddings and document embedding, which is insufficient to capture different context for a more effective UKE model. In this paper, we propose a novel method for UKE, where local and global contexts are jointly modeled. From a global view, we calculate the similarity between a certain phrase and the whole document in the vector space as transitional embedding based models do. In terms of the local view, we first build a graph structure based on the document where phrases are regarded as vertices and the edges are similarities between vertices. Then, we proposed a new centrality computation method to capture local salient information based on the graph structure. Finally, we further combine the modeling of global and local context for ranking. We evaluate our models on three public benchmarks (Inspec, DUC 2001, SemEval 2010) and compare with existing state-of-the-art models. The results show that our model outperforms most models while generalizing better on input documents with different domains and length. Additional ablation study shows that both the local and global information is crucial for unsupervised keyphrase extraction tasks.
    PTR: Prompt Tuning with Rules for Text Classification. (arXiv:2105.11259v3 [cs.CL] UPDATED)
    (2 min) Fine-tuned pre-trained language models (PLMs) have achieved awesome performance on almost all NLP tasks. By using additional prompts to fine-tune PLMs, we can further stimulate the rich knowledge distributed in PLMs to better serve downstream tasks. Prompt tuning has achieved promising results on some few-class classification tasks such as sentiment classification and natural language inference. However, manually designing lots of language prompts is cumbersome and fallible. For those auto-generated prompts, it is also expensive and time-consuming to verify their effectiveness in non-few-shot scenarios. Hence, it is still challenging for prompt tuning to address many-class classification tasks. To this end, we propose prompt tuning with rules (PTR) for many-class text classification and apply logic rules to construct prompts with several sub-prompts. In this way, PTR is able to encode prior knowledge of each class into prompt tuning. We conduct experiments on relation classification, a typical and complicated many-class classification task, and the results show that PTR can significantly and consistently outperform existing state-of-the-art baselines. This indicates that PTR is a promising approach to take advantage of both human prior knowledge and PLMs for those complicated classification tasks.
    BERT is Robust! A Case Against Synonym-Based Adversarial Examples in Text Classification. (arXiv:2109.07403v1 [cs.CL])
    (2 min) Deep Neural Networks have taken Natural Language Processing by storm. While this led to incredible improvements across many tasks, it also initiated a new research field, questioning the robustness of these neural networks by attacking them. In this paper, we investigate four word substitution-based attacks on BERT. We combine a human evaluation of individual word substitutions and a probabilistic analysis to show that between 96% and 99% of the analyzed attacks do not preserve semantics, indicating that their success is mainly based on feeding poor data to the model. To further confirm that, we introduce an efficient data augmentation procedure and show that many adversarial examples can be prevented by including data similar to the attacks during training. An additional post-processing step reduces the success rates of state-of-the-art attacks below 5%. Finally, by looking at more reasonable thresholds on constraints for word substitutions, we conclude that BERT is a lot more robust than research on attacks suggests.
    A Three Step Training Approach with Data Augmentation for Morphological Inflection. (arXiv:2109.07006v1 [cs.CL])
    (2 min) We present the BME submission for the SIGMORPHON 2021 Task 0 Part 1, Generalization Across Typologically Diverse Languages shared task. We use an LSTM encoder-decoder model with three step training that is first trained on all languages, then fine-tuned on each language families and finally finetuned on individual languages. We use a different type of data augmentation technique in the first two steps. Our system outperformed the only other submission. Although it remains worse than the Transformer baseline released by the organizers, our model is simpler and our data augmentation techniques are easily applicable to new languages. We perform ablation studies and show that the augmentation techniques and the three training steps often help but sometimes have a negative effect.
    On the Limits of Minimal Pairs in Contrastive Evaluation. (arXiv:2109.07465v1 [cs.CL])
    (2 min) Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often assumed that model behavior on contrastive pairs is predictive of model behavior at large. We argue that two conditions are necessary for this assumption to hold: First, a tested hypothesis should be well-motivated, since experiments show that contrastive evaluation can lead to false positives. Secondly, test data should be chosen such as to minimize distributional discrepancy between evaluation time and deployment time. For a good approximation of deployment-time decoding, we recommend that minimal pairs are created based on machine-generated text, as opposed to human-written references. We present a contrastive evaluation suite for English-German MT that implements this recommendation.
    Can Language Models be Biomedical Knowledge Bases?. (arXiv:2109.07154v1 [cs.CL])
    (2 min) Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can be used as domain-specific KBs. To this end, we create the BioLAMA benchmark, which is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs. We find that biomedical LMs with recently proposed probing methods can achieve up to 18.51% Acc@5 on retrieving biomedical knowledge. Although this seems promising given the task difficulty, our detailed analyses reveal that most predictions are highly correlated with prompt templates without any subjects, hence producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs. We hope that BioLAMA can serve as a challenging benchmark for biomedical factual probing.
    Adversarial Regularization as Stackelberg Game: An Unrolled Optimization Approach. (arXiv:2104.04886v2 [cs.LG] UPDATED)
    (2 min) Adversarial regularization has been shown to improve the generalization performance of deep learning models in various natural language processing tasks. Existing works usually formulate the method as a zero-sum game, which is solved by alternating gradient descent/ascent algorithms. Such a formulation treats the adversarial and the defending players equally, which is undesirable because only the defending player contributes to the generalization performance. To address this issue, we propose Stackelberg Adversarial Regularization (SALT), which formulates adversarial regularization as a Stackelberg game. This formulation induces a competition between a leader and a follower, where the follower generates perturbations, and the leader trains the model subject to the perturbations. Different from conventional approaches, in SALT, the leader is in an advantageous position. When the leader moves, it recognizes the strategy of the follower and takes the anticipated follower's outcomes into consideration. Such a leader's advantage enables us to improve the model fitting to the unperturbed data. The leader's strategic information is captured by the Stackelberg gradient, which is obtained using an unrolling algorithm. Our experimental results on a set of machine translation and natural language understanding tasks show that SALT outperforms existing adversarial regularization baselines across all tasks. Our code is publicly available.
    Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing. (arXiv:2104.07275v3 [cs.CL] UPDATED)
    (2 min) An effective recipe for building seq2seq, non-autoregressive, task-oriented parsers to map utterances to semantic frames proceeds in three steps: encoding an utterance $x$, predicting a frame's length |y|, and decoding a |y|-sized frame with utterance and ontology tokens. Though empirically strong, these models are typically bottlenecked by length prediction, as even small inaccuracies change the syntactic and semantic characteristics of resulting frames. In our work, we propose span pointer networks, non-autoregressive parsers which shift the decoding task from text generation to span prediction; that is, when imputing utterance spans into frame slots, our model produces endpoints (e.g., [i, j]) as opposed to text (e.g., "6pm"). This natural quantization of the output space reduces the variability of gold frames, therefore improving length prediction and, ultimately, exact match. Furthermore, length prediction is now responsible for frame syntax and the decoder is responsible for frame semantics, resulting in a coarse-to-fine model. We evaluate our approach on several task-oriented semantic parsing datasets. Notably, we bridge the quality gap between non-autogressive and autoregressive parsers, achieving 87 EM on TOPv2 (Chen et al. 2020). Furthermore, due to our more consistent gold frames, we show strong improvements in model generalization in both cross-domain and cross-lingual transfer in low-resource settings. Finally, due to our diminished output vocabulary, we observe 70% reduction in latency and 83% reduction in memory at beam size 5 compared to prior non-autoregressive parsers.
    Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation. (arXiv:2012.04882v2 [cs.CL] UPDATED)
    (2 min) The success of emotional conversation systems depends on sufficient perception and appropriate expression of emotions. In a real-world conversation, we firstly instinctively perceive emotions from multi-source information, including the emotion flow of dialogue history, facial expressions, and personalities of speakers, and then express suitable emotions according to our personalities, but these multiple types of information are insufficiently exploited in emotional conversation fields. To address this issue, we propose a heterogeneous graph-based model for emotional conversation generation. Specifically, we design a Heterogeneous Graph-Based Encoder to represent the conversation content (i.e., the dialogue history, its emotion flow, facial expressions, and speakers' personalities) with a heterogeneous graph neural network, and then predict suitable emotions for feedback. After that, we employ an Emotion-Personality-Aware Decoder to generate a response not only relevant to the conversation context but also with appropriate emotions, by taking the encoded graph representations, the predicted emotions from the encoder and the personality of the current speaker as inputs. Experimental results show that our model can effectively perceive emotions from multi-source knowledge and generate a satisfactory response, which significantly outperforms previous state-of-the-art models.
    Robust Open-Vocabulary Translation from Visual Text Representations. (arXiv:2104.08211v2 [cs.CL] UPDATED)
    (2 min) Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an 'open vocabulary.' This approach relies on consistent and correct underlying unicode sequences, and makes models susceptible to degradation from common types of noise and variation. Motivated by the robustness of human language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created by processing visually rendered text with sliding windows. We show that models using visual text representations approach or match performance of traditional text models on small and larger datasets. More importantly, models with visual embeddings demonstrate significant robustness to varied types of noise, achieving e.g., 25.9 BLEU on a character permuted German-English task where subword models degrade to 1.9.
    CLIPScore: A Reference-free Evaluation Metric for Image Captioning. (arXiv:2104.08718v2 [cs.CV] UPDATED)
    (2 min) Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.
    Mi{\dh}eind's WMT 2021 submission. (arXiv:2109.07343v1 [cs.CL])
    (2 min) We present Mi{\dh}eind's submission for the English$\to$Icelandic and Icelandic$\to$English subsets of the 2021 WMT news translation task. Transformer-base models are trained for translation on parallel data to generate backtranslations iteratively. A pretrained mBART-25 model is then adapted for translation using parallel data as well as the last backtranslation iteration. This adapted pretrained model is then used to re-generate backtranslations, and the training of the adapted model is continued.
    The ELITR ECA Corpus. (arXiv:2109.07351v1 [cs.CL])
    (2 min) We present the ELITR ECA corpus, a multilingual corpus derived from publications of the European Court of Auditors. We use automatic translation together with Bleualign to identify parallel sentence pairs in all 506 translation directions. The result is a corpus comprising 264k document pairs and 41.9M sentence pairs.
    LegaLMFiT: Efficient Short Legal Text Classification with LSTM Language Model Pre-Training. (arXiv:2109.00993v3 [cs.CL] UPDATED)
    (2 min) Large Transformer-based language models such as BERT have led to broad performance improvements on many NLP tasks. Domain-specific variants of these models have demonstrated excellent performance on a variety of specialised tasks. In legal NLP, BERT-based models have led to new state-of-the-art results on multiple tasks. The exploration of these models has demonstrated the importance of capturing the specificity of the legal language and its vocabulary. However, such approaches suffer from high computational costs, leading to a higher ecological impact and lower accessibility. Our findings, focusing on English language legal text, show that lightweight LSTM-based Language Models are able to capture enough information from a small legal text pretraining corpus and achieve excellent performance on short legal text classification tasks. This is achieved with a significantly reduced computational overhead compared to BERT-based models. However, our method also shows degraded performance on a more complex task, multi-label classification of longer documents, highlighting the limitations of this lightweight approach.
    Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better Language Models for Code Understanding. (arXiv:2109.07452v1 [cs.CL])
    (2 min) Code understanding is an increasingly important application of Artificial Intelligence. A fundamental aspect of understanding code is understanding text about code, e.g., documentation and forum discussions. Pre-trained language models (e.g., BERT) are a popular approach for various NLP tasks, and there are now a variety of benchmarks, such as GLUE, to help improve the development of such models for natural language understanding. However, little is known about how well such models work on textual artifacts about code, and we are unaware of any systematic set of downstream tasks for such an evaluation. In this paper, we derive a set of benchmarks (BLANCA - Benchmarks for LANguage models on Coding Artifacts) that assess code understanding based on tasks such as predicting the best answer to a question in a forum post, finding related forum posts, or predicting classes related in a hierarchy from class documentation. We evaluate the performance of current state-of-the-art language models on these tasks and show that there is a significant improvement on each task from fine tuning. We also show that multi-task training over BLANCA tasks helps build better language models for code understanding.
    Challenges in Detoxifying Language Models. (arXiv:2109.07445v1 [cs.CL])
    (2 min) Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and LM quality. We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions -- highlighting further the nuances involved in careful evaluation of LM toxicity.
    ARCH: Efficient Adversarial Regularized Training with Caching. (arXiv:2109.07048v1 [cs.CL])
    (0 min) Adversarial regularization can improve model generalization in many natural language processing tasks. However, conventional approaches are computationally expensive since they need to generate a perturbation for each sample in each epoch. We propose a new adversarial regularization method ARCH (adversarial regularization with caching), where perturbations are generated and cached once every several epochs. As caching all the perturbations imposes memory usage concerns, we adopt a K-nearest neighbors-based strategy to tackle this issue. The strategy only requires caching a small amount of perturbations, without introducing additional training time. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks. We observe that ARCH significantly eases the computational burden (saves up to 70\% of computational time in comparison with conventional approaches). More surprisingly, by reducing the variance of stochastic gradients, ARCH produces a notably better (in most of the tasks) or comparable model generalization. Our code is publicly available.
    Explainable Identification of Dementia from Transcripts using Transformer Networks. (arXiv:2109.06980v1 [cs.CL])
    (0 min) Alzheimer's disease (AD) is the main cause of dementia which is accompanied by loss of memory and may lead to severe consequences in peoples' everyday life if not diagnosed on time. Very few works have exploited transformer-based networks and despite the high accuracy achieved, little work has been done in terms of model interpretability. In addition, although Mini-Mental State Exam (MMSE) scores are inextricably linked with the identification of dementia, research works face the task of dementia identification and the task of the prediction of MMSE scores as two separate tasks. In order to address these limitations, we employ several transformer-based models, with BERT achieving the highest accuracy accounting for 85.56%. Concurrently, we propose an interpretable method to detect AD patients based on siamese networks reaching accuracy up to 81.18%. Next, we introduce two multi-task learning models, where the main task refers to the identification of dementia (binary classification), while the auxiliary one corresponds to the identification of the severity of dementia (multiclass classification). Our model obtains accuracy equal to 84.99% on the detection of AD patients in the multi-task learning setting. Finally, we present some new methods to identify the linguistic patterns used by AD patients and non-AD ones, including text statistics, vocabulary uniqueness, word usage, correlations via a detailed linguistic analysis, and explainability techniques (LIME). Findings indicate significant differences in language between AD and non-AD patients.
    Written Justifications are Key to Aggregate Crowdsourced Forecasts. (arXiv:2109.07017v1 [cs.CL])
    (0 min) This paper demonstrates that aggregating crowdsourced forecasts benefits from modeling the written justifications provided by forecasters. Our experiments show that the majority and weighted vote baselines are competitive, and that the written justifications are beneficial to call a question throughout its life except in the last quarter. We also conduct an error analysis shedding light into the characteristics that make a justification unreliable.
    UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation. (arXiv:2109.07368v1 [cs.CL])
    (0 min) This paper presents a unified end-to-end frame-work for both streaming and non-streamingspeech translation. While the training recipes for non-streaming speech translation have been mature, the recipes for streaming speechtranslation are yet to be built. In this work, wefocus on developing a unified model (UniST) which supports streaming and non-streaming ST from the perspective of fundamental components, including training objective, attention mechanism and decoding policy. Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST, and a better-learned trade-off for BLEU score and latency metrics for streaming ST, compared with end-to-end baselines and the cascaded models. We will make our codes and evaluation tools publicly available.
    Low-Resource Named Entity Recognition Based on Multi-hop Dependency Trigger. (arXiv:2109.07118v1 [cs.CL])
    (0 min) This paper presents a simple and effective approach in low-resource named entity recognition (NER) based on multi-hop dependency trigger. Dependency trigger refer to salient nodes relative to a entity in the dependency graph of a context sentence. Our main observation is that there often exists trigger which play an important role to recognize the location and type of entity in sentence. Previous research has used manual labelling of trigger. Our main contribution is to propose use a syntactic parser to automatically annotate trigger. Experiments on two English datasets (CONLL 2003 and BC5CDR) show that the proposed method is comparable to the previous trigger-based NER model.
    Embedding Convolutions for Short Text Extreme Classification with Millions of Labels. (arXiv:2109.07319v1 [cs.CL])
    (0 min) Automatic annotation of short-text data to a large number of target labels, referred to as Short Text Extreme Classification, has recently found numerous applications in prediction of related searches and product recommendation tasks. The conventional usage of Convolutional Neural Network (CNN) to capture n-grams in text-classification relies heavily on uniformity in word-ordering and the presence of long input sequences to convolve over. However, this is missing in short and unstructured text sequences encountered in search and recommendation. In order to tackle this, we propose an orthogonal approach by recasting the convolution operation to capture coupled semantics along the embedding dimensions, and develop a word-order agnostic embedding enhancement module to deal with the lack of structure in such queries. Benefitting from the computational efficiency of the convolution operation, Embedding Convolutions, when applied on the enriched word embeddings, result in a light-weight and yet powerful encoder (InceptionXML) that is robust to the inherent lack of structure in short-text extreme classification. Towards scaling our model to problems with millions of labels, we also propose InceptionXML+, which addresses the shortcomings of the dynamic hard-negative mining framework in the recently proposed LightXML by improving the alignment between the label-shortlister and extreme classifier. On popular benchmark datasets, we empirically demonstrate that the proposed method outperforms state-of-the-art deep extreme classifiers such as Astec by an average of 5% and 8% on the P@k and propensity-scored PSP@k metrics respectively.
    AnnIE: An Annotation Platform for Constructing Complete Open Information Extraction Benchmark. (arXiv:2109.07464v1 [cs.CL])
    (0 min) Open Information Extraction (OIE) is the task of extracting facts from sentences in the form of relations and their corresponding arguments in schema-free manner. Intrinsic performance of OIE systems is difficult to measure due to the incompleteness of existing OIE benchmarks: the ground truth extractions do not group all acceptable surface realizations of the same fact that can be extracted from a sentence. To measure performance of OIE systems more realistically, it is necessary to manually annotate complete facts (i.e., clusters of all acceptable surface realizations of the same fact) from input sentences. We propose AnnIE: an interactive annotation platform that facilitates such challenging annotation tasks and supports creation of complete fact-oriented OIE evaluation benchmarks. AnnIE is modular and flexible in order to support different use case scenarios (i.e., benchmarks covering different types of facts). We use AnnIE to build two complete OIE benchmarks: one with verb-mediated facts and another with facts encompassing named entities. Finally, we evaluate several OIE systems on our complete benchmarks created with AnnIE. Our results suggest that existing incomplete benchmarks are overly lenient, and that OIE systems are not as robust as previously reported. We publicly release AnnIE under non-restrictive license.
    Negative Statements Considered Useful. (arXiv:2001.04425v5 [cs.IR] UPDATED)
    (0 min) Knowledge bases (KBs) about notable entities and their properties are an important asset in applications such as search, question answering and dialogue. All popular KBs capture virtually only positive statements, and abstain from taking any stance on statements not stored in the KB. This paper makes the case for explicitly stating salient statements that do not hold. Negative statements are useful to overcome limitations of question answering systems that are mainly geared for positive questions; they can also contribute to informative summaries of entities. Due to the abundance of such invalid statements, any effort to compile them needs to address ranking by saliency. We present a statisticalinference method for compiling and ranking negative statements, based on expectations from positive statements of related entities in peer groups. Experimental results, with a variety of datasets, show that the method can effectively discover notable negative statements, and extrinsic studies underline their usefulness for entity summarization. Datasets and code are released as resources for further research.
    Attribute Alignment: Controlling Text Generation from Pre-trained Language Models. (arXiv:2103.11070v2 [cs.CL] UPDATED)
    (0 min) Large language models benefit from training with a large amount of unlabeled text, which gives them increasingly fluent and diverse generation capabilities. However, using these models for text generation that takes into account target attributes, such as sentiment polarity or specific topics, remains a challenge. We propose a simple and flexible method for controlling text generation by aligning disentangled attribute representations. In contrast to recent efforts on training a discriminator to perturb the token level distribution for an attribute, we use the same data to learn an alignment function to guide the pre-trained, non-controlled language model to generate texts with the target attribute without changing the original language model parameters. We evaluate our method on sentiment- and topic-controlled generation, and show large performance gains over previous methods while retaining fluency and diversity.
    The Unreasonable Effectiveness of the Baseline: Discussing SVMs in Legal Text Classification. (arXiv:2109.07234v1 [cs.CL])
    (0 min) We aim to highlight an interesting trend to contribute to the ongoing debate around advances within legal Natural Language Processing. Recently, the focus for most legal text classification tasks has shifted towards large pre-trained deep learning models such as BERT. In this paper, we show that a more traditional approach based on Support Vector Machine classifiers reaches competitive performance with deep learning models. We also highlight that error reduction obtained by using specialised BERT-based models over baselines is noticeably smaller in the legal domain when compared to general language tasks. We discuss some hypotheses for these results to support future discussions.
    Sequence Length is a Domain: Length-based Overfitting in Transformer Models. (arXiv:2109.07276v1 [cs.CL])
    (0 min) Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We demonstrate on a simple string editing task and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data. Additionally, we show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
    Dialog speech sentiment classification for imbalanced datasets. (arXiv:2109.07228v1 [cs.CL])
    (0 min) Speech is the most common way humans express their feelings, and sentiment analysis is the use of tools such as natural language processing and computational algorithms to identify the polarity of these feelings. Even though this field has seen tremendous advancements in the last two decades, the task of effectively detecting under represented sentiments in different kinds of datasets is still a challenging task. In this paper, we use single and bi-modal analysis of short dialog utterances and gain insights on the main factors that aid in sentiment detection, particularly in the underrepresented classes, in datasets with and without inherent sentiment component. Furthermore, we propose an architecture which uses a learning rate scheduler and different monitoring criteria and provides state-of-the-art results for the SWITCHBOARD imbalanced sentiment dataset.
    How Much do Lyrics Matter? Analysing Lyrical Simplicity Preferences for Individuals At Risk of Depression. (arXiv:2109.07227v1 [cs.CL])
    (2 min) Music affects and in some cases reflects one's emotional state. Key to this influence is lyrics and their meaning in conjunction with the acoustic properties of the track. Recent work has focused on analysing these acoustic properties and showing that individuals prone to depression primarily consume low valence and low energy music. However, no studies yet have explored lyrical content preferences in relation to online music consumption of such individuals. In the current study, we examine lyrical simplicity, measured as the Compressibility and Absolute Information Content of the text, associated with preferences of individuals at risk for depression. Using the six-month listening history of 541 Last.fm users, we compare lyrical simplicity trends for users grouped as being at risk (At-Risk) of depression from those that are not (No-Risk). Our findings reveal that At-Risk individuals prefer songs with greater information content (lower Compressibility) on average, especially for songs characterised as Sad. Furthermore, we found that At-Risk individuals also have greater variability of Absolute Information Content across their listening history. We discuss the results in light of existing socio-psychological lab-based research on music habits associated with depression and their relevance to naturally occurring online music listening behaviour.
    Cross-lingual Transfer of Monolingual Models. (arXiv:2109.07348v1 [cs.CL])
    (2 min) Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
    Data-to-text Generation by Splicing Together Nearest Neighbors. (arXiv:2101.08248v3 [cs.CL] UPDATED)
    (0 min) We propose to tackle data-to-text generation tasks by directly splicing together retrieved segments of text from "neighbor" source-target pairs. Unlike recent work that conditions on retrieved neighbors but generates text token-by-token, left-to-right, we learn a policy that directly manipulates segments of neighbor text, by inserting or replacing them in partially constructed generations. Standard techniques for training such a policy require an oracle derivation for each generation, and we prove that finding the shortest such derivation can be reduced to parsing under a particular weighted context-free grammar. We find that policies learned in this way perform on par with strong baselines in terms of automatic and human evaluation, but allow for more interpretable and controllable generation.
    Decision-Focused Summarization. (arXiv:2109.06896v1 [cs.CL])
    (0 min) Relevance in summarization is typically defined based on textual information alone, without incorporating insights about a particular decision. As a result, to support risk analysis of pancreatic cancer, summaries of medical notes may include irrelevant information such as a knee injury. We propose a novel problem, decision-focused summarization, where the goal is to summarize relevant information for a decision. We leverage a predictive model that makes the decision based on the full text to provide valuable insights on how a decision can be inferred from text. To build a summary, we then select representative sentences that lead to similar model decisions as using the full text while accounting for textual non-redundancy. To evaluate our method (DecSum), we build a testbed where the task is to summarize the first ten reviews of a restaurant in support of predicting its future rating on Yelp. DecSum substantially outperforms text-only summarization methods and model-based explanation methods in decision faithfulness and representativeness. We further demonstrate that DecSum is the only method that enables humans to outperform random chance in predicting which restaurant will be better rated in the future.
    Semantics of European poetry is shaped by conservative forces: The relationship between poetic meter and meaning in accentual-syllabic verse. (arXiv:2109.07148v1 [cs.CL])
    (0 min) Recent advances in cultural analytics and large-scale computational studies of art, literature and film often show that long-term change in the features of artistic works happens gradually. These findings suggest that conservative forces that shape creative domains might be underestimated. To this end, we provide the first large-scale formal evidence of the persistent association between poetic meter and semantics in 18-19th European literatures, using Czech, German and Russian collections with additional data from English poetry and early modern Dutch songs. Our study traces this association through a series of clustering experiments using the abstracted semantic features of 150,000 poems. With the aid of topic modeling we infer semantic features for individual poems. Texts were also lexically simplified across collections to increase generalizability and decrease the sparseness of word frequency distributions. Topics alone enable recognition of the meters in each observed language, as may be seen from highly robust clustering of same-meter samples (median Adjusted Rand Index between 0.48 and 1). In addition, this study shows that the strength of the association between form and meaning tends to decrease over time. This may reflect a shift in aesthetic conventions between the 18th and 19th centuries as individual innovation was increasingly favored in literature. Despite this decline, it remains possible to recognize semantics of the meters from past or future, which suggests the continuity of semantic traditions while also revealing the historical variability of conditions across languages. This paper argues that distinct metrical forms, which are often copied in a language over centuries, also maintain long-term semantic inertia in poetry. Our findings, thus, highlight the role of the formal features of cultural items in influencing the pace and shape of cultural evolution.
    Incorporating Residual and Normalization Layers into Analysis of Masked Language Models. (arXiv:2109.07152v1 [cs.CL])
    (0 min) Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers' progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.
    WikiGUM: Exhaustive Entity Linking for Wikification in 12 Genres. (arXiv:2109.07449v1 [cs.CL])
    (0 min) Previous work on Entity Linking has focused on resources targeting non-nested proper named entity mentions, often in data from Wikipedia, i.e. Wikification. In this paper, we present and evaluate WikiGUM, a fully wikified dataset, covering all mentions of named entities, including their non-named and pronominal mentions, as well as mentions nested within other mentions. The dataset covers a broad range of 12 written and spoken genres, most of which have not been included in Entity Linking efforts to date, leading to poor performance by a pretrained SOTA system in our evaluation. The availability of a variety of other annotations for the same data also enables further research on entities in context.
    Is "moby dick" a Whale or a Bird? Named Entities and Terminology in Speech Translation. (arXiv:2109.07439v1 [cs.CL])
    (0 min) Automatic translation systems are known to struggle with rare words. Among these, named entities (NEs) and domain-specific terms are crucial, since errors in their translation can lead to severe meaning distortions. Despite their importance, previous speech translation (ST) studies have neglected them, also due to the dearth of publicly available resources tailored to their specific evaluation. To fill this gap, we i) present the first systematic analysis of the behavior of state-of-the-art ST systems in translating NEs and terminology, and ii) release NEuRoparl-ST, a novel benchmark built from European Parliament speeches annotated with NEs and terminology. Our experiments on the three language directions covered by our benchmark (en->es/fr/it) show that ST systems correctly translate 75-80% of terms and 65-70% of NEs, with very low performance (37-40%) on person names.
    When Does Translation Require Context? A Data-driven, Multilingual Exploration. (arXiv:2109.07446v1 [cs.CL])
    (0 min) Although proper handling of discourse phenomena significantly contributes to the quality of machine translation (MT), common translation quality metrics do not adequately capture them. Recent works in context-aware MT attempt to target a small set of these phenomena during evaluation. In this paper, we propose a new metric, P-CXMI, which allows us to identify translations that require context systematically and confirm the difficulty of previously studied phenomena as well as uncover new ones that have not been addressed in previous work. We then develop the Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers for these phenomena in 14 different language pairs, which we use to evaluate context-aware MT. We find that state-of-the-art context-aware MT models find marginal improvements over context-agnostic models on our benchmark, which suggests current models do not handle these ambiguities effectively. We release code and data to invite the MT research community to increase efforts on context-aware translation on discourse phenomena and languages that are currently overlooked.
    Sentiment Analysis in Poems in Misurata Sub-dialect -- A Sentiment Detection in an Arabic Sub-dialect. (arXiv:2109.07203v1 [cs.CL])
    (2 min) Over the recent decades, there has been a significant increase and development of resources for Arabic natural language processing. This includes the task of exploring Arabic Language Sentiment Analysis (ALSA) from Arabic utterances in both Modern Standard Arabic (MSA) and different Arabic dialects. This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Misurata, Libya. The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool 1. Logistic Regression, Random Forest, Naive Bayes (NB), and Support Vector Machines (SVM) classifiers are used with Sklearn, while the Convolutional Neural Network (CNN) is implemented with Mazajak. The results show that the traditional classifiers score a higher level of accuracy as compared to Mazajak which is built on an algorithm that includes deep learning techniques. More research is suggested to analyze Arabic sub-dialect poetry in order to investigate the aspects that contribute to sentiments in these multi-line texts; for example, the use of figurative language such as metaphors.
    Disentangling Generative Factors in Natural Language with Discrete Variational Autoencoders. (arXiv:2109.07169v1 [cs.CL])
    (2 min) The ability of learning disentangled representations represents a major step for interpretable NLP systems as it allows latent linguistic features to be controlled. Most approaches to disentanglement rely on continuous variables, both for images and text. We argue that despite being suitable for image datasets, continuous variables may not be ideal to model features of textual data, due to the fact that most generative factors in text are discrete. We propose a Variational Autoencoder based method which models language features as discrete variables and encourages independence between variables for learning disentangled representations. The proposed model outperforms continuous and discrete baselines on several qualitative and quantitative benchmarks for disentanglement as well as on a text style transfer downstream application.
    Transformer-based Language Models for Factoid Question Answering at BioASQ9b. (arXiv:2109.07185v1 [cs.CL])
    (2 min) In this work, we describe our experiments and participating systems in the BioASQ Task 9b Phase B challenge of biomedical question answering. We have focused on finding the ideal answers and investigated multi-task fine-tuning and gradual unfreezing techniques on transformer-based language models. For factoid questions, our ALBERT-based systems ranked first in test batch 1 and fourth in test batch 2. Our DistilBERT systems outperformed the ALBERT variants in test batches 4 and 5 despite having 81% fewer parameters than ALBERT. However, we observed that gradual unfreezing had no significant impact on the model's accuracy compared to standard fine-tuning.
    Scope resolution of predicted negation cues: A two-step neural network-based approach. (arXiv:2109.07264v1 [cs.CL])
    (2 min) Neural network-based methods are the state of the art in negation scope resolution. However, they often use the unrealistic assumption that cue information is completely accurate. Even if this assumption holds, there remains a dependency on engineered features from state-of-the-art machine learning methods. The current study adopted a two-step negation resolving apporach to assess whether a Bidirectional Long Short-Term Memory-based method can be used for cue detection as well, and how inaccurate cue predictions would affect the scope resolution performance. Results suggest that this method is not suitable for negation detection. Scope resolution performance is most robust against inaccurate information for models with a recurrent layer only, compared to extensions with a Conditional Random Fields layer or a post-processing algorithm. We advocate for more research into the application of deep learning on negation detection and the effect of imperfect information on scope resolution.
    Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering. (arXiv:2109.07009v1 [cs.CL])
    (0 min) In this paper we propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them. This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text. This enables preemptive filtering of questions that are not answered by the system due to their answer confidence scores being lower than the system threshold. Specifically, we learn Transformer-based question models by distilling Transformer-based answering models. Our experiments on three popular QA datasets and one industrial QA benchmark demonstrate the ability of our question models to approximate the Precision/Recall curves of the target QA system well. These question models, when used as filters, can effectively trade off lower computation cost of QA systems for lower Recall, e.g., reducing computation by ~60%, while only losing ~3-4% of Recall.
    Self-Training with Differentiable Teacher. (arXiv:2109.07049v1 [cs.CL])
    (0 min) Self-training achieves enormous success in various semi-supervised and weakly-supervised learning tasks. The method can be interpreted as a teacher-student framework, where the teacher generates pseudo-labels, and the student makes predictions. The two models are updated alternatingly. However, such a straightforward alternating update rule leads to training instability. This is because a small change in the teacher may result in a significant change in the student. To address this issue, we propose {\ours}, short for differentiable self-training, that treats teacher-student as a Stackelberg game. In this game, a leader is always in a more advantageous position than a follower. In self-training, the student contributes to the prediction performance, and the teacher controls the training process by generating pseudo-labels. Therefore, we treat the student as the leader and the teacher as the follower. The leader procures its advantage by acknowledging the follower's strategy, which involves differentiable pseudo-labels and differentiable sample weights. Consequently, the leader-follower interaction can be effectively captured via Stackelberg gradient, obtained by differentiating the follower's strategy. Experimental results on semi- and weakly-supervised classification and named entity recognition tasks show that our model outperforms existing approaches by large margins.
    What Vision-Language Models `See' when they See Scenes. (arXiv:2109.07301v1 [cs.CL])
    (0 min) Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of both types with images. We compare 3 state-of-the-art models, VisualBERT, LXMERT and CLIP. We find that (i) V&L models are susceptible to stylistic biases acquired during pretraining; (ii) only CLIP performs consistently well on both object- and scene-level descriptions. A follow-up ablation study shows that CLIP uses object-level information in the visual modality to align with scene-level textual descriptions.
    A Relation-Oriented Clustering Method for Open Relation Extraction. (arXiv:2109.07205v1 [cs.CL])
    (2 min) The clustering-based unsupervised relation discovery method has gradually become one of the important methods of open relation extraction (OpenRE). However, high-dimensional vectors can encode complex linguistic information which leads to the problem that the derived clusters cannot explicitly align with the relational semantic classes. In this work, we propose a relation-oriented clustering model and use it to identify the novel relations in the unlabeled data. Specifically, to enable the model to learn to cluster relational data, our method leverages the readily available labeled data of pre-defined relations to learn a relation-oriented representation. We minimize distance between the instance with same relation by gathering the instances towards their corresponding relation centroids to form a cluster structure, so that the learned representation is cluster-friendly. To reduce the clustering bias on predefined classes, we optimize the model by minimizing a joint objective on both labeled and unlabeled data. Experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two datasets respectively, compared with current SOTA methods.
    Multiagent Multimodal Categorization for Symbol Emergence: Emergent Communication via Interpersonal Cross-modal Inference. (arXiv:2109.07194v1 [cs.AI])
    (2 min) This paper describes a computational model of multiagent multimodal categorization that realizes emergent communication. We clarify whether the computational model can reproduce the following functions in a symbol emergence system, comprising two agents with different sensory modalities playing a naming game. (1) Function for forming a shared lexical system that comprises perceptual categories and corresponding signs, formed by agents through individual learning and semiotic communication between agents. (2) Function to improve the categorization accuracy in an agent via semiotic communication with another agent, even when some sensory modalities of each agent are missing. (3) Function that an agent infers unobserved sensory information based on a sign sampled from another agent in the same manner as cross-modal inference. We propose an interpersonal multimodal Dirichlet mixture (Inter-MDM), which is derived by dividing an integrative probabilistic generative model, which is obtained by integrating two Dirichlet mixtures (DMs). The Markov chain Monte Carlo algorithm realizes emergent communication. The experimental results demonstrated that Inter-MDM enables agents to form multimodal categories and appropriately share signs between agents. It is shown that emergent communication improves categorization accuracy, even when some sensory modalities are missing. Inter-MDM enables an agent to predict unobserved information based on a shared sign.
    Adversarial Mixing Policy for Relaxing Locally Linear Constraints in Mixup. (arXiv:2109.07177v1 [cs.CL])
    (2 min) Mixup is a recent regularizer for current deep classification networks. Through training a neural network on convex combinations of pairs of examples and their labels, it imposes locally linear constraints on the model's input space. However, such strict linear constraints often lead to under-fitting which degrades the effects of regularization. Noticeably, this issue is getting more serious when the resource is extremely limited. To address these issues, we propose the Adversarial Mixing Policy (AMP), organized in a min-max-rand formulation, to relax the Locally Linear Constraints in Mixup. Specifically, AMP adds a small adversarial perturbation to the mixing coefficients rather than the examples. Thus, slight non-linearity is injected in-between the synthetic examples and synthetic labels. By training on these data, the deep networks are further regularized, and thus achieve a lower predictive error rate. Experiments on five text classification benchmarks and five backbone models have empirically shown that our methods reduce the error rate over Mixup variants in a significant margin (up to 31.3%), especially in low-resource conditions (up to 17.5%).
    Learning Mathematical Properties of Integers. (arXiv:2109.07230v1 [cs.CL])
    (2 min) Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the representations from mathematical sequence data, we can substantially improve over number embeddings learned from English text corpora.
    SWEAT: Scoring Polarization of Topics across Different Corpora. (arXiv:2109.07231v1 [cs.CL])
    (2 min) Understanding differences of viewpoints across corpora is a fundamental task for computational social sciences. In this paper, we propose the Sliced Word Embedding Association Test (SWEAT), a novel statistical measure to compute the relative polarization of a topical wordset across two distributional representations. To this end, SWEAT uses two additional wordsets, deemed to have opposite valence, to represent two different poles. We validate our approach and illustrate a case study to show the usefulness of the introduced measure.
    On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning. (arXiv:2109.06935v1 [cs.CL])
    (2 min) Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks -- POS tagging and natural language inference -- which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on 'unlearning' language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model's limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.
    Fast Extraction of Word Embedding from Q-contexts. (arXiv:2109.07084v1 [cs.CL])
    (2 min) The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts)which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11$\sim$13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVeand fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.
    Towards Document-Level Paraphrase Generation with Sentence Rewriting and Reordering. (arXiv:2109.07095v1 [cs.CL])
    (2 min) Paraphrase generation is an important task in natural language processing. Previous works focus on sentence-level paraphrase generation, while ignoring document-level paraphrase generation, which is a more challenging and valuable task. In this paper, we explore the task of document-level paraphrase generation for the first time and focus on the inter-sentence diversity by considering sentence rewriting and reordering. We propose CoRPG (Coherence Relationship guided Paraphrase Generation), which leverages graph GRU to encode the coherence relationship graph and get the coherence-aware representation for each sentence, which can be used for re-arranging the multiple (possibly modified) input sentences. We create a pseudo document-level paraphrase dataset for training CoRPG. Automatic evaluation results show CoRPG outperforms several strong baseline models on the BERTScore and diversity scores. Human evaluation also shows our model can generate document paraphrase with more diversity and semantic preservation.
    The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders. (arXiv:2109.06939v1 [cs.CL])
    (0 min) Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.
    Efficient Explanations from Empirical Explainers. (arXiv:2103.15429v2 [cs.LG] UPDATED)
    (2 min) Amid a discussion about Green AI in which we see explainability neglected, we explore the possibility to efficiently approximate computationally expensive explainers. To this end, we propose feature attribution modelling with Empirical Explainers. Empirical Explainers learn from data to predict the attribution maps of expensive explainers. We train and test Empirical Explainers in the language domain and find that they model their expensive counterparts surprisingly well, at a fraction of the cost. They could thus mitigate the computational burden of neural explanations significantly, in applications that tolerate an approximation error.
    Regressive Ensemble for Machine Translation Quality Evaluation. (arXiv:2109.07242v1 [cs.CL])
    (2 min) This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble's performance.
    NOPE: A Corpus of Naturally-Occurring Presuppositions in English. (arXiv:2109.06987v1 [cs.CL])
    (2 min) Understanding language requires grasping not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger them as well as the broader conversational context. In this work, we introduce the Naturally-Occurring Presuppositions in English (NOPE) Corpus to investigate the context-sensitivity of 10 different types of presupposition triggers and to evaluate machine learning models' ability to predict human inferences. We find that most of the triggers we investigate exhibit moderate variability. We further find that transformer-based models draw correct inferences in simple cases involving presuppositions, but they fail to capture the minority of exceptional cases in which human judgments reveal complex interactions between context and triggers.
    Improving Text Auto-Completion with Next Phrase Prediction. (arXiv:2109.07067v1 [cs.CL])
    (2 min) Language models such as GPT-2 have performed well on constructing syntactically sound sentences for text auto-completion task. However, such models often require considerable training effort to adapt to specific writing domains (e.g., medical). In this paper, we propose an intermediate training strategy to enhance pre-trained language models' performance in the text auto-completion task and fastly adapt them to specific domains. Our strategy includes a novel self-supervised training objective called Next Phrase Prediction (NPP), which encourages a language model to complete the partial query with enriched phrases and eventually improve the model's text auto-completion performance. Preliminary experiments have shown that our approach is able to outperform the baselines in auto-completion for email and academic writing domains.
    Learning to Match Job Candidates Using Multilingual Bi-Encoder BERT. (arXiv:2109.07157v1 [cs.CL])
    (2 min) In this talk, we will show how we used Randstad history of candidate placements to generate labeled CV-vacancy pairs dataset. Afterwards we fine-tune a multilingual BERT with bi encoder structure over this dataset, by adding a cosine similarity log loss layer. We will explain how using the mentioned structure helps us overcome most of the challenges described above, and how it enables us to build a maintainable and scalable pipeline to match CVs and vacancies. In addition, we show how we gain a better semantic understanding, and learn to bridge the vocabulary gap. Finally, we highlight how multilingual transformers help us handle cross language barrier and might reduce discrimination.
    What Does The User Want? Information Gain for Hierarchical Dialogue Policy Optimisation. (arXiv:2109.07129v1 [cs.LG])
    (2 min) The dialogue management component of a task-oriented dialogue system is typically optimised via reinforcement learning (RL). Optimisation via RL is highly susceptible to sample inefficiency and instability. The hierarchical approach called Feudal Dialogue Management takes a step towards more efficient learning by decomposing the action space. However, it still suffers from instability due to the reward only being provided at the end of the dialogue. We propose the usage of an intrinsic reward based on information gain to address this issue. Our proposed reward favours actions that resolve uncertainty or query the user whenever necessary. It enables the policy to learn how to retrieve the users' needs efficiently, which is an integral aspect in every task-oriented conversation. Our algorithm, which we call FeudalGain, achieves state-of-the-art results in most environments of the PyDial framework, outperforming much more complex approaches. We confirm the sample efficiency and stability of our algorithm through experiments in simulation and a human trial.
    Beyond Glass-Box Features: Uncertainty Quantification Enhanced Quality Estimation for Neural Machine Translation. (arXiv:2109.07141v1 [cs.CL])
    (2 min) Quality Estimation (QE) plays an essential role in applications of Machine Translation (MT). Traditionally, a QE system accepts the original source text and translation from a black-box MT system as input. Recently, a few studies indicate that as a by-product of translation, QE benefits from the model and training data's information of the MT system where the translations come from, and it is called the "glass-box QE". In this paper, we extend the definition of "glass-box QE" generally to uncertainty quantification with both "black-box" and "glass-box" approaches and design several features deduced from them to blaze a new trial in improving QE's performance. We propose a framework to fuse the feature engineering of uncertainty quantification into a pre-trained cross-lingual language model to predict the translation quality. Experiment results show that our method achieves state-of-the-art performances on the datasets of WMT 2020 QE shared task.
    On the Universality of Deep COntextual Language Models. (arXiv:2109.07140v1 [cs.CL])
    (2 min) Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning. Furthermore, multilingual versions of such models like XLM-R and mBERT have given promising results in zero-shot cross-lingual transfer, potentially enabling NLP applications in many under-served and under-resourced languages. Due to this initial success, pre-trained models are being used as `Universal Language Models' as the starting point across diverse tasks, domains, and languages. This work explores the notion of `Universality' by identifying seven dimensions across which a universal model should be able to scale, that is, perform equally well or reasonably well, to be useful across diverse settings. We outline the current theoretical and empirical results that support model performance across these dimensions, along with extensions that may help address some of their current limitations. Through this survey, we lay the foundation for understanding the capabilities and limitations of massive contextual language models and help discern research gaps and directions for future work to make these LMs inclusive and fair to diverse applications, users, and linguistic phenomena.
    Searching for More Efficient Dynamic Programs. (arXiv:2109.06966v1 [cs.CL])
    (2 min) Computational models of human language often involve combinatorial problems. For instance, a probabilistic parser may marginalize over exponentially many trees to make predictions. Algorithms for such problems often employ dynamic programming and are not always unique. Finding one with optimal asymptotic runtime can be unintuitive, time-consuming, and error-prone. Our work aims to automate this laborious process. Given an initial correct declarative program, we search for a sequence of semantics-preserving transformations to improve its running time as much as possible. To this end, we describe a set of program transformations, a simple metric for assessing the efficiency of a transformed program, and a heuristic search procedure to improve this metric. We show that in practice, automated search -- like the mental search performed by human programmers -- can find substantial improvements to the initial program. Empirically, we show that many common speed-ups described in the NLP literature could have been discovered automatically by our system.
    End-to-End Learning of Flowchart Grounded Task-Oriented Dialogs. (arXiv:2109.07263v1 [cs.CL])
    (0 min) We propose a novel problem within end-to-end learning of task-oriented dialogs (TOD), in which the dialog system mimics a troubleshooting agent who helps a user by diagnosing their problem (e.g., car not starting). Such dialogs are grounded in domain-specific flowcharts, which the agent is supposed to follow during the conversation. Our task exposes novel technical challenges for neural TOD, such as grounding an utterance to the flowchart without explicit annotation, referring to additional manual pages when user asks a clarification question, and ability to follow unseen flowcharts at test time. We release a dataset (FloDial) consisting of 2,738 dialogs grounded on 12 different troubleshooting flowcharts. We also design a neural model, FloNet, which uses a retrieval-augmented generation architecture to train the dialog agent. Our experiments find that FloNet can do zero-shot transfer to unseen flowcharts, and sets a strong baseline for future research.
    {E}fficient{BERT}: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation. (arXiv:2109.07222v1 [cs.CL])
    (2 min) Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA. Hence, to compact BERT, we are devoted to designing efficient FFN as opposed to previous works that pay attention to MHA. Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT optimization, we further design a thorough search space towards an advanced MLP and perform a coarse-to-fine mechanism to search for an efficient BERT architecture. Moreover, to accelerate searching and enhance model transferability, we employ a novel warm-up knowledge distillation strategy at each search stage. Extensive experiments show our searched EfficientBERT is 6.9$\times$ smaller and 4.4$\times$ faster than BERT$\rm_{BASE}$, and has competitive performances on GLUE and SQuAD Benchmarks. Concretely, EfficientBERT attains a 77.7 average score on GLUE \emph{test}, 0.7 higher than MobileBERT$\rm_{TINY}$, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 \emph{dev}, 3.2/2.7 higher than TinyBERT$_4$ even without data augmentation. The code is released at https://github.com/cheneydon/efficient-bert.
    Attention Is Indeed All You Need: Semantically Attention-Guided Decoding for Data-to-Text NLG. (arXiv:2109.07043v1 [cs.CL])
    (2 min) Ever since neural models were adopted in data-to-text language generation, they have invariably been reliant on extrinsic components to improve their semantic accuracy, because the models normally do not exhibit the ability to generate text that reliably mentions all of the information provided in the input. In this paper, we propose a novel decoding method that extracts interpretable information from encoder-decoder models' cross-attention, and uses it to infer which attributes are mentioned in the generated text, which is subsequently used to rescore beam hypotheses. Using this decoding method with T5 and BART, we show on three datasets its ability to dramatically reduce semantic errors in the generated outputs, while maintaining their state-of-the-art quality.
    Can Edge Probing Tasks Reveal Linguistic Knowledge in QA Models?. (arXiv:2109.07102v1 [cs.CL])
    (0 min) There have been many efforts to try to understand what grammatical knowledge (e.g., ability to understand the part of speech of a token) is encoded in large pre-trained language models (LM). This is done through `Edge Probing' (EP) tests: simple ML models that predict the grammatical properties of a span (whether it has a particular part of speech) using \textit{only} the LM's token representations. However, most NLP applications use \finetuned\ LMs. Here, we ask: if a LM is \finetuned, does the encoding of linguistic information in it change, as measured by EP tests? Conducting experiments on multiple question-answering (QA) datasets, we answer that question negatively: the EP test results do not change significantly when the fine-tuned QA model performs well or in adversarial situations where the model is forced to learn wrong correlations. However, a critical analysis of the EP task datasets reveals that EP models may rely on spurious correlations to make predictions. This indicates even if \finetuning\ changes the encoding of such knowledge, the EP tests might fail to measure it.
    When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute. (arXiv:2102.12459v3 [cs.CL] UPDATED)
    (0 min) Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.
    A Conditional Generative Matching Model for Multi-lingual Reply Suggestion. (arXiv:2109.07046v1 [cs.CL])
    (2 min) We study the problem of multilingual automated reply suggestions (RS) model serving many languages simultaneously. Multilingual models are often challenged by model capacity and severe data distribution skew across languages. While prior works largely focus on monolingual models, we propose Conditional Generative Matching models (CGM), optimized within a Variational Autoencoder framework to address challenges arising from multi-lingual RS. CGM does so with expressive message conditional priors, mixture densities to enhance multi-lingual data representation, latent alignment for language discrimination, and effective variational optimization techniques for training multi-lingual RS. The enhancements result in performance that exceed competitive baselines in relevance (ROUGE score) by more than 10\% on average, and 16\% for low resource languages. CGM also shows remarkable improvements in diversity (80\%) illustrating its expressiveness in representation of multi-lingual data.
    Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech. (arXiv:2109.06952v1 [cs.CL])
    (2 min) Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
    fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit. (arXiv:2109.06912v1 [eess.AS])
    (2 min) This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq S^2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.
    Frequency Effects on Syntactic Rule Learning in Transformers. (arXiv:2109.07020v1 [cs.CL])
    (2 min) Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT's performance on English subject-verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject-verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer analysis of these frequency effects reveals that BERT's behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items.
    Automatically Exposing Problems with Neural Dialog Models. (arXiv:2109.06950v1 [cs.CL])
    (2 min) Neural dialog models are known to suffer from problems such as generating unsafe and inconsistent responses. Even though these problems are crucial and prevalent, they are mostly manually identified by model designers through interactions. Recently, some research instructs crowdworkers to goad the bots into triggering such problems. However, humans leverage superficial clues such as hate speech, while leaving systematic problems undercover. In this paper, we propose two methods including reinforcement learning to automatically trigger a dialog model into generating problematic responses. We show the effect of our methods in exposing safety and contradiction issues with state-of-the-art dialog models.
    SupCL-Seq: Supervised Contrastive Learning for Downstream Optimized Sequence Representations. (arXiv:2109.07424v1 [cs.CL])
    (2 min) While contrastive learning is proven to be an effective training strategy in computer vision, Natural Language Processing (NLP) is only recently adopting it as a self-supervised alternative to Masked Language Modeling (MLM) for improving sequence representations. This paper introduces SupCL-Seq, which extends the supervised contrastive learning from computer vision to the optimization of sequence representations in NLP. By altering the dropout mask probability in standard Transformer architectures, for every representation (anchor), we generate augmented altered views. A supervised contrastive loss is then utilized to maximize the system's capability of pulling together similar samples (e.g., anchors and their altered views) and pushing apart the samples belonging to the other classes. Despite its simplicity, SupCLSeq leads to large gains in many sequence classification tasks on the GLUE benchmark compared to a standard BERTbase, including 6% absolute improvement on CoLA, 5.4% on MRPC, 4.7% on RTE and 2.6% on STSB. We also show consistent gains over self supervised contrastively learned representations, especially in non-semantic tasks. Finally we show that these gains are not solely due to augmentation, but rather to a downstream optimized sequence representation. Code: https://github.com/hooman650/SupCL-Seq
    Discriminative and Generative Transformer-based Models For Situation Entity Classification. (arXiv:2109.07434v1 [cs.CL])
    (2 min) We re-examine the situation entity (SE) classification task with varying amounts of available training data. We exploit a Transformer-based variational autoencoder to encode sentences into a lower dimensional latent space, which is used to generate the text and learn a SE classifier. Test set and cross-genre evaluations show that when training data is plentiful, the proposed model can improve over the previous discriminative state-of-the-art models. Our approach performs disproportionately better with smaller amounts of training data, but when faced with extremely small sets (4 instances per label), generative RNN methods outperform transformers. Our work provides guidance for future efforts on SE and semantic prediction tasks, and low-label training regimes.
    Optimal Neural Program Synthesis from Multimodal Specifications. (arXiv:2010.01678v2 [cs.CL] UPDATED)
    (2 min) Multimodal program synthesis, which leverages different types of user input to synthesize a desired program, is an attractive way to scale program synthesis to challenging settings; however, it requires integrating noisy signals from the user, like natural language, with hard constraints on the program's behavior. This paper proposes an optimal neural synthesis approach where the goal is to find a program that satisfies user-provided constraints while also maximizing the program's score with respect to a neural model. Specifically, we focus on multimodal synthesis tasks in which the user intent is expressed using a combination of natural language (NL) and input-output examples. At the core of our method is a top-down recurrent neural model that places distributions over abstract syntax trees conditioned on the NL input. This model not only allows for efficient search over the space of syntactically valid programs, but it allows us to leverage automated program analysis techniques for pruning the search space based on infeasibility of partial programs with respect to the user's constraints. The experimental results on a multimodal synthesis dataset (StructuredRegex) show that our method substantially outperforms prior state-of-the-art techniques in terms of accuracy and efficiency, and finds model-optimal programs more frequently.
    Efficient Domain Adaptation of Language Models via Adaptive Tokenization. (arXiv:2109.07460v1 [cs.CL])
    (2 min) Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides >97% of the performance benefits of domain specific pretraining. Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation. While adaptive tokenization incurs a 6% increase in model parameters in our experimentation, due to the introduction of 10k new domain-specific tokens, our approach, using 64 vCPUs, is 72x faster than further pretraining the language model on domain-specific corpora on 8 TPUs.
    Assisting the Human Fact-Checkers: Detecting All Previously Fact-Checked Claims in a Document. (arXiv:2109.07410v1 [cs.CL])
    (2 min) Given the recent proliferation of false claims online, there has been a lot of manual fact-checking effort. As this is very time-consuming, human fact-checkers can benefit from tools that can support them and make them more efficient. Here, we focus on building a system that could provide such support. Given an input document, it aims to detect all sentences that contain a claim that can be verified by some previously fact-checked claims (from a given database). The output is a re-ranked list of the document sentences, so that those that can be verified are ranked as high as possible, together with corresponding evidence. Unlike previous work, which has looked into claim retrieval, here we take a document-level perspective. We create a new manually annotated dataset for the task, and we propose suitable evaluation measures. We further experiment with a learning-to-rank approach, achieving sizable performance gains over several strong baselines. Our analysis demonstrates the importance of modeling text similarity and stance, while also taking into account the veracity of the retrieved previously fact-checked claims. We believe that this research would be of interest to fact-checkers, journalists, media, and regulatory authorities.
    Comparing Text Representations: A Theory-Driven Approach. (arXiv:2109.07458v1 [cs.CL])
    (2 min) Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compatibility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels.
    Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative. (arXiv:2109.07437v1 [cs.LG])
    (2 min) Pre-training, where models are trained on an auxiliary objective with abundant data before being fine-tuned on data from the downstream task, is now the dominant paradigm in NLP. In general, the pre-training step relies on little to no direct knowledge of the task on which the model will be fine-tuned, even when the end-task is known in advance. Our work challenges this status-quo of end-task agnostic pre-training. First, on three different low-resource NLP tasks from two domains, we demonstrate that multi-tasking the end-task and auxiliary objectives results in significantly better downstream task performance than the widely-used task-agnostic continued pre-training paradigm of Gururangan et al. (2020). We next introduce an online meta-learning algorithm that learns a set of multi-task weights to better balance among our multiple auxiliary objectives, achieving further improvements on end task performance and data efficiency.
  • cs.CV updates on arXiv.org

    MLFW: A Database for Face Recognition on Masked Faces. (arXiv:2109.05804v2 [cs.CV] UPDATED)
    (2 min) As more and more people begin to wear masks due to current COVID-19 pandemic, existing face recognition systems may encounter severe performance degradation when recognizing masked faces. To figure out the impact of masks on face recognition model, we build a simple but effective tool to generate masked faces from unmasked faces automatically, and construct a new database called Masked LFW (MLFW) based on Cross-Age LFW (CALFW) database. The mask on the masked face generated by our method has good visual consistency with the original face. Moreover, we collect various mask templates, covering most of the common styles appeared in the daily life, to achieve diverse generation effects. Considering realistic scenarios, we design three kinds of combinations of face pairs. The recognition accuracy of SOTA models declines 5%-16% on MLFW database compared with the accuracy on the original images. MLFW database can be viewed and downloaded at \url{this http URL}.
    Learning to Stylize Novel Views. (arXiv:2105.13509v2 [cs.CV] UPDATED)
    (2 min) We tackle a 3D scene stylization problem - generating stylized images of a scene from arbitrary novel views given a set of images of the same scene and a reference image of the desired style as inputs. Direct solution of combining novel view synthesis and stylization approaches lead to results that are blurry or not consistent across different views. We propose a point cloud-based method for consistent 3D scene stylization. First, we construct the point cloud by back-projecting the image features to the 3D space. Second, we develop point cloud aggregation modules to gather the style information of the 3D scene, and then modulate the features in the point cloud with a linear transformation matrix. Finally, we project the transformed features to 2D space to obtain the novel views. Experimental results on two diverse datasets of real-world scenes validate that our method generates consistent stylized novel view synthesis results against other alternative approaches.
    Deep Learning based Micro-expression Recognition: A Survey. (arXiv:2107.02823v2 [cs.CV] UPDATED)
    (2 min) Micro-expressions (MEs) are involuntary facial movements revealing people's hidden feelings in high-stake situations and have practical importance in medical treatment, national security, interrogations and many human-computer interaction systems. Early methods for MER mainly based on traditional appearance and geometry features. Recently, with the success of deep learning (DL) in various fields, neural networks have received increasing interests in MER. Different from macro-expressions, MEs are spontaneous, subtle, and rapid facial movements, leading to difficult data collection, thus have small-scale datasets. DL based MER becomes challenging due to above ME characters. To date, various DL approaches have been proposed to solve the ME issues and improve MER performance. In this survey, we provide a comprehensive review of deep micro-expression recognition (MER), including datasets, deep MER pipeline, and the bench-marking of most influential methods. This survey defines a new taxonomy for the field, encompassing all aspects of MER based on DL. For each aspect, the basic approaches and advanced developments are summarized and discussed. In addition, we conclude the remaining challenges and and potential directions for the design of robust deep MER systems. To the best of our knowledge, this is the first survey of deep MER methods, and this survey can serve as a reference point for future MER research.
    Harnessing Unrecognizable Faces for Improving Face Recognition. (arXiv:2106.04112v2 [cs.CV] UPDATED)
    (2 min) The common implementation of face recognition systems as a cascade of a detection stage and a recognition or verification stage can cause problems beyond failures of the detector. When the detector succeeds, it can detect faces that cannot be recognized, no matter how capable the recognition system. Recognizability, a latent variable, should therefore be factored into the design and implementation of face recognition systems. We propose a measure of recognizability of a face image that leverages a key empirical observation: an embedding of face images, implemented by a deep neural network trained using mostly recognizable identities, induces a partition of the hypersphere whereby unrecognizable identities cluster together. This occurs regardless of the phenomenon that causes a face to be unrecognizable, it be optical or motion blur, partial occlusion, spatial quantization, poor illumination. Therefore, we use the distance from such an "unrecognizable identity" as a measure of recognizability, and incorporate it in the design of the over-all system. We show that accounting for recognizability reduces error rate of single-image face recognition by 58% at FAR=1e-5 on the IJB-C Covariate Verification benchmark, and reduces verification error rate by 24% at FAR=1e-5 in set-based recognition on the IJB-C benchmark.
    CUDA-GHR: Controllable Unsupervised Domain Adaptation for Gaze and Head Redirection. (arXiv:2106.10852v2 [cs.CV] UPDATED)
    (2 min) The robustness of gaze and head pose estimation models is highly dependent on the amount of labeled data. Recently, generative modeling has shown excellent results in generating photo-realistic images, which can alleviate the need for labeled data. However, adopting such generative models to new domains while maintaining their ability to provide fine-grained control over different image attributes, e.g., gaze and head pose directions, has been a challenging problem. This paper proposes CUDA-GHR, an unsupervised domain adaptation framework that enables fine-grained control over gaze and head pose directions while preserving the appearance-related factors of the person. Our framework simultaneously learns to adapt to new domains and disentangle image attributes such as appearance, gaze direction, and head orientation by utilizing a label-rich source domain and an unlabeled target domain. Extensive experiments on the benchmarking datasets show that the proposed method can outperform state-of-the-art techniques on both quantitative and qualitative evaluations. Furthermore, we show that the generated image-label pairs in the target domain effectively transfer knowledge and boost the downstream tasks' performance.
    A Tutorial on Learning Disentangled Representations in the Imaging Domain. (arXiv:2108.12043v2 [cs.CV] UPDATED)
    (0 min) Disentangled representation learning has been proposed as an approach to learning general representations. This can be done in the absence of, or with limited, annotations. A good general representation can be readily fine-tuned for new target tasks using modest amounts of data, or even be used directly in unseen domains achieving remarkable performance in the corresponding task. This alleviation of the data and annotation requirements offers tantalising prospects for tractable and affordable applications in computer vision and healthcare. Finally, disentangled representations can offer model explainability and can help us understand the underlying causal relations of the factors of variation, increasing their suitability for real-world deployment. In this tutorial paper, we will offer an overview of the disentangled representation learning, its building blocks and criteria, and discuss applications in computer vision and medical imaging. We conclude our tutorial by presenting the identified opportunities for the integration of recent machine learning advances into disentanglement, as well as the remaining challenges.
    Instance-weighted Central Similarity for Multi-label Image Retrieval. (arXiv:2108.05274v4 [cs.CV] UPDATED)
    (0 min) Deep hashing has been widely applied to large-scale image retrieval by encoding high-dimensional data points into binary codes for efficient retrieval. Compared with pairwise/triplet similarity based hash learning, central similarity based hashing can more efficiently capture the global data distribution. For multi-label image retrieval, however, previous methods only use multiple hash centers with equal weights to generate one centroid as the learning target, which ignores the relationship between the weights of hash centers and the proportion of instance regions in the image. To address the above issue, we propose a two-step alternative optimization approach, Instance-weighted Central Similarity (ICS), to automatically learn the center weight corresponding to a hash code. Firstly, we apply the maximum entropy regularizer to prevent one hash center from dominating the loss function, and compute the center weights via projection gradient descent. Secondly, we update neural network parameters by standard back-propagation with fixed center weights. More importantly, the learned center weights can well reflect the proportion of foreground instances in the image. Our method achieves the state-of-the-art performance on the image retrieval benchmarks, and especially improves the mAP by 1.6%-6.4% on the MS COCO dataset.
    LOKI: Long Term and Key Intentions for Trajectory Prediction. (arXiv:2108.08236v2 [cs.CV] UPDATED)
    (0 min) Recent advances in trajectory prediction have shown that explicit reasoning about agents' intent is important to accurately forecast their motion. However, the current research activities are not directly applicable to intelligent and safety critical systems. This is mainly because very few public datasets are available, and they only consider pedestrian-specific intents for a short temporal horizon from a restricted egocentric view. To this end, we propose LOKI (LOng term and Key Intentions), a novel large-scale dataset that is designed to tackle joint trajectory and intention prediction for heterogeneous traffic agents (pedestrians and vehicles) in an autonomous driving setting. The LOKI dataset is created to discover several factors that may affect intention, including i) agent's own will, ii) social interactions, iii) environmental constraints, and iv) contextual information. We also propose a model that jointly performs trajectory and intention prediction, showing that recurrently reasoning about intention can assist with trajectory prediction. We show our method outperforms state-of-the-art trajectory prediction methods by upto $27\%$ and also provide a baseline for frame-wise intention estimation.
    Language Grounding with 3D Objects. (arXiv:2107.12514v2 [cs.CL] UPDATED)
    (0 min) Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" Flat images of candidate mice may not provide the discriminative information needed for "wireless." The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE) requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these image-based models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation. We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform, but note that a large gap remains between these models and human performance.
    Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation. (arXiv:2107.06011v2 [cs.CV] UPDATED)
    (0 min) In the context of visual navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals. This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities, and discover object characteristics. In classical Reinforcement Learning (RL) setups, this capacity is learned from reward alone. We introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objective. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input.
    New Encoder Learning for Captioning Heavy Rain Images via Semantic Visual Feature Matching. (arXiv:2105.13753v3 [cs.CV] UPDATED)
    (0 min) Image captioning generates text that describes scenes from input images. It has been developed for high quality images taken in clear weather. However, in bad weather conditions, such as heavy rain, snow, and dense fog, the poor visibility owing to rain streaks, rain accumulation, and snowflakes causes a serious degradation of image quality. This hinders the extraction of useful visual features and results in deteriorated image captioning performance. To address practical issues, this study introduces a new encoder for captioning heavy rain images. The central idea is to transform output features extracted from heavy rain input images into semantic visual features associated with words and sentence context. To achieve this, a target encoder is initially trained in an encoder-decoder framework to associate visual features with semantic words. Subsequently, the objects in a heavy rain image are rendered visible by using an initial reconstruction subnetwork (IRS) based on a heavy rain model. The IRS is then combined with another semantic visual feature matching subnetwork (SVFMS) to match the output features of the IRS with the semantic visual features of the pretrained target encoder. The proposed encoder is based on the joint learning of the IRS and SVFMS. It is is trained in an end-to-end manner, and then connected to the pretrained decoder for image captioning. It is experimentally demonstrated that the proposed encoder can generate semantic visual features associated with words even from heavy rain images, thereby increasing the accuracy of the generated captions.
    CoAtNet: Marrying Convolution and Attention for All Data Sizes. (arXiv:2106.04803v2 [cs.CV] UPDATED)
    (0 min) Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.
    CenterPoly: real-time instance segmentation using bounding polygons. (arXiv:2108.08923v2 [cs.CV] UPDATED)
    (2 min) We present a novel method, called CenterPoly, for real-time instance segmentation using bounding polygons. We apply it to detect road users in dense urban environments, making it suitable for applications in intelligent transportation systems like automated vehicles. CenterPoly detects objects by their center keypoint while predicting a fixed number of polygon vertices for each object, thus performing detection and segmentation in parallel. Most of the network parameters are shared by the network heads, making it fast and lightweight enough to run at real-time speed. To properly convert mask ground-truth to polygon ground-truth, we designed a vertex selection strategy to facilitate the learning of the polygons. Additionally, to better segment overlapping objects in dense urban scenes, we also train a relative depth branch to determine which instances are closer and which are further, using available weak annotations. We propose several models with different backbones to show the possible speed / accuracy trade-offs. The models were trained and evaluated on Cityscapes, KITTI and IDD and the results are reported on their public benchmark, which are state-of-the-art at real-time speeds. Code is available at https://github.com/hu64/CenterPoly
    FFAVOD: Feature Fusion Architecture for Video Object Detection. (arXiv:2109.07298v1 [cs.CV])
    (0 min) A significant amount of redundancy exists between consecutive frames of a video. Object detectors typically produce detections for one image at a time, without any capabilities for taking advantage of this redundancy. Meanwhile, many applications for object detection work with videos, including intelligent transportation systems, advanced driver assistance systems and video surveillance. Our work aims at taking advantage of the similarity between video frames to produce better detections. We propose FFAVOD, standing for feature fusion architecture for video object detection. We first introduce a novel video object detection architecture that allows a network to share feature maps between nearby frames. Second, we propose a feature fusion module that learns to merge feature maps to enhance them. We show that using the proposed architecture and the fusion module can improve the performance of three base object detectors on two object detection benchmarks containing sequences of moving road users. Additionally, to further increase performance, we propose an improvement to the SpotNet attention module. Using our architecture on the improved SpotNet detector, we obtain the state-of-the-art performance on the UA-DETRAC public benchmark as well as on the UAVDT dataset. Code is available at https://github.com/hu64/FFAVOD.
    CalCROP21: A Georeferenced multi-spectral dataset of Satellite Imagery and Crop Labels. (arXiv:2107.12499v2 [cs.CV] UPDATED)
    (2 min) Mapping and monitoring crops is a key step towards sustainable intensification of agriculture and addressing global food security. A dataset like ImageNet that revolutionized computer vision applications can accelerate development of novel crop mapping techniques. Currently, the United States Department of Agriculture (USDA) annually releases the Cropland Data Layer (CDL) which contains crop labels at 30m resolution for the entire United States of America. While CDL is state of the art and is widely used for a number of agricultural applications, it has a number of limitations (e.g., pixelated errors, labels carried over from previous errors and absence of input imagery along with class labels). In this work, we create a new semantic segmentation benchmark dataset, which we call CalCROP21, for the diverse crops in the Central Valley region of California at 10m spatial resolution using a Google Earth Engine based robust image processing pipeline and a novel attention based spatio-temporal semantic segmentation algorithm STATT. STATT uses re-sampled (interpolated) CDL labels for training, but is able to generate a better prediction than CDL by leveraging spatial and temporal patterns in Sentinel2 multi-spectral image series to effectively capture phenologic differences amongst crops and uses attention to reduce the impact of clouds and other atmospheric disturbances. We also present a comprehensive evaluation to show that STATT has significantly better results when compared to the resampled CDL labels. We have released the dataset and the processing pipeline code for generating the benchmark dataset.
    Improving 3D Object Detection with Channel-wise Transformer. (arXiv:2108.10723v2 [cs.CV] UPDATED)
    (2 min) Though 3D object detection from point clouds has achieved rapid progress in recent years, the lack of flexible and high-performance proposal refinement remains a great hurdle for existing state-of-the-art two-stage detectors. Previous works on refining 3D proposals have relied on human-designed components such as keypoints sampling, set abstraction and multi-scale feature fusion to produce powerful 3D object representations. Such methods, however, have limited ability to capture rich contextual dependencies among points. In this paper, we leverage the high-quality region proposal network and a Channel-wise Transformer architecture to constitute our two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. Extensive experiments demonstrate that our CT3D method has superior performance and excellent scalability. Remarkably, CT3D achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark, outperforms state-of-the-art 3D detectors.
    RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting. (arXiv:2107.02104v2 [cs.CV] UPDATED)
    (2 min) Chest radiographs are one of the most common diagnostic modalities in clinical routine. It can be done cheaply, requires minimal equipment, and the image can be diagnosed by every radiologists. However, the number of chest radiographs obtained on a daily basis can easily overwhelm the available clinical capacities. We propose RATCHET: RAdiological Text Captioning for Human Examined Thoraces. RATCHET is a CNN-RNN-based medical transformer that is trained end-to-end. It is capable of extracting image features from chest radiographs, and generates medically accurate text reports that fit seamlessly into clinical work flows. The model is evaluated for its natural language generation ability using common metrics from NLP literature, as well as its medically accuracy through a surrogate report classification task. The model is available for download at: this http URL
    ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. (arXiv:2108.02938v2 [cs.CV] UPDATED)
    (2 min) Denoising diffusion probabilistic models (DDPM) have shown remarkable performance in unconditional image generation. However, due to the stochasticity of the generative process in DDPM, it is challenging to generate images with the desired semantics. In this work, we propose Iterative Latent Variable Refinement (ILVR), a method to guide the generative process in DDPM to generate high-quality images based on a given reference image. Here, the refinement of the generative process in DDPM enables a single DDPM to sample images from various sets directed by the reference image. The proposed ILVR method generates high-quality images while controlling the generation. The controllability of our method allows adaptation of a single DDPM without any additional learning in various image generation tasks, such as generation from various downsampling factors, multi-domain image translation, paint-to-image, and editing with scribbles.
    Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion. (arXiv:2106.03821v2 [cs.SD] UPDATED)
    (0 min) It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.
    Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation. (arXiv:2108.06962v2 [cs.CV] UPDATED)
    (2 min) In this work, we address the task of unsupervised domain adaptation (UDA) for semantic segmentation in presence of multiple target domains: The objective is to train a single model that can handle all these domains at test time. Such a multi-target adaptation is crucial for a variety of scenarios that real-world autonomous systems must handle. It is a challenging setup since one faces not only the domain gap between the labeled source set and the unlabeled target set, but also the distribution shifts existing within the latter among the different target domains. To this end, we introduce two adversarial frameworks: (i) multi-discriminator, which explicitly aligns each target domain to its counterparts, and (ii) multi-target knowledge transfer, which learns a target-agnostic model thanks to a multi-teacher/single-student distillation mechanism.The evaluation is done on four newly-proposed multi-target benchmarks for UDA in semantic segmentation. In all tested scenarios, our approaches consistently outperform baselines, setting competitive standards for the novel task.
    DeepChange: A Long-Term Person Re-Identification Benchmark. (arXiv:2105.14685v3 [cs.CV] UPDATED)
    (2 min) Existing person re-identification (Re-ID) works mostly consider a short-term search problem assuming unchanged clothes and personal appearance. However, in real-world we often dress differently across locations, time, dates, seasons, weather, and events. As a result, the existing methods are unsuitable for long-term person Re-ID with clothes change involved. Whilst there are several recent long-term Re-ID attempts, a large realistic dataset with clothes change is lacking and indispensable for enabling extensive study as already experienced in short-term Re-ID setting. In this work, we contribute a large, realistic long-term person identification benchmark. It consists of 178K bounding boxes from 1.1K person identities, collected and constructed over 12 months. Unique characteristics of this dataset include: (1) Natural/native personal appearance (e.g., clothes and hair style) variations: The clothes-change and dressing styles all are highly diverse, with the reappearing gap in time ranging from minutes, hours, and days to weeks, months, seasons, and years. (2) Diverse walks of life: Persons across a wide range of ages and professions appear in different weather conditions (e.g., sunny, cloudy, windy, rainy, snowy, extremely cold) and events (e.g., working, leisure, daily activities). (3) Rich camera setups: The raw videos were recorded by 17 outdoor security cameras with various resolutions operating in a real-world surveillance system for a wide and dense block. (4) Largest scale: It covers the largest number of (17) cameras, (1, 121) identities, and (178, 407) bounding boxes, as compared to alternative datasets. Our dataset and benchmark codes are available on https://github.com/PengBoXiangShang/deepchange.
    A Novel Unified Stereo Stimuli based Binocular Eye-Tracking System for Accurate 3D Gaze Estimation. (arXiv:2104.12167v2 [cs.CV] UPDATED)
    (2 min) In addition to the high cost and complex setup, the main reason for the limitation of the three-dimensional (3D) display is the problem of accurately estimating the user's current point-of-gaze (PoG) in a 3D space. In this paper, we present a novel noncontact technique for the PoG estimation in a stereoscopic environment, which integrates a 3D stereoscopic display system and an eye-tracking system. The 3D stereoscopic display system can provide users with a friendly and immersive high-definition viewing experience without wearing any equipment. To accurately locate the user's 3D PoG in the field of view, we build a regression-based 3D eye-tracking model with the eye movement data and stereo stimulus videos as input. Besides, to train an optimal regression model, we also design and annotate a dataset that contains 30 users' eye-tracking data corresponding to two designed stereo test scenes. Innovatively, this dataset introduces feature vectors between eye region landmarks for the gaze vector estimation and a combined feature set for the gaze depth estimation. Moreover, five traditional regression models are trained and evaluated based on this dataset. Experimental results show that the average errors of the 3D PoG are about 0.90~cm on the X-axis, 0.83~cm on the Y-axis, and 1.48~cm$/$0.12~m along the Z-axis with the scene-depth range in 75~cm$/$8~m, respectively.
    Describing and Localizing Multiple Changes with Transformers. (arXiv:2103.14146v2 [cs.CV] UPDATED)
    (2 min) Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on a single change.However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a simulation-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional change captioning evaluation metrics for multi-change captioning. Additionally, our proposed method can separate attention maps for each change and performs well with respect to change localization. Moreover, the proposed framework outperformed the previous state-of-the-art methods on an existing change captioning benchmark, CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores), indicating its general ability in change captioning tasks.
    MOTR: End-to-End Multiple-Object Tracking with TRansformer. (arXiv:2105.03247v2 [cs.CV] UPDATED)
    (2 min) The key challenge in multiple-object tracking task is temporal modeling of the object under track. Existing tracking-by-detection methods adopt simple heuristics, such as spatial or appearance similarity. Such methods, in spite of their commonality, are overly simple and lack the ability to learn temporal variations from data in an end-to-end manner. In this paper, we present MOTR, a fully end-to-end multiple-object tracking framework. It learns to model the long-range temporal variation of the objects. It performs temporal association implicitly and avoids previous explicit heuristics. Built upon DETR, MOTR introduces the concept of "track query". Each track query models the entire track of an object. It is transferred and updated frame-by-frame to perform iterative predictions in a seamless manner. Tracklet-aware label assignment is proposed for one-to-one assignment between track queries and object tracks. Temporal aggregation network together with collective average loss is further proposed to enhance the long-range temporal relation. Experimental results show that MOTR achieves competitive performance and can serve as a strong Transformer-based baseline for future research. Code is available at \url{https://github.com/megvii-model/MOTR}.
    Self-Supervised Training Enhances Online Continual Learning. (arXiv:2103.14010v3 [cs.CV] UPDATED)
    (2 min) In continual learning, a system must incrementally learn from a non-stationary data stream without catastrophic forgetting. Recently, multiple methods have been devised for incrementally learning classes on large-scale image classification tasks, such as ImageNet. State-of-the-art continual learning methods use an initial supervised pre-training phase, in which the first 10% - 50% of the classes in a dataset are used to learn representations in an offline manner before continual learning of new classes begins. We hypothesize that self-supervised pre-training could yield features that generalize better than supervised learning, especially when the number of samples used for pre-training is small. We test this hypothesis using the self-supervised MoCo-V2, Barlow Twins, and SwAV algorithms. On ImageNet, we find that these methods outperform supervised pre-training considerably for online continual learning, and the gains are larger when fewer samples are available. Our findings are consistent across three online continual learning algorithms. Our best system achieves a 14.95% relative increase in top-1 accuracy on class incremental ImageNet over the prior state of the art for online continual learning.
    VideoGPT: Video Generation using VQ-VAE and Transformers. (arXiv:2104.10157v2 [cs.CV] UPDATED)
    (2 min) We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html
    SA-GAN: Structure-Aware GAN for Organ-Preserving Synthetic CT Generation. (arXiv:2105.07044v3 [eess.IV] UPDATED)
    (2 min) In medical image synthesis, model training could be challenging due to the inconsistencies between images of different modalities even with the same patient, typically caused by internal status/tissue changes as different modalities are usually obtained at a different time. This paper proposes a novel deep learning method, Structure-aware Generative Adversarial Network (SA-GAN), that preserves the shapes and locations of in-consistent structures when generating medical images. SA-GAN is employed to generate synthetic computed tomography (synCT) images from magnetic resonance imaging (MRI) with two parallel streams: the global stream translates the input from the MRI to the CT domain while the local stream automatically segments the inconsistent organs, maintains their locations and shapes in MRI, and translates the organ intensities to CT. Through extensive experiments on a pelvic dataset, we demonstrate that SA-GAN provides clinically acceptable accuracy on both synCTs and organ segmentation and supports MR-only treatment planning in disease sites with internal organ status changes.
    Self-Improving Semantic Perception for Indoor Localisation. (arXiv:2105.01595v2 [cs.RO] UPDATED)
    (2 min) We propose a novel robotic system that can improve its perception during deployment. Contrary to the established approach of learning semantics from large datasets and deploying fixed models, we propose a framework in which semantic models are continuously updated on the robot to adapt to the deployment environments. By combining continual learning with self-supervision, our robotic system learns online during deployment without external supervision. We conduct real-world experiments with robots localising in 3D floorplans. Our experiments show how the robot's semantic perception improves during deployment and how this translates into improved localisation, even across drastically different environments. We further study the risk of catastrophic forgetting that such a continuous learning setting poses. We find memory replay an effective measure to reduce forgetting and show how the robotic system can improve even when switching between different environments. On average, our system improves by 60% in segmentation and 10% in localisation accuracy compared to deployment of a fixed model, and it maintains this improvement while adapting to further environments.
    AGRNet: Adaptive Graph Representation Learning and Reasoning for Face Parsing. (arXiv:2101.07034v2 [cs.CV] UPDATED)
    (2 min) Face parsing infers a pixel-wise label to each facial component, which has drawn much attention recently.Previous methods have shown their success in face parsing, which however overlook the correlation among facial components.As a matter of fact, the component-wise relationship is a critical clue in discriminating ambiguous pixels in facial area.To address this issue, we propose adaptive graph representation learning and reasoning over facial components, aiming to learn representative vertices that describe each component, exploit the component-wise relationship and thereby produce accurate parsing results against ambiguity. In particular, we devise an adaptive and differentiable graph abstraction method to represent the components on a graph via pixel-to-vertex projection under the initial condition of a predicted parsing map, where pixel features within a certain facial region are aggregated onto a vertex. Further, we explicitly incorporate the image edge as a prior in the model, which helps to discriminate edge and non-edge pixels during the projection, thus leading to refined parsing results along the edges.Then, our model learns and reasons over the relations among components by propagating information across vertices on the graph. Finally, the refined vertex features are projected back to pixel grids for the prediction of the final parsing map.To train our model, we propose a discriminative loss to penalize small distances between vertices in the feature space, which leads to distinct vertices with strong semantics. Experimental results show the superior performance of the proposed model on multiple face parsing datasets, along with the validation on the human parsing task to demonstrate the generalizability of our model.
    Automatic Detection of Aedes aegypti Breeding Grounds Based on Deep Networks with Spatio-Temporal Consistency. (arXiv:2007.14863v2 [cs.CV] UPDATED)
    (2 min) Every year, the Aedes aegypti mosquito infects millions of people with diseases such as dengue, zika, chikungunya, and urban yellow fever. The main form to combat these diseases is to avoid mosquito reproduction by searching for and eliminating the potential mosquito breeding grounds. In this work, we introduce a comprehensive dataset of aerial videos, acquired with an unmanned aerial vehicle, containing possible mosquito breeding sites. All frames of the video dataset were manually annotated with bounding boxes identifying all objects of interest. This dataset was employed to develop an automatic detection system of such objects based on deep convolutional networks. We propose the exploitation of the temporal information contained in the videos by the incorporation, in the object detection pipeline, of a spatio-temporal consistency module that can register the detected objects, minimizing most false-positive and false-negative occurrences. Using the ResNet-50-FPN as a backbone, we achieve F$_1$-scores of 0.65 and 0.77 on the object-level detection of `tires' and `water tanks', respectively, illustrating the system capabilities to properly locate potential mosquito breeding objects.
    Fast and Accurate: Video Enhancement using Sparse Depth. (arXiv:2103.08764v2 [cs.CV] UPDATED)
    (2 min) This paper presents a general framework to build fast and accurate algorithms for video enhancement tasks such as super-resolution, deblurring, and denoising. Essential to our framework is the realization that the accuracy, rather than the density, of pixel flows is what is required for high-quality video enhancement. Most of prior works take the opposite approach: they estimate dense (per-pixel)-but generally less robust-flows, mostly using computationally costly algorithms. Instead, we propose a lightweight flow estimation algorithm; it fuses the sparse point cloud data and (even sparser and less reliable) IMU data available in modern autonomous agents to estimate the flow information. Building on top of the flow estimation, we demonstrate a general framework that integrates the flows in a plug-and-play fashion with different task-specific layers. Algorithms built in our framework achieve 1.78x - 187.41x speedup while providing a 0.42 dB - 6.70 dB quality improvement over competing methods.
    Direct and Sparse Deformable Tracking. (arXiv:2109.07370v1 [cs.CV])
    (2 min) Deformable Monocular SLAM algorithms recover the localization of a camera in an unknown deformable environment. Current approaches use a template-based deformable tracking to recover the camera pose and the deformation of the map. These template-based methods use an underlying global deformation model. In this paper, we introduce a novel deformable camera tracking method with a local deformation model for each point. Each map point is defined as a single textured surfel that moves independently of the other map points. Thanks to a direct photometric error cost function, we can track the position and orientation of the surfel without an explicit global deformation model. In our experiments, we validate the proposed system and observe that our local deformation model estimates more accurately and robustly the targeted deformations of the map in both laboratory-controlled experiments and in-body scenarios undergoing non-isometric deformations, with changing topology or discontinuities.
    Direct Estimation of Appearance Models for Segmentation. (arXiv:2102.11121v3 [cs.CV] UPDATED)
    (2 min) Image segmentation algorithms often depend on appearance models that characterize the distribution of pixel values in different image regions. We describe a new approach for estimating appearance models directly from an image, without explicit consideration of the pixels that make up each region. Our approach is based on novel algebraic expressions that relate local image statistics to the appearance of spatially coherent regions. We describe two algorithms that can use the aforementioned algebraic expressions to estimate appearance models directly from an image. The first algorithm solves a system of linear and quadratic equations using a least squares formulation. The second algorithm is a spectral method based on an eigenvector computation. We present experimental results that demonstrate the proposed methods work well in practice and lead to effective image segmentation algorithms.
    Polarimetric Spatio-Temporal Light Transport Probing. (arXiv:2105.11609v2 [cs.CV] UPDATED)
    (2 min) Light emitted from a source into a scene can undergo complex interactions with scene surfaces of different material types before being reflected. During this transport, every surface reflection is encoded in the properties of the photons that reach the detector, including time, direction, intensity, wavelength and polarization. Conventional imaging systems capture intensity by integrating over all other dimensions of the light, hiding this rich scene information. Existing methods are capable of untangling these measurements into their spatial and temporal dimensions, fueling geometric scene understanding tasks. However, examining material properties jointly with geometric properties is an open challenge that could enable unprecedented capabilities beyond geometric scene understanding, allowing for material-dependent scene understanding and imaging through complex transport. In this work, we close this gap, and propose a computational light transport imaging method that captures the spatially- and temporally-resolved complete polarimetric response of a scene. Our method hinges on a 7D tensor theory of light transport. We discover low-rank structure in the polarimetric tensor dimension and propose a data-driven rotating ellipsometry method that learns to exploit redundancy of polarimetric structure. We instantiate our theory with two prototypes: spatio-polarimetric imaging and coaxial temporal-polarimetric imaging. This allows us, for the first time, to decompose scene light transport into temporal, spatial, and complete polarimetric dimensions that unveil scene properties hidden to conventional methods. We validate the applicability of our method on diverse tasks, including shape reconstruction with subsurface scattering, seeing through scattering media, untangling multi-bounce light transport, breaking metamerism, and decomposition of crystals.
    Making Contrastive Learning Robust to Shortcuts. (arXiv:2012.09962v4 [cs.LG] UPDATED)
    (2 min) Contrastive learning is one of the fastest growing research areas in machine learning due to its ability to learn useful representations without labeled data. However, contrastive learning is susceptible to shortcuts - i.e., it may learn shortcut features irrelevant to the task of interest, and discard relevant information. Past work has addressed this limitation via handcrafted data augmentations that eliminate the shortcut. But, manually crafted augmentations do not work across all datasets and tasks. Further, data augmentations fail in addressing shortcuts in multi-attribute classification when one attribute acts as a shortcut around other attributes. In this paper, we analyze the objective function of contrastive learning and formally prove that it is vulnerable to shortcuts. We then present reconstructive contrastive learning (RCL), a framework for learning unsupervised representations that are robust to shortcuts. The key idea is to force the learned representation to reconstruct the input, which naturally counters potential shortcuts. Extensive experiments verify that RCL is highly robust to shortcuts and outperforms state-of-the-art contrastive learning methods on a variety of datasets and tasks.
    Land Cover Mapping in Limited Labels Scenario: A Survey. (arXiv:2103.02429v2 [cs.LG] UPDATED)
    (2 min) Land cover mapping is essential for monitoring global environmental change and managing natural resources. Unfortunately, traditional classification models are plagued by limited training data available in existing land cover products and data heterogeneity over space and time. In this survey, we provide a structured and comprehensive overview of challenges in land cover mapping and machine learning methods used to address these problems. We also discuss the gaps and opportunities that exist for advancing research in this promising direction.
    S3LAM: Structured Scene SLAM. (arXiv:2109.07339v1 [cs.RO])
    (2 min) We propose a new general SLAM system that uses the semantic segmentation of objects and structures in the scene. Semantic information is relevant as it contains high level information which may make SLAM more accurate and robust. Our contribution is threefold: i) A new SLAM system based on ORB-SLAM2 that creates a semantic map made of clusters of points corresponding to objects instances and structures in the scene. ii) A modification of the classical Bundle Adjustment formulation to constrain each cluster using geometrical priors, which improves both camera localization and reconstruction and enables a better understanding of the scene. iii) A new Bundle Adjustment formulation at the level of clusters to improve the convergence of classical Bundle Adjustment. We evaluate our approach on several sequences from a public dataset and show that, with respect to ORB-SLAM2 it improves camera pose estimation.
    Combo Loss: Handling Input and Output Imbalance in Multi-Organ Segmentation. (arXiv:1805.02798v6 [cs.CV] UPDATED)
    (3 min) Simultaneous segmentation of multiple organs from different medical imaging modalities is a crucial task as it can be utilized for computer-aided diagnosis, computer-assisted surgery, and therapy planning. Thanks to the recent advances in deep learning, several deep neural networks for medical image segmentation have been introduced successfully for this purpose. In this paper, we focus on learning a deep multi-organ segmentation network that labels voxels. In particular, we examine the critical choice of a loss function in order to handle the notorious imbalance problem that plagues both the input and output of a learning model. The input imbalance refers to the class-imbalance in the input training samples (i.e., small foreground objects embedded in an abundance of background voxels, as well as organs of varying sizes). The output imbalance refers to the imbalance between the false positives and false negatives of the inference model. In order to tackle both types of imbalance during training and inference, we introduce a new curriculum learning based loss function. Specifically, we leverage Dice similarity coefficient to deter model parameters from being held at bad local minima and at the same time gradually learn better model parameters by penalizing for false positives/negatives using a cross entropy term. We evaluated the proposed loss function on three datasets: whole body positron emission tomography (PET) scans with 5 target organs, magnetic resonance imaging (MRI) prostate scans, and ultrasound echocardigraphy images with a single target organ i.e., left ventricular. We show that a simple network architecture with the proposed integrative loss function can outperform state-of-the-art methods and results of the competing methods can be improved when our proposed loss is used.
    An Efficient Quantitative Approach for Optimizing Convolutional Neural Networks. (arXiv:2009.05236v4 [cs.CV] UPDATED)
    (2 min) With the increasing popularity of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in various domains, such as image classification and object detection, and achieve stunning success in terms of their high accuracy over the traditional statistical methods. To exploit the potential of CNN models, a huge amount of research and industry efforts have been devoted to optimizing CNNs. Among these endeavors, CNN architecture design has attracted tremendous attention because of its great potential of improving model accuracy or reducing model complexity. However, existing work either introduces repeated training overhead in the search process or lacks an interpretable metric to guide the design. To clear these hurdles, we propose 3D-Receptive Field (3DRF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs. To validate the effectiveness of 3DRF, we build a static optimizer to improve the CNN architectures at both the stage level and the kernel level. Our optimizer not only provides a clear and reproducible procedure but also mitigates unnecessary training efforts in the architecture search process. Extensive experiments and studies show that the models generated by our optimizer can achieve up to 5.47% accuracy improvement and up to 65.38% parameters deduction, compared with state-of-the-art CNN structures like MobileNet and ResNet.
    Likelihood-Based Diverse Sampling for Trajectory Forecasting. (arXiv:2011.15084v2 [cs.CV] UPDATED)
    (2 min) Forecasting complex vehicle and pedestrian multi-modal distributions requires powerful probabilistic approaches. Normalizing flows (NF) have recently emerged as an attractive tool to model such distributions. However, a key drawback is that independent samples drawn from a flow model often do not adequately capture all the modes in the underlying distribution. We propose Likelihood-Based Diverse Sampling (LDS), a method for improving the quality and the diversity of trajectory samples from a pre-trained flow model. Rather than producing individual samples, LDS produces a set of trajectories in one shot. Given a pre-trained forecasting flow model, we train LDS using gradients from the model, to optimize an objective function that rewards high likelihood for individual trajectories in the predicted set, together with high spatial separation among trajectories. LDS outperforms state-of-art post-hoc neural diverse forecasting methods for various pre-trained flow models as well as conditional variational autoencoder (CVAE) models. Crucially, it can also be used for transductive trajectory forecasting, where the diverse forecasts are trained on-the-fly on unlabeled test examples. LDS is easy to implement, and we show that it offers a simple plug-in improvement over baselines on two challenging benchmarks. Code is at: https://github.com/JasonMa2016/LDS
    DAPnet: A Double Self-attention Convolutional Network for Point Cloud Semantic Labeling. (arXiv:2004.08596v2 [cs.CV] UPDATED)
    (3 min) Airborne Laser Scanning (ALS) point clouds have complex structures, and their 3D semantic labeling has been a challenging task. It has three problems: (1) the difficulty of classifying point clouds around boundaries of objects from different classes, (2) the diversity of shapes within the same class, and (3) the scale differences between classes. In this study, we propose a novel double self-attention convolutional network called the DAPnet. The double self-attention includes the point attention module (PAM) and the group attention module (GAM). For problem (1), the PAM can effectively assign different weights based on the relevance of point clouds in adjacent areas. Meanwhile, for the problem (2), the GAM enhances the correlation between groups, i.e., grouped features within the same classes. To solve the problem (3), we adopt a multiscale radius to construct the groups and concatenate extracted hierarchical features with the output of the corresponding upsampling process. Under the ISPRS 3D Semantic Labeling Contest dataset, the DAPnet outperforms the benchmark by 85.2\% with an overall accuracy of 90.7\%. By conducting ablation comparisons, we find that the PAM effectively improves the model than the GAM. The incorporation of the double self-attention module has an average of 7\% improvement on the pre-class accuracy. Plus, the DAPnet consumes a similar training time to those without the attention modules for model convergence. The DAPnet can assign different weights to features based on the relevance between point clouds and their neighbors, which effectively improves classification performance. The source codes are available at: https://github.com/RayleighChen/point-attention.
    Uncertainty-Aware Body Composition Analysis with Deep Regression Ensembles on UK Biobank MRI. (arXiv:2101.06963v3 [eess.IV] UPDATED)
    (2 min) Along with rich health-related metadata, medical images have been acquired for over 40,000 male and female UK Biobank participants, aged 44-82, since 2014. Phenotypes derived from these images, such as measurements of body composition from MRI, can reveal new links between genetics, cardiovascular disease, and metabolic conditions. In this work, six measurements of body composition and adipose tissues were automatically estimated by image-based, deep regression with ResNet50 neural networks from neck-to-knee body MRI. Despite the potential for high speed and accuracy, these networks produce no output segmentations that could indicate the reliability of individual measurements. The presented experiments therefore examine uncertainty quantification with mean-variance regression and ensembling to estimate individual measurement errors and thereby identify potential outliers, anomalies, and other failure cases automatically. In 10-fold cross-validation on data of about 8,500 subjects, mean-variance regression and ensembling showed complementary benefits, reducing the mean absolute error across all predictions by 12%. Both improved the calibration of uncertainties and their ability to identify high prediction errors. With intra-class correlation coefficients (ICC) above 0.97, all targets except the liver fat content yielded relative measurement errors below 5%. Testing on another 1,000 subjects showed consistent performance, and the method was finally deployed for inference to 30,000 subjects with missing reference values. The results indicate that deep regression ensembles could ultimately provide automated, uncertainty-aware measurements of body composition for more than 120,000 UK Biobank neck-to-knee body MRI that are to be acquired within the coming years.
    Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering. (arXiv:2109.07448v1 [cs.CV])
    (2 min) In this paper, we aim at synthesizing a free-viewpoint video of an arbitrary human performance using sparse multi-view cameras. Recently, several works have addressed this problem by learning person-specific neural radiance fields (NeRF) to capture the appearance of a particular human. In parallel, some work proposed to use pixel-aligned features to generalize radiance fields to arbitrary new scenes and objects. Adopting such generalization approaches to humans, however, is highly challenging due to the heavy occlusions and dynamic articulations of body parts. To tackle this, we propose Neural Human Performer, a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture. Specifically, we first introduce a temporal transformer that aggregates tracked visual features based on the skeletal body motion over time. Moreover, a multi-view transformer is proposed to perform cross-attention between the temporally-fused features and the pixel-aligned features at each time step to integrate observations on the fly from multiple views. Experiments on the ZJU-MoCap and AIST datasets show that our method significantly outperforms recent generalizable NeRF methods on unseen identities and poses. The video results and code are available at https://youngjoongunc.github.io/nhp.
    VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction. (arXiv:2104.05932v2 [cs.CV] UPDATED)
    (2 min) 3D object detection and dense depth estimation are one of the most vital tasks in autonomous driving. Multiple sensor modalities can jointly attribute towards better robot perception, and to that end, we introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. LiDAR point-cloud is converted into a set of voxels, and its features are extracted using 3D convolution layers, from which we regress object pose parameters. Corresponding RGB image features are extracted using another 2D convolutional neural network. We further use these combined features to predict a dense depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions. We also introduce a loss function, edge-preserving smooth loss, and show that this results in better depth estimation compared to the edge-aware smooth loss function, frequently used in depth prediction works.
    Incremental Multi-Target Domain Adaptation for Object Detection with Efficient Domain Transfer. (arXiv:2104.06476v3 [cs.CV] UPDATED)
    (3 min) Recent advances in unsupervised domain adaptation have significantly improved the recognition accuracy of CNNs by alleviating the domain shift between (labeled) source and (unlabeled) target data distributions. While the problem of single-target domain adaptation (STDA) for object detection has recently received much attention, multi-target domain adaptation (MTDA) remains largely unexplored, despite its practical relevance in several real-world applications, such as multi-camera video surveillance. Compared to the STDA problem that may involve large domain shifts between complex source and target distributions, MTDA faces additional challenges, most notably the computational requirements and catastrophic forgetting of previously-learned targets, which can depend on the order of target adaptations. STDA for detection can be applied to MTDA by adapting one model per target, or one common model with a mixture of data from target domains. However, these approaches are either costly or inaccurate. The only state-of-art MTDA method specialized for detection learns targets incrementally, one target at a time, and mitigates the loss of knowledge by using a duplicated detection model for knowledge distillation, which is computationally expensive and does not scale well to many domains. In this paper, we introduce an efficient approach for incremental learning that generalizes well to multiple target domains. Our MTDA approach is more suitable for real-world applications since it allows updating the detection model incrementally, without storing data from previous-learned target domains, nor retraining when a new target domain becomes available. Our proposed method, MTDA-DTM, achieved the highest level of detection accuracy compared against state-of-the-art approaches on several MTDA detection benchmarks and Wildtrack, a benchmark for multi-camera pedestrian detection.
    CLIPScore: A Reference-free Evaluation Metric for Image Captioning. (arXiv:2104.08718v2 [cs.CV] UPDATED)
    (2 min) Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.
    Semi-supervised Contrastive Learning for Label-efficient Medical Image Segmentation. (arXiv:2109.07407v1 [cs.CV])
    (2 min) The success of deep learning methods in medical image segmentation tasks heavily depends on a large amount of labeled data to supervise the training. On the other hand, the annotation of biomedical images requires domain knowledge and can be laborious. Recently, contrastive learning has demonstrated great potential in learning latent representation of images even without any label. Existing works have explored its application to biomedical image segmentation where only a small portion of data is labeled, through a pre-training phase based on self-supervised contrastive learning without using any labels followed by a supervised fine-tuning phase on the labeled portion of data only. In this paper, we establish that by including the limited label in formation in the pre-training phase, it is possible to boost the performance of contrastive learning. We propose a supervised local contrastive loss that leverages limited pixel-wise annotation to force pixels with the same label to gather around in the embedding space. Such loss needs pixel-wise computation which can be expensive for large images, and we further propose two strategies, downsampling and block division, to address the issue. We evaluate our methods on two public biomedical image datasets of different modalities. With different amounts of labeled data, our methods consistently outperform the state-of-the-art contrast-based methods and other semi-supervised learning techniques.
    Does language help generalization in vision models?. (arXiv:2104.08313v3 [cs.AI] UPDATED)
    (2 min) Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and adversarial robustness. In each setting, multimodal training produced no additional generalization capability compared to standard supervised visual training. We conclude that work is still required for semantic grounding to help improve vision models.
    A Unified Framework for Biphasic Facial Age Translation with Noisy-Semantic Guided Generative Adversarial Networks. (arXiv:2109.07373v1 [cs.CV])
    (2 min) Biphasic facial age translation aims at predicting the appearance of the input face at any age. Facial age translation has received considerable research attention in the last decade due to its practical value in cross-age face recognition and various entertainment applications. However, most existing methods model age changes between holistic images, regardless of the human face structure and the age-changing patterns of individual facial components. Consequently, the lack of semantic supervision will cause infidelity of generated faces in detail. To this end, we propose a unified framework for biphasic facial age translation with noisy-semantic guided generative adversarial networks. Structurally, we project the class-aware noisy semantic layouts to soft latent maps for the following injection operation on the individual facial parts. In particular, we introduce two sub-networks, ProjectionNet and ConstraintNet. ProjectionNet introduces the low-level structural semantic information with noise map and produces soft latent maps. ConstraintNet disentangles the high-level spatial features to constrain the soft latent maps, which endows more age-related context into the soft latent maps. Specifically, attention mechanism is employed in ConstraintNet for feature disentanglement. Meanwhile, in order to mine the strongest mapping ability of the network, we embed two types of learning strategies in the training procedure, supervised self-driven generation and unsupervised condition-driven cycle-consistent generation. As a result, extensive experiments conducted on MORPH and CACD datasets demonstrate the prominent ability of our proposed method which achieves state-of-the-art performance.
    Contact-Aware Retargeting of Skinned Motion. (arXiv:2109.07431v1 [cs.CV])
    (2 min) This paper introduces a motion retargeting method that preserves self-contacts and prevents interpenetration. Self-contacts, such as when hands touch each other or the torso or the head, are important attributes of human body language and dynamics, yet existing methods do not model or preserve these contacts. Likewise, interpenetration, such as a hand passing into the torso, are a typical artifact of motion estimation methods. The input to our method is a human motion sequence and a target skeleton and character geometry. The method identifies self-contacts and ground contacts in the input motion, and optimizes the motion to apply to the output skeleton, while preserving these contacts and reducing interpenetration. We introduce a novel geometry-conditioned recurrent network with an encoder-space optimization strategy that achieves efficient retargeting while satisfying contact constraints. In experiments, our results quantitatively outperform previous methods and we conduct a user study where our retargeted motions are rated as higher-quality than those produced by recent works. We also show our method generalizes to motion estimated from human videos where we improve over previous works that produce noticeable interpenetration.
    Deep Bregman Divergence for Contrastive Learning of Visual Representations. (arXiv:2109.07455v1 [cs.CV])
    (2 min) Deep Bregman divergence measures divergence of data points using neural networks which is beyond Euclidean distance and capable of capturing divergence over distributions. In this paper, we propose deep Bregman divergences for contrastive learning of visual representation and we aim to enhance contrastive loss used in self-supervised learning by training additional networks based on functional Bregman divergence. In contrast to the conventional contrastive learning methods which are solely based on divergences between single points, our framework can capture the divergence between distributions which improves the quality of learned representation. By combining conventional contrastive loss with the proposed divergence loss, our method outperforms baseline and most of previous methods for self-supervised and semi-supervised learning on multiple classifications and object detection tasks and datasets. The source code of the method and of all the experiments are available at supplementary.
    A Wide-area, Low-latency, and Power-efficient 6-DoF Pose Tracking System for Rigid Objects. (arXiv:2109.07428v1 [cs.RO])
    (2 min) Position sensitive detectors (PSDs) offer possibility to track single active marker's two (or three) degrees of freedom (DoF) position with a high accuracy, while having a fast response time with high update frequency and low latency, all using a very simple signal processing circuit. However they are not particularly suitable for 6-DoF object pose tracking system due to lack of orientation measurement, limited tracking range, and sensitivity to environmental variation. We propose a novel 6-DoF pose tracking system for a rigid object tracking requiring a single active marker. The proposed system uses a stereo-based PSD pair and multiple Inertial Measurement Units (IMUs). This is done based on a practical approach to identify and control the power of Infrared-Light Emitting Diode (IR-LED) active markers, with an aim to increase the tracking work space and reduce the power consumption. Our proposed tracking system is validated with three different work space sizes and for static and dynamic positional accuracy using robotic arm manipulator with three different dynamic motion patterns. The results show that the static position root-mean-square (RMS) error is 0.6mm. The dynamic position RMS error is 0.7-0.9mm. The orientation RMS error is between 0.04 and 0.9 degree at varied dynamic motion. Overall, our proposed tracking system is capable of tracking a rigid object pose with sub-millimeter accuracy at the mid range of the work space and sub-degree accuracy for all work space under a lab setting.
    Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos. (arXiv:2109.07353v1 [cs.CV])
    (2 min) Graph Convolution Network (GCN) has been successfully used for 3D human pose estimation in videos. However, it is often built on the fixed human-joint affinity, according to human skeleton. This may reduce adaptation capacity of GCN to tackle complex spatio-temporal pose variations in videos. To alleviate this problem, we propose a novel Dynamical Graph Network (DG-Net), which can dynamically identify human-joint affinity, and estimate 3D pose by adaptively learning spatial/temporal joint relations from videos. Different from traditional graph convolution, we introduce Dynamical Spatial/Temporal Graph convolution (DSG/DTG) to discover spatial/temporal human-joint affinity for each video exemplar, depending on spatial distance/temporal movement similarity between human joints in this video. Hence, they can effectively understand which joints are spatially closer and/or have consistent motion, for reducing depth ambiguity and/or motion uncertainty when lifting 2D pose to 3D pose. We conduct extensive experiments on three popular benchmarks, e.g., Human3.6M, HumanEva-I, and MPI-INF-3DHP, where DG-Net outperforms a number of recent SOTA approaches with fewer input frames and model size.
    PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds. (arXiv:2109.07324v1 [cs.CV])
    (2 min) Augmentation can benefit point cloud learning due to the limited availability of large-scale public datasets. This paper proposes a mix-up augmentation approach, PointManifoldCut, which replaces the neural network embedded points, rather than the Euclidean space coordinates. This approach takes the advantage that points at the higher levels of the neural network are already trained to embed its neighbors relations and mixing these representation will not mingle the relation between itself and its label. This allows to regularize the parameter space as the other augmentation methods but without worrying about the proper label of the replaced points. The experiments show that our proposed approach provides a competitive performance on point cloud classification and segmentation when it is combined with the cutting-edge vanilla point cloud networks. The result shows a consistent performance boosting compared to other state-of-the-art point cloud augmentation method, such as PointMixup and PointCutMix. The code of this paper is available at: https://github.com/fun0515/PointManifoldCut.
    MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection. (arXiv:2109.07311v1 [cs.CV])
    (2 min) The rapid progress in the ease of creating and spreading ultra-realistic media over social platforms calls for an urgent need to develop a generalizable deepfake detection technique. It has been observed that current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. Inspired by this observation, in this paper, we present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation for classifying \textit{deepfakes}. MD-CSDNetwork is a novel cross-stitched network with two parallel branches carrying the spatial and frequency information, respectively. We hypothesize that these multi-domain input data streams can be considered as related supervisory signals. The supervision from both branches ensures better performance and generalization. Further, the concept of cross-stitch connections is utilized where they are inserted between the two branches to learn an optimal combination of domain-specific and shared representations from other domains automatically. Extensive experiments are conducted on the popular benchmark dataset namely FaceForeniscs++ for forgery classification. We report improvements over all the manipulation types in FaceForensics++ dataset and comparable results with state-of-the-art methods for cross-database evaluation on the Celeb-DF dataset and the Deepfake Detection Dataset.
    RGB-D Saliency Detection via Cascaded Mutual Information Minimization. (arXiv:2109.07246v1 [cs.CV])
    (2 min) Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning. In this paper, we introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data. Specifically, we first map the feature of each mode to a lower dimensional feature vector, and adopt mutual information minimization as a regularizer to reduce the redundancy between appearance features from RGB and geometric features from depth. We then perform multi-stage cascaded learning to impose the mutual information minimization constraint at every stage of the network. Extensive experiments on benchmark RGB-D saliency datasets illustrate the effectiveness of our framework. Further, to prosper the development of this field, we contribute the largest (7x larger than NJU2K) dataset, which contains 15,625 image pairs with high quality polygon-/scribble-/object-/instance-/rank-level annotations. Based on these rich labels, we additionally construct four new benchmarks with strong baselines and observe some interesting phenomena, which can motivate future model design. Source code and dataset are available at "https://github.com/JingZhang617/cascaded_rgbd_sod".
    3D Annotation Of Arbitrary Objects In The Wild. (arXiv:2109.07165v1 [cs.CV])
    (2 min) Recent years have produced a variety of learning based methods in the context of computer vision and robotics. Most of the recently proposed methods are based on deep learning, which require very large amounts of data compared to traditional methods. The performance of the deep learning methods are largely dependent on the data distribution they were trained on, and it is important to use data from the robot's actual operating domain during training. Therefore, it is not possible to rely on pre-built, generic datasets when deploying robots in real environments, creating a need for efficient data collection and annotation in the specific operating conditions the robots will operate in. The challenge is then: how do we reduce the cost of obtaining such datasets to a point where we can easily deploy our robots in new conditions, environments and to support new sensors? As an answer to this question, we propose a data annotation pipeline based on SLAM, 3D reconstruction, and 3D-to-2D geometry. The pipeline allows creating 3D and 2D bounding boxes, along with per-pixel annotations of arbitrary objects without needing accurate 3D models of the objects prior to data collection and annotation. Our results showcase almost 90% Intersection-over-Union (IoU) agreement on both semantic segmentation and 2D bounding box detection across a variety of objects and scenes, while speeding up the annotation process by several orders of magnitude compared to traditional manual annotation.
    Anchor DETR: Query Design for Transformer-Based Detector. (arXiv:2109.07107v1 [cs.CV])
    (2 min) In this paper, we propose a novel query design for the transformer-based detectors. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we can not explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focus on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10$\times$ fewer training epochs. For example, it achieves 44.2 AP with 16 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at https://github.com/megvii-model/AnchorDETR.
    Progressive Hard-case Mining across Pyramid Levels in Object Detection. (arXiv:2109.07217v1 [cs.CV])
    (2 min) In object detection, multi-level prediction (e.g., FPN, YOLO) and resampling skills (e.g., focal loss, ATSS) have drastically improved one-stage detector performance. However, how to improve the performance by optimizing the feature pyramid level-by-level remains unexplored. We find that, during training, the ratio of positive over negative samples varies across pyramid levels (\emph{level imbalance}), which is not addressed by current one-stage detectors. To mediate the influence of level imbalance, we propose a Unified Multi-level Optimization Paradigm (UMOP) consisting of two components: 1) an independent classification loss supervising each pyramid level with individual resampling considerations; 2) a progressive hard-case mining loss defining all losses across the pyramid levels without extra level-wise settings. With UMOP as a plug-and-play scheme, modern one-stage detectors can attain a ~1.5 AP improvement with fewer training iterations and no additional computation overhead. Our best model achieves 55.1 AP on COCO test-dev. Code is available at https://github.com/zimoqingfeng/UMOP.
    FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack. (arXiv:2109.07193v1 [cs.CV])
    (2 min) Physical adversarial attacks in object detection have attracted increasing attention. However, most previous works focus on hiding the objects from the detector by generating an individual adversarial patch, which only covers the planar part of the vehicle's surface and fails to attack the detector in physical scenarios for multi-view, long-distance and partially occluded objects. To bridge the gap between digital attacks and physical attacks, we exploit the full 3D vehicle surface to propose a robust Full-coverage Camouflage Attack (FCA) to fool detectors. Specifically, we first try rendering the non-planar camouflage texture over the full vehicle surface. To mimic the real-world environment conditions, we then introduce a transformation function to transfer the rendered camouflaged vehicle into a photo-realistic scenario. Finally, we design an efficient loss function to optimize the camouflage texture. Experiments show that the full-coverage camouflage attack can not only outperform state-of-the-art methods under various test cases but also generalize to different environments, vehicles, and object detectors.
    Patch-based medical image segmentation using Quantum Tensor Networks. (arXiv:2109.07138v1 [cs.CV])
    (2 min) Tensor networks are efficient factorisations of high dimensional tensors into a network of lower order tensors. They have been most commonly used to model entanglement in quantum many-body systems and more recently are witnessing increased applications in supervised machine learning. In this work, we formulate image segmentation in a supervised setting with tensor networks. The key idea is to first lift the pixels in image patches to exponentially high dimensional feature spaces and using a linear decision hyper-plane to classify the input pixels into foreground and background classes. The high dimensional linear model itself is approximated using the matrix product state (MPS) tensor network. The MPS is weight-shared between the non-overlapping image patches resulting in our strided tensor network model. The performance of the proposed model is evaluated on three 2D- and one 3D- biomedical imaging datasets. The performance of the proposed tensor network segmentation model is compared with relevant baseline methods. In the 2D experiments, the tensor network model yeilds competitive performance compared to the baseline methods while being more resource efficient.
    MISSFormer: An Effective Medical Image Segmentation Transformer. (arXiv:2109.07162v1 [cs.CV])
    (2 min) The CNN-based methods have achieved impressive results in medical image segmentation, but it failed to capture the long-range dependencies due to the inherent locality of convolution operation. Transformer-based methods are popular in vision tasks recently because of its capacity of long-range dependencies and get a promising performance. However, it lacks in modeling local context, although some works attempted to embed convolutional layer to overcome this problem and achieved some improvement, but it makes the feature inconsistent and fails to leverage the natural multi-scale features of hierarchical transformer, which limit the performance of models. In this paper, taking medical image segmentation as an example, we present MISSFormer, an effective and powerful Medical Image Segmentation tranSFormer. MISSFormer is a hierarchical encoder-decoder network and has two appealing designs: 1) A feed forward network is redesigned with the proposed Enhanced Transformer Block, which makes features aligned adaptively and enhances the long-range dependencies and local context. 2) We proposed Enhanced Transformer Context Bridge, a context bridge with the enhanced transformer block to model the long-range dependencies and local context of multi-scale features generated by our hierarchical transformer encoder. Driven by these two designs, the MISSFormer shows strong capacity to capture more valuable dependencies and context in medical image segmentation. The experiments on multi-organ and cardiac segmentation tasks demonstrate the superiority, effectiveness and robustness of our MISSFormer, the exprimental results of MISSFormer trained from scratch even outperforms state-of-the-art methods pretrained on ImageNet, and the core designs can be generalized to other visual segmentation tasks. The code will be released in Github.
    Integrating Sensing and Communication in Cellular Networks via NR Sidelink. (arXiv:2109.07253v1 [cs.CV])
    (2 min) RF-sensing, the analysis and interpretation of movement or environment-induced patterns in received electromagnetic signals, has been actively investigated for more than a decade. Since electromagnetic signals, through cellular communication systems, are omnipresent, RF sensing has the potential to become a universal sensing mechanism with applications in smart home, retail, localization, gesture recognition, intrusion detection, etc. Specifically, existing cellular network installations might be dual-used for both communication and sensing. Such communications and sensing convergence is envisioned for future communication networks. We propose the use of NR-sidelink direct device-to-device communication to achieve device-initiated,flexible sensing capabilities in beyond 5G cellular communication systems. In this article, we specifically investigate a common issue related to sidelink-based RF-sensing, which is its angle and rotation dependence. In particular, we discuss transformations of mmWave point-cloud data which achieve rotational invariance, as well as distributed processing based on such rotational invariant inputs, at angle and distance diverse devices. To process the distributed data, we propose a graph based encoder to capture spatio-temporal features of the data and propose four approaches for multi-angle learning. The approaches are compared on a newly recorded and openly available dataset comprising 15 subjects, performing 21 gestures which are recorded from 8 angles.
    Distract Your Attention: Multi-head Cross Attention Network for Facial Expression Recognition. (arXiv:2109.07270v1 [cs.CV])
    (2 min) We present a novel facial expression recognition network, called Distract your Attention Network (DAN). Our method is based on two key observations. Firstly, multiple classes share inherently similar underlying facial appearance, and their differences could be subtle. Secondly, facial expressions exhibit themselves through multiple facial regions simultaneously, and the recognition requires a holistic approach by encoding high-order interactions among local features. To address these issues, we propose our DAN with three key components: Feature Clustering Network (FCN), Multi-head cross Attention Network (MAN), and Attention Fusion Network (AFN). The FCN extracts robust features by adopting a large-margin learning objective to maximize class separability. In addition, the MAN instantiates a number of attention heads to simultaneously attend to multiple facial areas and build attention maps on these regions. Further, the AFN distracts these attentions to multiple locations before fusing the attention maps to a comprehensive one. Extensive experiments on three public datasets (including AffectNet, RAF-DB, and SFEW 2.0) verified that the proposed method consistently achieves state-of-the-art facial expression recognition performance. Code will be made available at https://github.com/yaoing/DAN.
    Navigation-Oriented Scene Understanding for Robotic Autonomy: Learning to Segment Driveability in Egocentric Images. (arXiv:2109.07245v1 [cs.RO])
    (2 min) This work tackles scene understanding for outdoor robotic navigation, solely relying on images captured by an on-board camera. Conventional visual scene understanding interprets the environment based on specific descriptive categories. However, such a representation is not directly interpretable for decision-making and constrains robot operation to a specific domain. Thus, we propose to segment egocentric images directly in terms of how a robot can navigate in them, and tailor the learning problem to an autonomous navigation task. Building around an image segmentation network, we present a generic and scalable affordance-based definition consisting of 3 driveability levels which can be applied to arbitrary scenes. By encoding these levels with soft ordinal labels, we incorporate inter-class distances during learning which improves segmentation compared to standard one-hot labelling. In addition, we propose a navigation-oriented pixel-wise loss weighting method which assigns higher importance to safety-critical areas. We evaluate our approach on large-scale public image segmentation datasets spanning off-road and urban scenes. In a zero-shot cross-dataset generalization experiment, we show that our affordance learning scheme can be applied across a diverse mix of datasets and improves driveability estimation in unseen environments compared to general-purpose, single-dataset segmentation.
    DeFungi: Direct Mycological Examination of Microscopic Fungi Images. (arXiv:2109.07322v1 [eess.IV])
    (3 min) Traditionally, diagnosis and treatment of fungal infections in humans depend heavily on face-to-face consultations or examinations made by specialized laboratory scientists known as mycologists. In many cases, such as the recent mucormycosis spread in the COVID-19 pandemic, an initial treatment can be safely suggested to the patient during the earliest stage of the mycological diagnostic process by performing a direct examination of biopsies or samples through a microscope. Computer-aided diagnosis systems using deep learning models have been trained and used for the late mycological diagnostic stages. However, there are no reference literature works made for the early stages. A mycological laboratory in Colombia donated the images used for the development of this research work. They were manually labelled into five classes and curated with a subject matter expert assistance. The images were later cropped and patched with automated code routines to produce the final dataset. This paper presents experimental results classifying five fungi types using two different deep learning approaches and three different convolutional neural network models, VGG16, Inception V3, and ResNet50. The first approach benchmarks the classification performance for the models trained from scratch, while the second approach benchmarks the classification performance using pre-trained models based on the ImageNet dataset. Using k-fold cross-validation testing on the 5-class dataset, the best performing model trained from scratch was Inception V3, reporting 73.2% accuracy. Also, the best performing model using transfer learning was VGG16 reporting 85.04%. The statistics provided by the two approaches create an initial point of reference to encourage future research works to improve classification performance. Furthermore, the dataset built is published in Kaggle and GitHub to foster future research.
    DSOR: A Scalable Statistical Filter for Removing Falling Snow from LiDAR Point Clouds in Severe Winter Weather. (arXiv:2109.07078v1 [cs.CV])
    (2 min) For autonomous vehicles to viably replace human drivers they must contend with inclement weather. Falling rain and snow introduce noise in LiDAR returns resulting in both false positive and false negative object detections. In this article we introduce the Winter Adverse Driving dataSet (WADS) collected in the snow belt region of Michigan's Upper Peninsula. WADS is the first multi-modal dataset featuring dense point-wise labeled sequential LiDAR scans collected in severe winter weather; weather that would cause an experienced driver to alter their driving behavior. We have labelled and will make available over 7 GB or 3.6 billion labelled LiDAR points out of over 26 TB of total LiDAR and camera data collected. We also present the Dynamic Statistical Outlier Removal (DSOR) filter, a statistical PCL-based filter capable or removing snow with a higher recall than the state of the art snow de-noising filter while being 28\% faster. Further, the DSOR filter is shown to have a lower time complexity compared to the state of the art resulting in an improved scalability. Our labeled dataset and DSOR filter will be made available at https://bitbucket.org/autonomymtu/dsor_filter
    Hybrid Local-Global Transformer for Image Dehazing. (arXiv:2109.07100v1 [cs.CV])
    (2 min) Recently, the Vision Transformer (ViT) has shown impressive performance on high-level and low-level vision tasks. In this paper, we propose a new ViT architecture, named Hybrid Local-Global Vision Transformer (HyLoG-ViT), for single image dehazing. The HyLoG-ViT block consists of two paths, the local ViT path and the global ViT path, which are used to capture local and global dependencies. The hybrid features are fused via convolution layers. As a result, the HyLoG-ViT reduces the computational complexity and introduces locality in the networks. Then, the HyLoG-ViT blocks are incorporated within our dehazing networks, which jointly learn the intrinsic image decomposition and image dehazing. Specifically, the network consists of one shared encoder and three decoders for reflectance prediction, shading prediction, and haze-free image generation. The tasks of reflectance and shading prediction can produce meaningful intermediate features that can serve as complementary features for haze-free image generation. To effectively aggregate the complementary features, we propose a complementary features selection module (CFSM) to select the useful ones for image dehazing. Extensive experiments on homogeneous, non-homogeneous, and nighttime dehazing tasks reveal that our proposed Transformer-based dehazing network can achieve comparable or even better performance than CNNs-based dehazing models.
    Resolution-robust Large Mask Inpainting with Fourier Convolutions. (arXiv:2109.07161v1 [cs.CV])
    (2 min) Modern image inpainting systems, despite the significant progress, often struggle with large missing areas, complex geometric structures, and high-resolution images. We find that one of the main reasons for that is the lack of an effective receptive field in both the inpainting network and the loss function. To alleviate this issue, we propose a new method called large mask inpainting (LaMa). LaMa is based on i) a new inpainting network architecture that uses fast Fourier convolutions, which have the image-wide receptive field; ii) a high receptive field perceptual loss; and iii) large training masks, which unlocks the potential of the first two components. Our inpainting network improves the state-of-the-art across a range of datasets and achieves excellent performance even in challenging scenarios, e.g. completion of periodic structures. Our model generalizes surprisingly well to resolutions that are higher than those seen at train time, and achieves this at lower parameter&compute costs than the competitive baselines. The code is available at https://github.com/saic-mdal/lama.
    New Perspective on Progressive GANs Distillationfor One-class Novelty Detection. (arXiv:2109.07295v1 [cs.CV])
    (2 min) One-class novelty detection is conducted to iden-tify anomalous instances, with different distributions from theexpected normal instances. In this paper, the Generative Adver-sarial Network based on the Encoder-Decoder-Encoder scheme(EDE-GAN) achieves state-of-the-art performance. The two fac-tors bellow serve the above purpose: 1) The EDE-GAN calculatesthe distance between two latent vectors as the anomaly score,which is unlike the previous methods by utilizing the reconstruc-tion error between images. 2) The model obtains best resultswhen the batch size is set to 1. To illustrate their superiority,we design a new GAN architecture, and compareperformances according to different batch sizes. Moreover, withexperimentation leads to discovery, our result implies there is alsoevidence of just how beneficial constraint on the latent space arewhen engaging in model training.In an attempt to learn compact and fast models, we present anew technology, Progressive Knowledge Distillation with GANs(P-KDGAN), which connects two standard GANs through thedesigned distillation loss. Two-step progressive learning continu-ously augments the performance of student GANs with improvedresults over single-step approach. Our experimental results onCIFAR-10, MNIST, and FMNIST datasets illustrate that P-KDGAN improves the performance of the student GAN by2.44%, 1.77%, and 1.73% when compressing the computationat ratios of 24.45:1, 311.11:1, and 700:1, respectively.
    PnP-DETR: Towards Efficient Visual Analysis with Transformers. (arXiv:2109.07036v1 [cs.CV])
    (2 min) Recently, DETR~\cite{carion2020end} pioneered the solution of vision tasks with transformers, it directly translates the image feature map into the object detection result. Though effective, translating the full feature map can be costly due to redundant computation on some area like the background. In this work, we encapsulate the idea of reducing spatial redundancy into a novel poll and pool (PnP) sampling module, with which we build an end-to-end PnP-DETR architecture that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The transformer models information interaction within the fine-coarse feature space and translates the features into the detection result. Moreover, the PnP-augmented model can instantly achieve various desired trade-offs between performance and computation with a single model by varying the sampled feature length, without requiring to train multiple models as existing methods. Thus it offers greater flexibility for deployment in diverse scenarios with varying computation constraint. We further validate the generalizability of the PnP module on \textbf{panoptic segmentation} and the recent transformer-based image recognition model {\textbf{ViT}}~\cite{dosovitskiy2020image} and show consistent efficiency gain. We believe our method makes a step for efficient visual analysis with transformers, wherein spatial redundancy is commonly observed. Code will be available at \url{https://github.com/twangnh/pnp-detr}.
    F-CAM: Full Resolution CAM via Guided Parametric Upscaling. (arXiv:2109.07069v1 [cs.CV])
    (2 min) Class Activation Mapping (CAM) methods have recently gained much attention for weakly-supervised object localization (WSOL) tasks, allowing for CNN visualization and interpretation without training on fully annotated image datasets. CAM methods are typically integrated within off-the-shelf CNN backbones, such as ResNet50. Due to convolution and downsampling/pooling operations, these backbones yield low resolution CAMs with a down-scaling factor of up to 32, making accurate localization more difficult. Interpolation is required to restore a full size CAMs, but without considering the statistical properties of the objects, leading to activations with inconsistent boundaries and inaccurate localizations. As an alternative, we introduce a generic method for parametric upscaling of CAMs that allows constructing accurate full resolution CAMs (F-CAMs). In particular, we propose a trainable decoding architecture that can be connected to any CNN classifier to produce more accurate CAMs. Given an original (low resolution) CAM, foreground and background pixels are randomly sampled for fine-tuning the decoder. Additional priors such as image statistics, and size constraints are also considered to expand and refine object boundaries. Extensive experiments using three CNN backbones and six WSOL baselines on the CUB-200-2011 and OpenImages datasets, indicate that our F-CAM method yields a significant improvement in CAM localization accuracy. F-CAM performance is competitive with state-of-art WSOL methods, yet it requires fewer computational resources during inference.
    ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors. (arXiv:2109.07001v1 [cs.CV])
    (2 min) Image-based virtual try-on involves synthesizing perceptually convincing images of a model wearing a particular garment and has garnered significant research interest due to its immense practical applicability. Recent methods involve a two stage process: i) warping of the garment to align with the model ii) texture fusion of the warped garment and target model to generate the try-on output. Issues arise due to the non-rigid nature of garments and the lack of geometric information about the model or the garment. It often results in improper rendering of granular details. We propose ZFlow, an end-to-end framework, which seeks to alleviate these concerns regarding geometric and textural integrity (such as pose, depth-ordering, skin and neckline reproduction) through a combination of gated aggregation of hierarchical flow estimates termed Gated Appearance Flow, and dense structural priors at various stage of the network. ZFlow achieves state-of-the-art results as observed qualitatively, and on quantitative benchmarks of image quality (PSNR, SSIM, and FID). The paper presents extensive comparisons with other existing solutions including a detailed user study and ablation studies to gauge the effect of each of our contributions on multiple datasets.
    Image Synthesis via Semantic Composition. (arXiv:2109.07053v1 [cs.CV])
    (2 min) In this paper, we present a novel approach to synthesize realistic images based on their semantic layouts. It hypothesizes that for objects with similar appearance, they share similar representation. Our method establishes dependencies between regions according to their appearance correlation, yielding both spatially variant and associated representations. Conditioning on these features, we propose a dynamic weighted network constructed by spatially conditional computation (with both convolution and normalization). More than preserving semantic distinctions, the given dynamic network strengthens semantic relevance, benefiting global structure and detail synthesis. We demonstrate that our method gives the compelling generation performance qualitatively and quantitatively with extensive experiments on benchmarks.
    Combining GEDI and Sentinel-2 for wall-to-wall mapping of tall and short crops. (arXiv:2109.06972v1 [eess.IV])
    (3 min) High resolution crop type maps are an important tool for improving food security, and remote sensing is increasingly used to create such maps in regions that possess ground truth labels for model training. However, these labels are absent in many regions, and models trained in other regions on typical satellite features, such as those from optical sensors, often exhibit low performance when transferred. Here we explore the use of NASA's Global Ecosystem Dynamics Investigation (GEDI) spaceborne lidar instrument, combined with Sentinel-2 optical data, for crop type mapping. Using data from three major cropped regions (in China, France, and the United States) we first demonstrate that GEDI energy profiles are capable of reliably distinguishing maize, a crop typically above 2m in height, from crops like rice and soybean that are shorter. We further show that these GEDI profiles provide much more invariant features across geographies compared to spectral and phenological features detected by passive optical sensors. GEDI is able to distinguish maize from other crops within each region with accuracies higher than 84%, and able to transfer across regions with accuracies higher than 82% compared to 64% for transfer of optical features. Finally, we show that GEDI profiles can be used to generate training labels for models based on optical imagery from Sentinel-2, thereby enabling the creation of 10m wall-to-wall maps of tall versus short crops in label-scarce regions. As maize is the second most widely grown crop in the world and often the only tall crop grown within a landscape, we conclude that GEDI offers great promise for improving global crop type maps.
    Multi-modal Wound Classification using Wound Image and Location by Deep Neural Network. (arXiv:2109.06969v1 [cs.CV])
    (2 min) Wound classification is an essential step of wound diagnosis. An efficient classifier can assist wound specialists in classifying wound types with less financial and time costs and help them decide an optimal treatment procedure. This study developed a deep neural network-based multi-modal classifier using wound images and their corresponding locations to categorize wound images into multiple classes, including diabetic, pressure, surgical, and venous ulcers. A body map is also developed to prepare the location data, which can help wound specialists tag wound locations more efficiently. Three datasets containing images and their corresponding location information are designed with the help of wound specialists. The multi-modal network is developed by concatenating the image-based and location-based classifier's outputs with some other modifications. The maximum accuracy on mixed-class classifications (containing background and normal skin) varies from 77.33% to 100% on different experiments. The maximum accuracy on wound-class classifications (containing only diabetic, pressure, surgical, and venous) varies from 72.95% to 98.08% on different experiments. The proposed multi-modal network also shows a significant improvement in results from the previous works of literature.
    Hardware-aware Real-time Myocardial Segmentation Quality Control in Contrast Echocardiography. (arXiv:2109.06909v1 [eess.IV])
    (2 min) Automatic myocardial segmentation of contrast echocardiography has shown great potential in the quantification of myocardial perfusion parameters. Segmentation quality control is an important step to ensure the accuracy of segmentation results for quality research as well as its clinical application. Usually, the segmentation quality control happens after the data acquisition. At the data acquisition time, the operator could not know the quality of the segmentation results. On-the-fly segmentation quality control could help the operator to adjust the ultrasound probe or retake data if the quality is unsatisfied, which can greatly reduce the effort of time-consuming manual correction. However, it is infeasible to deploy state-of-the-art DNN-based models because the segmentation module and quality control module must fit in the limited hardware resource on the ultrasound machine while satisfying strict latency constraints. In this paper, we propose a hardware-aware neural architecture search framework for automatic myocardial segmentation and quality control of contrast echocardiography. We explicitly incorporate the hardware latency as a regularization term into the loss function during training. The proposed method searches the best neural network architecture for the segmentation module and quality prediction module with strict latency.
    Uncertainty Quantification in Medical Image Segmentation with Multi-decoder U-Net. (arXiv:2109.07045v1 [eess.IV])
    (2 min) Accurate medical image segmentation is crucial for diagnosis and analysis. However, the models without calibrated uncertainty estimates might lead to errors in downstream analysis and exhibit low levels of robustness. Estimating the uncertainty in the measurement is vital to making definite, informed conclusions. Especially, it is difficult to make accurate predictions on ambiguous areas and focus boundaries for both models and radiologists, even harder to reach a consensus with multiple annotations. In this work, the uncertainty under these areas is studied, which introduces significant information with anatomical structure and is as important as segmentation performance. We exploit the medical image segmentation uncertainty quantification by measuring segmentation performance with multiple annotations in a supervised learning manner and propose a U-Net based architecture with multiple decoders, where the image representation is encoded with the same encoder, and segmentation referring to each annotation is estimated with multiple decoders. Nevertheless, a cross-loss function is proposed for bridging the gap between different branches. The proposed architecture is trained in an end-to-end manner and able to improve predictive uncertainty estimates. The model achieves comparable performance with fewer parameters to the integrated training model that ranked the runner-up in the MICCAI-QUBIQ 2020 challenge.
    A trainable monogenic ConvNet layer robust in front of large contrast changes in image classification. (arXiv:2109.06926v1 [cs.CV])
    (2 min) Convolutional Neural Networks (ConvNets) at present achieve remarkable performance in image classification tasks. However, current ConvNets cannot guarantee the capabilities of the mammalian visual systems such as invariance to contrast and illumination changes. Some ideas to overcome the illumination and contrast variations usually have to be tuned manually and tend to fail when tested with other types of data degradation. In this context, we present a new bio-inspired {entry} layer, M6, which detects low-level geometric features (lines, edges, and orientations) which are similar to patterns detected by the V1 visual cortex. This new trainable layer is capable of coping with image classification even with large contrast variations. The explanation for this behavior is the monogenic signal geometry, which represents each pixel value in a 3D space using quaternions, a fact that confers a degree of explainability to the networks. We compare M6 with a conventional convolutional layer (C) and a deterministic quaternion local phase layer (Q9). The experimental setup {is designed to evaluate the robustness} of our M6 enriched ConvNet model and includes three architectures, four datasets, three types of contrast degradation (including non-uniform haze degradations). The numerical results reveal that the models with M6 are the most robust in front of any kind of contrast variations. This amounts to a significant enhancement of the C models, which usually have reasonably good performance only when the same training and test degradation are used, except for the case of maximum degradation. Moreover, the Structural Similarity Index Measure (SSIM) is used to analyze and explain the robustness effect of the M6 feature maps under any kind of contrast degradations.
    Seeking an Optimal Approach for Computer-Aided Pulmonary Embolism Detection. (arXiv:2109.07029v1 [eess.IV])
    (2 min) Pulmonary embolism (PE) represents a thrombus ("blood clot"), usually originating from a lower extremity vein, that travels to the blood vessels in the lung, causing vascular obstruction and in some patients, death. This disorder is commonly diagnosed using CT pulmonary angiography (CTPA). Deep learning holds great promise for the computer-aided CTPA diagnosis (CAD) of PE. However, numerous competing methods for a given task in the deep learning literature exist, causing great confusion regarding the development of a CAD PE system. To address this confusion, we present a comprehensive analysis of competing deep learning methods applicable to PE diagnosis using CTPA at the both image and exam levels. At the image level, we compare convolutional neural networks (CNNs) with vision transformers, and contrast self-supervised learning (SSL) with supervised learning, followed by an evaluation of transfer learning compared with training from scratch. At the exam level, we focus on comparing conventional classification (CC) with multiple instance learning (MIL). Our extensive experiments consistently show: (1) transfer learning consistently boosts performance despite differences between natural images and CT scans, (2) transfer learning with SSL surpasses its supervised counterparts; (3) CNNs outperform vision transformers, which otherwise show satisfactory performance; and (4) CC is, surprisingly, superior to MIL. Compared with the state of the art, our optimal approach provides an AUC gain of 0.2\% and 1.05\% for image-level and exam-level, respectively.
    Multi-Scale Aligned Distillation for Low-Resolution Detection. (arXiv:2109.06875v1 [cs.CV])
    (2 min) In instance-level detection tasks (e.g., object detection), reducing input resolution is an easy option to improve runtime efficiency. However, this option traditionally hurts the detection performance much. This paper focuses on boosting the performance of low-resolution models by distilling knowledge from a high- or multi-resolution model. We first identify the challenge of applying knowledge distillation (KD) to teacher and student networks that act on different input resolutions. To tackle it, we explore the idea of spatially aligning feature maps between models of varying input resolutions by shifting feature pyramid positions and introduce aligned multi-scale training to train a multi-scale teacher that can distill its knowledge to a low-resolution student. Further, we propose crossing feature-level fusion to dynamically fuse teacher's multi-resolution features to guide the student better. On several instance-level detection tasks and datasets, the low-resolution models trained via our approach perform competitively with high-resolution models trained via conventional multi-scale training, while outperforming the latter's low-resolution models by 2.1% to 3.6% in terms of mAP. Our code is made publicly available at https://github.com/dvlab-research/MSAD.
  • cs.IR updates on arXiv.org

    Counterfactual Explainable Recommendation. (arXiv:2108.10539v2 [cs.IR] UPDATED)
    (2 min) By providing explanations for users and system designers to facilitate better understanding and decision making, explainable recommendation has been an important research problem. In this paper, we propose Counterfactual Explainable Recommendation (CountER), which takes the insights of counterfactual reasoning from causal inference for explainable recommendation. CountER is able to formulate the complexity and the strength of explanations, and it adopts a counterfactual learning framework to seek simple (low complexity) and effective (high strength) explanations for the model decision. Technically, for each item recommended to each user, CountER formulates a joint optimization problem to generate minimal changes on the item aspects so as to create a counterfactual item, such that the recommendation decision on the counterfactual item is reversed. These altered aspects constitute the explanation of why the original item is recommended. The counterfactual explanation helps both the users for better understanding and the system designers for better model debugging. Another contribution of the work is the evaluation of explainable recommendation, which has been a challenging task. Fortunately, counterfactual explanations are very suitable for standard quantitative evaluation. To measure the explanation quality, we design two types of evaluation metrics, one from user's perspective (i.e. why the user likes the item), and the other from model's perspective (i.e. why the item is recommended by the model). We apply our counterfactual learning algorithm on a black-box recommender system and evaluate the generated explanations on five real-world datasets. Results show that our model generates more accurate and effective explanations than state-of-the-art explainable recommendation models.
    Hierarchical Attentive Knowledge Graph Embedding for Personalized Recommendation. (arXiv:1910.08288v4 [cs.IR] UPDATED)
    (2 min) Knowledge graphs (KGs) have proven to be effective for high-quality recommendation, where the connectivities between users and items provide rich and complementary information to user-item interactions. Most existing methods, however, are insufficient to exploit the KGs for capturing user preferences, as they either represent the user-item connectivities via paths with limited expressiveness or implicitly model them by propagating information over the entire KG with inevitable noise. In this paper, we design a novel hierarchical attentive knowledge graph embedding (HAKG) framework to exploit the KGs for effective recommendation. Specifically, HAKG first extracts the expressive subgraphs that link user-item pairs to characterize their connectivities, which accommodate both the semantics and topology of KGs. The subgraphs are then encoded via a hierarchical attentive subgraph encoding to generate effective subgraph embeddings for enhanced user preference prediction. Extensive experiments show the superiority of HAKG against state-of-the-art recommendation methods, as well as its potential in alleviating the data sparsity issue.
    AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce. (arXiv:2109.07411v1 [cs.IR])
    (2 min) Live streaming is becoming an increasingly popular trend of sales in E-commerce. The core of live-streaming sales is to encourage customers to purchase in an online broadcasting room. To enable customers to better understand a product without jumping out, we propose AliMe MKG, a multi-modal knowledge graph that aims at providing a cognitive profile for products, through which customers are able to seek information about and understand a product. Based on the MKG, we build an online live assistant that highlights product search, product exhibition and question answering, allowing customers to skim over item list, view item details, and ask item-related questions. Our system has been launched online in the Taobao app, and currently serves hundreds of thousands of customers per day.
    Assisting the Human Fact-Checkers: Detecting All Previously Fact-Checked Claims in a Document. (arXiv:2109.07410v1 [cs.CL])
    (2 min) Given the recent proliferation of false claims online, there has been a lot of manual fact-checking effort. As this is very time-consuming, human fact-checkers can benefit from tools that can support them and make them more efficient. Here, we focus on building a system that could provide such support. Given an input document, it aims to detect all sentences that contain a claim that can be verified by some previously fact-checked claims (from a given database). The output is a re-ranked list of the document sentences, so that those that can be verified are ranked as high as possible, together with corresponding evidence. Unlike previous work, which has looked into claim retrieval, here we take a document-level perspective. We create a new manually annotated dataset for the task, and we propose suitable evaluation measures. We further experiment with a learning-to-rank approach, achieving sizable performance gains over several strong baselines. Our analysis demonstrates the importance of modeling text similarity and stance, while also taking into account the veracity of the retrieved previously fact-checked claims. We believe that this research would be of interest to fact-checkers, journalists, media, and regulatory authorities.
    Negative Statements Considered Useful. (arXiv:2001.04425v5 [cs.IR] UPDATED)
    (2 min) Knowledge bases (KBs) about notable entities and their properties are an important asset in applications such as search, question answering and dialogue. All popular KBs capture virtually only positive statements, and abstain from taking any stance on statements not stored in the KB. This paper makes the case for explicitly stating salient statements that do not hold. Negative statements are useful to overcome limitations of question answering systems that are mainly geared for positive questions; they can also contribute to informative summaries of entities. Due to the abundance of such invalid statements, any effort to compile them needs to address ranking by saliency. We present a statisticalinference method for compiling and ranking negative statements, based on expectations from positive statements of related entities in peer groups. Experimental results, with a variety of datasets, show that the method can effectively discover notable negative statements, and extrinsic studies underline their usefulness for entity summarization. Datasets and code are released as resources for further research.
    Recovering individual emotional states from sparse ratings using collaborative filtering. (arXiv:2109.06906v1 [cs.IR])
    (2 min) A fundamental challenge in emotion research is measuring feeling states with high granularity and temporal precision without disrupting the emotion generation process. Here we introduce and validate a new approach in which responses are sparsely sampled and the missing data are recovered using a computational technique known as collaborative filtering (CF). This approach leverages structured covariation across individual experiences and is available in Neighbors, an open-source Python toolbox. We validate our approach across three different experimental contexts by recovering dense individual ratings using only a small subset of the original data. In dataset 1, participants (n=316) separately rated 112 emotional images on 6 different discrete emotions. In dataset 2, participants (n=203) watched 8 short emotionally engaging autobiographical stories while simultaneously providing moment-by-moment ratings of the intensity of their affective experience. In dataset 3, participants (n=60) with distinct social preferences made 76 decisions about how much money to return in a hidden multiplier trust game. Across all experimental contexts, CF was able to accurately recover missing data and importantly outperformed mean imputation, particularly in contexts with greater individual variability. This approach will enable new avenues for affective science research by allowing researchers to acquire high dimensional ratings from emotional experiences with minimal disruption to the emotion-generation process.
    FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining. (arXiv:2109.07323v1 [cs.IR])
    (2 min) Tables store rich numerical data, but numerical reasoning over tables is still a challenge. In this paper, we find that the spreadsheet formula, which performs calculations on numerical values in tables, is naturally a strong supervision of numerical reasoning. More importantly, large amounts of spreadsheets with expert-made formulae are available on the web and can be obtained easily. FORTAP is the first method for numerical-reasoning-aware table pretraining by leveraging large corpus of spreadsheet formulae. We design two formula pretraining tasks to explicitly guide FORTAP to learn numerical reference and calculation in semi-structured tables. FORTAP achieves state-of-the-art results on two representative downstream tasks, cell type classification and formula prediction, showing great potential of numerical-reasoning-aware pretraining.
    Co-Embedding: Discovering Communities on Bipartite Graphs through Projection. (arXiv:2109.07135v1 [cs.IR])
    (2 min) Many datasets take the form of a bipartite graph where two types of nodes are connected by relationships, like the movies watched by a user or the tags associated with a file. The partitioning of the bipartite graph could be used to fasten recommender systems, or reduce the information retrieval system's index size, by identifying groups of items with similar properties. This type of graph is often processed by algorithms using the Vector Space Model representation, where a binary vector represents an item with 0 and 1. The main problem with this representation is the dimension relatedness, like words' synonymity, which is not considered. This article proposes a co-clustering algorithm using items projection, allowing the measurement of features similarity. We evaluated our algorithm on a cluster retrieval task. Over various datasets, our algorithm produced well balanced clusters with coherent items in, leading to high retrieval scores on this task.
    Powered Hawkes-Dirichlet Process: Challenging Textual Clustering using a Flexible Temporal Prior. (arXiv:2109.07170v1 [cs.LG])
    (2 min) The textual content of a document and its publication date are intertwined. For example, the publication of a news article on a topic is influenced by previous publications on similar issues, according to underlying temporal dynamics. However, it can be challenging to retrieve meaningful information when textual information conveys little information or when temporal dynamics are hard to unveil. Furthermore, the textual content of a document is not always linked to its temporal dynamics. We develop a flexible method to create clusters of textual documents according to both their content and publication time, the Powered Dirichlet-Hawkes process (PDHP). We show PDHP yields significantly better results than state-of-the-art models when temporal information or textual content is weakly informative. The PDHP also alleviates the hypothesis that textual content and temporal dynamics are always perfectly correlated. PDHP allows retrieving textual clusters, temporal clusters, or a mixture of both with high accuracy when they are not. We demonstrate that PDHP generalizes previous work --such as the Dirichlet-Hawkes process (DHP) and Uniform process (UP). Finally, we illustrate the changes induced by PDHP over DHP and UP in a real-world application using Reddit data.
    Matching with Transformers in MELT. (arXiv:2109.07401v1 [cs.CL])
    (2 min) One of the strongest signals for automated matching of ontologies and knowledge graphs are the textual descriptions of the concepts. The methods that are typically applied (such as character- or token-based comparisons) are relatively simple, and therefore do not capture the actual meaning of the texts. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is possible. In this paper, we model the ontology matching task as classification problem and present approaches based on transformer models. We further provide an easy to use implementation in the MELT framework which is suited for ontology and knowledge graph matching. We show that a transformer-based filter helps to choose the correct correspondences given a high-recall alignment and already achieves a good result with simple alignment post-processing methods.
    Learning to Match Job Candidates Using Multilingual Bi-Encoder BERT. (arXiv:2109.07157v1 [cs.CL])
    (2 min) In this talk, we will show how we used Randstad history of candidate placements to generate labeled CV-vacancy pairs dataset. Afterwards we fine-tune a multilingual BERT with bi encoder structure over this dataset, by adding a cosine similarity log loss layer. We will explain how using the mentioned structure helps us overcome most of the challenges described above, and how it enables us to build a maintainable and scalable pipeline to match CVs and vacancies. In addition, we show how we gain a better semantic understanding, and learn to bridge the vocabulary gap. Finally, we highlight how multilingual transformers help us handle cross language barrier and might reduce discrimination.
  • cs.LG updates on arXiv.org

    Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation. (arXiv:2107.06011v2 [cs.CV] UPDATED)
    (2 min) In the context of visual navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals. This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities, and discover object characteristics. In classical Reinforcement Learning (RL) setups, this capacity is learned from reward alone. We introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objective. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input.
    Language Grounding with 3D Objects. (arXiv:2107.12514v2 [cs.CL] UPDATED)
    (2 min) Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" Flat images of candidate mice may not provide the discriminative information needed for "wireless." The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE) requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these image-based models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation. We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform, but note that a large gap remains between these models and human performance.
    Augmenting Decision Making via Interactive What-If Analysis. (arXiv:2109.06160v2 [cs.DB] UPDATED)
    (2 min) The fundamental goal of business data analysis is to improve business decisions using data. Business users such as sales, marketing, product, or operations managers often make decisions to achieve key performance indicator (KPI) goals such as increasing customer retention, decreasing cost, and increasing sales. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to perform lengthy exploratory analyses, considering multitudes of combinations and scenarios, slicing, dicing, and transforming the data accordingly. For example, analyzing customer retention across quarters of the year or suggesting optimal media channels across strata of customers. However, the increasing complexity of datasets combined with the cognitive limitations of humans makes it challenging to carry over multiple hypotheses, even for simple datasets. Therefore mentally performing such analyses is hard. Existing commercial tools either provide partial solutions whose effectiveness remains unclear or fail to cater to business users. Here we argue for four functionalities that we believe are necessary to enable business users to interactively learn and reason about the relationships (functions) between sets of data attributes, facilitating data-driven decision making. We implement these functionalities in SystemD, an interactive visual analysis system enabling business users to experiment with the data by asking what-if questions. We evaluate the system through three business use cases: marketing mix modeling analysis, customer retention analysis, and deal closing analysis, and report on feedback from multiple business users. Overall, business users find SystemD intuitive and useful for quick testing and validation of their hypotheses around interested KPI as well as in making effective and fast data-driven decisions.
    Predicting Road Flooding Risk with Machine Learning Approaches Using Crowdsourced Reports and Fine-grained Traffic Data. (arXiv:2108.13265v2 [physics.soc-ph] UPDATED)
    (3 min) The objective of this study is to predict road flooding risks based on topographic, hydrologic, and temporal precipitation features using machine learning models. Predictive flood monitoring of road network flooding status plays an essential role in community hazard mitigation, preparedness, and response activities. Existing studies related to the estimation of road inundations either lack observed road inundation data for model validations or focus mainly on road inundation exposure assessment based on flood maps. This study addresses this limitation by using crowdsourced and fine-grained traffic data as an indicator of road inundation, and topographic, hydrologic, and temporal precipitation features as predictor variables. Two tree-based machine learning models (random forest and AdaBoost) were then tested and trained for predicting road inundations in the contexts of 2017 Hurricane Harvey and 2019 Tropical Storm Imelda in Harris County, Texas. The findings from Hurricane Harvey indicate that precipitation is the most important feature for predicting road inundation susceptibility, and that topographic features are more essential than hydrologic features for predicting road inundations in both storm cases. The random forest and AdaBoost models had relatively high AUC scores (0.860 and 0.810 for Harvey respectively and 0.790 and 0.720 for Imelda respectively) with the random forest model performing better in both cases. The random forest model showed stable performance for Harvey, while varying significantly for Imelda. This study advances the emerging field of smart flood resilience in terms of predictive flood risk mapping at the road level. For example, such models could help impacted communities and emergency management agencies develop better preparedness and response strategies with improved situational awareness of road inundation likelihood as an extreme weather event unfolds.
    Spline-PINN: Approaching PDEs without Data using Fast, Physics-Informed Hermite-Spline CNNs. (arXiv:2109.07143v1 [cs.LG])
    (2 min) Partial Differential Equations (PDEs) are notoriously difficult to solve. In general, closed-form solutions are not available and numerical approximation schemes are computationally expensive. In this paper, we propose to approach the solution of PDEs based on a novel technique that combines the advantages of two recently emerging machine learning based approaches. First, physics-informed neural networks (PINNs) learn continuous solutions of PDEs and can be trained with little to no ground truth data. However, PINNs do not generalize well to unseen domains. Second, convolutional neural networks provide fast inference and generalize but either require large amounts of training data or a physics-constrained loss based on finite differences that can lead to inaccuracies and discretization artifacts. We leverage the advantages of both of these approaches by using Hermite spline kernels in order to continuously interpolate a grid-based state representation that can be handled by a CNN. This allows for training without any precomputed training data using a physics-informed loss function only and provides fast, continuous solutions that generalize to unseen domains. We demonstrate the potential of our method at the examples of the incompressible Navier-Stokes equation and the damped wave equation. Our models are able to learn several intriguing phenomena such as Karman vortex streets, the Magnus effect, Doppler effect, interference patterns and wave reflections. Our quantitative assessment and an interactive real-time demo show that we are narrowing the gap in accuracy of unsupervised ML based methods to industrial CFD solvers while being orders of magnitude faster.
    See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization. (arXiv:2105.09601v2 [cs.LG] UPDATED)
    (2 min) In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths. In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground-truth. We then propose FLORAL, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task. FLORAL utilizes an increasing number of self-attentions to capture multimodality and performs significantly better than traditional encoder-decoder based networks. Extensive experiments illustrate that FLORAL achieves significant improvement over the baselines in both qualitative and quantitative evaluations on the existing How2 dataset for short videos and newly introduced AVIATE dataset for videos with diverse duration, beating the best baseline on the two datasets by $1.39$ and $2.74$ ROUGE-L points respectively.
    Exploiting Heterogeneity in Robust Federated Best-Arm Identification. (arXiv:2109.05700v2 [cs.LG] UPDATED)
    (2 min) We study a federated variant of the best-arm identification problem in stochastic multi-armed bandits: a set of clients, each of whom can sample only a subset of the arms, collaborate via a server to identify the best arm (i.e., the arm with the highest mean reward) with prescribed confidence. For this problem, we propose Fed-SEL, a simple communication-efficient algorithm that builds on successive elimination techniques and involves local sampling steps at the clients. To study the performance of Fed-SEL, we introduce a notion of arm-heterogeneity that captures the level of dissimilarity between distributions of arms corresponding to different clients. Interestingly, our analysis reveals the benefits of arm-heterogeneity in reducing both the sample- and communication-complexity of Fed-SEL. As a special case of our analysis, we show that for certain heterogeneous problem instances, Fed-SEL outputs the best-arm after just one round of communication. Our findings have the following key implication: unlike federated supervised learning where recent work has shown that statistical heterogeneity can lead to poor performance, one can provably reap the benefits of both local computation and heterogeneity for federated best-arm identification. As our final contribution, we develop variants of Fed-SEL, both for federated and peer-to-peer settings, that are robust to the presence of Byzantine clients, and hence suitable for deployment in harsh, adversarial environments.
    Counterfactual Explainable Recommendation. (arXiv:2108.10539v2 [cs.IR] UPDATED)
    (2 min) By providing explanations for users and system designers to facilitate better understanding and decision making, explainable recommendation has been an important research problem. In this paper, we propose Counterfactual Explainable Recommendation (CountER), which takes the insights of counterfactual reasoning from causal inference for explainable recommendation. CountER is able to formulate the complexity and the strength of explanations, and it adopts a counterfactual learning framework to seek simple (low complexity) and effective (high strength) explanations for the model decision. Technically, for each item recommended to each user, CountER formulates a joint optimization problem to generate minimal changes on the item aspects so as to create a counterfactual item, such that the recommendation decision on the counterfactual item is reversed. These altered aspects constitute the explanation of why the original item is recommended. The counterfactual explanation helps both the users for better understanding and the system designers for better model debugging. Another contribution of the work is the evaluation of explainable recommendation, which has been a challenging task. Fortunately, counterfactual explanations are very suitable for standard quantitative evaluation. To measure the explanation quality, we design two types of evaluation metrics, one from user's perspective (i.e. why the user likes the item), and the other from model's perspective (i.e. why the item is recommended by the model). We apply our counterfactual learning algorithm on a black-box recommender system and evaluate the generated explanations on five real-world datasets. Results show that our model generates more accurate and effective explanations than state-of-the-art explainable recommendation models.
    pix2rule: End-to-end Neuro-symbolic Rule Learning. (arXiv:2106.07487v2 [cs.LG] UPDATED)
    (2 min) Humans have the ability to seamlessly combine low-level visual input with high-level symbolic reasoning often in the form of recognising objects, learning relations between them and applying rules. Neuro-symbolic systems aim to bring a unifying approach to connectionist and logic-based principles for visual processing and abstract reasoning respectively. This paper presents a complete neuro-symbolic method for processing images into objects, learning relations and logical rules in an end-to-end fashion. The main contribution is a differentiable layer in a deep learning architecture from which symbolic relations and rules can be extracted by pruning and thresholding. We evaluate our model using two datasets: subgraph isomorphism task for symbolic rule learning and an image classification domain with compound relations for learning objects, relations and rules. We demonstrate that our model scales beyond state-of-the-art symbolic learners and outperforms deep relational neural network architectures.
    CoAtNet: Marrying Convolution and Attention for All Data Sizes. (arXiv:2106.04803v2 [cs.CV] UPDATED)
    (2 min) Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.
    A survey of Monte Carlo methods for noisy and costly densities with application to reinforcement learning. (arXiv:2108.00490v2 [cs.LG] UPDATED)
    (2 min) This survey gives an overview of Monte Carlo methodologies using surrogate models, for dealing with densities which are intractable, costly, and/or noisy. This type of problem can be found in numerous real-world scenarios, including stochastic optimization and reinforcement learning, where each evaluation of a density function may incur some computationally-expensive or even physical (real-world activity) cost, likely to give different results each time. The surrogate model does not incur this cost, but there are important trade-offs and considerations involved in the choice and design of such methodologies. We classify the different methodologies into three main classes and describe specific instances of algorithms under a unified notation. A modular scheme which encompasses the considered methods is also presented. A range of application scenarios is discussed, with special attention to the likelihood-free setting and reinforcement learning. Several numerical comparisons are also provided.
    Adversarial for Good? How the Adversarial ML Community's Values Impede Socially Beneficial Uses of Attacks. (arXiv:2107.10302v2 [cs.CR] UPDATED)
    (2 min) Attacks from adversarial machine learning (ML) have the potential to be used "for good": they can be used to run counter to the existing power structures within ML, creating breathing space for those who would otherwise be the targets of surveillance and control. But most research on adversarial ML has not engaged in developing tools for resistance against ML systems. Why? In this paper, we review the broader impact statements that adversarial ML researchers wrote as part of their NeurIPS 2020 papers and assess the assumptions that authors have about the goals of their work. We also collect information about how authors view their work's impact more generally. We find that most adversarial ML researchers at NeurIPS hold two fundamental assumptions that will make it difficult for them to consider socially beneficial uses of attacks: (1) it is desirable to make systems robust, independent of context, and (2) attackers of systems are normatively bad and defenders of systems are normatively good. That is, despite their expressed and supposed neutrality, most adversarial ML researchers believe that the goal of their work is to secure systems, making it difficult to conceptualize and build tools for disrupting the status quo.
    A Tutorial on Learning Disentangled Representations in the Imaging Domain. (arXiv:2108.12043v2 [cs.CV] UPDATED)
    (2 min) Disentangled representation learning has been proposed as an approach to learning general representations. This can be done in the absence of, or with limited, annotations. A good general representation can be readily fine-tuned for new target tasks using modest amounts of data, or even be used directly in unseen domains achieving remarkable performance in the corresponding task. This alleviation of the data and annotation requirements offers tantalising prospects for tractable and affordable applications in computer vision and healthcare. Finally, disentangled representations can offer model explainability and can help us understand the underlying causal relations of the factors of variation, increasing their suitability for real-world deployment. In this tutorial paper, we will offer an overview of the disentangled representation learning, its building blocks and criteria, and discuss applications in computer vision and medical imaging. We conclude our tutorial by presenting the identified opportunities for the integration of recent machine learning advances into disentanglement, as well as the remaining challenges.
    Self-Supervised Learning from Automatically Separated Sound Scenes. (arXiv:2105.02132v2 [cs.SD] UPDATED)
    (2 min) Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
    Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification. (arXiv:2109.05542v1 [cs.CV] CROSS LISTED)
    (2 min) Person re-identification (re-ID) has gained more and more attention due to its widespread applications in intelligent video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models, and annotating data is an expensive work in real-world scenarios. In addition, due to domain gaps between different datasets, the performance is dramatically decreased when re-ID models pre-trained on label-rich datasets (source domain) are directly applied to other unlabeled datasets (target domain). In this paper, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them, which free humans from heavy data collections and annotations. Based on them, we build two synthetic person re-ID datasets with different scales, "GSPR" and "mini-GSPR" datasets. Secondly, we propose a synthesis-based multi-domain collaborative refinement (SMCR) network, which contains a synthetic pretraining module and two collaborative-refinement modules to implement sufficient learning for the valuable knowledge from multiple domains. Extensive experiments show that our proposed framework obtains significant performance improvements over the state-of-the-art methods on multiple unsupervised domain adaptation tasks of person re-ID.
    Learning to Reach, Swim, Walk and Fly in One Trial: Data-Driven Control with Scarce Data and Side Information. (arXiv:2106.10533v2 [eess.SY] UPDATED)
    (2 min) We develop a learning-based control algorithm for unknown dynamical systems under very severe data limitations. Specifically, the algorithm has access to streaming data only from a single and ongoing trial. Despite the scarcity of data, we show -- through a series of examples -- that the algorithm can provide performance comparable to reinforcement learning algorithms trained over millions of environment interactions. It accomplishes such performance by effectively leveraging various forms of side information on the dynamics to reduce the sample complexity. Such side information typically comes from elementary laws of physics and qualitative properties of the system. More precisely, the algorithm approximately solves an optimal control problem encoding the system's desired behavior. To this end, it constructs and refines a differential inclusion that contains the unknown vector field of the dynamics. The differential inclusion, used in an interval Taylor-based method, enables to over-approximate the set of states the system may reach. Theoretically, we establish a bound on the suboptimality of the approximate solution with respect to the case of known dynamics. We show that the longer the trial or the more side information is available, the tighter the bound. Empirically, experiments in a high-fidelity F-16 aircraft simulator and MuJoCo's environments such as the Reacher, Swimmer, and Cheetah illustrate the algorithm's effectiveness.
    Compensation Learning. (arXiv:2107.11921v2 [cs.LG] UPDATED)
    (2 min) Weighting strategy prevails in machine learning. For example, a common approach in robust machine learning is to exert lower weights on samples which are likely to be noisy or quite hard. This study reveals another undiscovered strategy, namely, compensating, that has also been widely used in machine learning. Learning with compensating is called compensation learning and a systematic taxonomy is constructed for it in this study. In our taxonomy, compensation learning is divided on the basis of the compensation targets, directions, inference manners, and granularity levels. Many existing learning algorithms including some classical ones can be seen as a special case of compensation learning or partially leveraging compensating. Furthermore, a family of new learning algorithms can be obtained by plugging the compensation learning into existing learning algorithms. Specifically, two concrete new learning algorithms are proposed for robust machine learning. Extensive experiments on text sentiment analysis and image classification verify the effectiveness of the two new algorithms. Compensation learning can also be used in various learning scenarios, such as imbalance learning, clustering, regression, and so on.
    LOKI: Long Term and Key Intentions for Trajectory Prediction. (arXiv:2108.08236v2 [cs.CV] UPDATED)
    (2 min) Recent advances in trajectory prediction have shown that explicit reasoning about agents' intent is important to accurately forecast their motion. However, the current research activities are not directly applicable to intelligent and safety critical systems. This is mainly because very few public datasets are available, and they only consider pedestrian-specific intents for a short temporal horizon from a restricted egocentric view. To this end, we propose LOKI (LOng term and Key Intentions), a novel large-scale dataset that is designed to tackle joint trajectory and intention prediction for heterogeneous traffic agents (pedestrians and vehicles) in an autonomous driving setting. The LOKI dataset is created to discover several factors that may affect intention, including i) agent's own will, ii) social interactions, iii) environmental constraints, and iv) contextual information. We also propose a model that jointly performs trajectory and intention prediction, showing that recurrently reasoning about intention can assist with trajectory prediction. We show our method outperforms state-of-the-art trajectory prediction methods by upto $27\%$ and also provide a baseline for frame-wise intention estimation.
    Post-Selections in AI and How to Avoid Them. (arXiv:2106.13233v2 [cs.LG] UPDATED)
    (2 min) Neural network based Artificial Intelligence (AI) has reported increasing scales in experiments. However, this paper raises a rarely reported stage in such experiments called Post-Selection alter the reader to several possible protocol flaws that may result in misleading results. All AI methods fall into two broad schools, connectionist and symbolic. The Post-Selection fall into two kinds, Post-Selection Using Validation Sets (PSUVS) and Post-Selection Using Test Sets (PSUTS). Each kind has two types of post-selectors, machines and humans. The connectionist school received criticisms for its "black box" and now the Post-Selection; but the seemingly "clean" symbolic school seems more brittle because of its human PSUTS. This paper first presents a controversial view: all static "big data" are non-scalable. We then analyze why error-backprop from randomly initialized weights suffers from severe local minima, why PSUVS lacks cross-validation, why PSUTS violates well-established protocols, and why every paper involved should transparently report the Post-Selection stage. To avoid future pitfalls in AI competitions, this paper proposes a new AI metrics, called developmental errors for all networks trained, under Three Learning Conditions: (1) an incremental learning architecture (due to a "big data" flaw), (2) a training experience and (3) a limited amount of computational resources. Developmental Networks avoid Post-Selections because they automatically discover context-rules on the fly by generating emergent Turing machines (not black boxes) that are optimal in the sense of maximum-likelihood across lifetime, conditioned on the Three Learning Conditions.
    Prescriptive Process Monitoring for Cost-Aware Cycle Time Reduction. (arXiv:2105.07111v2 [cs.LG] UPDATED)
    (2 min) Reducing cycle time is a recurrent concern in the field of business process management. Depending on the process, various interventions may be triggered to reduce the cycle time of a case, for example, using a faster shipping service in an order-to-delivery process or giving a phone call to a customer to obtain missing information rather than waiting passively. Each of these interventions comes with a cost. This paper tackles the problem of determining if and when to trigger a time-reducing intervention in a way that maximizes the total net gain. The paper proposes a prescriptive process monitoring method that uses orthogonal random forest models to estimate the causal effect of triggering a time-reducing intervention for each ongoing case of a process. Based on this causal effect estimate, the method triggers interventions according to a user-defined policy. The method is evaluated on two real-life logs.
    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. (arXiv:2104.05158v5 [cs.DC] UPDATED)
    (3 min) Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
    Periodic Freight Demand Estimation for Large-scale Tactical Planning. (arXiv:2105.09136v2 [cs.LG] UPDATED)
    (2 min) Freight carriers rely on tactical planning to design their service network to satisfy demand in a cost-effective way. For computational tractability, deterministic and cyclic Service Network Design (SND) formulations are used to solve large-scale problems. A central input is the periodic demand, that is, the demand expected to repeat in every period in the planning horizon. In practice, demand is predicted by a time series forecasting model and the periodic demand is the average of those forecasts. This is, however, only one of many possible mappings. The problem consisting in selecting this mapping has hitherto been overlooked in the literature. We propose to use the structure of the downstream decision-making problem to select a good mapping. For this purpose, we introduce a multilevel mathematical programming formulation that explicitly links the time series forecasts to the SND problem of interest. The solution is a periodic demand estimate that minimizes costs over the tactical planning horizon. We report results in an extensive empirical study of a large-scale application from the Canadian National Railway Company. They clearly show the importance of the periodic demand estimation problem. Indeed, the planning costs exhibit an important variation over different periodic demand estimates and using an estimate different from the mean forecast can lead to substantial cost reductions. Moreover, the costs associated with the period demand estimates based on forecasts were comparable to, or even better than those obtained using the mean of actual demand.
    Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis. (arXiv:2106.02588v2 [cs.LG] UPDATED)
    (2 min) The representation of functions by artificial neural networks depends on a large number of parameters in a non-linear fashion. Suitable parameters of these are found by minimizing a 'loss functional', typically by stochastic gradient descent (SGD) or an advanced SGD-based algorithm. In a continuous time model for SGD with noise that follows the 'machine learning scaling', we show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
    Data-to-text Generation by Splicing Together Nearest Neighbors. (arXiv:2101.08248v3 [cs.CL] UPDATED)
    (2 min) We propose to tackle data-to-text generation tasks by directly splicing together retrieved segments of text from "neighbor" source-target pairs. Unlike recent work that conditions on retrieved neighbors but generates text token-by-token, left-to-right, we learn a policy that directly manipulates segments of neighbor text, by inserting or replacing them in partially constructed generations. Standard techniques for training such a policy require an oracle derivation for each generation, and we prove that finding the shortest such derivation can be reduced to parsing under a particular weighted context-free grammar. We find that policies learned in this way perform on par with strong baselines in terms of automatic and human evaluation, but allow for more interpretable and controllable generation.
    Efficient Explanations from Empirical Explainers. (arXiv:2103.15429v2 [cs.LG] UPDATED)
    (2 min) Amid a discussion about Green AI in which we see explainability neglected, we explore the possibility to efficiently approximate computationally expensive explainers. To this end, we propose feature attribution modelling with Empirical Explainers. Empirical Explainers learn from data to predict the attribution maps of expensive explainers. We train and test Empirical Explainers in the language domain and find that they model their expensive counterparts surprisingly well, at a fraction of the cost. They could thus mitigate the computational burden of neural explanations significantly, in applications that tolerate an approximation error.
    D3p -- A Python Package for Differentially-Private Probabilistic Programming. (arXiv:2103.11648v2 [cs.LG] UPDATED)
    (2 min) We present d3p, a software package designed to help fielding runtime efficient widely-applicable Bayesian inference under differential privacy guarantees. d3p achieves general applicability to a wide range of probabilistic modelling problems by implementing the differentially private variational inference algorithm, allowing users to fit any parametric probabilistic model with a differentiable density function. d3p adopts the probabilistic programming paradigm as a powerful way for the user to flexibly define such models. We demonstrate the use of our software on a hierarchical logistic regression example, showing the expressiveness of the modelling approach as well as the ease of running the parameter inference. We also perform an empirical evaluation of the runtime of the private inference on a complex model and find a $\sim$10 fold speed-up compared to an implementation using TensorFlow Privacy.
    Making Contrastive Learning Robust to Shortcuts. (arXiv:2012.09962v4 [cs.LG] UPDATED)
    (2 min) Contrastive learning is one of the fastest growing research areas in machine learning due to its ability to learn useful representations without labeled data. However, contrastive learning is susceptible to shortcuts - i.e., it may learn shortcut features irrelevant to the task of interest, and discard relevant information. Past work has addressed this limitation via handcrafted data augmentations that eliminate the shortcut. But, manually crafted augmentations do not work across all datasets and tasks. Further, data augmentations fail in addressing shortcuts in multi-attribute classification when one attribute acts as a shortcut around other attributes. In this paper, we analyze the objective function of contrastive learning and formally prove that it is vulnerable to shortcuts. We then present reconstructive contrastive learning (RCL), a framework for learning unsupervised representations that are robust to shortcuts. The key idea is to force the learned representation to reconstruct the input, which naturally counters potential shortcuts. Extensive experiments verify that RCL is highly robust to shortcuts and outperforms state-of-the-art contrastive learning methods on a variety of datasets and tasks.
    Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Reinforcement Learning. (arXiv:2106.00589v2 [cs.LG] UPDATED)
    (2 min) We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Targeting immediately measurable proxies such as clicks can lead to suboptimal recommendations due to misalignment with the long-term metric. We develop a new reinforcement learning algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced drift in user behavior across sessions. SHPI is a straightforward modification of episodic RL algorithms for session-based recommendation, that additionally gives an appropriate termination bonus in each session. Empirical results on four recommendation tasks show that SHPI can outperform state-of-the-art recommendation techniques like matrix factorization with offline proxy signals, bandits with myopic online proxies, and RL baselines with limited amounts of user interaction.
    VIRDOCD: a VIRtual DOCtor to Predict Dengue Fatality. (arXiv:2104.14282v2 [q-bio.QM] UPDATED)
    (3 min) Clinicians make routine diagnosis by scrutinizing patients' medical signs and symptoms, a skill popularly referred to as "Clinical Eye". This skill evolves through trial-and-error and improves with time. The success of the therapeutic regime relies largely on the accuracy of interpretation of such sign-symptoms, analyzing which a clinician assesses the severity of the illness. The present study is an attempt to propose a complementary medical front by mathematically modeling the "Clinical Eye" of a VIRtual DOCtor, using Statistical and Machine Intelligence tools (SMI), to analyze Dengue epidemic infected patients (100 case studies with 11 weighted sign-symptoms). The SMI in VIRDOCD reads medical data and translates these into a vector comprising Multiple Linear Regression (MLR) coefficients to predict infection severity grades of dengue patients that clone the clinician's experience-based assessment. Risk managed through ANOVA, the dengue severity grade prediction accuracy from VIRDOCD is found higher (ca 75%) than conventional clinical practice (ca 71.4%, mean accuracy profile assessed by a team of 10 senior consultants). Free of human errors and capable of deciphering even minute differences from almost identical symptoms (to the Clinical Eye), VIRDOCD is uniquely individualized in its decision-making ability. The algorithm has been validated against Random Forest classification (RF, ca 63%), another regression-based classifier similar to MLR that can be trained through supervised learning. We find that MLR-based VIRDOCD is superior to RF in predicting the grade of Dengue morbidity. VIRDOCD can be further extended to analyze other epidemic infections, such as COVID-19.
    From Sampling to Optimization on Discrete Domains with Applications to Determinant Maximization. (arXiv:2102.05347v3 [cs.LG] UPDATED)
    (2 min) We show a connection between sampling and optimization on discrete domains. For a family of distributions $\mu$ defined on size $k$ subsets of a ground set of elements that is closed under external fields, we show that rapid mixing of natural local random walks implies the existence of simple approximation algorithms to find $\max \mu(\cdot)$. More precisely we show that if (multi-step) down-up random walks have spectral gap at least inverse polynomially large in $k$, then (multi-step) local search can find $\max \mu(\cdot)$ within a factor of $k^{O(k)}$. As the main application of our result, we show a simple nearly-optimal $k^{O(k)}$-factor approximation algorithm for MAP inference on nonsymmetric DPPs. This is the first nontrivial multiplicative approximation for finding the largest size $k$ principal minor of a square (not-necessarily-symmetric) matrix $L$ with $L+L^\intercal\succeq 0$. We establish the connection between sampling and optimization by showing that an exchange inequality, a concept rooted in discrete convex analysis, can be derived from fast mixing of local random walks. We further connect exchange inequalities with composable core-sets for optimization, generalizing recent results on composable core-sets for DPP maximization to arbitrary distributions that satisfy either the strongly Rayleigh property or that have a log-concave generating polynomial.
    VideoGPT: Video Generation using VQ-VAE and Transformers. (arXiv:2104.10157v2 [cs.CV] UPDATED)
    (2 min) We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html
    Adversarial Regularization as Stackelberg Game: An Unrolled Optimization Approach. (arXiv:2104.04886v2 [cs.LG] UPDATED)
    (2 min) Adversarial regularization has been shown to improve the generalization performance of deep learning models in various natural language processing tasks. Existing works usually formulate the method as a zero-sum game, which is solved by alternating gradient descent/ascent algorithms. Such a formulation treats the adversarial and the defending players equally, which is undesirable because only the defending player contributes to the generalization performance. To address this issue, we propose Stackelberg Adversarial Regularization (SALT), which formulates adversarial regularization as a Stackelberg game. This formulation induces a competition between a leader and a follower, where the follower generates perturbations, and the leader trains the model subject to the perturbations. Different from conventional approaches, in SALT, the leader is in an advantageous position. When the leader moves, it recognizes the strategy of the follower and takes the anticipated follower's outcomes into consideration. Such a leader's advantage enables us to improve the model fitting to the unperturbed data. The leader's strategic information is captured by the Stackelberg gradient, which is obtained using an unrolling algorithm. Our experimental results on a set of machine translation and natural language understanding tasks show that SALT outperforms existing adversarial regularization baselines across all tasks. Our code is publicly available.
    When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute. (arXiv:2102.12459v3 [cs.CL] UPDATED)
    (2 min) Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.
    Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. (arXiv:2105.01650v2 [stat.ML] UPDATED)
    (2 min) Stochastic gradient descent (SGD) is one of the most popular algorithms in modern machine learning. The noise encountered in these applications is different from that in many theoretical analyses of stochastic gradient algorithms. In this article, we discuss some of the common properties of energy landscapes and stochastic noise encountered in machine learning problems, and how they affect SGD-based optimization. In particular, we show that the learning rate in SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems. If the objective function satisfies a Lojasiewicz inequality, SGD converges to the global minimum exponentially fast, and even for functions which may have local minima, we establish almost sure convergence to the global minimum at an exponential rate from any finite energy initialization. The assumptions that we make in this result concern the behavior where the objective function is either small or large and the nature of the gradient noise, but the energy landscape is fairly unconstrained on the domain where the objective function takes values in an intermediate regime.
    Holmes: An Efficient and Lightweight Semantic Based Anomalous Email Detector. (arXiv:2104.08044v9 [cs.CR] UPDATED)
    (3 min) Email threat is a serious issue for enterprise security, which consists of various malicious scenarios, such as phishing, fraud, blackmail and malvertisement. Traditional anti-spam gateway commonly requires to maintain a greylist to filter out unexpected emails based on suspicious vocabularies existed in the mail subject and content. However, the signature-based approach cannot effectively discover novel and unknown suspicious emails that utilize various hot topics at present, such as COVID-19 and US election. To address the problem, in this paper, we present Holmes, an efficient and lightweight semantic based engine for anomalous email detection. Holmes can convert each event log of email to a sentence through word embedding then extract interesting items among them by novelty detection. Based on our observations, we claim that, in an enterprise environment, there is a stable relation between senders and receivers, but suspicious emails are commonly from unusual sources, which can be detected through the rareness selection. We evaluate the performance of Holmes in a real-world enterprise environment, in which it sends and receives around 5,000 emails each day. As a result, Holmes can achieve a high detection rate (output around 200 suspicious emails per day) and maintain a low false alarm rate for anomaly detection.
    An Efficient Quantitative Approach for Optimizing Convolutional Neural Networks. (arXiv:2009.05236v4 [cs.CV] UPDATED)
    (2 min) With the increasing popularity of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in various domains, such as image classification and object detection, and achieve stunning success in terms of their high accuracy over the traditional statistical methods. To exploit the potential of CNN models, a huge amount of research and industry efforts have been devoted to optimizing CNNs. Among these endeavors, CNN architecture design has attracted tremendous attention because of its great potential of improving model accuracy or reducing model complexity. However, existing work either introduces repeated training overhead in the search process or lacks an interpretable metric to guide the design. To clear these hurdles, we propose 3D-Receptive Field (3DRF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs. To validate the effectiveness of 3DRF, we build a static optimizer to improve the CNN architectures at both the stage level and the kernel level. Our optimizer not only provides a clear and reproducible procedure but also mitigates unnecessary training efforts in the architecture search process. Extensive experiments and studies show that the models generated by our optimizer can achieve up to 5.47% accuracy improvement and up to 65.38% parameters deduction, compared with state-of-the-art CNN structures like MobileNet and ResNet.
    Sequential prediction under log-loss and misspecification. (arXiv:2102.00050v3 [cs.LG] UPDATED)
    (2 min) We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption that the data-generating distribution belongs to the hypothesis class as well. Here we present results in the more general misspecified case. Due to special properties of the log-loss, the same problem arises in the context of competitive-optimality in density estimation, and model selection. For the $d$-dimensional Gaussian location hypothesis class, we show that cumulative regrets in the well-specified and misspecified cases asymptotically coincide. In other words, we provide an $o(1)$ characterization of the distribution-free (or PAC) regret in this case -- the first such result as far as we know. We recall that the worst-case (or individual-sequence) regret in this case is larger by an additive constant ${d\over 2} + o(1)$. Surprisingly, neither the traditional Bayesian estimators, nor the Shtarkov's normalized maximum likelihood achieve the PAC regret and our estimator requires special "robustification" against heavy-tailed data. In addition, we show two general results for misspecified regret: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecified regret between well-specified regrets with (asymptotically) close hypotheses classes.
    Supervised machine learning techniques for data matching based on similarity metrics. (arXiv:2007.04001v2 [cs.LG] UPDATED)
    (2 min) Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame. Clean and consistent data is therefore crucial. Data matching is the field that tries to identify instances in data that refer to the same real-world entity. In this study, machine learning techniques are combined with string similarity functions to the field of data matching. A dataset of invoices from a variety of businesses and organizations was preprocessed with a grouping scheme to reduce pair dimensionality and a set of similarity functions was used to quantify similarity between invoice pairs. The resulting invoice pair dataset was then used to train and validate a neural network and a boosted decision tree. The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions. Both the neural network and boosted decision tree showed equal to better performance.
    Likelihood-Based Diverse Sampling for Trajectory Forecasting. (arXiv:2011.15084v2 [cs.CV] UPDATED)
    (2 min) Forecasting complex vehicle and pedestrian multi-modal distributions requires powerful probabilistic approaches. Normalizing flows (NF) have recently emerged as an attractive tool to model such distributions. However, a key drawback is that independent samples drawn from a flow model often do not adequately capture all the modes in the underlying distribution. We propose Likelihood-Based Diverse Sampling (LDS), a method for improving the quality and the diversity of trajectory samples from a pre-trained flow model. Rather than producing individual samples, LDS produces a set of trajectories in one shot. Given a pre-trained forecasting flow model, we train LDS using gradients from the model, to optimize an objective function that rewards high likelihood for individual trajectories in the predicted set, together with high spatial separation among trajectories. LDS outperforms state-of-art post-hoc neural diverse forecasting methods for various pre-trained flow models as well as conditional variational autoencoder (CVAE) models. Crucially, it can also be used for transductive trajectory forecasting, where the diverse forecasts are trained on-the-fly on unlabeled test examples. LDS is easy to implement, and we show that it offers a simple plug-in improvement over baselines on two challenging benchmarks. Code is at: https://github.com/JasonMa2016/LDS
    Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting and Regularization. (arXiv:2011.09148v4 [stat.ML] UPDATED)
    (3 min) Deep neural networks generalize well despite being exceedingly overparameterized and being trained without explicit regularization. This curious phenomenon has inspired extensive research activity in establishing its statistical principles: Under what conditions is it observed? How do these depend on the data and on the training algorithm? When does regularization benefit generalization? While such questions remain wide open for deep neural nets, recent works have attempted gaining insights by studying simpler, often linear, models. Our paper contributes to this growing line of work by examining binary linear classification under a generative Gaussian mixture model. Motivated by recent results on the implicit bias of gradient descent, we study both max-margin SVM classifiers (corresponding to logistic loss) and min-norm interpolating classifiers (corresponding to least-squares loss). First, we leverage an idea introduced in [V. Muthukumar et al., arXiv:2005.08054, (2020)] to relate the SVM solution to the min-norm interpolating solution. Second, we derive novel non-asymptotic bounds on the classification error of the latter. Combining the two, we present novel sufficient conditions on the covariance spectrum and on the signal-to-noise ratio (SNR) under which interpolating estimators achieve asymptotically optimal performance as overparameterization increases. Interestingly, our results extend to a noisy model with constant probability noise flips. Contrary to previously studied discriminative data models, our results emphasize the crucial role of the SNR and its interplay with the data covariance. Finally, via a combination of analytical arguments and numerical demonstrations we identify conditions under which the interpolating estimator performs better than corresponding regularized estimates.
    VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction. (arXiv:2104.05932v2 [cs.CV] UPDATED)
    (2 min) 3D object detection and dense depth estimation are one of the most vital tasks in autonomous driving. Multiple sensor modalities can jointly attribute towards better robot perception, and to that end, we introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. LiDAR point-cloud is converted into a set of voxels, and its features are extracted using 3D convolution layers, from which we regress object pose parameters. Corresponding RGB image features are extracted using another 2D convolutional neural network. We further use these combined features to predict a dense depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions. We also introduce a loss function, edge-preserving smooth loss, and show that this results in better depth estimation compared to the edge-aware smooth loss function, frequently used in depth prediction works.
    DA-HGT: Domain Adaptive Heterogeneous Graph Transformer. (arXiv:2012.05688v2 [cs.LG] UPDATED)
    (2 min) Domain adaptation using graph networks learns label-discriminative and network-invariant node embeddings by sharing graph parameters. Most existing works focus on domain adaptation of homogeneous networks. The few works that study heterogeneous cases only consider shared node types but ignore private node types in individual networks. However, for given source and target heterogeneous networks, they generally contain shared and private node types, where private types bring an extra challenge for graph domain adaptation. In this paper, we investigate Heterogeneous Information Networks (HINs) with partially shared node types and propose a novel Domain Adaptive Heterogeneous Graph Transformer (DA-HGT) to handle the domain shift between them. DA-HGT can not only align the distribution of identical-type nodes and edges in two HINs but also make full use of different-type nodes and edges to improve the performance of knowledge transfer. Extensive experiments on several datasets demonstrate that DA-HGT can outperform state-of-the-art methods in various domain adaptation tasks across heterogeneous networks.
    Queueing Network Controls via Deep Reinforcement Learning. (arXiv:2008.01644v7 [math.OC] UPDATED)
    (2 min) Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the optimal policy can be computed. A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function via sampling. First, we use a discounted relative value function as an approximation of the relative value function. Second, we propose regenerative simulation to estimate the discounted relative value function. Finally, we incorporate the approximating martingale-process method into the regenerative estimator.
    Land Cover Mapping in Limited Labels Scenario: A Survey. (arXiv:2103.02429v2 [cs.LG] UPDATED)
    (2 min) Land cover mapping is essential for monitoring global environmental change and managing natural resources. Unfortunately, traditional classification models are plagued by limited training data available in existing land cover products and data heterogeneity over space and time. In this survey, we provide a structured and comprehensive overview of challenges in land cover mapping and machine learning methods used to address these problems. We also discuss the gaps and opportunities that exist for advancing research in this promising direction.
    $(f,\Gamma)$-Divergences: Interpolating between $f$-Divergences and Integral Probability Metrics. (arXiv:2011.05953v3 [stat.ML] UPDATED)
    (2 min) We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,\Gamma)$-divergences, provide a notion of `distance' between probability measures and show that they can be expressed as a two-stage mass-redistribution/mass-transport process. The $(f,\Gamma)$-divergences inherit features from IPMs, such as the ability to compare distributions which are not absolutely continuous, as well as from $f$-divergences, namely the strict concavity of their variational representations and the ability to control heavy-tailed distributions for particular choices of $f$. When combined, these features establish a divergence with improved properties for estimation, statistical learning, and uncertainty quantification applications. Using statistical learning as an example, we demonstrate their advantage in training generative adversarial networks (GANs) for heavy-tailed, not-absolutely continuous sample distributions. We also show improved performance and stability over gradient-penalized Wasserstein GAN in image generation.
    Provenance Graph Kernel. (arXiv:2010.10343v2 [cs.LG] UPDATED)
    (2 min) Provenance is a record that describes how entities, activities, and agents have influenced a piece of data; it is commonly represented as graphs with relevant labels on both their nodes and edges. With the growing adoption of provenance in a wide range of application domains, users are increasingly confronted with an abundance of graph data, which may prove challenging to process. Graph kernels, on the other hand, have been successfully used to efficiently analyse graphs. In this paper, we introduce a novel graph kernel called provenance kernel, which is inspired by and tailored for provenance data. It decomposes a provenance graph into tree-patterns rooted at a given node and considers the labels of edges and nodes up to a certain distance from the root. We employ provenance kernels to classify provenance graphs from three application domains. Our evaluation shows that they perform well in terms of classification accuracy and yield competitive results when compared against existing graph kernel methods and the provenance network analytics method while more efficient in computing time. Moreover, the provenance types used by provenance kernels also help improve the explainability of predictive models built on them.
    An Accurate, Scalable and Verifiable Protocol for Federated Differentially Private Averaging. (arXiv:2006.07218v2 [cs.CR] UPDATED)
    (2 min) Learning from data owned by several parties, as in federated learning, raises challenges regarding the privacy guarantees provided to participants and the correctness of the computation in the presence of malicious parties. We tackle these challenges in the context of distributed averaging, an essential building block of federated learning algorithms. Our first contribution is a scalable protocol in which participants exchange correlated Gaussian noise along the edges of a network graph, complemented by independent noise added by each party. We analyze the differential privacy guarantees of our protocol and the impact of the graph topology under colluding malicious parties, showing that we can nearly match the utility of the trusted curator model even when each honest party communicates with only a logarithmic number of other parties chosen at random. This is in contrast with protocols in the local model of privacy (with lower utility) or based on secure aggregation (where all pairs of users need to exchange messages). Our second contribution enables users to prove the correctness of their computations without compromising the efficiency and privacy guarantees of the protocol. Our verification protocol relies on standard cryptographic primitives like commitment schemes and zero knowledge proofs.
    Adversarial Weighting for Domain Adaptation in Regression. (arXiv:2006.08251v4 [cs.LG] UPDATED)
    (2 min) We present a novel instance-based approach to handle regression tasks in the context of supervised domain adaptation under an assumption of covariate shift. The approach developed in this paper is based on the assumption that the task on the target domain can be efficiently learned by adequately reweighting the source instances during training phase. We introduce a novel formulation of the optimization objective for domain adaptation which relies on a discrepancy distance characterizing the difference between domains according to a specific task and a class of hypotheses. To solve this problem, we develop an adversarial network algorithm which learns both the source weighting scheme and the task in one feed-forward gradient descent. We provide numerical evidence of the relevance of the method on public data sets for regression domain adaptation through reproducible experiments.
    Hierarchical Attentive Knowledge Graph Embedding for Personalized Recommendation. (arXiv:1910.08288v4 [cs.IR] UPDATED)
    (2 min) Knowledge graphs (KGs) have proven to be effective for high-quality recommendation, where the connectivities between users and items provide rich and complementary information to user-item interactions. Most existing methods, however, are insufficient to exploit the KGs for capturing user preferences, as they either represent the user-item connectivities via paths with limited expressiveness or implicitly model them by propagating information over the entire KG with inevitable noise. In this paper, we design a novel hierarchical attentive knowledge graph embedding (HAKG) framework to exploit the KGs for effective recommendation. Specifically, HAKG first extracts the expressive subgraphs that link user-item pairs to characterize their connectivities, which accommodate both the semantics and topology of KGs. The subgraphs are then encoded via a hierarchical attentive subgraph encoding to generate effective subgraph embeddings for enhanced user preference prediction. Extensive experiments show the superiority of HAKG against state-of-the-art recommendation methods, as well as its potential in alleviating the data sparsity issue.
    On a Combination of Alternating Minimization and Nesterov's Momentum. (arXiv:1906.03622v5 [math.OC] UPDATED)
    (2 min) Alternating minimization (AM) procedures are practically efficient in many applications for solving convex and non-convex optimization problems. On the other hand, Nesterov's accelerated gradient is theoretically optimal first-order method for convex optimization. In this paper we combine AM and Nesterov's acceleration to propose an accelerated alternating minimization algorithm. We prove $1/k^2$ convergence rate in terms of the objective for convex problems and $1/k$ in terms of the squared gradient norm for non-convex problems, where $k$ is the iteration counter. Our method does not require any knowledge of neither convexity of the problem nor function parameters such as Lipschitz constant of the gradient, i.e. it is adaptive to convexity and smoothness and is uniformly optimal for smooth convex and non-convex problems. Further, we develop its primal-dual modification for strongly convex problems with linear constraints and prove the same $1/k^2$ for the primal objective residual and constraints feasibility.
    Disparate Vulnerability to Membership Inference Attacks. (arXiv:1906.00389v3 [cs.LG] UPDATED)
    (2 min) A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model's training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability to MIAs by na\"ively applying existing attacks can lead to overestimation. We then establish which attacks are suitable for estimating disparate vulnerability, and provide a statistical framework for doing so reliably. We conduct experiments on synthetic and real-world data finding statistically significant evidence of disparate vulnerability in realistic settings.
    Deep Bregman Divergence for Contrastive Learning of Visual Representations. (arXiv:2109.07455v1 [cs.CV])
    (2 min) Deep Bregman divergence measures divergence of data points using neural networks which is beyond Euclidean distance and capable of capturing divergence over distributions. In this paper, we propose deep Bregman divergences for contrastive learning of visual representation and we aim to enhance contrastive loss used in self-supervised learning by training additional networks based on functional Bregman divergence. In contrast to the conventional contrastive learning methods which are solely based on divergences between single points, our framework can capture the divergence between distributions which improves the quality of learned representation. By combining conventional contrastive loss with the proposed divergence loss, our method outperforms baseline and most of previous methods for self-supervised and semi-supervised learning on multiple classifications and object detection tasks and datasets. The source code of the method and of all the experiments are available at supplementary.
    Challenges in Detoxifying Language Models. (arXiv:2109.07445v1 [cs.CL])
    (2 min) Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and LM quality. We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions -- highlighting further the nuances involved in careful evaluation of LM toxicity.
    Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators. (arXiv:2109.07419v1 [cs.AR])
    (2 min) To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW co-design ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations(CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.
    Neural network optimal feedback control with enhanced closed loop stability. (arXiv:2109.07466v1 [math.OC])
    (2 min) Recent research has shown that supervised learning can be an effective tool for designing optimal feedback controllers for high-dimensional nonlinear dynamic systems. But the behavior of these neural network (NN) controllers is still not well understood. In this paper we use numerical simulations to demonstrate that typical test accuracy metrics do not effectively capture the ability of an NN controller to stabilize a system. In particular, some NNs with high test accuracy can fail to stabilize the dynamics. To address this we propose two NN architectures which locally approximate a linear quadratic regulator (LQR). Numerical simulations confirm our intuition that the proposed architectures reliably produce stabilizing feedback controllers without sacrificing performance. In addition, we introduce a preliminary theoretical result describing some stability properties of such NN-controlled systems.
    Multi View Spatial-Temporal Model for Travel Time Estimation. (arXiv:2109.07402v1 [cs.LG])
    (2 min) Taxi arrival time prediction is an essential part of building intelligent transportation systems. Traditional arrival time estimation methods mainly rely on traffic map feature extraction, which can not model complex situations and nonlinear spatial and temporal relationships. Therefore, we propose a Multi-View Spatial-Temporal Model (MVSTM) to capture the dependence of spatial-temporal and trajectory. Specifically, we use graph2vec to model the spatial view, dual-channel temporal module to model the trajectory view, and structural embedding to model the traffic semantics. Experiments on large-scale taxi trajectory data show that our approach is more effective than the novel method. The source code can be obtained from https://github.com/775269512/SIGSPATIAL-2021-GISCUP-4th-Solution.
    Safety Verification and Robustness Analysis of Neural Networks via Quadratic Constraints and Semidefinite Programming. (arXiv:1903.01287v3 [math.OC] UPDATED)
    (2 min) Certifying the safety or robustness of neural networks against input uncertainties and adversarial attacks is an emerging challenge in the area of safe machine learning and control. To provide such a guarantee, one must be able to bound the output of neural networks when their input changes within a bounded set. In this paper, we propose a semidefinite programming (SDP) framework to address this problem for feed-forward neural networks with general activation functions and input uncertainty sets. Our main idea is to abstract various properties of activation functions (e.g., monotonicity, bounded slope, bounded values, and repetition across layers) with the formalism of quadratic constraints. We then analyze the safety properties of the abstracted network via the S-procedure and semidefinite programming. Our framework spans the trade-off between conservatism and computational efficiency and applies to problems beyond safety verification. We evaluate the performance of our approach via numerical problem instances of various sizes.
    On ADMM in Deep Learning: Convergence and Saturation-Avoidance. (arXiv:1902.02060v3 [cs.LG] UPDATED)
    (3 min) In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called \textit{sigmoid-ADMM pair}), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (ReLU) counterparts (called deep ReLU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than that of deep ReLU nets by showing that ReLU activation function can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training from arbitrary initial points to a Karush-Kuhn-Tucker (KKT) point at a rate of order ${\cal O}(1/k)$. Besides sigmoid activation, such a convergence theorem holds for a general class of smooth activations. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep ReLU nets training (called ReLU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the ReLU-SGD pair.
    Assisting the Human Fact-Checkers: Detecting All Previously Fact-Checked Claims in a Document. (arXiv:2109.07410v1 [cs.CL])
    (2 min) Given the recent proliferation of false claims online, there has been a lot of manual fact-checking effort. As this is very time-consuming, human fact-checkers can benefit from tools that can support them and make them more efficient. Here, we focus on building a system that could provide such support. Given an input document, it aims to detect all sentences that contain a claim that can be verified by some previously fact-checked claims (from a given database). The output is a re-ranked list of the document sentences, so that those that can be verified are ranked as high as possible, together with corresponding evidence. Unlike previous work, which has looked into claim retrieval, here we take a document-level perspective. We create a new manually annotated dataset for the task, and we propose suitable evaluation measures. We further experiment with a learning-to-rank approach, achieving sizable performance gains over several strong baselines. Our analysis demonstrates the importance of modeling text similarity and stance, while also taking into account the veracity of the retrieved previously fact-checked claims. We believe that this research would be of interest to fact-checkers, journalists, media, and regulatory authorities.
    Disentangling Generative Factors of Physical Fields Using Variational Autoencoders. (arXiv:2109.07399v1 [physics.comp-ph])
    (2 min) The ability to extract generative parameters from high-dimensional fields of data in an unsupervised manner is a highly desirable yet unrealized goal in computational physics. This work explores the use of variational autoencoders (VAEs) for non-linear dimension reduction with the aim of disentangling the low-dimensional latent variables to identify independent physical parameters that generated the data. A disentangled decomposition is interpretable and can be transferred to a variety of tasks including generative modeling, design optimization, and probabilistic reduced order modelling. A major emphasis of this work is to characterize disentanglement using VAEs while minimally modifying the classic VAE loss function (i.e. the ELBO) to maintain high reconstruction accuracy. Disentanglement is shown to be highly sensitive to rotations of the latent space, hyperparameters, random initializations and the learning schedule. The loss landscape is characterized by over-regularized local minima which surrounds desirable solutions. We illustrate comparisons between disentangled and entangled representations by juxtaposing learned latent distributions and the 'true' generative factors in a model porous flow problem. Implementing hierarchical priors (HP) is shown to better facilitate the learning of disentangled representations over the classic VAE. The choice of the prior distribution is shown to have a dramatic effect on disentanglement. In particular, the regularization loss is unaffected by latent rotation when training with rotationally-invariant priors, and thus learning non-rotationally-invariant priors aids greatly in capturing the properties of generative factors, improving disentanglement. Some issues inherent to training VAEs, such as the convergence to over-regularized local minima are illustrated and investigated, and potential techniques for mitigation are presented.
    Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative. (arXiv:2109.07437v1 [cs.LG])
    (2 min) Pre-training, where models are trained on an auxiliary objective with abundant data before being fine-tuned on data from the downstream task, is now the dominant paradigm in NLP. In general, the pre-training step relies on little to no direct knowledge of the task on which the model will be fine-tuned, even when the end-task is known in advance. Our work challenges this status-quo of end-task agnostic pre-training. First, on three different low-resource NLP tasks from two domains, we demonstrate that multi-tasking the end-task and auxiliary objectives results in significantly better downstream task performance than the widely-used task-agnostic continued pre-training paradigm of Gururangan et al. (2020). We next introduce an online meta-learning algorithm that learns a set of multi-task weights to better balance among our multiple auxiliary objectives, achieving further improvements on end task performance and data efficiency.
    SupCL-Seq: Supervised Contrastive Learning for Downstream Optimized Sequence Representations. (arXiv:2109.07424v1 [cs.CL])
    (2 min) While contrastive learning is proven to be an effective training strategy in computer vision, Natural Language Processing (NLP) is only recently adopting it as a self-supervised alternative to Masked Language Modeling (MLM) for improving sequence representations. This paper introduces SupCL-Seq, which extends the supervised contrastive learning from computer vision to the optimization of sequence representations in NLP. By altering the dropout mask probability in standard Transformer architectures, for every representation (anchor), we generate augmented altered views. A supervised contrastive loss is then utilized to maximize the system's capability of pulling together similar samples (e.g., anchors and their altered views) and pushing apart the samples belonging to the other classes. Despite its simplicity, SupCLSeq leads to large gains in many sequence classification tasks on the GLUE benchmark compared to a standard BERTbase, including 6% absolute improvement on CoLA, 5.4% on MRPC, 4.7% on RTE and 2.6% on STSB. We also show consistent gains over self supervised contrastively learned representations, especially in non-semantic tasks. Finally we show that these gains are not solely due to augmentation, but rather to a downstream optimized sequence representation. Code: https://github.com/hooman650/SupCL-Seq
    Comparing Text Representations: A Theory-Driven Approach. (arXiv:2109.07458v1 [cs.CL])
    (2 min) Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compatibility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels.
    CAMul: Calibrated and Accurate Multi-view Time-Series Forecasting. (arXiv:2109.07438v1 [cs.LG])
    (2 min) Probabilistic time-series forecasting enables reliable decision making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information as well as uncertainty from these data sources for well-calibrated and accurate forecasts is an important challenging problem. Most previous work on multi-modal learning and forecasting simply aggregate intermediate representations from each data view by simple methods of summation or concatenation and do not explicitly model uncertainty for each data-view. We propose a general probabilistic multi-view forecasting framework CAMul, that can learn representations and uncertainty from diverse data sources. It integrates the knowledge and uncertainty from each data view in a dynamic context-specific manner assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25\% in accuracy and calibration.
    Self-learn to Explain Siamese Networks Robustly. (arXiv:2109.07371v1 [cs.LG])
    (2 min) Learning to compare two objects are essential in applications, such as digital forensics, face recognition, and brain network analysis, especially when labeled data is scarce and imbalanced. As these applications make high-stake decisions and involve societal values like fairness and transparency, it is critical to explain the learned models. We aim to study post-hoc explanations of Siamese networks (SN) widely used in learning to compare. We characterize the instability of gradient-based explanations due to the additional compared object in SN, in contrast to architectures with a single input instance. We propose an optimization framework that derives global invariance from unlabeled data using self-learning to promote the stability of local explanations tailored for specific query-reference pairs. The optimization problems can be solved using gradient descent-ascent (GDA) for constrained optimization, or SGD for KL-divergence regularized unconstrained optimization, with convergence proofs, especially when the objective functions are nonconvex due to the Siamese architecture. Quantitative results and case studies on tabular and graph data from neuroscience and chemical engineering show that the framework respects the self-learned invariance while robustly optimizing the faithfulness and simplicity of the explanation. We further demonstrate the convergence of GDA experimentally.
    When Does Translation Require Context? A Data-driven, Multilingual Exploration. (arXiv:2109.07446v1 [cs.CL])
    (2 min) Although proper handling of discourse phenomena significantly contributes to the quality of machine translation (MT), common translation quality metrics do not adequately capture them. Recent works in context-aware MT attempt to target a small set of these phenomena during evaluation. In this paper, we propose a new metric, P-CXMI, which allows us to identify translations that require context systematically and confirm the difficulty of previously studied phenomena as well as uncover new ones that have not been addressed in previous work. We then develop the Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers for these phenomena in 14 different language pairs, which we use to evaluate context-aware MT. We find that state-of-the-art context-aware MT models find marginal improvements over context-agnostic models on our benchmark, which suggests current models do not handle these ambiguities effectively. We release code and data to invite the MT research community to increase efforts on context-aware translation on discourse phenomena and languages that are currently overlooked.
    Matching with Transformers in MELT. (arXiv:2109.07401v1 [cs.CL])
    (2 min) One of the strongest signals for automated matching of ontologies and knowledge graphs are the textual descriptions of the concepts. The methods that are typically applied (such as character- or token-based comparisons) are relatively simple, and therefore do not capture the actual meaning of the texts. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is possible. In this paper, we model the ontology matching task as classification problem and present approaches based on transformer models. We further provide an easy to use implementation in the MELT framework which is suited for ontology and knowledge graph matching. We show that a transformer-based filter helps to choose the correct correspondences given a high-recall alignment and already achieves a good result with simple alignment post-processing methods.
    How to use KL-divergence to construct conjugate priors, with well-defined non-informative limits, for the multivariate Gaussian. (arXiv:2109.07384v1 [stat.ML])
    (2 min) The Wishart distribution is the standard conjugate prior for the precision of the multivariate Gaussian likelihood, when the mean is known -- while the normal-Wishart can be used when the mean is also unknown. It is however not so obvious how to assign values to the hyperparameters of these distributions. In particular, when forming non-informative limits of these distributions, the shape (or degrees of freedom) parameter of the Wishart must be handled with care. The intuitive solution of directly interpreting the shape as a pseudocount and letting it go to zero, as proposed by some authors, violates the restrictions on the shape parameter. We show how to use the scaled KL-divergence between multivariate Gaussians as an energy function to construct Wishart and normal-Wishart conjugate priors. When used as informative priors, the salient feature of these distributions is the mode, while the KL scaling factor serves as the pseudocount. The scale factor can be taken down to the limit at zero, to form non-informative priors that do not violate the restrictions on the Wishart shape parameter. This limit is non-informative in the sense that the posterior mode is identical to the maximum likelihood estimate of the Gaussian likelihood parameters.
    Can one hear the shape of a neural network?: Snooping the GPU via Magnetic Side Channel. (arXiv:2109.07395v1 [cs.CR])
    (2 min) Neural network applications have become popular in both enterprise and personal settings. Network solutions are tuned meticulously for each task, and designs that can robustly resolve queries end up in high demand. As the commercial value of accurate and performant machine learning models increases, so too does the demand to protect neural architectures as confidential investments. We explore the vulnerability of neural networks deployed as black boxes across accelerated hardware through electromagnetic side channels. We examine the magnetic flux emanating from a graphics processing unit's power cable, as acquired by a cheap $3 induction sensor, and find that this signal betrays the detailed topology and hyperparameters of a black-box neural network model. The attack acquires the magnetic signal for one query with unknown input values, but known input dimensions. The network reconstruction is possible due to the modular layer sequence in which deep neural networks are evaluated. We find that each layer component's evaluation produces an identifiable magnetic signal signature, from which layer topology, width, function type, and sequence order can be inferred using a suitably trained classifier and a joint consistency optimization based on integer programming. We study the extent to which network specifications can be recovered, and consider metrics for comparing network similarity. We demonstrate the potential accuracy of this side channel attack in recovering the details for a broad range of network architectures, including random designs. We consider applications that may exploit this novel side channel exposure, such as adversarial transfer attacks. In response, we discuss countermeasures to protect against our method and other similar snooping techniques.
    DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning. (arXiv:2109.07380v1 [cs.LG])
    (2 min) Deep reinforcement learning (RL) has shown great empirical successes, but suffers from brittleness and sample inefficiency. A potential remedy is to use a previously-trained policy as a source of supervision. In this work, we refer to these policies as teachers and study how to transfer their expertise to new student policies by focusing on data usage. We propose a framework, Data CUrriculum for Reinforcement learning (DCUR), which first trains teachers using online deep RL, and stores the logged environment interaction history. Then, students learn by running either offline RL or by using teacher data in combination with a small amount of self-generated data. DCUR's central idea involves defining a class of data curricula which, as a function of training time, limits the student to sampling from a fixed subset of the full teacher data. We test teachers and students using state-of-the-art deep RL algorithms across a variety of data curricula. Results suggest that the choice of data curricula significantly impacts student learning, and that it is beneficial to limit the data during early training stages while gradually letting the data availability grow over time. We identify when the student can learn offline and match teacher performance without relying on specialized offline RL algorithms. Furthermore, we show that collecting a small fraction of online data provides complementary benefits with the data curriculum. Supplementary material is available at https://tinyurl.com/teach-dcur.
    Constraint based Knowledge Base Distillation in End-to-End Task Oriented Dialogs. (arXiv:2109.07396v1 [cs.CL])
    (2 min) End-to-End task-oriented dialogue systems generate responses based on dialog history and an accompanying knowledge base (KB). Inferring those KB entities that are most relevant for an utterance is crucial for response generation. Existing state of the art scales to large KBs by softly filtering over irrelevant KB information. In this paper, we propose a novel filtering technique that consists of (1) a pairwise similarity based filter that identifies relevant information by respecting the n-ary structure in a KB record. and, (2) an auxiliary loss that helps in separating contextually unrelated KB information. We also propose a new metric -- multiset entity F1 which fixes a correctness issue in the existing entity F1 metric. Experimental results on three publicly available task-oriented dialog datasets show that our proposed approach outperforms existing state-of-the-art models.
    Cross-lingual Transfer of Monolingual Models. (arXiv:2109.07348v1 [cs.CL])
    (2 min) Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
    Comparing decision mining approaches with regard to the meaningfulness of their results. (arXiv:2109.07335v1 [cs.LG])
    (2 min) Decisions and the underlying rules are indispensable for driving process execution during runtime, i.e., for routing process instances at alternative branches based on the values of process data. Decision rules can comprise unary data conditions, e.g., age > 40, binary data conditions where the relation between two or more variables is relevant, e.g. temperature1 < temperature2, and more complex conditions that refer to, for example, parts of a medical image. Decision discovery aims at automatically deriving decision rules from process event logs. Existing approaches focus on the discovery of unary, or in some instances binary data conditions. The discovered decision rules are usually evaluated using accuracy, but not with regards to their semantics and meaningfulness, although this is crucial for validation and the subsequent implementation/adaptation of the decision rules. Hence, this paper compares three decision mining approaches, i.e., two existing ones and one newly described approach, with respect to the meaningfulness of their results. For comparison, we use one synthetic data set for a realistic manufacturing case and the two real-world BPIC 2017/2020 logs. The discovered rules are discussed with regards to their semantics and meaningfulness.
    The potential of self-supervised networks for random noise suppression in seismic data. (arXiv:2109.07344v1 [physics.geo-ph])
    (2 min) Noise suppression is an essential step in any seismic processing workflow. A portion of this noise, particularly in land datasets, presents itself as random noise. In recent years, neural networks have been successfully used to denoise seismic data in a supervised fashion. However, supervised learning always comes with the often unachievable requirement of having noisy-clean data pairs for training. Using blind-spot networks, we redefine the denoising task as a self-supervised procedure where the network uses the surrounding noisy samples to estimate the noise-free value of a central sample. Based on the assumption that noise is statistically independent between samples, the network struggles to predict the noise component of the sample due to its randomnicity, whilst the signal component is accurately predicted due to its spatio-temporal coherency. Illustrated on synthetic examples, the blind-spot network is shown to be an efficient denoiser of seismic data contaminated by random noise with minimal damage to the signal; therefore, providing improvements in both the image domain and down-the-line tasks, such as inversion. To conclude the study, the suggested approach is applied to field data and the results are compared with two commonly used random denoising techniques: FX-deconvolution and Curvelet transform. By demonstrating that blind-spot networks are an efficient suppressor of random noise, we believe this is just the beginning of utilising self-supervised learning in seismic applications.
    Modular Neural Ordinary Differential Equations. (arXiv:2109.07359v1 [cs.LG])
    (2 min) The laws of physics have been written in the language of dif-ferential equations for centuries. Neural Ordinary Differen-tial Equations (NODEs) are a new machine learning architecture which allows these differential equations to be learned from a dataset. These have been applied to classical dynamics simulations in the form of Lagrangian Neural Net-works (LNNs) and Second Order Neural Differential Equations (SONODEs). However, they either cannot represent the most general equations of motion or lack interpretability. In this paper, we propose Modular Neural ODEs, where each force component is learned with separate modules. We show how physical priors can be easily incorporated into these models. Through a number of experiments, we demonstrate these result in better performance, are more interpretable, and add flexibility due to their modularity.
    Embedding Convolutions for Short Text Extreme Classification with Millions of Labels. (arXiv:2109.07319v1 [cs.CL])
    (2 min) Automatic annotation of short-text data to a large number of target labels, referred to as Short Text Extreme Classification, has recently found numerous applications in prediction of related searches and product recommendation tasks. The conventional usage of Convolutional Neural Network (CNN) to capture n-grams in text-classification relies heavily on uniformity in word-ordering and the presence of long input sequences to convolve over. However, this is missing in short and unstructured text sequences encountered in search and recommendation. In order to tackle this, we propose an orthogonal approach by recasting the convolution operation to capture coupled semantics along the embedding dimensions, and develop a word-order agnostic embedding enhancement module to deal with the lack of structure in such queries. Benefitting from the computational efficiency of the convolution operation, Embedding Convolutions, when applied on the enriched word embeddings, result in a light-weight and yet powerful encoder (InceptionXML) that is robust to the inherent lack of structure in short-text extreme classification. Towards scaling our model to problems with millions of labels, we also propose InceptionXML+, which addresses the shortcomings of the dynamic hard-negative mining framework in the recently proposed LightXML by improving the alignment between the label-shortlister and extreme classifier. On popular benchmark datasets, we empirically demonstrate that the proposed method outperforms state-of-the-art deep extreme classifiers such as Astec by an average of 5% and 8% on the P@k and propensity-scored PSP@k metrics respectively.
    Distribution-free Contextual Dynamic Pricing. (arXiv:2109.07340v1 [stat.ML])
    (2 min) Contextual dynamic pricing aims to set personalized prices based on sequential interactions with customers. At each time period, a customer who is interested in purchasing a product comes to the platform. The customer's valuation for the product is a linear function of contexts, including product and customer features, plus some random market noise. The seller does not observe the customer's true valuation, but instead needs to learn the valuation by leveraging contextual information and historical binary purchase feedbacks. Existing models typically assume full or partial knowledge of the random noise distribution. In this paper, we consider contextual dynamic pricing with unknown random noise in the valuation model. Our distribution-free pricing policy learns both the contextual function and the market noise simultaneously. A key ingredient of our method is a novel perturbed linear bandit framework, where a modified linear upper confidence bound algorithm is proposed to balance the exploration of market noise and the exploitation of the current knowledge for better pricing. We establish the regret upper bound and a matching lower bound of our policy in the perturbed linear bandit framework and prove a sub-linear regret bound in the considered pricing problem. Finally, we demonstrate the superior performance of our policy on simulations and a real-life auto-loan dataset.
    DeFungi: Direct Mycological Examination of Microscopic Fungi Images. (arXiv:2109.07322v1 [eess.IV])
    (3 min) Traditionally, diagnosis and treatment of fungal infections in humans depend heavily on face-to-face consultations or examinations made by specialized laboratory scientists known as mycologists. In many cases, such as the recent mucormycosis spread in the COVID-19 pandemic, an initial treatment can be safely suggested to the patient during the earliest stage of the mycological diagnostic process by performing a direct examination of biopsies or samples through a microscope. Computer-aided diagnosis systems using deep learning models have been trained and used for the late mycological diagnostic stages. However, there are no reference literature works made for the early stages. A mycological laboratory in Colombia donated the images used for the development of this research work. They were manually labelled into five classes and curated with a subject matter expert assistance. The images were later cropped and patched with automated code routines to produce the final dataset. This paper presents experimental results classifying five fungi types using two different deep learning approaches and three different convolutional neural network models, VGG16, Inception V3, and ResNet50. The first approach benchmarks the classification performance for the models trained from scratch, while the second approach benchmarks the classification performance using pre-trained models based on the ImageNet dataset. Using k-fold cross-validation testing on the 5-class dataset, the best performing model trained from scratch was Inception V3, reporting 73.2% accuracy. Also, the best performing model using transfer learning was VGG16 reporting 85.04%. The statistics provided by the two approaches create an initial point of reference to encourage future research works to improve classification performance. Furthermore, the dataset built is published in Kaggle and GitHub to foster future research.
    PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching. (arXiv:2109.07321v1 [cs.DB])
    (2 min) Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
    FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining. (arXiv:2109.07323v1 [cs.IR])
    (2 min) Tables store rich numerical data, but numerical reasoning over tables is still a challenge. In this paper, we find that the spreadsheet formula, which performs calculations on numerical values in tables, is naturally a strong supervision of numerical reasoning. More importantly, large amounts of spreadsheets with expert-made formulae are available on the web and can be obtained easily. FORTAP is the first method for numerical-reasoning-aware table pretraining by leveraging large corpus of spreadsheet formulae. We design two formula pretraining tasks to explicitly guide FORTAP to learn numerical reference and calculation in semi-structured tables. FORTAP achieves state-of-the-art results on two representative downstream tasks, cell type classification and formula prediction, showing great potential of numerical-reasoning-aware pretraining.
    Modelling Major Disease Outbreaks in the 21st Century: A Causal Approach. (arXiv:2109.07266v1 [cs.LG])
    (2 min) Epidemiologists aiming to model the dynamics of global events face a significant challenge in identifying the factors linked with anomalies such as disease outbreaks. In this paper, we present a novel method for identifying the most important development sectors sensitive to disease outbreaks by using global development indicators as markers. We use statistical methods to assess the causative linkages between these indicators and disease outbreaks, as well as to find the most often ranked indicators. We used data imputation techniques in addition to statistical analysis to convert raw real-world data sets into meaningful data for causal inference. The application of various algorithms for the detection of causal linkages between the indicators is the subject of this research. Despite the fact that disparities in governmental policies between countries account for differences in causal linkages, several indicators emerge as important determinants sensitive to disease outbreaks over the world in the 21st Century.
    Photon detection probability prediction using one-dimensional generative neural network. (arXiv:2109.07277v1 [physics.ins-det])
    (2 min) Photon detection is important for liquid argon detectors for direct dark matter searches or neutrino property measurements. Precise simulation of photon transport is widely used to understand the probability of photon detection in liquid argon detectors. Traditional photon transport simulation, which tracks every photon using theGeant4simulation toolkit, is a major computational challenge for kilo-tonne-scale liquid argon detectors and GeV-level energy depositions. In this work, we propose a one-dimensional generative model which efficiently generates features using an OuterProduct-layer. This model bypasses photon transport simulation and predicts the number of photons detected by particular photon detectors at the same level of detail as theGeant4simulation. The application to simulating photon detection systems in kilo-tonne-scale liquid argon detectors demonstrates this novel generative model is able to reproduceGeant4simulation with good accuracy and 20 to 50 times faster. This generative model can be used to quickly predict photon detection probability in huge liquid argon detectors like ProtoDUNE or DUNE.
    DROMO: Distributionally Robust Offline Model-based Policy Optimization. (arXiv:2109.07275v1 [cs.LG])
    (2 min) We consider the problem of offline reinforcement learning with model-based control, whose goal is to learn a dynamics model from the experience replay and obtain a pessimism-oriented agent under the learned model. Current model-based constraint includes explicit uncertainty penalty and implicit conservative regularization that pushes Q-values of out-of-distribution state-action pairs down and the in-distribution up. While the uncertainty estimation, on which the former relies on, can be loosely calibrated for complex dynamics, the latter performs slightly better. To extend the basic idea of regularization without uncertainty quantification, we propose distributionally robust offline model-based policy optimization (DROMO), which leverages the ideas in distributionally robust optimization to penalize a broader range of out-of-distribution state-action pairs beyond the standard empirical out-of-distribution Q-value minimization. We theoretically show that our method optimizes a lower bound on the ground-truth policy evaluation, and it can be incorporated into any existing policy gradient algorithms. We also analyze the theoretical properties of DROMO's linear and non-linear instantiations.
    HeMI: Multi-view Embedding in Heterogeneous Graphs. (arXiv:2109.07008v1 [cs.LG])
    (2 min) Many real-world graphs involve different types of nodes and relations between nodes, being heterogeneous by nature. The representation learning of heterogeneous graphs (HGs) embeds the rich structure and semantics of such graphs into a low-dimensional space and facilitates various data mining tasks, such as node classification, node clustering, and link prediction. In this paper, we propose a self-supervised method that learns HG representations by relying on knowledge exchange and discovery among different HG structural semantics (meta-paths). Specifically, by maximizing the mutual information of meta-path representations, we promote meta-path information fusion and consensus, and ensure that globally shared semantics are encoded. By extensive experiments on node classification, node clustering, and link prediction tasks, we show that the proposed self-supervision both outperforms and improves competing methods by 1% and up to 10% for all tasks.
    End-to-End Learning of Flowchart Grounded Task-Oriented Dialogs. (arXiv:2109.07263v1 [cs.CL])
    (2 min) We propose a novel problem within end-to-end learning of task-oriented dialogs (TOD), in which the dialog system mimics a troubleshooting agent who helps a user by diagnosing their problem (e.g., car not starting). Such dialogs are grounded in domain-specific flowcharts, which the agent is supposed to follow during the conversation. Our task exposes novel technical challenges for neural TOD, such as grounding an utterance to the flowchart without explicit annotation, referring to additional manual pages when user asks a clarification question, and ability to follow unseen flowcharts at test time. We release a dataset (FloDial) consisting of 2,738 dialogs grounded on 12 different troubleshooting flowcharts. We also design a neural model, FloNet, which uses a retrieval-augmented generation architecture to train the dialog agent. Our experiments find that FloNet can do zero-shot transfer to unseen flowcharts, and sets a strong baseline for future research.
    Parallel Constraint-Driven Inductive Logic Programming. (arXiv:2109.07132v1 [cs.AI])
    (2 min) Multi-core machines are ubiquitous. However, most inductive logic programming (ILP) approaches use only a single core, which severely limits their scalability. To address this limitation, we introduce parallel techniques based on constraint-driven ILP where the goal is to accumulate constraints to restrict the hypothesis space. Our experiments on two domains (program synthesis and inductive general game playing) show that (i) parallelisation can substantially reduce learning times, and (ii) worker communication (i.e. sharing constraints) is important for good performance.
    {E}fficient{BERT}: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation. (arXiv:2109.07222v1 [cs.CL])
    (2 min) Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA. Hence, to compact BERT, we are devoted to designing efficient FFN as opposed to previous works that pay attention to MHA. Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT optimization, we further design a thorough search space towards an advanced MLP and perform a coarse-to-fine mechanism to search for an efficient BERT architecture. Moreover, to accelerate searching and enhance model transferability, we employ a novel warm-up knowledge distillation strategy at each search stage. Extensive experiments show our searched EfficientBERT is 6.9$\times$ smaller and 4.4$\times$ faster than BERT$\rm_{BASE}$, and has competitive performances on GLUE and SQuAD Benchmarks. Concretely, EfficientBERT attains a 77.7 average score on GLUE \emph{test}, 0.7 higher than MobileBERT$\rm_{TINY}$, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 \emph{dev}, 3.2/2.7 higher than TinyBERT$_4$ even without data augmentation. The code is released at https://github.com/cheneydon/efficient-bert.
    Avengers Ensemble! Improving Transferability of Authorship Obfuscation. (arXiv:2109.07028v1 [cs.LG])
    (2 min) Stylometric approaches have been shown to be quite effective for real-world authorship attribution. To mitigate the privacy threat posed by authorship attribution, researchers have proposed automated authorship obfuscation approaches that aim to conceal the stylometric artefacts that give away the identity of an anonymous document's author. Recent work has focused on authorship obfuscation approaches that rely on black-box access to an attribution classifier to evade attribution while preserving semantics. However, to be useful under a realistic threat model, it is important that these obfuscation approaches work well even when the adversary's attribution classifier is different from the one used internally by the obfuscator. Unfortunately, existing authorship obfuscation approaches do not transfer well to unseen attribution classifiers. In this paper, we propose an ensemble-based approach for transferable authorship obfuscation. Our experiments show that if an obfuscator can evade an ensemble attribution classifier, which is based on multiple base attribution classifiers, it is more likely to transfer to different attribution classifiers. Our analysis shows that ensemble-based authorship obfuscation achieves better transferability because it combines the knowledge from each of the base attribution classifiers by essentially averaging their decision boundaries.
    NBcoded: network attack classifiers based on Encoder and Naive Bayes model for resource limited devices. (arXiv:2109.07273v1 [cs.LG])
    (2 min) In the recent years, cybersecurity has gained high relevance, converting the detection of attacks or intrusions into a key task. In fact, a small breach in a system, application, or network, can cause huge damage for the companies. However, when this attack detection encounters the Artificial Intelligence paradigm, it can be addressed using high-quality classifiers which often need high resource demands in terms of computation or memory usage. This situation has a high impact when the attack classifiers need to be used with limited resourced devices or without overloading the performance of the devices, as it happens for example in IoT devices, or in industrial systems. For overcoming this issue, NBcoded, a novel light attack classification tool is proposed in this work. NBcoded works in a pipeline combining the removal of noisy data properties of the encoders with the low resources and timing consuming obtained by the Naive Bayes classifier. This work compares three different NBcoded implementations based on three different Naive Bayes likelihood distribution assumptions (Gaussian, Complement and Bernoulli). Then, the best NBcoded is compared with state of the art classifiers like Multilayer Perceptron and Random Forest. Our implementation shows to be the best model reducing the impact of training time and disk usage, even if it is outperformed by the other two in terms of Accuracy and F1-score (~ 2%).
    Evolutionary Reinforcement Learning Dynamics with Irreducible Environmental Uncertainty. (arXiv:2109.07259v1 [nlin.AO])
    (2 min) In this work we derive and present evolutionary reinforcement learning dynamics in which the agents are irreducibly uncertain about the current state of the environment. We evaluate the dynamics across different classes of partially observable agent-environment systems and find that irreducible environmental uncertainty can lead to better learning outcomes faster, stabilize the learning process and overcome social dilemmas. However, as expected, we do also find that partial observability may cause worse learning outcomes, for example, in the form of a catastrophic limit cycle. Compared to fully observant agents, learning with irreducible environmental uncertainty often requires more exploration and less weight on future rewards to obtain the best learning outcomes. Furthermore, we find a range of dynamical effects induced by partial observability, e.g., a critical slowing down of the learning processes between reward regimes and the separation of the learning dynamics into fast and slow directions. The presented dynamics are a practical tool for researchers in biology, social science and machine learning to systematically investigate the evolutionary effects of environmental uncertainty.
    Federated Learning of Molecular Properties in a Heterogeneous Setting. (arXiv:2109.07258v1 [cs.LG])
    (2 min) Chemistry research has both high material and computational costs to conduct experiments. Institutions thus consider chemical data to be valuable and there have been few efforts to construct large public datasets for machine learning. Another challenge is that different intuitions are interested in different classes of molecules, creating heterogeneous data that cannot be easily joined by conventional distributed training. In this work, we introduce federated heterogeneous molecular learning to address these challenges. Federated learning allows end-users to build a global model collaboratively while preserving the training data distributed over isolated clients. Due to the lack of related research, we first simulate a federated heterogeneous benchmark called FedChem. FedChem is constructed by jointly performing scaffold splitting and Latent Dirichlet Allocation on existing datasets. Our results on FedChem show that significant learning challenges arise when working with heterogeneous molecules. We then propose a method to alleviate the problem, namely Federated Learning by Instance reweighTing (FLIT). FLIT can align the local training across heterogeneous clients by improving the performance for uncertain samples. Comprehensive experiments conducted on our new benchmark FedChem validate the advantages of this method over other federated learning schemes. FedChem should enable a new type of collaboration for improving AI in chemistry that mitigates concerns about valuable chemical data.
    Quantitative reconstruction of defects in multi-layered bonded composites using fully convolutional network-based ultrasonic inversion. (arXiv:2109.07284v1 [cond-mat.mtrl-sci])
    (2 min) Ultrasonic methods have great potential applications to detect and characterize defects in multi-layered bonded composites. However, it remains challenging to quantitatively reconstruct defects, such as disbonds and kissing bonds, that influence the integrity of adhesive bonds and seriously reduce the strength of assemblies. In this work, an ultrasonic method based on the supervised fully convolutional network (FCN) is proposed to quantitatively reconstruct defects hidden in multi-layered bonded composites. In the training process of this method, an FCN establishes a non-linear mapping from measured ultrasonic data to the corresponding velocity models of multi-layered bonded composites. In the predicting process, the trained network obtained from the training process is used to directly reconstruct the velocity models from the new measured ultrasonic data of adhesively bonded composites. The presented FCN-based inversion method can automatically extract useful features in multi-layered composites. Although this method is computationally expensive in the training process, the prediction itself in the online phase takes only seconds. The numerical results show that the FCN-based ultrasonic inversion method is capable to accurately reconstruct ultrasonic velocity models of the high contrast defects, which has great potential for online detection of adhesively bonded composites.
    Self-Training with Differentiable Teacher. (arXiv:2109.07049v1 [cs.CL])
    (2 min) Self-training achieves enormous success in various semi-supervised and weakly-supervised learning tasks. The method can be interpreted as a teacher-student framework, where the teacher generates pseudo-labels, and the student makes predictions. The two models are updated alternatingly. However, such a straightforward alternating update rule leads to training instability. This is because a small change in the teacher may result in a significant change in the student. To address this issue, we propose {\ours}, short for differentiable self-training, that treats teacher-student as a Stackelberg game. In this game, a leader is always in a more advantageous position than a follower. In self-training, the student contributes to the prediction performance, and the teacher controls the training process by generating pseudo-labels. Therefore, we treat the student as the leader and the teacher as the follower. The leader procures its advantage by acknowledging the follower's strategy, which involves differentiable pseudo-labels and differentiable sample weights. Consequently, the leader-follower interaction can be effectively captured via Stackelberg gradient, obtained by differentiating the follower's strategy. Experimental results on semi- and weakly-supervised classification and named entity recognition tasks show that our model outperforms existing approaches by large margins.
    Embedding Node Structural Role Identity Using Stress Majorization. (arXiv:2109.07023v1 [cs.SI])
    (2 min) Nodes in networks may have one or more functions that determine their role in the system. As opposed to local proximity, which captures the local context of nodes, the role identity captures the functional "role" that nodes play in a network, such as being the center of a group, or the bridge between two groups. This means that nodes far apart in a network can have similar structural role identities. Several recent works have explored methods for embedding the roles of nodes in networks. However, these methods all rely on either approximating or indirect modeling of structural equivalence. In this paper, we present a novel and flexible framework using stress majorization, to transform the high-dimensional role identities in networks directly (without approximation or indirect modeling) to a low-dimensional embedding space. Our method is also flexible, in that it does not rely on specific structural similarity definitions. We evaluated our method on the tasks of node classification, clustering, and visualization, using three real-world and five synthetic networks. Our experiments show that our framework achieves superior results than existing methods in learning node role representations.
    Attention Is Indeed All You Need: Semantically Attention-Guided Decoding for Data-to-Text NLG. (arXiv:2109.07043v1 [cs.CL])
    (2 min) Ever since neural models were adopted in data-to-text language generation, they have invariably been reliant on extrinsic components to improve their semantic accuracy, because the models normally do not exhibit the ability to generate text that reliably mentions all of the information provided in the input. In this paper, we propose a novel decoding method that extracts interpretable information from encoder-decoder models' cross-attention, and uses it to infer which attributes are mentioned in the generated text, which is subsequently used to rescore beam hypotheses. Using this decoding method with T5 and BART, we show on three datasets its ability to dramatically reduce semantic errors in the generated outputs, while maintaining their state-of-the-art quality.
    Adversarial Mixing Policy for Relaxing Locally Linear Constraints in Mixup. (arXiv:2109.07177v1 [cs.CL])
    (2 min) Mixup is a recent regularizer for current deep classification networks. Through training a neural network on convex combinations of pairs of examples and their labels, it imposes locally linear constraints on the model's input space. However, such strict linear constraints often lead to under-fitting which degrades the effects of regularization. Noticeably, this issue is getting more serious when the resource is extremely limited. To address these issues, we propose the Adversarial Mixing Policy (AMP), organized in a min-max-rand formulation, to relax the Locally Linear Constraints in Mixup. Specifically, AMP adds a small adversarial perturbation to the mixing coefficients rather than the examples. Thus, slight non-linearity is injected in-between the synthetic examples and synthetic labels. By training on these data, the deep networks are further regularized, and thus achieve a lower predictive error rate. Experiments on five text classification benchmarks and five backbone models have empirically shown that our methods reduce the error rate over Mixup variants in a significant margin (up to 31.3%), especially in low-resource conditions (up to 17.5%).
    Back to Basics: Deep Reinforcement Learning in Traffic Signal Control. (arXiv:2109.07180v1 [cs.LG])
    (2 min) In this paper we revisit some of the fundamental premises for a reinforcement learning (RL) approach to self-learning traffic lights. We propose RLight, a combination of choices that offers robust performance and good generalization to unseen traffic flows. In particular, our main contributions are threefold: our lightweight and cluster-aware state representation leads to improved performance; we reformulate the MDP such that it skips redundant timesteps of yellow light, speeding up learning by 30%; and we investigate the action space and provide insight into the difference in performance between acyclic and cyclic phase transitions. Additionally, we provide insights into the generalisation of the methods to unseen traffic. Evaluations using the real-world Hangzhou traffic dataset show that RLight outperforms state-of-the-art rule-based and deep reinforcement learning algorithms, demonstrating the potential of RL-based methods to improve urban traffic flows.
    What Does The User Want? Information Gain for Hierarchical Dialogue Policy Optimisation. (arXiv:2109.07129v1 [cs.LG])
    (2 min) The dialogue management component of a task-oriented dialogue system is typically optimised via reinforcement learning (RL). Optimisation via RL is highly susceptible to sample inefficiency and instability. The hierarchical approach called Feudal Dialogue Management takes a step towards more efficient learning by decomposing the action space. However, it still suffers from instability due to the reward only being provided at the end of the dialogue. We propose the usage of an intrinsic reward based on information gain to address this issue. Our proposed reward favours actions that resolve uncertainty or query the user whenever necessary. It enables the policy to learn how to retrieve the users' needs efficiently, which is an integral aspect in every task-oriented conversation. Our algorithm, which we call FeudalGain, achieves state-of-the-art results in most environments of the PyDial framework, outperforming much more complex approaches. We confirm the sample efficiency and stability of our algorithm through experiments in simulation and a human trial.
    WaveCorr: Correlation-savvy Deep Reinforcement Learning for Portfolio Management. (arXiv:2109.07005v1 [q-fin.PM])
    (2 min) The problem of portfolio management represents an important and challenging class of dynamic decision making problems, where rebalancing decisions need to be made over time with the consideration of many factors such as investors preferences, trading environments, and market conditions. In this paper, we present a new portfolio policy network architecture for deep reinforcement learning (DRL)that can exploit more effectively cross-asset dependency information and achieve better performance than state-of-the-art architectures. In particular, we introduce a new property, referred to as \textit{asset permutation invariance}, for portfolio policy networks that exploit multi-asset time series data, and design the first portfolio policy network, named WaveCorr, that preserves this invariance property when treating asset correlation information. At the core of our design is an innovative permutation invariant correlation processing layer. An extensive set of experiments are conducted using data from both Canadian (TSX) and American stock markets (S&P 500), and WaveCorr consistently outperforms other architectures with an impressive 3%-25% absolute improvement in terms of average annual return, and up to more than 200% relative improvement in average Sharpe ratio. We also measured an improvement of a factor of up to 5 in the stability of performance under random choices of initial asset ordering and weights. The stability of the network has been found as particularly valuable by our industrial partner.
    WIP: Medical Incident Prediction Through Analysis of Electronic Medical Records Using Machine Lerning: Fall Prediction. (arXiv:2109.07106v1 [cs.LG])
    (2 min) This paper reports our preliminary work on medical incident prediction in general, and fall risk prediction in specific, using machine learning. Data for the machine learning are generated only from the particular subset of the electronic medical records (EMR) at Osaka Medical and Pharmaceutical University Hospital. As a result of conducting three experiments such as (1) machine learning algorithm comparison, (2) handling imbalance, and (3) investigation of explanatory variable contribution to the fall incident prediction, we find the investigation of explanatory variables the most effective.
    Powered Hawkes-Dirichlet Process: Challenging Textual Clustering using a Flexible Temporal Prior. (arXiv:2109.07170v1 [cs.LG])
    (2 min) The textual content of a document and its publication date are intertwined. For example, the publication of a news article on a topic is influenced by previous publications on similar issues, according to underlying temporal dynamics. However, it can be challenging to retrieve meaningful information when textual information conveys little information or when temporal dynamics are hard to unveil. Furthermore, the textual content of a document is not always linked to its temporal dynamics. We develop a flexible method to create clusters of textual documents according to both their content and publication time, the Powered Dirichlet-Hawkes process (PDHP). We show PDHP yields significantly better results than state-of-the-art models when temporal information or textual content is weakly informative. The PDHP also alleviates the hypothesis that textual content and temporal dynamics are always perfectly correlated. PDHP allows retrieving textual clusters, temporal clusters, or a mixture of both with high accuracy when they are not. We demonstrate that PDHP generalizes previous work --such as the Dirichlet-Hawkes process (DHP) and Uniform process (UP). Finally, we illustrate the changes induced by PDHP over DHP and UP in a real-world application using Reddit data.
    F-CAM: Full Resolution CAM via Guided Parametric Upscaling. (arXiv:2109.07069v1 [cs.CV])
    (2 min) Class Activation Mapping (CAM) methods have recently gained much attention for weakly-supervised object localization (WSOL) tasks, allowing for CNN visualization and interpretation without training on fully annotated image datasets. CAM methods are typically integrated within off-the-shelf CNN backbones, such as ResNet50. Due to convolution and downsampling/pooling operations, these backbones yield low resolution CAMs with a down-scaling factor of up to 32, making accurate localization more difficult. Interpolation is required to restore a full size CAMs, but without considering the statistical properties of the objects, leading to activations with inconsistent boundaries and inaccurate localizations. As an alternative, we introduce a generic method for parametric upscaling of CAMs that allows constructing accurate full resolution CAMs (F-CAMs). In particular, we propose a trainable decoding architecture that can be connected to any CNN classifier to produce more accurate CAMs. Given an original (low resolution) CAM, foreground and background pixels are randomly sampled for fine-tuning the decoder. Additional priors such as image statistics, and size constraints are also considered to expand and refine object boundaries. Extensive experiments using three CNN backbones and six WSOL baselines on the CUB-200-2011 and OpenImages datasets, indicate that our F-CAM method yields a significant improvement in CAM localization accuracy. F-CAM performance is competitive with state-of-art WSOL methods, yet it requires fewer computational resources during inference.
    Deploying clinical machine learning? Consider the following.... (arXiv:2109.06919v1 [cs.LG])
    (2 min) Despite the intense attention and investment into clinical machine learning (CML) research, relatively few applications convert to clinical practice. While research is important in advancing the state-of-the-art, translation is equally important in bringing these technologies into a position to ultimately impact patient care and live up to extensive expectations surrounding AI in healthcare. To better characterize a holistic perspective among researchers and practitioners, we survey several participants with experience in developing CML for clinical deployment about their learned experiences. We collate these insights and identify several main categories of barriers and pitfalls in order to better design and develop clinical machine learning applications.
    Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition. (arXiv:2109.07149v1 [cs.MM])
    (2 min) Automatic emotion recognition (AER) based on enriched multimodal inputs, including text, speech, and visual clues, is crucial in the development of emotionally intelligent machines. Although complex modality relationships have been proven effective for AER, they are still largely underexplored because previous works predominantly relied on various fusion mechanisms with simply concatenated features to learn multimodal representations for emotion classification. This paper proposes a novel hierarchical fusion graph convolutional network (HFGCN) model that learns more informative multimodal representations by considering the modality dependencies during the feature fusion procedure. Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation. We verified the interpretable capabilities of the proposed method by projecting the emotional states to a 2D valence-arousal (VA) subspace. Extensive experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets, IEMOCAP and MELD.
    Building Accurate Simple Models with Multihop. (arXiv:2109.06961v1 [cs.LG])
    (2 min) Knowledge transfer from a complex high performing model to a simpler and potentially low performing one in order to enhance its performance has been of great interest over the last few years as it finds applications in important problems such as explainable artificial intelligence, model compression, robust model building and learning from small data. Known approaches to this problem (viz. Knowledge Distillation, Model compression, ProfWeight, etc.) typically transfer information directly (i.e. in a single/one hop) from the complex model to the chosen simple model through schemes that modify the target or reweight training examples on which the simple model is trained. In this paper, we propose a meta-approach where we transfer information from the complex model to the simple model by dynamically selecting and/or constructing a sequence of intermediate models of decreasing complexity that are less intricate than the original complex model. Our approach can transfer information between consecutive models in the sequence using any of the previously mentioned approaches as well as work in 1-hop fashion, thus generalizing these approaches. In the experiments on real data, we observe that we get consistent gains for different choices of models over 1-hop, which on average is more than 2\% and reaches up to 8\% in a particular case. We also empirically analyze conditions under which the multi-hop approach is likely to be beneficial over the traditional 1-hop approach, and report other interesting insights. To the best of our knowledge, this is the first work that proposes such a multi-hop approach to perform knowledge transfer given a single high performing complex model, making it in our opinion, an important methodological contribution.
    Learning and Decision-Making with Data: Optimal Formulations and Phase Transitions. (arXiv:2109.06911v1 [stat.ML])
    (2 min) We study the problem of designing optimal learning and decision-making formulations when only historical data is available. Prior work typically commits to a particular class of data-driven formulation and subsequently tries to establish out-of-sample performance guarantees. We take here the opposite approach. We define first a sensible yard stick with which to measure the quality of any data-driven formulation and subsequently seek to find an optimal such formulation. Informally, any data-driven formulation can be seen to balance a measure of proximity of the estimated cost to the actual cost while guaranteeing a level of out-of-sample performance. Given an acceptable level of out-of-sample performance, we construct explicitly a data-driven formulation that is uniformly closer to the true cost than any other formulation enjoying the same out-of-sample performance. We show the existence of three distinct out-of-sample performance regimes (a superexponential regime, an exponential regime and a subexponential regime) between which the nature of the optimal data-driven formulation experiences a phase transition. The optimal data-driven formulations can be interpreted as a classically robust formulation in the superexponential regime, an entropic distributionally robust formulation in the exponential regime and finally a variance penalized formulation in the subexponential regime. This final observation unveils a surprising connection between these three, at first glance seemingly unrelated, data-driven formulations which until now remained hidden.
    Internet of Behavior (IoB) and Explainable AI Systems for Influencing IoT Behavior. (arXiv:2109.07239v1 [cs.DC])
    (2 min) Pandemics and natural disasters over the years have changed the behavior of people, which has had a tremendous impact on all life aspects. With the technologies available in each era, governments, organizations, and companies have used these technologies to track, control, and influence the behavior of individuals for a benefit. Nowadays, the use of the Internet of Things (IoT), cloud computing, and artificial intelligence (AI) have made it easier to track and change the behavior of users through changing IoT behavior. This article introduces and discusses the concept of the Internet of Behavior (IoB) and its integration with Explainable AI (XAI) techniques to provide trusted and evident experience in the process of changing IoT behavior to ultimately improving users' behavior. Therefore, a system based on IoB and XAI has been proposed in a use case scenario of electrical power consumption that aims to influence user consuming behavior to reduce power consumption and cost. The scenario results showed a decrease of 522.2 kW of active power when compared to original consumption over a 200-hours period. It also showed a total power cost saving of 95.04 Euro for the same period. Moreover, decreasing the global active power will reduce the power intensity through the positive correlation.
    Optimal Cycling of a Heterogenous Battery Bank via Reinforcement Learning. (arXiv:2109.07137v1 [cs.LG])
    (2 min) We consider the problem of optimal charging/discharging of a bank of heterogenous battery units, driven by stochastic electricity generation and demand processes. The batteries in the battery bank may differ with respect to their capacities, ramp constraints, losses, as well as cycling costs. The goal is to minimize the degradation costs associated with battery cycling in the long run; this is posed formally as a Markov decision process. We propose a linear function approximation based Q-learning algorithm for learning the optimal solution, using a specially designed class of kernel functions that approximate the structure of the value functions associated with the MDP. The proposed algorithm is validated via an extensive case study.
    Integrating Sensing and Communication in Cellular Networks via NR Sidelink. (arXiv:2109.07253v1 [cs.CV])
    (2 min) RF-sensing, the analysis and interpretation of movement or environment-induced patterns in received electromagnetic signals, has been actively investigated for more than a decade. Since electromagnetic signals, through cellular communication systems, are omnipresent, RF sensing has the potential to become a universal sensing mechanism with applications in smart home, retail, localization, gesture recognition, intrusion detection, etc. Specifically, existing cellular network installations might be dual-used for both communication and sensing. Such communications and sensing convergence is envisioned for future communication networks. We propose the use of NR-sidelink direct device-to-device communication to achieve device-initiated,flexible sensing capabilities in beyond 5G cellular communication systems. In this article, we specifically investigate a common issue related to sidelink-based RF-sensing, which is its angle and rotation dependence. In particular, we discuss transformations of mmWave point-cloud data which achieve rotational invariance, as well as distributed processing based on such rotational invariant inputs, at angle and distance diverse devices. To process the distributed data, we propose a graph based encoder to capture spatio-temporal features of the data and propose four approaches for multi-angle learning. The approaches are compared on a newly recorded and openly available dataset comprising 15 subjects, performing 21 gestures which are recorded from 8 angles.
    Risk Measurement, Risk Entropy, and Autonomous Driving Risk Modeling. (arXiv:2109.07211v1 [q-fin.RM])
    (2 min) It has been for a long time to use big data of autonomous vehicles for perception, prediction, planning, and control of driving. Naturally, it is increasingly questioned why not using this big data for risk management and actuarial modeling. This article examines the emerging technical difficulties, new ideas, and methods of risk modeling under autonomous driving scenarios. Compared with the traditional risk model, the novel model is more consistent with the real road traffic and driving safety performance. More importantly, it provides technical feasibility for realizing risk assessment and car insurance pricing under a computer simulation environment.
    Automatic Symmetry Discovery with Lie Algebra Convolutional Network. (arXiv:2109.07103v1 [cs.LG])
    (2 min) Existing equivariant neural networks for continuous groups require discretization or group representations. All these approaches require detailed knowledge of the group parametrization and cannot learn entirely new symmetries. We propose to work with the Lie algebra (infinitesimal generators) instead of the Lie group.Our model, the Lie algebra convolutional network (L-conv) can learn potential symmetries and does not require discretization of the group. We show that L-conv can serve as a building block to construct any group equivariant architecture. We discuss how CNNs and Graph Convolutional Networks are related to and can be expressed as L-conv with appropriate groups. We also derive the MSE loss for a single L-conv layer and find a deep relation with Lagrangians used in physics, with some of the physics aiding in defining generalization and symmetries in the loss landscape. Conversely, L-conv could be used to propose more general equivariant ans\"atze for scientific machine learning.
    Universal Adversarial Attack on Deep Learning Based Prognostics. (arXiv:2109.07142v1 [cs.LG])
    (2 min) Deep learning-based time series models are being extensively utilized in engineering and manufacturing industries for process control and optimization, asset monitoring, diagnostic and predictive maintenance. These models have shown great improvement in the prediction of the remaining useful life (RUL) of industrial equipment but suffer from inherent vulnerability to adversarial attacks. These attacks can be easily exploited and can lead to catastrophic failure of critical industrial equipment. In general, different adversarial perturbations are computed for each instance of the input data. This is, however, difficult for the attacker to achieve in real time due to higher computational requirement and lack of uninterrupted access to the input data. Hence, we present the concept of universal adversarial perturbation, a special imperceptible noise to fool regression based RUL prediction models. Attackers can easily utilize universal adversarial perturbations for real-time attack since continuous access to input data and repetitive computation of adversarial perturbations are not a prerequisite for the same. We evaluate the effect of universal adversarial attacks using NASA turbofan engine dataset. We show that addition of universal adversarial perturbation to any instance of the input data increases error in the output predicted by the model. To the best of our knowledge, we are the first to study the effect of the universal adversarial perturbation on time series regression models. We further demonstrate the effect of varying the strength of perturbations on RUL prediction models and found that model accuracy decreases with the increase in perturbation strength of the universal adversarial attack. We also showcase that universal adversarial perturbation can be transferred across different models.
    Learning Mathematical Properties of Integers. (arXiv:2109.07230v1 [cs.CL])
    (2 min) Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the representations from mathematical sequence data, we can substantially improve over number embeddings learned from English text corpora.
    A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence. (arXiv:2109.06932v1 [cs.CR])
    (2 min) The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.
    Non-linear Independent Dual System (NIDS) for Discretization-independent Surrogate Modeling over Complex Geometries. (arXiv:2109.07018v1 [physics.comp-ph])
    (2 min) Numerical solution of partial differential equations (PDEs) require expensive simulations, limiting their application in design optimization routines, model-based control, or solution of large-scale inverse problems. Existing Convolutional Neural Network-based frameworks for surrogate modeling require lossy pixelization and data-preprocessing, which is not suitable for realistic engineering applications. Therefore, we propose non-linear independent dual system (NIDS), which is a deep learning surrogate model for discretization-independent, continuous representation of PDE solutions, and can be used for prediction over domains with complex, variable geometries and mesh topologies. NIDS leverages implicit neural representations to develop a non-linear mapping between problem parameters and spatial coordinates to state predictions by combining evaluations of a case-wise parameter network and a point-wise spatial network in a linear output layer. The input features of the spatial network include physical coordinates augmented by a minimum distance function evaluation to implicitly encode the problem geometry. The form of the overall output layer induces a dual system, where each term in the map is non-linear and independent. Further, we propose a minimum distance function-driven weighted sum of NIDS models using a shared parameter network to enforce boundary conditions by construction under certain restrictions. The framework is applied to predict solutions around complex, parametrically-defined geometries on non-parametrically-defined meshes with solution obtained many orders of magnitude faster than the full order models. Test cases include a vehicle aerodynamics problem with complex geometry and data scarcity, enabled by a training method in which more cases are gradually added as training progresses.
    Balancing detectability and performance of attacks on the control channel of Markov Decision Processes. (arXiv:2109.07171v1 [eess.SY])
    (2 min) We investigate the problem of designing optimal stealthy poisoning attacks on the control channel of Markov decision processes (MDPs). This research is motivated by the recent interest of the research community for adversarial and poisoning attacks applied to MDPs, and reinforcement learning (RL) methods. The policies resulting from these methods have been shown to be vulnerable to attacks perturbing the observations of the decision-maker. In such an attack, drawing inspiration from adversarial examples used in supervised learning, the amplitude of the adversarial perturbation is limited according to some norm, with the hope that this constraint will make the attack imperceptible. However, such constraints do not grant any level of undetectability and do not take into account the dynamic nature of the underlying Markov process. In this paper, we propose a new attack formulation, based on information-theoretical quantities, that considers the objective of minimizing the detectability of the attack as well as the performance of the controlled process. We analyze the trade-off between the efficiency of the attack and its detectability. We conclude with examples and numerical simulations illustrating this trade-off.
    Improving Robustness and Efficiency in Active Learning with Contrastive Loss. (arXiv:2109.06873v1 [cs.LG])
    (2 min) This paper introduces supervised contrastive active learning (SCAL) by leveraging the contrastive loss for active learning in a supervised setting. We propose efficient query strategies in active learning to select unbiased and informative data samples of diverse feature representations. We demonstrate our proposed method reduces sampling bias, achieves state-of-the-art accuracy and model calibration in an active learning setup with the query computation 11x faster than CoreSet and 26x faster than Bayesian active learning by disagreement. Our method yields well-calibrated models even with imbalanced datasets. We also evaluate robustness to dataset shift and out-of-distribution in active learning setup and demonstrate our proposed SCAL method outperforms high performing compute-intensive methods by a bigger margin (average 8.9% higher AUROC for out-of-distribution detection and average 7.2% lower ECE under dataset shift).
    Explainable Identification of Dementia from Transcripts using Transformer Networks. (arXiv:2109.06980v1 [cs.CL])
    (2 min) Alzheimer's disease (AD) is the main cause of dementia which is accompanied by loss of memory and may lead to severe consequences in peoples' everyday life if not diagnosed on time. Very few works have exploited transformer-based networks and despite the high accuracy achieved, little work has been done in terms of model interpretability. In addition, although Mini-Mental State Exam (MMSE) scores are inextricably linked with the identification of dementia, research works face the task of dementia identification and the task of the prediction of MMSE scores as two separate tasks. In order to address these limitations, we employ several transformer-based models, with BERT achieving the highest accuracy accounting for 85.56%. Concurrently, we propose an interpretable method to detect AD patients based on siamese networks reaching accuracy up to 81.18%. Next, we introduce two multi-task learning models, where the main task refers to the identification of dementia (binary classification), while the auxiliary one corresponds to the identification of the severity of dementia (multiclass classification). Our model obtains accuracy equal to 84.99% on the detection of AD patients in the multi-task learning setting. Finally, we present some new methods to identify the linguistic patterns used by AD patients and non-AD ones, including text statistics, vocabulary uniqueness, word usage, correlations via a detailed linguistic analysis, and explainability techniques (LIME). Findings indicate significant differences in language between AD and non-AD patients.
    Targeted Cross-Validation. (arXiv:2109.06949v1 [stat.ML])
    (2 min) In many applications, we have access to the complete dataset but are only interested in the prediction of a particular region of predictor variables. A standard approach is to find the globally best modeling method from a set of candidate methods. However, it is perhaps rare in reality that one candidate method is uniformly better than the others. A natural approach for this scenario is to apply a weighted $L_2$ loss in performance assessment to reflect the region-specific interest. We propose a targeted cross-validation (TCV) to select models or procedures based on a general weighted $L_2$ loss. We show that the TCV is consistent in selecting the best performing candidate under the weighted $L_2$ loss. Experimental studies are used to demonstrate the use of TCV and its potential advantage over the global CV or the approach of using only local data for modeling a local region. Previous investigations on CV have relied on the condition that when the sample size is large enough, the ranking of two candidates stays the same. However, in many applications with the setup of changing data-generating processes or highly adaptive modeling methods, the relative performance of the methods is not static as the sample size varies. Even with a fixed data-generating process, it is possible that the ranking of two methods switches infinitely many times. In this work, we broaden the concept of the selection consistency by allowing the best candidate to switch as the sample size varies, and then establish the consistency of the TCV. This flexible framework can be applied to high-dimensional and complex machine learning scenarios where the relative performances of modeling procedures are dynamic.
    Graph Embedding via Diffusion-Wavelets-Based Node Feature Distribution Characterization. (arXiv:2109.07016v1 [cs.LG])
    (2 min) Recent years have seen a rise in the development of representational learning methods for graph data. Most of these methods, however, focus on node-level representation learning at various scales (e.g., microscopic, mesoscopic, and macroscopic node embedding). In comparison, methods for representation learning on whole graphs are currently relatively sparse. In this paper, we propose a novel unsupervised whole graph embedding method. Our method uses spectral graph wavelets to capture topological similarities on each k-hop sub-graph between nodes and uses them to learn embeddings for the whole graph. We evaluate our method against 12 well-known baselines on 4 real-world datasets and show that our method achieves the best performance across all experiments, outperforming the current state-of-the-art by a considerable margin.
    Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback. (arXiv:2109.07054v1 [cs.LG])
    (2 min) Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback. For these three feedback types, we find that COACH can behave sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types. We compare our COACH variant with two other reinforcement-learning algorithms: Q-learning and TAMER.
    Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Streaming Data. (arXiv:2109.07117v1 [cs.LG])
    (2 min) Motivated by the high-frequency data streams continuously generated, real-time learning is becoming increasingly important. These data streams should be processed sequentially with the property that the stream may change over time. In this streaming setting, we propose techniques for minimizing a convex objective through unbiased estimates of its gradients, commonly referred to as stochastic approximation problems. Our methods rely on stochastic approximation algorithms due to their computationally advantage as they only use the previous iterate as a parameter estimate. The reasoning includes iterate averaging that guarantees optimal statistical efficiency under classical conditions. Our non-asymptotic analysis shows accelerated convergence by selecting the learning rate according to the expected data streams. We show that the average estimate converges optimally and robustly to any data stream rate. In addition, noise reduction can be achieved by processing the data in a specific pattern, which is advantageous for large-scale machine learning. These theoretical results are illustrated for various data streams, showing the effectiveness of the proposed algorithms.
    Reconstruction on Trees and Low-Degree Polynomials. (arXiv:2109.06915v1 [math.PR])
    (2 min) The study of Markov processes and broadcasting on trees has deep connections to a variety of areas including statistical physics, phylogenetic reconstruction, MCMC algorithms, and community detection in random graphs. Notably, the celebrated Belief Propagation (BP) algorithm achieves Bayes-optimal performance for the reconstruction problem of predicting the value of the Markov process at the root of the tree from its values at the leaves. Recently, the analysis of low-degree polynomials has emerged as a valuable tool for predicting computational-to-statistical gaps. In this work, we investigate the performance of low-degree polynomials for the reconstruction problem on trees. Perhaps surprisingly, we show that there are simple tree models with $N$ leaves where (1) nontrivial reconstruction of the root value is possible with a simple polynomial time algorithm and with robustness to noise, but not with any polynomial of degree $N^{c}$ for $c > 0$ a constant, and (2) when the tree is unknown and given multiple samples with correlated root assignments, nontrivial reconstruction of the root value is possible with a simple, noise-robust, and computationally efficient SQ (Statistical Query) algorithm but not with any polynomial of degree $N^c$. These results clarify some of the limitations of low-degree polynomials vs. polynomial time algorithms for Bayesian estimation problems. They also complement recent work of Moitra, Mossel, and Sandon who studied the circuit complexity of Belief Propagation. We pose related open questions about low-degree polynomials and the Kesten-Stigum threshold.
    Testing Self-Organized Criticality Across the Main Sequence using Stellar Flares from TESS. (arXiv:2109.07011v1 [astro-ph.SR])
    (2 min) Stars produce explosive flares, which are believed to be powered by the release of energy stored in coronal magnetic field configurations. It has been shown that solar flares exhibit energy distributions typical of self-organized critical systems. This study applies a novel flare detection technique to data obtained by NASA's TESS mission and identifies $\sim10^6$ flaring events on $\sim10^5$ stars across spectral types. Our results suggest that magnetic reconnection events that maintain the topology of the magnetic field in a self-organized critical state are ubiquitous among stellar coronae.
    Scalable Average Consensus with Compressed Communications. (arXiv:2109.06996v1 [math.OC])
    (2 min) We propose a new decentralized average consensus algorithm with compressed communication that scales linearly with the network size n. We prove that the proposed method converges to the average of the initial values held locally by the agents of a network when agents are allowed to communicate with compressed messages. The proposed algorithm works for a broad class of compression operators (possibly biased), where agents interact over arbitrary static, undirected, and connected networks. We further present numerical experiments that confirm our theoretical results and illustrate the scalability and communication efficiency of our algorithm.
    Combining GEDI and Sentinel-2 for wall-to-wall mapping of tall and short crops. (arXiv:2109.06972v1 [eess.IV])
    (3 min) High resolution crop type maps are an important tool for improving food security, and remote sensing is increasingly used to create such maps in regions that possess ground truth labels for model training. However, these labels are absent in many regions, and models trained in other regions on typical satellite features, such as those from optical sensors, often exhibit low performance when transferred. Here we explore the use of NASA's Global Ecosystem Dynamics Investigation (GEDI) spaceborne lidar instrument, combined with Sentinel-2 optical data, for crop type mapping. Using data from three major cropped regions (in China, France, and the United States) we first demonstrate that GEDI energy profiles are capable of reliably distinguishing maize, a crop typically above 2m in height, from crops like rice and soybean that are shorter. We further show that these GEDI profiles provide much more invariant features across geographies compared to spectral and phenological features detected by passive optical sensors. GEDI is able to distinguish maize from other crops within each region with accuracies higher than 84%, and able to transfer across regions with accuracies higher than 82% compared to 64% for transfer of optical features. Finally, we show that GEDI profiles can be used to generate training labels for models based on optical imagery from Sentinel-2, thereby enabling the creation of 10m wall-to-wall maps of tall versus short crops in label-scarce regions. As maize is the second most widely grown crop in the world and often the only tall crop grown within a landscape, we conclude that GEDI offers great promise for improving global crop type maps.
    Behavior of k-NN as an Instance-Based Explanation Method. (arXiv:2109.06999v1 [cs.LG])
    (2 min) Adoption of DL models in critical areas has led to an escalating demand for sound explanation methods. Instance-based explanation methods are a popular type that return selective instances from the training set to explain the predictions for a test sample. One way to connect these explanations with prediction is to ask the following counterfactual question - how does the loss and prediction for a test sample change when explanations are removed from the training set? Our paper answers this question for k-NNs which are natural contenders for an instance-based explanation method. We first demonstrate empirically that the representation space induced by last layer of a neural network is the best to perform k-NN in. Using this layer, we conduct our experiments and compare them to influence functions (IFs) ~\cite{koh2017understanding} which try to answer a similar question. Our evaluations do indicate change in loss and predictions when explanations are removed but we do not find a trend between $k$ and loss or prediction change. We find significant stability in the predictions and loss of MNIST vs. CIFAR-10. Surprisingly, we do not observe much difference in the behavior of k-NNs vs. IFs on this question. We attribute this to training set subsampling for IFs.

2021-09-15

  • cs.CL updates on arXiv.org

    Sequential Modelling with Applications to Music Recommendation, Fact-Checking, and Speed Reading. (arXiv:2109.06736v1 [cs.IR])
    (2 min) Sequential modelling entails making sense of sequential data, which naturally occurs in a wide array of domains. One example is systems that interact with users, log user actions and behaviour, and make recommendations of items of potential interest to users on the basis of their previous interactions. In such cases, the sequential order of user interactions is often indicative of what the user is interested in next. Similarly, for systems that automatically infer the semantics of text, capturing the sequential order of words in a sentence is essential, as even a slight re-ordering could significantly alter its original meaning. This thesis makes methodological contributions and new investigations of sequential modelling for the specific application areas of systems that recommend music tracks to listeners and systems that process text semantics in order to automatically fact-check claims, or "speed read" text for efficient further classification. (Rest of abstract omitted due to arXiv abstract limit)
    Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation. (arXiv:2104.08771v2 [cs.CL] UPDATED)
    (0 min) We study the power of cross-attention in the Transformer architecture within the context of transfer learning for machine translation, and extend the findings of studies into cross-attention when training from scratch. We conduct a series of experiments through fine-tuning a translation model on data where either the source or target language has changed. These experiments reveal that fine-tuning only the cross-attention parameters is nearly as effective as fine-tuning all parameters (i.e., the entire translation model). We provide insights into why this is the case and observe that limiting fine-tuning in this manner yields cross-lingually aligned embeddings. The implications of this finding for researchers and practitioners include a mitigation of catastrophic forgetting, the potential for zero-shot translation, and the ability to extend machine translation models to several new language pairs with reduced parameter storage overhead.
    BenchIE: Open Information Extraction Evaluation Based on Facts, Not Tokens. (arXiv:2109.06850v1 [cs.CL])
    (0 min) Intrinsic evaluations of OIE systems are carried out either manually -- with human evaluators judging the correctness of extractions -- or automatically, on standardized benchmarks. The latter, while much more cost-effective, is less reliable, primarily because of the incompleteness of the existing OIE benchmarks: the ground truth extractions do not include all acceptable variants of the same fact, leading to unreliable assessment of models' performance. Moreover, the existing OIE benchmarks are available for English only. In this work, we introduce BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to existing OIE benchmarks, BenchIE takes into account informational equivalence of extractions: our gold standard consists of fact synsets, clusters in which we exhaustively list all surface forms of the same fact. We benchmark several state-of-the-art OIE systems using BenchIE and demonstrate that these systems are significantly less effective than indicated by existing OIE benchmarks. We make BenchIE (data and evaluation code) publicly available.
    Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition. (arXiv:2109.06870v1 [cs.CL])
    (0 min) This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.
    Expert Knowledge-Guided Length-Variant Hierarchical Label Generation for Proposal Classification. (arXiv:2109.06661v1 [cs.LG])
    (2 min) To advance the development of science and technology, research proposals are submitted to open-court competitive programs developed by government agencies (e.g., NSF). Proposal classification is one of the most important tasks to achieve effective and fair review assignments. Proposal classification aims to classify a proposal into a length-variant sequence of labels. In this paper, we formulate the proposal classification problem into a hierarchical multi-label classification task. Although there are certain prior studies, proposal classification exhibit unique features: 1) the classification result of a proposal is in a hierarchical discipline structure with different levels of granularity; 2) proposals contain multiple types of documents; 3) domain experts can empirically provide partial labels that can be leveraged to improve task performances. In this paper, we focus on developing a new deep proposal classification framework to jointly model the three features. In particular, to sequentially generate labels, we leverage previously-generated labels to predict the label of next level; to integrate partial labels from experts, we use the embedding of these empirical partial labels to initialize the state of neural networks. Our model can automatically identify the best length of label sequence to stop next label prediction. Finally, we present extensive results to demonstrate that our method can jointly model partial labels, textual information, and semantic dependencies in label sequences, and, thus, achieve advanced performances.
    Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. (arXiv:2104.08455v2 [cs.CL] UPDATED)
    (0 min) Dialogue systems powered by large pre-trained language models (LM) exhibit an innate ability to deliver fluent and natural-looking responses. Despite their impressive generation performance, these models can often generate factually incorrect statements impeding their widespread adoption. In this paper, we focus on the task of improving the faithfulness -- and thus reduce hallucination -- of Neural Dialogue Systems to known facts supplied by a Knowledge Graph (KG). We propose Neural Path Hunter which follows a generate-then-refine strategy whereby a generated response is amended using the k-hop subgraph of a KG. Neural Path Hunter leverages a separate token-level fact critic to identify plausible sources of hallucination followed by a refinement stage consisting of a chain of two neural LM's that retrieves correct entities by crafting a query signal that is propagated over the k-hop subgraph. Our proposed model can easily be applied to any dialogue generated responses without retraining the model. We empirically validate our proposed approach on the OpenDialKG dataset against a suite of metrics and report a relative improvement of faithfulness over dialogue responses by 20.35% based on FeQA (Durmus et al., 2020).
    Talking Space: inference from spatial linguistic meanings. (arXiv:2109.06554v1 [cs.CL])
    (2 min) This paper concerns the intersection of natural language and the physical space around us in which we live, that we observe and/or imagine things within. Many important features of language have spatial connotations, for example, many prepositions (like in, next to, after, on, etc.) are fundamentally spatial. Space is also a key factor of the meanings of many words/phrases/sentences/text, and space is a, if not the key, context for referencing (e.g. pointing) and embodiment. We propose a mechanism for how space and linguistic structure can be made to interact in a matching compositional fashion. Examples include Cartesian space, subway stations, chesspieces on a chess-board, and Penrose's staircase. The starting point for our construction is the DisCoCat model of compositional natural language meaning, which we relax to accommodate physical space. We address the issue of having multiple agents/objects in a space, including the case that each agent has different capabilities with respect to that space, e.g., the specific moves each chesspiece can make, or the different velocities one may be able to reach. Once our model is in place, we show how inferences drawing from the structure of physical space can be made. We also how how linguistic model of space can interact with other such models related to our senses and/or embodiment, such as the conceptual spaces of colour, taste and smell, resulting in a rich compositional model of meaning that is close to human experience and embodiment in the world.
    Efficient Sampling of Dependency Structures. (arXiv:2109.06521v1 [cs.CL])
    (2 min) Probabilistic distributions over spanning trees in directed graphs are a fundamental model of dependency structure in natural language processing, syntactic dependency trees. In NLP, dependency trees often have an additional root constraint: only one edge may emanate from the root. However, no sampling algorithm has been presented in the literature to account for this additional constraint. In this paper, we adapt two spanning tree sampling algorithms to faithfully sample dependency trees from a graph subject to the root constraint. Wilson (1996)'s sampling algorithm has a running time of $\mathcal{O}(H)$ where $H$ is the mean hitting time of the graph. Colbourn (1996)'s sampling algorithm has a running time of $\mathcal{O}(N^3)$, which is often greater than the mean hitting time of a directed graph. Additionally, we build upon Colbourn's algorithm and present a novel extension that can sample $K$ trees without replacement in $\mathcal{O}(K N^3 + K^2 N)$ time. To the best of our knowledge, no algorithm has been given for sampling spanning trees without replacement from a directed graph.
    Different Strokes for Different Folks: Investigating Appropriate Further Pre-training Approaches for Diverse Dialogue Tasks. (arXiv:2109.06524v1 [cs.CL])
    (0 min) Loading models pre-trained on the large-scale corpus in the general domain and fine-tuning them on specific downstream tasks is gradually becoming a paradigm in Natural Language Processing. Previous investigations prove that introducing a further pre-training phase between pre-training and fine-tuning phases to adapt the model on the domain-specific unlabeled data can bring positive effects. However, most of these further pre-training works just keep running the conventional pre-training task, e.g., masked language model, which can be regarded as the domain adaptation to bridge the data distribution gap. After observing diverse downstream tasks, we suggest that different tasks may also need a further pre-training phase with appropriate training tasks to bridge the task formulation gap. To investigate this, we carry out a study for improving multiple task-oriented dialogue downstream tasks through designing various tasks at the further pre-training phase. The experiment shows that different downstream tasks prefer different further pre-training tasks, which have intrinsic correlation and most further pre-training tasks significantly improve certain target tasks rather than all. Our investigation indicates that it is of great importance and effectiveness to design appropriate further pre-training tasks modeling specific information that benefit downstream tasks. Besides, we present multiple constructive empirical conclusions for enhancing task-oriented dialogues.
    MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model. (arXiv:2109.06605v1 [cs.CL])
    (2 min) Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets-for biomedical named entity recognition and financial sentence classification-covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.
    Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework. (arXiv:2109.05897v2 [cs.CL] UPDATED)
    (2 min) Answering questions asked from instructional corpora such as E-manuals, recipe books, etc., has been far less studied than open-domain factoid context-based question answering. This can be primarily attributed to the absence of standard benchmark datasets. In this paper we meticulously create a large amount of data connected with E-manuals and develop suitable algorithm to exploit it. We collect E-Manual Corpus, a huge corpus of 307,957 E-manuals and pretrain RoBERTa on this large corpus. We create various benchmark QA datasets which include question answer pairs curated by experts based upon two E-manuals, real user questions from Community Question Answering Forum pertaining to E-manuals etc. We introduce EMQAP (E-Manual Question Answering Pipeline) that answers questions pertaining to electronics devices. Built upon the pretrained RoBERTa, it harbors a supervised multi-task learning framework which efficiently performs the dual tasks of identifying the section in the E-manual where the answer can be found and the exact answer span within that section. For E-Manual annotated question-answer pairs, we show an improvement of about 40% in ROUGE-L F1 scores over the most competitive baseline. We perform a detailed ablation study and establish the versatility of EMQAP across different circumstances. The code and datasets are shared at https://github.com/abhi1nandy2/EMNLP-2021-Findings, and the corresponding project website is https://sites.google.com/view/emanualqa/home.
    Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise. (arXiv:2109.06772v1 [cs.CL])
    (0 min) Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity, but current approaches that operate in the embedding space do not take surface similarity into account. In this work, we present a simple yet effective strategy to improve cross-lingual transfer between closely related varieties by augmenting the data of the high-resource parent language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Germanic, Uralic, and Romance language genera. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties.
    DEGREE: A Data-Efficient Generative Event Extraction Model. (arXiv:2108.12724v2 [cs.CL] UPDATED)
    (2 min) Event extraction (EE) aims to identify structured events, including event triggers and their corresponding arguments, from unstructured text. Most of the existing works rely on a large number of labeled instances to train models, while the labeled data could be expensive to be obtained. In this work, we present a data-efficient event extraction method by formulating event extraction as a natural language generation problem. The formulation allows us to inject knowledge of label semantics, event structure, and output dependencies into the model. Given a passage and an event type, our model learns to summarize this passage into a templated sentence in a predefined structure. The template is event-type-specific, manually created, and contains event trigger and argument information. Lastly, a rule-based algorithm is used to derive the trigger and argument predictions from the generated sentence. Our method inherently enjoys the following benefits: (1) The pretraining of the generative language models help incorporate the semantics of the labels for generative EE. (2) The autoregressive generation process and our end-to-end design for extracting triggers and arguments force the model to capture the dependencies among the output triggers and their arguments. (3) The predefined templates form concrete yet flexible rules to hint the models about the valid patterns for each event type, reducing the models' burden to learn structures from the data. Empirical results show that our model achieves superior performance over strong baselines on EE tasks in the low data regime and achieves competitive results to the current state-of-the-art when more data becomes available.
    Challenging Instances are Worth Learning: Generating Valuable Negative Samples for Response Selection Training. (arXiv:2109.06538v1 [cs.CL])
    (2 min) Retrieval-based chatbot selects the appropriate response from candidates according to the context, which heavily depends on a response selection module. A response selection module is generally a scoring model to evaluate candidates and is usually trained on the annotated positive response and sampled negative responses. Sampling negative responses lead to two risks: a). The sampled negative instances, especially that from random sampling methods, are mostly irrelevant to the dialogue context and too easy to be fitted at the training stage while causing a weak model in the real scenario. b). The so-called negative instances may be positive, which is known as the fake negative problem. To address the above issue, we employ pre-trained language models, such as the DialoGPT to construct more challenging negative instances to enhance the model robustness. Specifically, we provide garbled context to the pre-trained model to generate responses and filter the fake negative ones. In this way, our negative instances are fluent, context-related, and more challenging for the model to learn, while can not be positive. Extensive experiments show that our method brings significant and stable improvements on the dialogue response selection capacity.
    Connecting Attributions and QA Model Behavior on Realistic Counterfactuals. (arXiv:2104.04515v2 [cs.CL] UPDATED)
    (2 min) When a model attribution technique highlights a particular part of the input, a user might understand this highlight as making a statement about counterfactuals (Miller, 2019): if that part of the input were to change, the model's prediction might change as well. This paper investigates how well different attribution techniques align with this assumption on realistic counterfactuals in the case of reading comprehension (RC). RC is a particularly challenging test case, as token-level attributions that have been extensively studied in other NLP tasks such as sentiment analysis are less suitable to represent the reasoning that RC models perform. We construct counterfactual sets for three different RC settings, and through heuristics that can connect attribution methods' outputs to high-level model behavior, we can evaluate how useful different attribution methods and even different formats are for understanding counterfactuals. We find that pairwise attributions are better suited to RC than token-level attributions across these different RC settings, with our best performance coming from a modification that we propose to an existing pairwise attribution method.
    AligNART: Non-autoregressive Neural Machine Translation by Jointly Learning to Estimate Alignment and Translate. (arXiv:2109.06481v1 [cs.CL])
    (2 min) Non-autoregressive neural machine translation (NART) models suffer from the multi-modality problem which causes translation inconsistency such as token repetition. Most recent approaches have attempted to solve this problem by implicitly modeling dependencies between outputs. In this paper, we introduce AligNART, which leverages full alignment information to explicitly reduce the modality of the target distribution. AligNART divides the machine translation task into $(i)$ alignment estimation and $(ii)$ translation with aligned decoder inputs, guiding the decoder to focus on simplified one-to-one translation. To alleviate the alignment estimation problem, we further propose a novel alignment decomposition method. Our experiments show that AligNART outperforms previous non-iterative NART models that focus on explicit modality reduction on WMT14 En$\leftrightarrow$De and WMT16 Ro$\rightarrow$En. Furthermore, AligNART achieves BLEU scores comparable to those of the state-of-the-art connectionist temporal classification based models on WMT14 En$\leftrightarrow$De. We also observe that AligNART effectively addresses the token repetition problem even without sequence-level knowledge distillation.
    Efficient Inference for Multilingual Neural Machine Translation. (arXiv:2109.06679v1 [cs.CL])
    (2 min) Multilingual NMT has become an attractive solution for MT deployment in production. But to match bilingual quality, it comes at the cost of larger and slower models. In this work, we consider several ways to make multilingual NMT faster at inference without degrading its quality. We experiment with several "light decoder" architectures in two 20-language multi-parallel settings: small-scale on TED Talks and large-scale on ParaCrawl. Our experiments demonstrate that combining a shallow decoder with vocabulary filtering leads to more than twice faster inference with no loss in translation quality. We validate our findings with BLEU and chrF (on 380 language pairs), robustness evaluation and human evaluation.
    Scalable Font Reconstruction with Dual Latent Manifolds. (arXiv:2109.06627v1 [cs.CV])
    (2 min) We propose a deep generative model that performs typography analysis and font reconstruction by learning disentangled manifolds of both font style and character shape. Our approach enables us to massively scale up the number of character types we can effectively model compared to previous methods. Specifically, we infer separate latent variables representing character and font via a pair of inference networks which take as input sets of glyphs that either all share a character type, or belong to the same font. This design allows our model to generalize to characters that were not observed during training time, an important task in light of the relative sparsity of most fonts. We also put forward a new loss, adapted from prior work that measures likelihood using an adaptive distribution in a projected space, resulting in more natural images without requiring a discriminator. We evaluate on the task of font reconstruction over various datasets representing character types of many languages, and compare favorably to modern style transfer systems according to both automatic and manually-evaluated metrics.
    Mitigating Language-Dependent Ethnic Bias in BERT. (arXiv:2109.05704v2 [cs.CL] UPDATED)
    (2 min) BERT and other large-scale language models (LMs) contain gender and racial bias. They also exhibit other dimensions of social bias, most of which have not been studied in depth, and some of which vary depending on the language. In this paper, we study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual BERT for English, German, Spanish, Korean, Turkish, and Chinese. To observe and quantify ethnic bias, we develop a novel metric called Categorical Bias score. Then we propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models. We compare our proposed methods with monolingual BERT and show that these methods effectively alleviate the ethnic bias. Which of the two methods works better depends on the amount of NLP resources available for that language. We additionally experiment with Arabic and Greek to verify that our proposed methods work for a wider variety of languages.
    End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents. (arXiv:2107.05541v4 [cs.CL] UPDATED)
    (2 min) Chatbots are intelligent software built to be used as a replacement for human interaction. However, existing studies typically do not provide enough support for low-resource languages like Bangla. Moreover, due to the increasing popularity of social media, we can also see the rise of interactions in Bangla transliteration (mostly in English) among the native Bangla speakers. In this paper, we propose a novel approach to build a Bangla chatbot aimed to be used as a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently. Since annotated data was not available for this purpose, we had to work on the whole machine learning life cycle (data preparation, machine learning modeling, and model deployment) using Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks. While working with the skewed annotated dataset, we try out different setups and pipelines to evaluate which works best and provide possible reasoning behind the observed results. Finally, we present a pipeline for intent classification and entity extraction which achieves reasonable performance (accuracy: 83.02%, precision: 80.82%, recall: 83.02%, F1-score: 80%).
    The Low-Dimensional Linear Geometry of Contextualized Word Representations. (arXiv:2105.07109v2 [cs.CL] UPDATED)
    (2 min) Black-box probing models can reliably extract linguistic features like tense, number, and syntactic role from pretrained word representations. However, the manner in which these features are encoded in representations remains poorly understood. We present a systematic study of the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features (including structured dependency relationships) are encoded in low-dimensional subspaces. We then refine this geometric picture, showing that there are hierarchical relations between the subspaces encoding general linguistic categories and more specific ones, and that low-dimensional feature encodings are distributed rather than aligned to individual neurons. Finally, we demonstrate that these linear subspaces are causally related to model behavior, and can be used to perform fine-grained manipulation of BERT's output distribution.
    Categorical Semantics of Reversible Pattern-Matching. (arXiv:2109.05837v2 [cs.LO] UPDATED)
    (2 min) This paper is concerned with categorical structures for reversible computation. In particular, we focus on a typed, functional reversible language based on Theseus. We discuss how join inverse rig categories do not in general capture pattern-matching, the core construct Theseus uses to enforce reversibility. We then derive a categorical structure to add to join inverse rig categories in order to capture pattern-matching. We show how such a structure makes an adequate model for reversible pattern-matching.
    Types of Out-of-Distribution Texts and How to Detect Them. (arXiv:2109.06827v1 [cs.CL])
    (0 min) Despite agreement on the importance of detecting out-of-distribution (OOD) examples, there is little consensus on the formal definition of OOD examples and how to best detect them. We categorize these examples by whether they exhibit a background shift or a semantic shift, and find that the two major approaches to OOD detection, model calibration and density estimation (language modeling for text), have distinct behavior on these types of OOD data. Across 14 pairs of in-distribution and OOD English natural language understanding datasets, we find that density estimation methods consistently beat calibration methods in background shift settings, while performing worse in semantic shift settings. In addition, we find that both methods generally fail to detect examples from challenge data, highlighting a weak spot for current methods. Since no single method works well across all settings, our results call for an explicit definition of OOD examples when evaluating different detection methods.
    Coreference-Aware Dialogue Summarization. (arXiv:2106.08556v2 [cs.CL] UPDATED)
    (0 min) Summarizing conversations via neural approaches has been gaining research traction lately, yet it is still challenging to obtain practical solutions. Examples of such challenges include unstructured information exchange in dialogues, informal interactions between speakers, and dynamic role changes of speakers as the dialogue evolves. Many of such challenges result in complex coreference links. Therefore, in this work, we investigate different approaches to explicitly incorporate coreference information in neural abstractive dialogue summarization models to tackle the aforementioned challenges. Experimental results show that the proposed approaches achieve state-of-the-art performance, implying it is useful to utilize coreference information in dialogue summarization. Evaluation results on factual correctness suggest such coreference-aware models are better at tracing the information flow among interlocutors and associating accurate status/actions with the corresponding interlocutors and person mentions.
    Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning. (arXiv:2109.06349v1 [cs.CL])
    (0 min) In this work, we focus on a more challenging few-shot intent detection scenario where many intents are fine-grained and semantically similar. We present a simple yet effective few-shot intent detection schema via contrastive pre-training and fine-tuning. Specifically, we first conduct self-supervised contrastive pre-training on collected intent datasets, which implicitly learns to discriminate semantically similar utterances without using any labels. We then perform few-shot intent detection together with supervised contrastive learning, which explicitly pulls utterances from the same intent closer and pushes utterances across different intents farther. Experimental results show that our proposed method achieves state-of-the-art performance on three challenging intent detection datasets under 5-shot and 10-shot settings.
    Robust Retrieval Augmented Generation for Zero-shot Slot Filling. (arXiv:2108.13934v2 [cs.CL] UPDATED)
    (0 min) Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [Entity, Slot, ?], a system is asked to fill the slot by generating or extracting the missing value exploiting evidence extracted from relevant passage(s) in the given document collection. The recent works in the field try to solve this task in an end-to-end fashion using retrieval-based language models. In this paper, we present a novel approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. Our model reports large improvements on both T-REx and zsRE slot filling datasets, improving both passage retrieval and slot value generation, and ranking at the top-1 position in the KILT leaderboard. Moreover, we demonstrate the robustness of our system showing its domain adaptation capability on a new variant of the TACRED dataset for slot filling, through a combination of zero/few-shot learning. We release the source code and pre-trained models.
    An MRC Framework for Semantic Role Labeling. (arXiv:2109.06660v1 [cs.CL])
    (0 min) Semantic Role Labeling (SRL) aims at recognizing the predicate-argument structure of a sentence and can be decomposed into two subtasks: predicate disambiguation and argument labeling. Prior work deals with these two tasks independently, which ignores the semantic connection between the two tasks. In this paper, we propose to use the machine reading comprehension (MRC) framework to bridge this gap. We formalize predicate disambiguation as multiple-choice machine reading comprehension, where the descriptions of candidate senses of a given predicate are used as options to select the correct sense. The chosen predicate sense is then used to determine the semantic roles for that predicate, and these semantic roles are used to construct the query for another MRC model for argument labeling. In this way, we are able to leverage both the predicate semantics and the semantic role semantics for argument labeling. We also propose to select a subset of all the possible semantic roles for computational efficiency. Experiments show that the proposed framework achieves state-of-the-art results on both span and dependency benchmarks.
    Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP. (arXiv:2109.06598v1 [cs.CL])
    (2 min) A key part of the NLP ethics movement is responsible use of data, but exactly what that means or how it can be best achieved remain unclear. This position paper discusses the core legal and ethical principles for collection and sharing of textual data, and the tensions between them. We propose a potential checklist for responsible data (re-)use that could both standardise the peer review of conference submissions, as well as enable a more in-depth view of published research across the community. Our proposal aims to contribute to the development of a consistent standard for data (re-)use, embraced across NLP conferences.
    Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color. (arXiv:2109.06129v2 [cs.CV] UPDATED)
    (2 min) Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases -- (Paris, Capital, France). However, simple relations of this type can often be recovered heuristically and the extent to which models implicitly reflect topological structure that is grounded in world, such as perceptual structure, is unknown. To explore this question, we conduct a thorough case study on color. Namely, we employ a dataset of monolexemic color terms and color chips represented in CIELAB, a color space with a perceptually meaningful distance metric. Using two methods of evaluating the structural alignment of colors in this space with text-derived color term representations, we find significant correspondence. Analyzing the differences in alignment across the color spectrum, we find that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming. Further analysis suggests that differences in alignment are, in part, mediated by collocationality and differences in syntactic usage, posing questions as to the relationship between color perception and usage and context.
    Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation. (arXiv:2102.05766v2 [cs.CL] UPDATED)
    (2 min) Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations: (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result,(b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data.To address these problems, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data. Within this cross-modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that by fine-tuning from FAT-MLM, our proposed speech translation models substantially improve translation quality by up to +5.9 BLEU.
    To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation. (arXiv:2107.10821v2 [cs.CL] UPDATED)
    (2 min) Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.
    CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation. (arXiv:2109.05729v2 [cs.CL] UPDATED)
    (2 min) In this paper, we take the advantage of previous pre-trained models (PTMs) and propose a novel Chinese Pre-trained Unbalanced Transformer (CPT). Different from previous Chinese PTMs, CPT is designed for both natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: a shared encoder, an understanding decoder, and a generation decoder. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model. Moreover, the unbalanced Transformer saves the computational and storage cost, which makes CPT competitive and greatly accelerates the inference of text generation. Experimental results on a wide range of Chinese NLU and NLG tasks show the effectiveness of CPT.
    Exploring Personality and Online Social Engagement: An Investigation of MBTI Users on Twitter. (arXiv:2109.06402v1 [cs.CL])
    (0 min) Text-based personality prediction by computational models is an emerging field with the potential to significantly improve on key weaknesses of survey-based personality assessment. We investigate 3848 profiles from Twitter with self-labeled Myers-Briggs personality traits (MBTI) - a framework closely related to the Five Factor Model of personality - to better understand how text-based digital traces from social engagement online can be used to predict user personality traits. We leverage BERT, a state-of-the-art NLP architecture based on deep learning, to analyze various sources of text that hold most predictive power for our task. We find that biographies, statuses, and liked tweets contain significant predictive power for all dimensions of the MBTI system. We discuss our findings and their implications for the validity of the MBTI and the lexical hypothesis, a foundational theory underlying the Five Factor Model that links language use and behavior. Our results hold optimistic implications for personality psychologists, computational linguists, and other social scientists aiming to predict personality from observational text data and explore the links between language and core behavioral traits.
    Controllable Dialogue Generation with Disentangled Multi-grained Style Specification and Attribute Consistency Reward. (arXiv:2109.06717v1 [cs.CL])
    (0 min) Controllable text generation is an appealing but challenging task, which allows users to specify particular attributes of the generated outputs. In this paper, we propose a controllable dialogue generation model to steer response generation under multi-attribute constraints. Specifically, we define and categorize the commonly used control attributes into global and local ones, which possess different granularities of effects on response generation. Then, we significantly extend the conventional seq2seq framework by introducing a novel two-stage decoder, which first uses a multi-grained style specification layer to impose the stylistic constraints and determine word-level control states of responses based on the attributes, and then employs a response generation layer to generate final responses maintaining both semantic relevancy to the contexts and fidelity to the attributes. Furthermore, we train our model with an attribute consistency reward to promote response control with explicit supervision signals. Extensive experiments and in-depth analyses on two datasets indicate that our model can significantly outperform competitive baselines in terms of response quality, content diversity and controllability.
    Learning Bill Similarity with Annotated and Augmented Corpora of Bills. (arXiv:2109.06527v1 [cs.CL])
    (0 min) Bill writing is a critical element of representative democracy. However, it is often overlooked that most legislative bills are derived, or even directly copied, from other bills. Despite the significance of bill-to-bill linkages for understanding the legislative process, existing approaches fail to address semantic similarities across bills, let alone reordering or paraphrasing which are prevalent in legal document writing. In this paper, we overcome these limitations by proposing a 5-class classification task that closely reflects the nature of the bill generation process. In doing so, we construct a human-labeled dataset of 4,721 bill-to-bill relationships at the subsection-level and release this annotated dataset to the research community. To augment the dataset, we generate synthetic data with varying degrees of similarity, mimicking the complex bill writing process. We use BERT variants and apply multi-stage training, sequentially fine-tuning our models with synthetic and human-labeled datasets. We find that the predictive performance significantly improves when training with both human-labeled and synthetic data. Finally, we apply our trained model to infer section- and bill-level similarities. Our analysis shows that the proposed methodology successfully captures the similarities across legal documents at various levels of aggregation.
    The Emergence of the Shape Bias Results from Communicative Efficiency. (arXiv:2109.06232v1 [cs.CL])
    (2 min) By the age of two, children tend to assume that new word categories are based on objects' shape, rather than their color or texture; this assumption is called the shape bias. They are thought to learn this bias by observing that their caregiver's language is biased towards shape based categories. This presents a chicken and egg problem: if the shape bias must be present in the language in order for children to learn it, how did it arise in language in the first place? In this paper, we propose that communicative efficiency explains both how the shape bias emerged and why it persists across generations. We model this process with neural emergent language agents that learn to communicate about raw pixelated images. First, we show that the shape bias emerges as a result of efficient communication strategies employed by agents. Second, we show that pressure brought on by communicative need is also necessary for it to persist across generations; simply having a shape bias in an agent's input language is insufficient. These results suggest that, over and above the operation of other learning strategies, the shape bias in human learners may emerge and be sustained by communicative pressures.
    Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation. (arXiv:2109.06604v1 [cs.CL])
    (2 min) Recently, $k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, it heavily relies on high-quality in-domain parallel corpora, limiting its capability on unsupervised domain adaptation, where in-domain parallel corpora are scarce or nonexistent. In this paper, we propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval. To this end, we first introduce an autoencoder task based on the target language, and then insert lightweight adapters into the original NMT model to map the token-level representation of this task to the ideal representation of translation task. Experiments on multi-domain datasets demonstrate that our proposed approach significantly improves the translation accuracy with target-side monolingual data, while achieving comparable performance with back-translation.
    Logic-level Evidence Retrieval and Graph-based Verification Network for Table-based Fact Verification. (arXiv:2109.06480v1 [cs.AI])
    (2 min) Table-based fact verification task aims to verify whether the given statement is supported by the given semi-structured table. Symbolic reasoning with logical operations plays a crucial role in this task. Existing methods leverage programs that contain rich logical information to enhance the verification process. However, due to the lack of fully supervised signals in the program generation process, spurious programs can be derived and employed, which leads to the inability of the model to catch helpful logical operations. To address the aforementioned problems, in this work, we formulate the table-based fact verification task as an evidence retrieval and reasoning framework, proposing the Logic-level Evidence Retrieval and Graph-based Verification network (LERGV). Specifically, we first retrieve logic-level program-like evidence from the given table and statement as supplementary evidence for the table. After that, we construct a logic-level graph to capture the logical relations between entities and functions in the retrieved evidence, and design a graph-based verification network to perform logic-level graph-based reasoning based on the constructed graph to classify the final entailment relation. Experimental results on the large-scale benchmark TABFACT show the effectiveness of the proposed approach.
    KFCNet: Knowledge Filtering and Contrastive Learning Network for Generative Commonsense Reasoning. (arXiv:2109.06704v1 [cs.CL])
    (0 min) Pre-trained language models have led to substantial gains over a broad range of natural language processing (NLP) tasks, but have been shown to have limitations for natural language generation tasks with high-quality requirements on the output, such as commonsense generation and ad keyword generation. In this work, we present a novel Knowledge Filtering and Contrastive learning Network (KFCNet) which references external knowledge and achieves better generation performance. Specifically, we propose a BERT-based filter model to remove low-quality candidates, and apply contrastive learning separately to each of the encoder and decoder, within a general encoder--decoder architecture. The encoder contrastive module helps to capture global target semantics during encoding, and the decoder contrastive module enhances the utility of retrieved prototypes while learning general features. Extensive experiments on the CommonGen benchmark show that our model outperforms the previous state of the art by a large margin: +6.6 points (42.5 vs. 35.9) for BLEU-4, +3.7 points (33.3 vs. 29.6) for SPICE, and +1.3 points (18.3 vs. 17.0) for CIDEr. We further verify the effectiveness of the proposed contrastive module on ad keyword generation, and show that our model has potential commercial value.
    Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition. (arXiv:2109.06684v1 [cs.CL])
    (0 min) Non-autoregressive (NAR) transformer models have been studied intensively in automatic speech recognition (ASR), and a substantial part of NAR transformer models is to use the casual mask to limit token dependencies. However, the casual mask is designed for the left-to-right decoding process of the non-parallel autoregressive (AR) transformer, which is inappropriate for the parallel NAR transformer since it ignores the right-to-left contexts. Some models are proposed to utilize right-to-left contexts with an extra decoder, but these methods increase the model complexity. To tackle the above problems, we propose a new non-autoregressive transformer with a unified bidirectional decoder (NAT-UBD), which can simultaneously utilize left-to-right and right-to-left contexts. However, direct use of bidirectional contexts will cause information leakage, which means the decoder output can be affected by the character information from the input of the same position. To avoid information leakage, we propose a novel attention mask and modify vanilla queries, keys, and values matrices for NAT-UBD. Experimental results verify that NAT-UBD can achieve character error rates (CERs) of 5.0%/5.5% on the Aishell1 dev/test sets, outperforming all previous NAR transformer models. Moreover, NAT-UBD can run 49.8x faster than the AR transformer baseline when decoding in a single step.
    Uncovering Implicit Gender Bias in Narratives through Commonsense Inference. (arXiv:2109.06437v1 [cs.CL])
    (2 min) Pre-trained language models learn socially harmful biases from their training corpora, and may repeat these biases when used for generation. We study gender biases associated with the protagonist in model-generated stories. Such biases may be expressed either explicitly ("women can't park") or implicitly (e.g. an unsolicited male character guides her into a parking space). We focus on implicit biases, and use a commonsense reasoning engine to uncover them. Specifically, we infer and analyze the protagonist's motivations, attributes, mental states, and implications on others. Our findings regarding implicit biases are in line with prior work that studied explicit biases, for example showing that female characters' portrayal is centered around appearance, while male figures' focus on intellect.
    UNIQORN: Unified Question Answering over RDF Knowledge Graphs and Natural Language Text. (arXiv:2108.08614v2 [cs.IR] UPDATED)
    (2 min) Question answering over knowledge graphs and other RDF data has been greatly advanced, with a number of good systems providing crisp answers for natural language questions or telegraphic queries. Some of these systems incorporate textual sources as additional evidence for the answering process, but cannot compute answers that are present in text alone. Conversely, systems from the IR and NLP communities have addressed QA over text, but barely utilize semantic data and knowledge. This paper presents the first QA system that can seamlessly operate over RDF datasets and text corpora, or both together, in a unified framework. Our method, called UNIQORN, builds a context graph on the fly, by retrieving question-relevant triples from the RDF data and/or the text corpus, where the latter case is handled by automatic information extraction. The resulting graph is typically rich but highly noisy. UNIQORN copes with this input by advanced graph algorithms for Group Steiner Trees, that identify the best answer candidates in the context graph. Experimental results on several benchmarks of complex questions with multiple entities and relations, show that UNIQORN, an unsupervised method with only five parameters, produces results comparable to the state-of-the-art on KGs, text corpora, and heterogeneous sources. The graph-based methodology provides user-interpretable evidence for the complete answering process.
    Identifying Untrustworthy Samples: Data Filtering for Open-domain Dialogues with Bayesian Optimization. (arXiv:2109.06471v1 [cs.CL])
    (2 min) Being able to reply with a related, fluent, and informative response is an indispensable requirement for building high-quality conversational agents. In order to generate better responses, some approaches have been proposed, such as feeding extra information by collecting large-scale datasets with human annotations, designing neural conversational models (NCMs) with complex architecture and loss functions, or filtering out untrustworthy samples based on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we follow the third research branch and present a data filtering method for open-domain dialogues, which identifies untrustworthy samples from training data with a quality measure that linearly combines seven dialogue attributes. The attribute weights are obtained via Bayesian Optimization (BayesOpt) that aims to optimize an objective function for dialogue generation iteratively on the validation set. Then we score training samples with the quality measure, sort them in descending order, and filter out those at the bottom. Furthermore, to accelerate the "filter-train-evaluate" iterations involved in BayesOpt on large-scale datasets, we propose a training framework that integrates maximum likelihood estimation (MLE) and negative training method (NEG). The training method updates parameters of a trained NCMs on two small sets with newly maintained and removed samples, respectively. Specifically, MLE is applied to maximize the log-likelihood of newly maintained samples, while NEG is used to minimize the log-likelihood of newly removed ones. Experimental results on two datasets show that our method can effectively identify untrustworthy samples, and NCMs trained on the filtered datasets achieve better performance.
    Sparse Fuzzy Attention for Structural Sentiment Analysis. (arXiv:2109.06719v1 [cs.CL])
    (0 min) Attention scorers have achieved success in parsing tasks like semantic and syntactic dependency parsing. However, in tasks modeled into parsing, like structural sentiment analysis, "dependency edges" are very sparse which hinders parser performance. Thus we propose a sparse and fuzzy attention scorer with pooling layers which improves parser performance and sets the new state-of-the-art on structural sentiment analysis. We further explore the parsing modeling on structural sentiment analysis with second-order parsing and introduce a novel sparse second-order edge building procedure that leads to significant improvement in parsing performance.
    Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding. (arXiv:2109.06400v1 [cs.CV])
    (0 min) A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.
    A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling. (arXiv:2109.06705v1 [cs.CL])
    (0 min) Table filling based relational triple extraction methods are attracting growing research interests due to their promising performance and their abilities on extracting triples from complex sentences. However, this kind of methods are far from their full potential because most of them only focus on using local features but ignore the global associations of relations and of token pairs, which increases the possibility of overlooking some important information during triple extraction. To overcome this deficiency, we propose a global feature-oriented triple extraction model that makes full use of the mentioned two kinds of global associations. Specifically, we first generate a table feature for each relation. Then two kinds of global associations are mined from the generated table features. Next, the mined global associations are integrated into the table feature of each relation. This "generate-mine-integrate" process is performed multiple times so that the table feature of each relation is refined step by step. Finally, each relation's table is filled based on its refined table feature, and all triples linked to this relation are extracted based on its filled table. We evaluate the proposed model on three benchmark datasets. Experimental results show our model is effective and it achieves state-of-the-art results on all of these datasets. The source code of our work is available at: https://github.com/neukg/GRTE.
    Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder. (arXiv:2109.06536v1 [cs.CL])
    (2 min) Recent work has proposed several efficient approaches for generating gradient-based adversarial perturbations on embeddings and proved that the model's performance and robustness can be improved when they are trained with these contaminated embeddings. While they paid little attention to how to help the model to learn these adversarial samples more efficiently. In this work, we focus on enhancing the model's ability to defend gradient-based adversarial attack during the model's training process and propose two novel adversarial training approaches: (1) CARL narrows the original sample and its adversarial sample in the representation space while enlarging their distance from different labeled samples. (2) RAR forces the model to reconstruct the original sample from its adversarial representation. Experiments show that the proposed two approaches outperform strong baselines on various text classification datasets. Analysis experiments find that when using our approaches, the semantic representation of the input sentence won't be significantly affected by adversarial perturbations, and the model's performance drops less under adversarial attack. That is to say, our approaches can effectively improve the robustness of the model. Besides, RAR can also be used to generate text-form adversarial samples.
    Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis. (arXiv:2109.06374v1 [cs.CL])
    (2 min) Spell checking and morphological analysis are two fundamental tasks in text and natural language processing and are addressed in the early stages of the development of language technology. Despite the previous efforts, there is no progress in open-source to create such tools for Sorani Kurdish, also known as Central Kurdish, as a less-resourced language. In this paper, we present our efforts in annotating a lexicon with morphosyntactic tags and also, extracting morphological rules of Sorani Kurdish to build a morphological analyzer, a stemmer and a spell-checking system using Hunspell. This implementation can be used for further developments in the field by researchers and also, be integrated into text editors under a publicly available license.
    Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos. (arXiv:2109.06398v1 [cs.CV])
    (2 min) We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foreground-background classification upon the video and regress on the foreground frames to adaptively generate proposals. In this way, the handcrafted proposal design is discarded and the redundant proposals are decreased. Then, a proposal consolidation module is further developed to enhance the semantic of the generated proposals. Finally, we locate the target moments with these generated proposals following the top-down framework. Extensive experiments on three challenging benchmarks show that our proposed APGN significantly outperforms previous state-of-the-art methods.
    Semantic Answer Type Prediction using BERT: IAI at the ISWC SMART Task 2020. (arXiv:2109.06714v1 [cs.CL])
    (0 min) This paper summarizes our participation in the SMART Task of the ISWC 2020 Challenge. A particular question we are interested in answering is how well neural methods, and specifically transformer models, such as BERT, perform on the answer type prediction task compared to traditional approaches. Our main finding is that coarse-grained answer types can be identified effectively with standard text classification methods, with over 95% accuracy, and BERT can bring only marginal improvements. For fine-grained type detection, on the other hand, BERT clearly outperforms previous retrieval-based approaches.
    A system for information extraction from scientific texts in Russian. (arXiv:2109.06703v1 [cs.CL])
    (0 min) In this paper, we present a system for information extraction from scientific texts in the Russian language. The system performs several tasks in an end-to-end manner: term recognition, extraction of relations between terms, and term linking with entities from the knowledge base. These tasks are extremely important for information retrieval, recommendation systems, and classification. The advantage of the implemented methods is that the system does not require a large amount of labeled data, which saves time and effort for data labeling and therefore can be applied in low- and mid-resource settings. The source code is publicly available and can be used for different research purposes.
    Document Graph for Neural Machine Translation. (arXiv:2012.03477v3 [cs.CL] UPDATED)
    (2 min) Previous works have shown that contextual information can improve the performance of neural machine translation (NMT). However, most existing document-level NMT methods only consider a few number of previous sentences. How to make use of the whole document as global contexts is still a challenge. To address this issue, we hypothesize that a document can be represented as a graph that connects relevant contexts regardless of their distances. We employ several types of relations, including adjacency, syntactic dependency, lexical consistency, and coreference, to construct the document graph. Then, we incorporate both source and target graphs into the conventional Transformer architecture with graph convolutional networks. Experiments on various NMT benchmarks, including IWSLT English--French, Chinese-English, WMT English--German and Opensubtitle English--Russian, demonstrate that using document graphs can significantly improve the translation quality. Extensive analysis verifies that the document graph is beneficial for capturing discourse phenomena.
    LM-Critic: Language Models for Unsupervised Grammatical Error Correction. (arXiv:2109.06822v1 [cs.CL])
    (0 min) Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs, but manually annotating such pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. In this work, we show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector. We evaluate our approach on GEC datasets across multiple domains (CoNLL-2014, BEA-2019, GMEG-wiki and GMEG-yahoo) and show that it outperforms existing methods in both the unsupervised setting (+7.7 F0.5) and the supervised setting (+0.5 F0.5).
    The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation. (arXiv:2109.06835v1 [cs.CL])
    (0 min) Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.
    A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space. (arXiv:2109.06324v1 [cs.CL])
    (0 min) In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available.
    Old BERT, New Tricks: Artificial Language Learning for Pre-Trained Language Models. (arXiv:2109.06333v1 [cs.CL])
    (0 min) We extend the artificial language learning experimental paradigm from psycholinguistics and apply it to pre-trained language models -- specifically, BERT (Devlin et al., 2019). We treat the model as a subject in an artificial language learning experimental setting: in order to learn the relation between two linguistic properties A and B, we introduce a set of new, non-existent, linguistic items, give the model information about their variation along property A, then measure to what extent the model learns property B for these items as a result of training. We show this method at work for degree modifiers (expressions like "slightly", "very", "rather", "extremely") and test the hypothesis that the degree expressed by modifiers (low, medium or high degree) is related to their sensitivity to sentence polarity (whether they show preference for affirmative or negative sentences or neither). Our experimental results are compatible with existing linguistic observations that relate degree semantics to polarity-sensitivity, including the main one: low degree semantics leads to positive polarity sensitivity (that is, to preference towards affirmative contexts). The method can be used in linguistics to elaborate on hypotheses and interpret experimental results, as well as for more insightful evaluation of linguistic representations in language models.
    Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding. (arXiv:2109.06466v1 [cs.CL])
    (2 min) Task-adaptive pre-training (TAPT) and Self-training (ST) have emerged as the major semi-supervised approaches to improve natural language understanding (NLU) tasks with massive amount of unlabeled data. However, it's unclear whether they learn similar representations or they can be effectively combined. In this paper, we show that TAPT and ST can be complementary with simple TFS protocol by following TAPT -> Finetuning -> Self-training (TFS) process. Experimental results show that TFS protocol can effectively utilize unlabeled data to achieve strong combined gains consistently across six datasets covering sentiment classification, paraphrase identification, natural language inference, named entity recognition and dialogue slot classification. We investigate various semi-supervised settings and consistently show that gains from TAPT and ST can be strongly additive by following TFS procedure. We hope that TFS could serve as an important semi-supervised baseline for future NLP studies.
    Commonsense-Focused Dialogues for Response Generation: An Empirical Study. (arXiv:2109.06427v1 [cs.CL])
    (2 min) Smooth and effective communication requires the ability to perform latent or explicit commonsense inference. Prior commonsense reasoning benchmarks (such as SocialIQA and CommonsenseQA) mainly focus on the discriminative task of choosing the right answer from a set of candidates, and do not involve interactive language generation as in dialogue. Moreover, existing dialogue datasets do not explicitly focus on exhibiting commonsense as a facet. In this paper, we present an empirical study of commonsense in dialogue response generation. We first auto-extract commonsensical dialogues from existing dialogue datasets by leveraging ConceptNet, a commonsense knowledge graph. Furthermore, building on social contexts/situations in SocialIQA, we collect a new dialogue dataset with 25K dialogues aimed at exhibiting social commonsense in an interactive setting. We evaluate response generation models trained using these datasets and find that models trained on both extracted and our collected data produce responses that consistently exhibit more commonsense than baselines. Finally we propose an approach for automatic evaluation of commonsense that relies on features derived from ConceptNet and pre-trained language and dialog models, and show reasonable correlation with human evaluation of responses' commonsense quality. We are releasing a subset of our collected data, Commonsense-Dialogues, containing about 11K dialogs.
    Quantum Mathematics in Artificial Intelligence. (arXiv:2101.04255v4 [cs.AI] UPDATED)
    (3 min) In the decade since 2010, successes in artificial intelligence have been at the forefront of computer science and technology, and vector space models have solidified a position at the forefront of artificial intelligence. At the same time, quantum computers have become much more powerful, and announcements of major advances are frequently in the news. The mathematical techniques underlying both these areas have more in common than is sometimes realized. Vector spaces took a position at the axiomatic heart of quantum mechanics in the 1930s, and this adoption was a key motivation for the derivation of logic and probability from the linear geometry of vector spaces. Quantum interactions between particles are modelled using the tensor product, which is also used to express objects and operations in artificial neural networks. This paper describes some of these common mathematical areas, including examples of how they are used in artificial intelligence (AI), particularly in automated reasoning and natural language processing (NLP). Techniques discussed include vector spaces, scalar products, subspaces and implication, orthogonal projection and negation, dual vectors, density matrices, positive operators, and tensor products. Application areas include information retrieval, categorization and implication, modelling word-senses and disambiguation, inference in knowledge bases, and semantic composition. Some of these approaches can potentially be implemented on quantum hardware. Many of the practical steps in this implementation are in early stages, and some are already realized. Explaining some of the common mathematical tools can help researchers in both AI and quantum computing further exploit these overlaps, recognizing and exploring new directions along the way.
    YES SIR!Optimizing Semantic Space of Negatives with Self-Involvement Ranker. (arXiv:2109.06436v1 [cs.IR])
    (2 min) Pre-trained model such as BERT has been proved to be an effective tool for dealing with Information Retrieval (IR) problems. Due to its inspiring performance, it has been widely used to tackle with real-world IR problems such as document ranking. Recently, researchers have found that selecting "hard" rather than "random" negative samples would be beneficial for fine-tuning pre-trained models on ranking tasks. However, it remains elusive how to leverage hard negative samples in a principled way. To address the aforementioned issues, we propose a fine-tuning strategy for document ranking, namely Self-Involvement Ranker (SIR), to dynamically select hard negative samples to construct high-quality semantic space for training a high-quality ranking model. Specifically, SIR consists of sequential compressors implemented with pre-trained models. Front compressor selects hard negative samples for rear compressor. Moreover, SIR leverages supervisory signal to adaptively adjust semantic space of negative samples. Finally, supervisory signal in rear compressor is computed based on condition probability and thus can control sample dynamic and further enhance the model performance. SIR is a lightweight and general framework for pre-trained models, which simplifies the ranking process in industry practice. We test our proposed solution on MS MARCO with document ranking setting, and the results show that SIR can significantly improve the ranking performance of various pre-trained models. Moreover, our method became the new SOTA model anonymously on MS MARCO Document ranking leaderboard in May 2021.
    L2R2: Leveraging Ranking for Abductive Reasoning. (arXiv:2005.11223v2 [cs.IR] UPDATED)
    (2 min) The abductive natural language inference task ($\alpha$NLI) is proposed to evaluate the abductive reasoning ability of a learning system. In the $\alpha$NLI task, two observations are given and the most plausible hypothesis is asked to pick out from the candidates. Existing methods simply formulate it as a classification problem, thus a cross-entropy log-loss objective is used during training. However, discriminating true from false does not measure the plausibility of a hypothesis, for all the hypotheses have a chance to happen, only the probabilities are different. To fill this gap, we switch to a ranking perspective that sorts the hypotheses in order of their plausibilities. With this new perspective, a novel $L2R^2$ approach is proposed under the learning-to-rank framework. Firstly, training samples are reorganized into a ranking form, where two observations and their hypotheses are treated as the query and a set of candidate documents respectively. Then, an ESIM model or pre-trained language model, e.g. BERT or RoBERTa, is obtained as the scoring function. Finally, the loss functions for the ranking task can be either pair-wise or list-wise for training. The experimental results on the ART dataset reach the state-of-the-art in the public leaderboard.
    Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration. (arXiv:2109.06304v1 [cs.CL])
    (2 min) Phrase representations derived from BERT often do not exhibit complex phrasal compositionality, as the model relies instead on lexical similarity to determine semantic relatedness. In this paper, we propose a contrastive fine-tuning objective that enables BERT to produce more powerful phrase embeddings. Our approach (Phrase-BERT) relies on a dataset of diverse phrasal paraphrases, which is automatically generated using a paraphrase generation model, as well as a large-scale dataset of phrases in context mined from the Books3 corpus. Phrase-BERT outperforms baselines across a variety of phrase-level similarity tasks, while also demonstrating increased lexical diversity between nearest neighbors in the vector space. Finally, as a case study, we show that Phrase-BERT embeddings can be easily integrated with a simple autoencoder to build a phrase-based neural topic model that interprets topics as mixtures of words and phrases by performing a nearest neighbor search in the embedding space. Crowdsourced evaluations demonstrate that this phrase-based topic model produces more coherent and meaningful topics than baseline word and phrase-level topic models, further validating the utility of Phrase-BERT.
    Adaptive Information Seeking for Open-Domain Question Answering. (arXiv:2109.06747v1 [cs.CL])
    (2 min) Information seeking is an essential step for open-domain question answering to efficiently gather evidence from a large corpus. Recently, iterative approaches have been proven to be effective for complex questions, by recursively retrieving new evidence at each step. However, almost all existing iterative approaches use predefined strategies, either applying the same retrieval function multiple times or fixing the order of different retrieval functions, which cannot fulfill the diverse requirements of various questions. In this paper, we propose a novel adaptive information-seeking strategy for open-domain question answering, namely AISO. Specifically, the whole retrieval and answer process is modeled as a partially observed Markov decision process, where three types of retrieval operations (e.g., BM25, DPR, and hyperlink) and one answer operation are defined as actions. According to the learned policy, AISO could adaptively select a proper retrieval action to seek the missing evidence at each step, based on the collected evidence and the reformulated query, or directly output the answer when the evidence set is sufficient for the question. Experiments on SQuAD Open and HotpotQA fullwiki, which serve as single-hop and multi-hop open-domain QA benchmarks, show that AISO outperforms all baseline methods with predefined strategies in terms of both retrieval and answer evaluations.
    Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction. (arXiv:2109.06798v1 [cs.CL])
    (2 min) Zero-shot cross-lingual information extraction (IE) describes the construction of an IE model for some target language, given existing annotations exclusively in some other language, typically English. While the advance of pretrained multilingual encoders suggests an easy optimism of "train on English, run on any language", we find through a thorough exploration and extension of techniques that a combination of approaches, both new and old, leads to better performance than any one cross-lingual strategy in particular. We explore techniques including data projection and self-training, and how different pretrained encoders impact them. We use English-to-Arabic IE as our initial example, demonstrating strong performance in this setting for event extraction, named entity recognition, part-of-speech tagging, and dependency parsing. We then apply data projection and self-training to three tasks across eight target languages. Because no single set of techniques performs the best across all tasks, we encourage practitioners to explore various configurations of the techniques described in this work when seeking to improve on zero-shot training.
    Legal Transformer Models May Not Always Help. (arXiv:2109.06862v1 [cs.CL])
    (2 min) Deep learning-based Natural Language Processing methods, especially transformers, have achieved impressive performance in the last few years. Applying those state-of-the-art NLP methods to legal activities to automate or simplify some simple work is of great value. This work investigates the value of domain adaptive pre-training and language adapters in legal NLP tasks. By comparing the performance of language models with domain adaptive pre-training on different tasks and different dataset splits, we show that domain adaptive pre-training is only helpful with low-resource downstream tasks, thus far from being a panacea. We also benchmark the performance of adapters in a typical legal NLP task and show that they can yield similar performance to full model tuning with much smaller training costs. As an additional result, we release LegalRoBERTa, a RoBERTa model further pre-trained on legal corpora.
    What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey. (arXiv:2109.06808v1 [cs.CR])
    (2 min) Cybersecurity researchers have contributed to the automated extraction of CTI from textual sources, such as threat reports and online articles, where cyberattack strategies, procedures, and tools are described. The goal of this article is to aid cybersecurity researchers understand the current techniques used for cyberthreat intelligence extraction from text through a survey of relevant studies in the literature. We systematically collect "CTI extraction from text"-related studies from the literature and categorize the CTI extraction purposes. We propose a CTI extraction pipeline abstracted from these studies. We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline. Our work finds ten types of extraction purposes, such as extraction indicators of compromise extraction, TTPs (tactics, techniques, procedures of attack), and cybersecurity keywords. We also identify seven types of textual sources for CTI extraction, and textual data obtained from hacker forums, threat reports, social media posts, and online news articles have been used by almost 90% of the studies. Natural language processing along with both supervised and unsupervised machine learning techniques such as named entity recognition, topic modelling, dependency parsing, supervised classification, and clustering are used for CTI extraction. We observe the technical challenges associated with these studies related to obtaining available clean, labelled data which could assure replication, validation, and further extension of the studies. As we find the studies focusing on CTI information extraction from text, we advocate for building upon the current CTI extraction work to help cybersecurity practitioners with proactive decision making such as threat prioritization, automated threat modelling to utilize knowledge from past cybersecurity incidents.
    Learning Zero-Shot Multifaceted Visually Grounded Word Embeddings via Multi-Task Training. (arXiv:2104.07500v2 [cs.CL] UPDATED)
    (2 min) Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world. The general approach is to embed both textual and visual information into a common space -the grounded space-confined by an explicit relationship between both modalities. We argue that this approach sacrifices the abstract knowledge obtained from linguistic co-occurrence statistics in the process of acquiring perceptual information. The focus of this paper is to solve this issue by implicitly grounding the word embeddings. Rather than learning two mappings into a joint space, our approach integrates modalities by determining a reversible grounded mapping between the textual and the grounded space by means of multi-task learning. Evaluations on intrinsic and extrinsic tasks show that our embeddings are highly beneficial for both abstract and concrete words. They are strongly correlated with human judgments and outperform previous works on a wide range of benchmarks. Our grounded embeddings are publicly available here.
    conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers. (arXiv:2109.06501v1 [cs.CL])
    (2 min) In this paper we focus on constructing useful embeddings of textual information in vacancies and resumes, which we aim to incorporate as features into job to job seeker matching models alongside other features. We explain our task where noisy data from parsed resumes, heterogeneous nature of the different sources of data, and crosslinguality and multilinguality present domain-specific challenges. We address these challenges by fine-tuning a Siamese Sentence-BERT (SBERT) model, which we call conSultantBERT, using a large-scale, real-world, and high quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing consultants. We show how our fine-tuned model significantly outperforms unsupervised and supervised baselines that rely on TF-IDF-weighted feature vectors and BERT embeddings. In addition, we find our model successfully matches cross-lingual and multilingual textual content.
    Improving and Simplifying Pattern Exploiting Training. (arXiv:2103.11955v2 [cs.CL] UPDATED)
    (2 min) Recently, pre-trained language models (LMs) have achieved strong performance when fine-tuned on difficult benchmarks like SuperGLUE. However, performance can suffer when there are very few labeled examples available for fine-tuning. Pattern Exploiting Training (PET) is a recent approach that leverages patterns for few-shot learning. However, PET uses task-specific unlabeled data. In this paper, we focus on few-shot learning without any unlabeled data and introduce ADAPET, which modifies PET's objective to provide denser supervision during fine-tuning. As a result, ADAPET outperforms PET on SuperGLUE without any task-specific unlabeled data. Our code can be found at https://github.com/rrmenon10/ADAPET.
    Exploring Hyper-Parameter Optimization for Neural Machine Translation on GPU Architectures. (arXiv:1805.02094v2 [cs.CL] UPDATED)
    (2 min) Neural machine translation (NMT) has been accelerated by deep learning neural networks over statistical-based approaches, due to the plethora and programmability of commodity heterogeneous computing architectures such as FPGAs and GPUs and the massive amount of training corpuses generated from news outlets, government agencies and social media. Training a learning classifier for neural networks entails tuning hyper-parameters that would yield the best performance. Unfortunately, the number of parameters for machine translation include discrete categories as well as continuous options, which makes for a combinatorial explosive problem. This research explores optimizing hyper-parameters when training deep learning neural networks for machine translation. Specifically, our work investigates training a language model with Marian NMT. Results compare NMT under various hyper-parameter settings across a variety of modern GPU architecture generations in single node and multi-node settings, revealing insights on which hyper-parameters matter most in terms of performance, such as words processed per second, convergence rates, and translation accuracy, and provides insights on how to best achieve high-performing NMT systems.
    Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning. (arXiv:2109.06860v1 [cs.CL])
    (2 min) Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.
    Gradient Imitation Reinforcement Learning for Low Resource Relation Extraction. (arXiv:2109.06415v1 [cs.CL])
    (2 min) Low-resource Relation Extraction (LRE) aims to extract relation facts from limited labeled corpora when human annotation is scarce. Existing works either utilize self-training scheme to generate pseudo labels that will cause the gradual drift problem, or leverage meta-learning scheme which does not solicit feedback explicitly. To alleviate selection bias due to the lack of feedback loops in existing LRE learning paradigms, we developed a Gradient Imitation Reinforcement Learning method to encourage pseudo label data to imitate the gradient descent direction on labeled data and bootstrap its optimization capability through trial and error. We also propose a framework called GradLRE, which handles two major scenarios in low-resource relation extraction. Besides the scenario where unlabeled data is sufficient, GradLRE handles the situation where no unlabeled data is available, by exploiting a contextualized augmentation method to generate data. Experimental results on two public datasets demonstrate the effectiveness of GradLRE on low resource relation extraction when comparing with baselines.
    ePiC: Employing Proverbs in Context as a Benchmark for Abstract Language Understanding. (arXiv:2109.06838v1 [cs.CL])
    (2 min) While large language models have shown exciting progress on several NLP benchmarks, evaluating their ability for complex analogical reasoning remains under-explored. Here, we introduce a high-quality crowdsourced dataset of narratives for employing proverbs in context as a benchmark for abstract language understanding. The dataset provides fine-grained annotation of aligned spans between proverbs and narratives, and contains minimal lexical overlaps between narratives and proverbs, ensuring that models need to go beyond surface-level reasoning to succeed. We explore three tasks: (1) proverb recommendation and alignment prediction, (2) narrative generation for a given proverb and topic, and (3) identifying narratives with similar motifs. Our experiments show that neural language models struggle in our tasks compared to humans, and the tasks pose multiple learning challenges.
    Uncertainty-Aware Machine Translation Evaluation. (arXiv:2109.06352v1 [cs.CL])
    (2 min) Several neural-based metrics have been recently proposed to evaluate machine translation quality. However, all of them resort to point estimates, which provide limited information at segment level. This is made worse as they are trained on noisy, biased and scarce human judgements, often resulting in unreliable quality predictions. In this paper, we introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality. We combine the COMET framework with two uncertainty estimation methods, Monte Carlo dropout and deep ensembles, to obtain quality scores along with confidence intervals. We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task, augmented with MQM annotations. We experiment with varying numbers of references and further discuss the usefulness of uncertainty-aware quality estimation (without references) to flag possibly critical translation mistakes.
    A Temporal Variational Model for Story Generation. (arXiv:2109.06807v1 [cs.CL])
    (2 min) Recent language models can generate interesting and grammatically correct text in story generation but often lack plot development and long-term coherence. This paper experiments with a latent vector planning approach based on a TD-VAE (Temporal Difference Variational Autoencoder), using the model for conditioning and reranking for text generation. The results demonstrate strong performance in automatic cloze and swapping evaluations. The human judgments show stories generated with TD-VAE reranking improve on a GPT-2 medium baseline and show comparable performance to a hierarchical LSTM reranking model. Conditioning on the latent vectors proves disappointing and deteriorates performance in human evaluation because it reduces the diversity of generation, and the models don't learn to progress the narrative. This highlights an important difference between technical task performance (e.g. cloze) and generating interesting stories.
    Rationales for Sequential Predictions. (arXiv:2109.06387v1 [cs.CL])
    (2 min) Sequence models are a critical component of modern NLP systems, but their predictions are difficult to explain. We consider model explanations though rationales, subsets of context that can explain individual model predictions. We find sequential rationales by solving a combinatorial optimization: the best rationale is the smallest subset of input tokens that would predict the same output as the full sequence. Enumerating all subsets is intractable, so we propose an efficient greedy algorithm to approximate this objective. The algorithm, which is called greedy rationalization, applies to any model. For this approach to be effective, the model should form compatible conditional distributions when making predictions on incomplete subsets of the context. This condition can be enforced with a short fine-tuning step. We study greedy rationalization on language modeling and machine translation. Compared to existing baselines, greedy rationalization is best at optimizing the combinatorial objective and provides the most faithful rationales. On a new dataset of annotated sequential rationales, greedy rationales are most similar to human rationales.
    Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension. (arXiv:2109.06853v1 [cs.CL])
    (2 min) How can we generate concise explanations for multi-hop Reading Comprehension (RC)? The current strategies of identifying supporting sentences can be seen as an extractive question-focused summarization of the input text. However, these extractive explanations are not necessarily concise i.e. not minimally sufficient for answering a question. Instead, we advocate for an abstractive approach, where we propose to generate a question-focused, abstractive summary of input paragraphs and then feed it to an RC system. Given a limited amount of human-annotated abstractive explanations, we train the abstractive explainer in a semi-supervised manner, where we start from the supervised model and then train it further through trial and error maximizing a conciseness-promoted reward function. Our experiments demonstrate that the proposed abstractive explainer can generate more compact explanations than an extractive explainer with limited supervision (only 2k instances) while maintaining sufficiency.
    Evaluating Multiway Multilingual NMT in the Turkic Languages. (arXiv:2109.06262v1 [cs.CL])
    (2 min) Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.
    Post-OCR Document Correction with large Ensembles of Character Sequence Models. (arXiv:2109.06264v1 [cs.CL])
    (2 min) In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.
    Cross-document Event Identity via Dense Annotation. (arXiv:2109.06417v1 [cs.CL])
    (2 min) In this paper, we study the identity of textual events from different documents. While the complex nature of event identity is previously studied (Hovy et al., 2013), the case of events across documents is unclear. Prior work on cross-document event coreference has two main drawbacks. First, they restrict the annotations to a limited set of event types. Second, they insufficiently tackle the concept of event identity. Such annotation setup reduces the pool of event mentions and prevents one from considering the possibility of quasi-identity relations. We propose a dense annotation approach for cross-document event coreference, comprising a rich source of event mentions and a dense annotation effort between related document pairs. To this end, we design a new annotation workflow with careful quality control and an easy-to-use annotation interface. In addition to the links, we further collect overlapping event contexts, including time, location, and participants, to shed some light on the relation between identity decisions and context. We present an open-access dataset for cross-document event coreference, CDEC-WN, collected from English Wikinews and open-source our annotation toolkit to encourage further research on cross-document tasks.
    MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks. (arXiv:2109.06275v1 [cs.AI])
    (2 min) An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners' beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.
    Exploring Prompt-based Few-shot Learning for Grounded Dialog Generation. (arXiv:2109.06513v1 [cs.CL])
    (0 min) Dialog grounding enables conversational models to make full use of external information to establish multiple desired qualities, such as knowledgeable, engaging and empathetic. However, naturally grounded dialog corpora are usually not directly available, which puts forward requirements for the few-shot learning ability of conversational models. Motivated by recent advances in pre-trained language models and prompt-based learning, in this paper we explore prompt-based few-shot learning for grounded dialog generation (GDG). We first formulate the prompt construction for GDG tasks, based on which we then conduct comprehensive empirical analysis on two common types of prompting methods: template-based prompting and soft-prompting. We demonstrate the potential of prompt-based methods in few-shot learning for GDG and provide directions of improvement for future work.
    Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers. (arXiv:2109.06488v1 [cs.CL])
    (0 min) Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework that takes both cognition and affect-based features into consideration. A pre-features fusion-based framework that takes into account: situation-based features from a regular snapshot of a trailer that includes nouns and verbs providing the useful affect-based mapping with the corresponding genres, dialogue (speech) based feature from audio, metadata which together provides the relevant information for cognitive and affect based video analysis. We also develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres: Action, Romance, Comedy, Horror, and Science Fiction, and perform cross-validation on the standard LMTD-9 dataset for validating the proposed framework. The results demonstrate that the proposed methodology for movie genre classification has performed excellently as depicted by the F1 scores, precision, recall, and area under the precision-recall curves.
    Tribrid: Stance Classification with Neural Inconsistency Detection. (arXiv:2109.06508v1 [cs.CL])
    (0 min) We study the problem of performing automatic stance classification on social media with neural architectures such as BERT. Although these architectures deliver impressive results, their level is not yet comparable to the one of humans and they might produce errors that have a significant impact on the downstream task (e.g., fact-checking). To improve the performance, we present a new neural architecture where the input also includes automatically generated negated perspectives over a given claim. The model is jointly learned to make simultaneously multiple predictions, which can be used either to improve the classification of the original perspective or to filter out doubtful predictions. In the first case, we propose a weakly supervised method for combining the predictions into a final one. In the second case, we show that using the confidence scores to remove doubtful predictions allows our method to achieve human-like performance over the retained information, which is still a sizable part of the original input.
    Netmarble AI Center's WMT21 Automatic Post-Editing Shared Task Submission. (arXiv:2109.06515v1 [cs.CL])
    (0 min) This paper describes Netmarble's submission to WMT21 Automatic Post-Editing (APE) Shared Task for the English-German language pair. First, we propose a Curriculum Training Strategy in training stages. Facebook Fair's WMT19 news translation model was chosen to engage the large and powerful pre-trained neural networks. Then, we post-train the translation model with different levels of data at each training stages. As the training stages go on, we make the system learn to solve multiple tasks by adding extra information at different training stages gradually. We also show a way to utilize the additional data in large volume for APE tasks. For further improvement, we apply Multi-Task Learning Strategy with the Dynamic Weight Average during the fine-tuning stage. To fine-tune the APE corpus with limited data, we add some related subtasks to learn a unified representation. Finally, for better performance, we leverage external translations as augmented machine translation (MT) during the post-training and fine-tuning. As experimental results show, our APE system significantly improves the translations of provided MT results by -2.848 and +3.74 on the development dataset in terms of TER and BLEU, respectively. It also demonstrates its effectiveness on the test dataset with higher quality than the development dataset.
    Evaluating Transferability of BERT Models on Uralic Languages. (arXiv:2109.06327v1 [cs.CL])
    (0 min) Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hungarian, Erzya, Moksha, Karelian, Livvi, Komi Permyak, Komi Zyrian, Northern S\'ami, and Skolt S\'ami. When monolingual models are available (currently only et, fi, hu), these perform better on their native language, but in general they transfer worse than multilingual models or models of genetically unrelated languages that share the same character set. Remarkably, straightforward transfer of high-resource models, even without special efforts toward hyperparameter optimization, yields what appear to be state of the art POS and NER tools for the minority Uralic languages where there is sufficient data for finetuning.
    STraTA: Self-Training with Task Augmentation for Better Few-shot Learning. (arXiv:2109.06270v1 [cs.CL])
    (0 min) Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective.
    Learning Constraints and Descriptive Segmentation for Subevent Detection. (arXiv:2109.06316v1 [cs.CL])
    (0 min) Event mentions in text correspond to real-world events of varying degrees of granularity. The task of subevent detection aims to resolve this granularity issue, recognizing the membership of multi-granular events in event complexes. Since knowing the span of descriptive contexts of event complexes helps infer the membership of events, we propose the task of event-based text segmentation (EventSeg) as an auxiliary task to improve the learning for subevent detection. To bridge the two tasks together, we propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction, as well as guiding the model to make globally consistent inference. Specifically, we adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model. Experimental results show that the proposed method outperforms baseline methods by 2.3% and 2.5% on benchmark datasets for subevent detection, HiEve and IC, respectively, while achieving a decent performance on EventSeg prediction.
    Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation. (arXiv:2109.06253v1 [cs.CL])
    (2 min) Neural Machine Translation (NMT) is known to suffer from a beam-search problem: after a certain point, increasing beam size causes an overall drop in translation quality. This effect is especially pronounced for long sentences. While much work was done analyzing this phenomenon, primarily for autoregressive NMT models, there is still no consensus on its underlying cause. In this work, we analyze errors that cause major quality degradation with large beams in NMT and Automatic Speech Recognition (ASR). We show that a factor that strongly contributes to the quality degradation with large beams is \textit{dataset length-bias} - \textit{NMT datasets are strongly biased towards short sentences}. To mitigate this issue, we propose a new data augmentation technique -- \textit{Multi-Sentence Resampling (MSR)}. This technique extends the training examples by concatenating several sentences from the original dataset to make a long training example. We demonstrate that MSR significantly reduces degradation with growing beam size and improves final translation quality on the IWSTL$15$ En-Vi, IWSTL$17$ En-Fr, and WMT$14$ En-De datasets.
    Graph Algorithms for Multiparallel Word Alignment. (arXiv:2109.06283v1 [cs.CL])
    (2 min) With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in $F_1$ of up to 28% over the baseline bilingual word aligner in different datasets.
    Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation. (arXiv:2109.06379v1 [cs.CL])
    (2 min) Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective based on the nature of information change in NLG tasks, including compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). Information alignment between input, context, and output text plays a common central role in characterizing the generation. With automatic alignment prediction models, we develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog.
    Mitigating Catastrophic Forgetting in Scheduled Sampling with Elastic Weight Consolidation in Neural Machine Translation. (arXiv:2109.06308v1 [cs.CL])
    (2 min) Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. a discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and often empirically successful approach which addresses this issue by incorporating model-generated prefixes into the training process. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that it ameliorates exposure bias by increasing model reliance on the input sequence. We also observe that as a side-effect, it worsens performance when the model-generated prefix is correct, a form of catastrophic forgetting. We propose using Elastic Weight Consolidation as trade-off between mitigating exposure bias and retaining output quality. Experiments on two IWSLT'14 translation tasks demonstrate that our approach alleviates catastrophic forgetting and significantly improves BLEU compared to standard scheduled sampling.
    KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation. (arXiv:2109.06243v1 [cs.CL])
    (2 min) The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power, it makes them unsuitable for deployment on low-capacity devices. We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We use this decomposition for compression of the embedding layer, all linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layer. We perform intermediate-layer knowledge distillation using the uncompressed model as the teacher to improve the performance of the compressed model. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework. We evaluate the performance of KroneckerBERT on well-known NLP benchmarks and show that for a high compression factor of 19 (5% of the size of the BERT_BASE model), our KroneckerBERT outperforms state-of-the-art compression methods on the GLUE. Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.
  • cs.CV updates on arXiv.org

    Spatial-Separated Curve Rendering Network for Efficient and High-Resolution Image Harmonization. (arXiv:2109.05750v2 [cs.CV] UPDATED)
    (0 min) Image harmonization aims to modify the color of the composited region with respect to the specific background. Previous works model this task as a pixel-wise image-to-image translation using UNet family structures. However, the model size and computational cost limit the performability of their models on edge devices and higher-resolution images. To this end, we propose a novel spatial-separated curve rendering network (S$^2$CRNet) for efficient and high-resolution image harmonization for the first time. In S$^2$CRNet, we firstly extract the spatial-separated embeddings from the thumbnails of the masked foreground and background individually. Then, we design a curve rendering module (CRM), which learns and combines the spatial-specific knowledge using linear layers to generate the parameters of the pixel-wise curve mapping in the foreground region. Finally, we directly render the original high-resolution images using the learned color curve. Besides, we also make two extensions of the proposed framework via the Cascaded-CRM and Semantic-CRM for cascaded refinement and semantic guidance, respectively. Experiments show that the proposed method reduces more than 90% parameters compared with previous methods but still achieves the state-of-the-art performance on both synthesized iHarmony4 and real-world DIH test set. Moreover, our method can work smoothly on higher resolution images in real-time which is more than 10$\times$ faster than the existing methods. The code and pre-trained models will be made available and released at https://github.com/stefanLeong/S2CRNet.
    Bayesian Deep Neural Networks for Supervised Learning Single-View Depth. (arXiv:2104.14202v2 [cs.CV] UPDATED)
    (2 min) Uncertainty quantification is a key aspect of robotic perception, as overconfident or point estimators can lead to collisions and damages to the environment and the robot. In this paper, we evaluate scalable approaches to uncertainty quantification in single-view supervised depth learning, specifically MC dropout and deep ensembles. For MC dropout,in particular, we explore deeply the effect of the dropout at different levels in the architecture. We demonstrate that adding dropout in the encoder offer better results than adding it in the decoder, the latest being the usual approach in the literature for similar problems. We also show the use of depth uncertainty in the application of pseudo-RGBD ICP and demonstrate its potential for improving the accuracy in such a task.
    Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color. (arXiv:2109.06129v2 [cs.CV] UPDATED)
    (2 min) Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases -- (Paris, Capital, France). However, simple relations of this type can often be recovered heuristically and the extent to which models implicitly reflect topological structure that is grounded in world, such as perceptual structure, is unknown. To explore this question, we conduct a thorough case study on color. Namely, we employ a dataset of monolexemic color terms and color chips represented in CIELAB, a color space with a perceptually meaningful distance metric. Using two methods of evaluating the structural alignment of colors in this space with text-derived color term representations, we find significant correspondence. Analyzing the differences in alignment across the color spectrum, we find that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming. Further analysis suggests that differences in alignment are, in part, mediated by collocationality and differences in syntactic usage, posing questions as to the relationship between color perception and usage and context.
    Point-set Distances for Learning Representations of 3D Point Clouds. (arXiv:2102.04014v2 [cs.CV] UPDATED)
    (0 min) Learning an effective representation of 3D point clouds requires a good metric to measure the discrepancy between two 3D point sets, which is non-trivial due to their irregularity. Most of the previous works resort to using the Chamfer discrepancy or Earth Mover's distance, but those metrics are either ineffective in measuring the differences between point clouds or computationally expensive. In this paper, we conduct a systematic study with extensive experiments on distance metrics for 3D point clouds. From this study, we propose to use sliced Wasserstein distance and its variants for learning representations of 3D point clouds. In addition, we introduce a new algorithm to estimate sliced Wasserstein distance that guarantees that the estimated value is close enough to the true one. Experiments show that the sliced Wasserstein distance and its variants allow the neural network to learn a more efficient representation compared to the Chamfer discrepancy. We demonstrate the efficiency of the sliced Wasserstein metric and its variants on several tasks in 3D computer vision including training a point cloud autoencoder, generative modeling, transfer learning, and point cloud registration.
    Domain Adaptation by Maximizing Population Correlation with Neural Architecture Search. (arXiv:2109.06652v1 [cs.CV])
    (0 min) In Domain Adaptation (DA), where the feature distributions of the source and target domains are different, various distance-based methods have been proposed to minimize the discrepancy between the source and target domains to handle the domain shift. In this paper, we propose a new similarity function, which is called Population Correlation (PC), to measure the domain discrepancy for DA. Base on the PC function, we propose a new method called Domain Adaptation by Maximizing Population Correlation (DAMPC) to learn a domain-invariant feature representation for DA. Moreover, most existing DA methods use hand-crafted bottleneck networks, which may limit the capacity and flexibility of the corresponding model. Therefore, we further propose a method called DAMPC with Neural Architecture Search (DAMPC-NAS) to search the optimal network architecture for DAMPC. Experiments on several benchmark datasets, including Office-31, Office-Home, and VisDA-2017, show that the proposed DAMPC-NAS method achieves better results than state-of-the-art DA methods.
    Detection and Localization of Multiple Image Splicing Using MobileNet V1. (arXiv:2108.09674v2 [cs.CV] UPDATED)
    (0 min) In modern society, digital images have become a prominent source of information and medium of communication. They can, however, be simply altered using freely available image editing software. Two or more images are combined to generate a new image that can transmit information across social media platforms to influence the people in the society. This information may have both positive and negative consequences. Hence there is a need to develop a technique that will detect and locates a multiple image splicing forgery in an image. This research work proposes multiple image splicing forgery detection using Mask R-CNN, with a backbone as a MobileNet V1. It also calculates the percentage score of a forged region of multiple spliced images. The comparative analysis of the proposed work with the variants of ResNet is performed. The proposed model is trained and tested using our MISD (Multiple Image Splicing Dataset), and it is observed that the proposed model outperforms the variants of ResNet models (ResNet 51,101 and 151).
    Learning without Seeing nor Knowing: Towards Open Zero-Shot Learning. (arXiv:2103.12437v2 [cs.CV] UPDATED)
    (0 min) In Generalized Zero-Shot Learning (GZSL), unseen categories (for which no visual data are available at training time) can be predicted by leveraging their class embeddings (e.g., a list of attributes describing them) together with a complementary pool of seen classes (paired with both visual data and class embeddings). Despite GZSL is arguably challenging, we posit that knowing in advance the class embeddings, especially for unseen categories, is an actual limit of the applicability of GZSL towards real-world scenarios. To relax this assumption, we propose Open Zero-Shot Learning (OZSL) to extend GZSL towards the open-world settings. We formalize OZSL as the problem of recognizing seen and unseen classes (as in GZSL) while also rejecting instances from unknown categories, for which neither visual data nor class embeddings are provided. We formalize the OZSL problem introducing evaluation protocols, error metrics and benchmark datasets. We also suggest to tackle the OZSL problem by proposing the idea of performing unknown feature generation (instead of only unseen features generation as done in GZSL). We achieve this by optimizing a generative process to sample unknown class embeddings as complementary to the seen and the unseen. We intend these results to be the ground to foster future research, extending the standard closed-world zero-shot learning (GZSL) with the novel open-world counterpart (OZSL).
    Towards Understanding the Generative Capability of Adversarially Robust Classifiers. (arXiv:2108.09093v2 [cs.LG] UPDATED)
    (2 min) Recently, some works found an interesting phenomenon that adversarially robust classifiers can generate good images comparable to generative models. We investigate this phenomenon from an energy perspective and provide a novel explanation. We reformulate adversarial example generation, adversarial training, and image generation in terms of an energy function. We find that adversarial training contributes to obtaining an energy function that is flat and has low energy around the real data, which is the key for generative capability. Based on our new understanding, we further propose a better adversarial training method, Joint Energy Adversarial Training (JEAT), which can generate high-quality images and achieve new state-of-the-art robustness under a wide range of attacks. The Inception Score of the images (CIFAR-10) generated by JEAT is 8.80, much better than original robust classifiers (7.50).
    Visual News: Benchmark and Challenges in News Image Captioning. (arXiv:2010.03743v3 [cs.CV] UPDATED)
    (2 min) We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.
    Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image. (arXiv:2012.09855v3 [cs.CV] UPDATED)
    (2 min) We introduce the problem of perpetual view generation - long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. This is a challenging problem that goes far beyond the capabilities of current view synthesis methods, which quickly degenerate when presented with large camera motions. Methods for video generation also have limited ability to produce long sequences and are often agnostic to scene geometry. We take a hybrid approach that integrates both geometry and image synthesis in an iterative `\emph{render}, \emph{refine} and \emph{repeat}' framework, allowing for long-range generation that cover large distances after hundreds of frames. Our approach can be trained from a set of monocular video sequences. We propose a dataset of aerial footage of coastal scenes, and compare our method with recent view synthesis and conditional video generation baselines, showing that it can generate plausible scenes for much longer time horizons over large camera trajectories compared to existing methods. Project page at https://infinite-nature.github.io/.
    Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning. (arXiv:2109.06860v1 [cs.CL])
    (2 min) Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.
    A Deep-Learning Approach For Direct Whole-Heart Mesh Reconstruction. (arXiv:2102.07899v2 [eess.IV] UPDATED)
    (2 min) Automated construction of surface geometries of cardiac structures from volumetric medical images is important for a number of clinical applications. While deep-learning-based approaches have demonstrated promising reconstruction precision, these approaches have mostly focused on voxel-wise segmentation followed by surface reconstruction and post-processing techniques. However, such approaches suffer from a number of limitations including disconnected regions or incorrect surface topology due to erroneous segmentation and stair-case artifacts due to limited segmentation resolution. We propose a novel deep-learning-based approach that directly predicts whole heart surface meshes from volumetric CT and MR image data. Our approach leverages a graph convolutional neural network to predict deformation on mesh vertices from a pre-defined mesh template to reconstruct multiple anatomical structures in a 3D image volume. Our method demonstrated promising performance of generating whole heart reconstructions with as good or better accuracy than prior deep-learning-based methods on both CT and MR data. Furthermore, by deforming a template mesh, our method can generate whole heart geometries with better anatomical consistency and produce high-resolution geometries from lower resolution input image data. Our method was also able to produce temporally consistent surface mesh predictions for heart motion from CT or MR cine sequences, and therefore can potentially be applied for efficiently constructing 4D whole heart dynamics. Our code and pre-trained networks are available at https://github.com/fkong7/MeshDeformNet
    A Unified Objective for Novel Class Discovery. (arXiv:2108.08536v3 [cs.CV] UPDATED)
    (2 min) In this paper, we study the problem of Novel Class Discovery (NCD). NCD aims at inferring novel object categories in an unlabeled set by leveraging from prior knowledge of a labeled set containing different, but related classes. Existing approaches tackle this problem by considering multiple objective functions, usually involving specialized loss terms for the labeled and the unlabeled samples respectively, and often requiring auxiliary regularization terms. In this paper, we depart from this traditional scheme and introduce a UNified Objective function (UNO) for discovering novel classes, with the explicit purpose of favoring synergy between supervised and unsupervised learning. Using a multi-view self-labeling strategy, we generate pseudo-labels that can be treated homogeneously with ground truth labels. This leads to a single classification objective operating on both known and unknown classes. Despite its simplicity, UNO outperforms the state of the art by a significant margin on several benchmarks (~+10% on CIFAR-100 and +8% on ImageNet). The project page is available at: https://ncd-uno.github.io.
    Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion Recognition. (arXiv:2103.09154v2 [cs.CV] UPDATED)
    (2 min) Emotional expressions are the behaviors that communicate our emotional state or attitude to others. They are expressed through verbal and non-verbal communication. Complex human behavior can be understood by studying physical features from multiple modalities; mainly facial, vocal and physical gestures. Recently, spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis. In this paper, we propose a new deep learning-based approach for audio-visual emotion recognition. Our approach leverages recent advances in deep learning like knowledge distillation and high-performing deep architectures. The deep feature representations of the audio and visual modalities are fused based on a model-level fusion strategy. A recurrent neural network is then used to capture the temporal dynamics. Our proposed approach substantially outperforms state-of-the-art approaches in predicting valence on the RECOLA dataset. Moreover, our proposed visual facial expression feature extraction network outperforms state-of-the-art results on the AffectNet and Google Facial Expression Comparison datasets.
    Fast and Accurate Road Crack Detection Based on Adaptive Cost-Sensitive Loss Function. (arXiv:2106.15510v2 [cs.CV] UPDATED)
    (2 min) Numerous detection problems in computer vision, including road crack detection, suffer from exceedingly foreground-background imbalance. Fortunately, modification of loss function appears to solve this puzzle once and for all. In this paper, we propose a pixel-based adaptive weighted cross-entropy loss in conjunction with Jaccard distance to facilitate high-quality pixel-level road crack detection. Our work profoundly demonstrates the influence of loss functions on detection outcomes, and sheds light on the sophisticated consecutive improvements in the realm of crack detection. Specifically, to verify the effectiveness of the proposed loss, we conduct extensive experiments on four public databases, i.e., CrackForest, AigleRN, Crack360, and BJN260. Compared with the vanilla weighted cross-entropy, the proposed loss significantly speeds up the training process while retaining the test accuracy.
    Scalable Scene Flow from Point Clouds in the Real World. (arXiv:2103.01306v4 [cs.CV] UPDATED)
    (2 min) Autonomous vehicles operate in highly dynamic environments necessitating an accurate assessment of which aspects of a scene are moving and where they are moving to. A popular approach to 3D motion estimation, termed scene flow, is to employ 3D point cloud data from consecutive LiDAR scans, although such approaches have been limited by the small size of real-world, annotated LiDAR data. In this work, we introduce a new large-scale dataset for scene flow estimation derived from corresponding tracked 3D objects, which is $\sim$1,000$\times$ larger than previous real-world datasets in terms of the number of annotated frames. We demonstrate how previous works were bounded based on the amount of real LiDAR data available, suggesting that larger datasets are required to achieve state-of-the-art predictive performance. Furthermore, we show how previous heuristics for operating on point clouds such as down-sampling heavily degrade performance, motivating a new class of models that are tractable on the full point cloud. To address this issue, we introduce the FastFlow3D architecture which provides real time inference on the full point cloud. Additionally, we design human-interpretable metrics that better capture real world aspects by accounting for ego-motion and providing breakdowns per object type. We hope that this dataset may provide new opportunities for developing real world scene flow systems.
    NURBS-Diff: A differentiable programming module for NURBS. (arXiv:2104.14547v2 [cs.LG] UPDATED)
    (2 min) Boundary representations (B-reps) using Non-Uniform Rational B-splines (NURBS) are the de facto standard used in CAD, but their utility in deep learning-based approaches is not well researched. We propose a differentiable NURBS module to integrate the NURBS representation of CAD models with deep learning methods. We mathematically define the derivatives of the NURBS curves or surfaces with respect to the input parameters. These derivatives are used to define an approximate Jacobian that can be used to perform the "backward" evaluation used while training deep learning models. We have implemented our NURBS module using GPU-accelerated algorithms and integrated it with PyTorch, a popular deep learning framework. We demonstrate the efficacy of our NURBS module in performing CAD operations such as curve or surface fitting and surface offsetting. Further, we show its utility in deep learning for unsupervised point cloud reconstruction. These examples show that our module performs better for certain deep learning frameworks and can be directly integrated with any deep-learning framework requiring NURBS.
    Attention-augmented Spatio-Temporal Segmentation for Land Cover Mapping. (arXiv:2105.02963v2 [cs.CV] UPDATED)
    (2 min) The availability of massive earth observing satellite data provide huge opportunities for land use and land cover mapping. However, such mapping effort is challenging due to the existence of various land cover classes, noisy data, and the lack of proper labels. Also, each land cover class typically has its own unique temporal pattern and can be identified only during certain periods. In this article, we introduce a novel architecture that incorporates the UNet structure with Bidirectional LSTM and Attention mechanism to jointly exploit the spatial and temporal nature of satellite data and to better identify the unique temporal patterns of each land cover. We evaluate this method for mapping crops in multiple regions over the world. We compare our method with other state-of-the-art methods both quantitatively and qualitatively on two real-world datasets which involve multiple land cover classes. We also visualise the attention weights to study its effectiveness in mitigating noise and identifying discriminative time period.
    IMU-Assisted Learning of Single-View Rolling Shutter Correction. (arXiv:2011.03106v2 [cs.CV] UPDATED)
    (2 min) Rolling shutter distortion is highly undesirable for photography and computer vision algorithms (e.g., visual SLAM) because pixels can be potentially captured at different times and poses. In this paper, we propose a deep neural network to predict depth and row-wise pose from a single image for rolling shutter correction. Our contribution in this work is to incorporate inertial measurement unit (IMU) data into the pose refinement process, which, compared to the state-of-the-art, greatly enhances the pose prediction. The improved accuracy and robustness make it possible for numerous vision algorithms to use imagery captured by rolling shutter cameras and produce highly accurate results. We also extend a dataset to have real rolling shutter images, IMU data, depth maps, camera poses, and corresponding global shutter images for rolling shutter correction training. We demonstrate the efficacy of the proposed method by evaluating the performance of Direct Sparse Odometry (DSO) algorithm on rolling shutter imagery corrected using the proposed approach. Results show marked improvements of the DSO algorithm over using uncorrected imagery, validating the proposed approach.
    Sparse PointPillars: Maintaining and Exploiting Input Sparsity to Improve Runtime on Embedded Systems. (arXiv:2106.06882v2 [cs.CV] UPDATED)
    (2 min) Bird's Eye View (BEV) is a popular representation for processing 3D point clouds, and by its nature is fundamentally sparse. Motivated by the computational limitations of mobile robot platforms, we take a fast, high-performance BEV 3D object detector - PointPillars - and modify its backbone to maintain and exploit this input sparsity, leading to decreased runtimes. We present results on KITTI, a canonical 3D detection dataset, and Matterport-Chair, a novel Matterport3D-derived chair detection dataset from scenes in real furnished homes. We evaluate runtime characteristics using a desktop GPU, an embedded ML accelerator, and a robot CPU, demonstrating that our method results in significant runtime decreases (2x or more) for embedded systems with only a modest decrease in detection quality. Our work represents a new approach for practitioners to optimize models for embedded systems by maintaining and exploiting input sparsity throughout their entire pipeline to reduce runtime and resource usage while preserving detection performance. All models, weights, experimental configurations, and datasets used are publicly available.
    Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization. (arXiv:2008.08170v4 [math.OC] UPDATED)
    (2 min) In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method to solve stochastic mini-optimization problems. We prove that the Acc-ZOM method achieves a lower query complexity of $\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the parameter dimension. In particular, the Acc-ZOM does not need large batches that are required in the existing zeroth-order stochastic algorithms. At the same time, we propose an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method for black-box minimax-optimization. We prove that the Acc-ZOMDA method reaches the best known query complexity of $\tilde{O}((d_1+d_2)\kappa_y^{3}\epsilon^{-3})$ without large batches for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote dimensions of optimization parameters and $\kappa_y$ is condition number. Moreover, we propose an accelerated first-order momentum descent ascent (Acc-MDA) method for solving white-box minimax problems, and prove that it achieves a lower gradient complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ given batch size $b=\kappa_y^{4}$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(\kappa_y^{1/2})$. Extensive experimental results on the black-box adversarial attack to deep neural networks (DNNs) and poisoning attack demonstrate the efficiency of our algorithms.
    D2C-SR: A Divergence to Convergence Approach for Real-World Image Super-Resolution. (arXiv:2103.14373v2 [cs.CV] UPDATED)
    (2 min) In this paper, we present D2C-SR, a novel framework for the task of real-world image super-resolution. As an ill-posed problem, the key challenge in super-resolution related tasks is there can be multiple predictions for a given low-resolution input. Most classical deep learning based approaches ignored the fundamental fact and lack explicit modeling of the underlying high-frequency distribution which leads to blurred results. Recently, some methods of GAN-based or learning super-resolution space can generate simulated textures but do not promise the accuracy of the textures which have low quantitative performance. Rethinking both, we learn the distribution of underlying high-frequency details in a discrete form and propose a two-stage pipeline: divergence stage to convergence stage. At divergence stage, we propose a tree-based structure deep network as our divergence backbone. Divergence loss is proposed to encourage the generated results from the tree-based network to diverge into possible high-frequency representations, which is our way of discretely modeling the underlying high-frequency distribution. At convergence stage, we assign spatial weights to fuse these divergent predictions to obtain the final output with more accurate details. Our approach provides a convenient end-to-end manner to inference. We conduct evaluations on several real-world benchmarks, including a new proposed D2CRealSR dataset with x8 scaling factor. Our experiments demonstrate that D2C-SR achieves better accuracy and visual improvements against state-of-the-art methods, with a significantly less parameters number.
    Unified Representation of Geometric Primitives for Graph-SLAM Optimization Using Decomposed Quadrics. (arXiv:2108.08957v2 [cs.RO] UPDATED)
    (2 min) In Simultaneous Localization And Mapping (SLAM) problems, high-level landmarks have the potential to build compact and informative maps compared to traditional point-based landmarks. In this work, we focus on the parameterization of frequently used geometric primitives including points, lines, planes, ellipsoids, cylinders, and cones. We first present a unified representation based on quadrics, leading to a consistent and concise formulation. Then we further study a decomposed model of quadrics that discloses the symmetric and degenerated properties of a primitive. Based on the decomposition, we develop geometrically meaningful quadrics factors in the settings of a graph-SLAM problem. Then in simulation experiments, it is shown that the decomposed formulation has better efficiency and robustness to observation noises than baseline parameterizations. Finally, in real-world experiments, the proposed back-end framework is demonstrated to be capable of building compact and regularized maps.
    Recognizing License Plates in Real-Time. (arXiv:1906.04376v3 [cs.CV] UPDATED)
    (3 min) License plate detection and recognition (LPDR) is of growing importance for enabling intelligent transportation and ensuring the security and safety of the cities. However, LPDR faces a big challenge in a practical environment. The license plates can have extremely diverse sizes, fonts and colors, and the plate images are usually of poor quality caused by skewed capturing angles, uneven lighting, occlusion, and blurring. In applications such as surveillance, it often requires fast processing. To enable real-time and accurate license plate recognition, in this work, we propose a set of techniques: 1) a contour reconstruction method along with edge-detection to quickly detect the candidate plates; 2) a simple zero-one-alternation scheme to effectively remove the fake top and bottom borders around plates to facilitate more accurate segmentation of characters on plates; 3) a set of techniques to augment the training data, incorporate SIFT features into the CNN network, and exploit transfer learning to obtain the initial parameters for more effective training; and 4) a two-phase verification procedure to determine the correct plate at low cost, a statistical filtering in the plate detection stage to quickly remove unwanted candidates, and the accurate CR results after the CR process to perform further plate verification without additional processing. We implement a complete LPDR system based on our algorithms. The experimental results demonstrate that our system can accurately recognize license plate in real-time. Additionally, it works robustly under various levels of illumination and noise, and in the presence of car movement. Compared to peer schemes, our system is not only among the most accurate ones but is also the fastest, and can be easily applied to other scenarios.
    ImUnity: a generalizable VAE-GAN solution for multicenter MR image harmonization. (arXiv:2109.06756v1 [eess.IV])
    (2 min) ImUnity is an original deep-learning model designed for efficient and flexible MR image harmonization. A VAE-GAN network, coupled with a confusion module and an optional biological preservation module, uses multiple 2D-slices taken from different anatomical locations in each subject of the training database, as well as image contrast transformations for its self-supervised training. It eventually generates 'corrected' MR images that can be used for various multi-center population studies. Using 3 open source databases (ABIDE, OASIS and SRPBS), which contain MR images from multiple acquisition scanner types or vendors and a large range of subjects ages, we show that ImUnity: (1) outperforms state-of-the-art methods in terms of quality of images generated using traveling subjects; (2) removes sites or scanner biases while improving patients classification; (3) harmonizes data coming from new sites or scanners without the need for an additional fine-tuning and (4) allows the selection of multiple MR reconstructed images according to the desired applications. Tested here on T1-weighted images, ImUnity could be used to harmonize other types of medical images.
    MIMIR: Deep Regression for Automated Analysis of UK Biobank Body MRI. (arXiv:2106.11731v2 [eess.IV] UPDATED)
    (2 min) UK Biobank (UKB) conducts large-scale examinations of more than half a million volunteers, collecting health-related information on genetics, lifestyle, blood biochemistry, and more. Medical imaging of 100,000 subjects, with 70,000 follow-up sessions, enables measurements of organs, muscle, and body composition. With up to 170,000 mounting MR images, various methodologies are accordingly engaged in large-scale image analysis. This work presents an experimental inference engine that can automatically predict a comprehensive profile of subject metadata from UKB neck-to-knee body MRI. It was evaluated in cross-validation for baseline characteristics such as age, height, weight, and sex, but also measurements of body composition, organ volumes, and abstract properties like grip strength, pulse rate, and type 2 diabetic status. It predicted subsequently released test data covering twelve body composition metrics with a 3% median error. The proposed system can automatically analyze one thousand subjects within ten minutes, providing individual confidence intervals. The underlying methodology utilizes convolutional neural networks for image-based mean-variance regression on two-dimensional representations of the MRI data. This work aims to make the proposed system available for free to researchers, who can use it to obtain fast and fully-automated estimates of 72 different measurements immediately upon release of new UKB image data.
    Transductive Zero-Shot Learning by Decoupled Feature Generation. (arXiv:2102.03266v3 [cs.CV] UPDATED)
    (2 min) In this paper, we address zero-shot learning (ZSL), the problem of recognizing categories for which no labeled visual data are available during training. We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. State-of-the-art paradigms in ZSL typically exploit generative adversarial networks to synthesize visual features from semantic attributes. We posit that the main limitation of these approaches is to adopt a single model to face two problems: 1) generating realistic visual features, and 2) translating semantic attributes into visual cues. Differently, we propose to decouple such tasks, solving them separately. In particular, we train an unconditional generator to solely capture the complexity of the distribution of visual data and we subsequently pair it with a conditional generator devoted to enrich the prior knowledge of the data distribution with the semantic content of the class embeddings. We present a detailed ablation study to dissect the effect of our proposed decoupling approach, while demonstrating its superiority over the related state-of-the-art.
    Non-line-of-Sight Imaging via Neural Transient Fields. (arXiv:2101.00373v3 [eess.IV] UPDATED)
    (2 min) We present a neural modeling framework for Non-Line-of-Sight (NLOS) imaging. Previous solutions have sought to explicitly recover the 3D geometry (e.g., as point clouds) or voxel density (e.g., within a pre-defined volume) of the hidden scene. In contrast, inspired by the recent Neural Radiance Field (NeRF) approach, we use a multi-layer perceptron (MLP) to represent the neural transient field or NeTF. However, NeTF measures the transient over spherical wavefronts rather than the radiance along lines. We therefore formulate a spherical volume NeTF reconstruction pipeline, applicable to both confocal and non-confocal setups. Compared with NeRF, NeTF samples a much sparser set of viewpoints (scanning spots) and the sampling is highly uneven. We thus introduce a Monte Carlo technique to improve the robustness in the reconstruction. Comprehensive experiments on synthetic and real datasets demonstrate NeTF provides higher quality reconstruction and preserves fine details largely missing in the state-of-the-art.
    Temporal-Spatial Feature Pyramid for Video Saliency Detection. (arXiv:2105.04213v2 [cs.CV] UPDATED)
    (2 min) Multi-level features are important for saliency detection. Better combination and use of multi-level features with time information can greatly improve the accuracy of the video saliency model. In order to fully combine multi-level features and make it serve the video saliency model, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines scale, space and time information for video saliency modeling. The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales, and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective, and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can improve the precision of video saliency detection significantly. Experimental results on three purely visual video saliency benchmarks and six audio-video saliency benchmarks demonstrate that our method outperforms the existing state-of-the-art methods.
    Investigation of condominium building collapse in Surfside, Florida: a video feature tracking approach. (arXiv:2109.06629v1 [cs.CV])
    (2 min) On June 24, 2021, a 12-story condominium building (Champlain Towers South) in Surfside, Florida partially collapsed, resulting in one of the deadliest building collapses in United States history with 98 people are confirmed dead. We analyze this collapse event using a video clip that is publicly available from social media. We apply computer vision algorithms to corroborate new information from the video clip that may not be readily interpreted by human eyes. By comparing the differential features against different video frames, our method can quantify the falling structural components by intuitively showing the directions and magnitudes of their movements. We demonstrate the potential of this video processing methodology in investigations of catastrophic structural failures and hope our results would serve as the basis for further investigations of this and other structure collapse events.
    Meta-repository of screening mammography classifiers. (arXiv:2108.04800v2 [cs.LG] UPDATED)
    (2 min) Artificial intelligence (AI) is showing promise in improving clinical diagnosis. In breast cancer screening, several recent studies show that AI has the potential to improve radiologists' accuracy, subsequently helping in early cancer diagnosis and reducing unnecessary workup. As the number of proposed models and their complexity grows, it is becoming increasingly difficult to re-implement them in order to reproduce the results and to compare different approaches. To enable reproducibility of research in this application area and to enable comparison between different methods, we release a meta-repository containing deep learning models for classification of screening mammograms. This meta-repository creates a framework that enables the evaluation of machine learning models on any private or public screening mammography data set. At its inception, our meta-repository contains five state-of-the-art models with open-source implementations and cross-platform compatibility. We compare their performance on six international data sets: two New York University breast cancer screening data sets, DDSM, INbreast, OPTIMAM and Chinese Mammography Database. Our framework has a flexible design that can be generalized to other medical image analysis tasks. The meta-repository is available at https://www.github.com/nyukat/mammography_metarepository.
    An Empirical Study and Analysis on Open-Set Semi-Supervised Learning. (arXiv:2101.08237v2 [cs.CV] UPDATED)
    (2 min) Pseudo-labeling (PL) and Data Augmentation-based Consistency Training (DACT) are two approaches widely used in Semi-Supervised Learning (SSL) methods. These methods exhibit great power in many machine learning tasks by utilizing unlabeled data for efficient training. But in a more realistic setting (termed as open-set SSL), where unlabeled dataset contains out-of-distribution (OOD) samples, the traditional SSL methods suffer severe performance degradation. Recent approaches mitigate the negative influence of OOD samples by filtering them out from the unlabeled data. However, it is not clear whether directly removing the OOD samples is the best choice. Furthermore, why PL and DACT could perform differently in open-set SSL remains a mystery. In this paper, we thoroughly analyze various SSL methods (PL and DACT) on open-set SSL and discuss pros and cons of these two approaches separately. Based on our analysis, we propose Style Disturbance to improve traditional SSL methods on open-set SSL and experimentally show our approach can achieve state-of-the-art results on various datasets by utilizing OOD samples properly. We believe our study can bring new insights for SSL research.
    P-WAE: Generalized Patch-Wasserstein Autoencoder for Anomaly Screening. (arXiv:2108.03815v2 [cs.CV] UPDATED)
    (2 min) Anomaly detection plays a pivotal role in numerous real-world scenarios, such as industrial automation and manufacturing intelligence. Recently, variational inference-based anomaly analysis has attracted researchers' and developers' attention. It aims to model the defect-free distribution so that anomalies can be classified as out-of-distribution samples. Nevertheless, there are two disturbing factors that need us to prioritize: (i) the simplistic prior latent distribution inducing limited expressive capability; (ii) the strong probability distance notion results in collapsed features. In this paper, we propose a novel Patch-wise Wasserstein AutoEncoder (P-WAE) architecture to alleviate those challenges. In particular, a patch-wise variational inference model coupled with solving the jigsaw puzzle is designed, which is a simple yet effective way to increase the expressiveness of the latent manifold. This makes using the model on high-dimensional practical data possible. In addition, we leverage a weaker measure, sliced-Wasserstein distance, to achieve the equilibrium between the reconstruction fidelity and generalized representations. Comprehensive experiments, conducted on the MVTec AD dataset, demonstrate the superior performance of our proposed method.
    High-Resolution Image Harmonization via Collaborative Dual Transformations. (arXiv:2109.06671v1 [cs.CV])
    (2 min) Given a composite image, image harmonization aims to adjust the foreground to make it compatible with the background. High-resolution image harmonization is in high demand, but still remains unexplored. Conventional image harmonization methods learn global RGB-to-RGB transformation which could effortlessly scale to high resolution, but ignore diverse local context. Recent deep learning methods learn the dense pixel-to-pixel transformation which could generate harmonious outputs, but are highly constrained in low resolution. In this work, we propose a high-resolution image harmonization network with Collaborative Dual Transformation (CDTNet) to combine pixel-to-pixel transformation and RGB-to-RGB transformation coherently in an end-to-end framework. Our CDTNet consists of a low-resolution generator for pixel-to-pixel transformation, a color mapping module for RGB-to-RGB transformation, and a refinement module to take advantage of both. Extensive experiments on high-resolution image harmonization dataset demonstrate that our CDTNet strikes a good balance between efficiency and effectiveness.
    Luminance Attentive Networks for HDR Image and Panorama Reconstruction. (arXiv:2109.06688v1 [cs.CV])
    (2 min) It is very challenging to reconstruct a high dynamic range (HDR) from a low dynamic range (LDR) image as an ill-posed problem. This paper proposes a luminance attentive network named LANet for HDR reconstruction from a single LDR image. Our method is based on two fundamental observations: (1) HDR images stored in relative luminance are scale-invariant, which means the HDR images will hold the same information when multiplied by any positive real number. Based on this observation, we propose a novel normalization method called " HDR calibration " for HDR images stored in relative luminance, calibrating HDR images into a similar luminance scale according to the LDR images. (2) The main difference between HDR images and LDR images is in under-/over-exposed areas, especially those highlighted. Following this observation, we propose a luminance attention module with a two-stream structure for LANet to pay more attention to the under-/over-exposed areas. In addition, we propose an extended network called panoLANet for HDR panorama reconstruction from an LDR panorama and build a dualnet structure for panoLANet to solve the distortion problem caused by the equirectangular panorama. Extensive experiments show that our proposed approach LANet can reconstruct visually convincing HDR images and demonstrate its superiority over state-of-the-art approaches in terms of all metrics in inverse tone mapping. The image-based lighting application with our proposed panoLANet also demonstrates that our method can simulate natural scene lighting using only LDR panorama. Our source code is available at https://github.com/LWT3437/LANet.
    Automatic hippocampal surface generation via 3D U-net and active shape modeling with hybrid particle swarm optimization. (arXiv:2109.06817v1 [cs.NE])
    (2 min) In this paper, we proposed and validated a fully automatic pipeline for hippocampal surface generation via 3D U-net coupled with active shape modeling (ASM). Principally, the proposed pipeline consisted of three steps. In the beginning, for each magnetic resonance image, a 3D U-net was employed to obtain the automatic hippocampus segmentation at each hemisphere. Secondly, ASM was performed on a group of pre-obtained template surfaces to generate mean shape and shape variation parameters through principal component analysis. Ultimately, hybrid particle swarm optimization was utilized to search for the optimal shape variation parameters that best match the segmentation. The hippocampal surface was then generated from the mean shape and the shape variation parameters. The proposed pipeline was observed to provide hippocampal surfaces at both hemispheres with high accuracy, correct anatomical topology, and sufficient smoothness.
    Scale-covariant and scale-invariant Gaussian derivative networks. (arXiv:2011.14759v9 [cs.CV] UPDATED)
    (2 min) This paper presents a hybrid approach between scale-space theory and deep learning, where a deep learning architecture is constructed by coupling parameterized scale-space operations in cascade. By sharing the learnt parameters between multiple scale channels, and by using the transformation properties of the scale-space primitives under scaling transformations, the resulting network becomes provably scale covariant. By in addition performing max pooling over the multiple scale channels, a resulting network architecture for image classification also becomes provably scale invariant. We investigate the performance of such networks on the MNISTLargeScale dataset, which contains rescaled images from original MNIST over a factor of 4 concerning training data and over a factor of 16 concerning testing data. It is demonstrated that the resulting approach allows for scale generalization, enabling good performance for classifying patterns at scales not present in the training data.
    MotionHint: Self-Supervised Monocular Visual Odometrywith Motion Constraints. (arXiv:2109.06768v1 [cs.CV])
    (2 min) We present a novel self-supervised algorithmnamedMotionHintfor monocular visual odometry (VO) that takes motion constraints into account. A key aspect of ourapproach is to use an appropriate motion model that can help existing self-supervised monocular VO (SSM-VO) algorithms to overcome issues related to the local minima within their self-supervised loss functions. The motion model is expressed with a neural network named PPnet. It is trained to coarsely predict the next pose of the camera and the uncertainty of this prediction. Our self-supervised approach combines the original loss and the motion loss, which is the weighted difference between the prediction and the generated ego-motion. Taking two existing SSM-VO systems as our baseline, we evaluate our MotionHint algorithm on the standard KITTI and EuRoC benchmark. Experimental results show that our MotionHint algorithm can be easily applied to existing open-source state-of-the-art SSM-VO systems to greatly improve the performance on KITTI dataset by reducing the resulting ATE by up to 28.73%. For EuRoc dataset, our method can extract the motion model.But due to the poor performance of the baseline methods, MotionHint cannot significantly improve their results.
    Universal Graph Transformer Self-Attention Networks. (arXiv:1909.11855v11 [cs.LG] UPDATED)
    (2 min) The transformer has been extensively used in research domains such as computer vision, image processing, and natural language processing. The transformer, however, has not been actively used in graph neural networks. To this end, we introduce a transformer-based advanced GNN model, named UGformer, to learn graph representations. In particular, given an input graph, we present two UGformer variants. The first variant is to leverage the transformer on a set of sampled neighbors for each node, while the second is to leverage the transformer directly on the input graph. Experimental results demonstrate that our UGformer achieves state-of-the-art accuracies on well-known benchmark datasets for graph classification and inductive text classification. The code is available on Github: \url{https://github.com/daiquocnguyen/Graph-Transformer}.
    A Deep Learning Approach for Masking Fetal Gender in Ultrasound Images. (arXiv:2109.06790v1 [cs.CV])
    (2 min) Ultrasound (US) imaging is highly effective with regards to both cost and versatility in real-time diagnosis; however, determination of fetal gender by US scan in the early stages of pregnancy is also a cause of sex-selective abortion. This work proposes a deep learning object detection approach to accurately mask fetal gender in US images in order to increase the accessibility of the technology. We demonstrate how the YOLOv5L architecture exhibits superior performance relative to other object detection models on this task. Our model achieves 45.8% AP[0.5:0.95], 92% F1-score and 0.006 False Positive Per Image rate on our test set. Furthermore, we introduce a bounding box delay rule based on frame-to-frame structural similarity to reduce the false negative rate by 85%, further improving masking reliability.
    Learnable Discrete Wavelet Pooling (LDW-Pooling) For Convolutional Networks. (arXiv:2109.06638v1 [cs.CV])
    (2 min) Pooling is a simple but essential layer in modern deep CNN architectures for feature aggregation and extraction. Typical CNN design focuses on the conv layers and activation functions, while leaving the pooling layers with fewer options. We introduce the Learning Discrete Wavelet Pooling (LDW-Pooling) that can be applied universally to replace standard pooling operations to better extract features with improved accuracy and efficiency. Motivated from the wavelet theory, we adopt the low-pass (L) and high-pass (H) filters horizontally and vertically for pooling on a 2D feature map. Feature signals are decomposed into four (LL, LH, HL, HH) subbands to retain features better and avoid information dropping. The wavelet transform ensures features after pooling can be fully preserved and recovered. We next adopt an energy-based attention learning to fine-select crucial and representative features. LDW-Pooling is effective and efficient when compared with other state-of-the-art pooling techniques such as WaveletPooling and LiftPooling. Extensive experimental validation shows that LDW-Pooling can be applied to a wide range of standard CNN architectures and consistently outperform standard (max, mean, mixed, and stochastic) pooling operations.
    One-Class Meta-Learning: Towards Generalizable Few-Shot Open-Set Classification. (arXiv:2109.06859v1 [cs.CV])
    (2 min) Real-world classification tasks are frequently required to work in an open-set setting. This is especially challenging for few-shot learning problems due to the small sample size for each known category, which prevents existing open-set methods from working effectively; however, most multiclass few-shot methods are limited to closed-set scenarios. In this work, we address the problem of few-shot open-set classification by first proposing methods for few-shot one-class classification and then extending them to few-shot multiclass open-set classification. We introduce two independent few-shot one-class classification methods: Meta Binary Cross-Entropy (Meta-BCE), which learns a separate feature representation for one-class classification, and One-Class Meta-Learning (OCML), which learns to generate one-class classifiers given standard multiclass feature representation. Both methods can augment any existing few-shot learning method without requiring retraining to work in a few-shot multiclass open-set setting without degrading its closed-set performance. We demonstrate the benefits and drawbacks of both methods in different problem settings and evaluate them on three standard benchmark datasets, miniImageNet, tieredImageNet, and Caltech-UCSD-Birds-200-2011, where they surpass the state-of-the-art methods in the few-shot multiclass open-set and few-shot one-class tasks.
    Multi-modal Representation Learning for Video Advertisement Content Structuring. (arXiv:2109.06637v1 [cs.CV])
    (2 min) Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style. Different from real-life videos, video advertisements contain sufficient and useful multi-modal content like caption and speech, which provides crucial video semantics and would enhance the structuring process. In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. Based on multi-modal representation, we then apply Boundary-Matching Network to generate temporal proposals. To make the proposals more accurate, we refine generated proposals by scene-guided alignment and re-ranking. Finally, we incorporate proposal located embeddings into the introduced multi-modal encoder to capture temporal relationships between local features of each proposal and global features of the whole video for classification. Experimental results show that our method achieves significantly improvement compared with several baselines and Rank 1 on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge. Ablation study further shows that leveraging multi-modal content like caption and speech in video advertisements significantly improve the performance.
    Identifying partial mouse brain microscopy images from Allen reference atlas using a contrastively learned semantic space. (arXiv:2109.06662v1 [cs.CV])
    (2 min) Precise identification of mouse brain microscopy images is a crucial first step when anatomical structures in the mouse brain are to be registered to a reference atlas. Practitioners usually rely on manual comparison of images or tools that assume the presence of complete images. This work explores Siamese Networks as the method for finding corresponding 2D reference atlas plates for given partial 2D mouse brain images. Siamese networks are a class of convolutional neural networks (CNNs) that use weight-shared paths to obtain low dimensional embeddings of pairs of input images. The correspondence between the partial mouse brain image and reference atlas plate is determined based on the distance between low dimensional embeddings of brain slices and atlas plates that are obtained from Siamese networks using contrastive learning. Experiments showed that Siamese CNNs can precisely identify brain slices using the Allen mouse brain atlas when training and testing images come from the same source. They achieved TOP-1 and TOP-5 accuracy of 25% and 100%, respectively, taking only 7.2 seconds to identify 29 images.
    An Insect-Inspired Randomly, Weighted Neural Network with Random Fourier Features For Neuro-Symbolic Relational Learning. (arXiv:2109.06663v1 [cs.CV])
    (3 min) Insects, such as fruit flies and honey bees, can solve simple associative learning tasks and learn abstract concepts such as "sameness" and "difference", which is viewed as a higher-order cognitive function and typically thought to depend on top-down neocortical processing. Empirical research with fruit flies strongly supports that a randomized representational architecture is used in olfactory processing in insect brains. Based on these results, we propose a Randomly Weighted Feature Network (RWFN) that incorporates randomly drawn, untrained weights in an encoder that uses an adapted linear model as a decoder. The randomized projections between input neurons and higher-order processing centers in the input brain is mimicked in RWFN by a single-hidden-layer neural network that specially structures latent representations in the hidden layer using random Fourier features that better represent complex relationships between inputs using kernel approximation. Because of this special representation, RWFNs can effectively learn the degree of relationship among inputs by training only a linear decoder model. We compare the performance of RWFNs to LTNs for Semantic Image Interpretation (SII) tasks that have been used as a representative example of how LTNs utilize reasoning over first-order logic to surpass the performance of solely data-driven methods. We demonstrate that compared to LTNs, RWFNs can achieve better or similar performance for both object classification and detection of the part-of relations between objects in SII tasks while using much far fewer learnable parameters (1:62 ratio) and a faster learning process (1:2 ratio of running speed). Furthermore, we show that because the randomized weights do not depend on the data, several decoders can share a single randomized encoder, giving RWFNs a unique economy of spatial scale for simultaneous classification tasks.
    LRWR: Large-Scale Benchmark for Lip Reading in Russian language. (arXiv:2109.06692v1 [cs.CV])
    (2 min) Lipreading, also known as visual speech recognition, aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas. One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages: so far, these methods have been focused only on English or Chinese. In this paper, we introduce a naturally distributed large-scale benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers. We provide a detailed description of the dataset collection pipeline and dataset statistics. We also present a comprehensive comparison of the current popular lipreading methods on LRWR and conduct a detailed analysis of their performance. The results demonstrate the differences between the benchmarked languages and provide several promising directions for lipreading models finetuning. Thanks to our findings, we also achieved new state-of-the-art results on the LRW benchmark.
    CropDefender: deep watermark which is more convenient to train and more robust against cropping. (arXiv:2109.06651v1 [cs.CV])
    (2 min) Digital image watermarking, which is a technique for invisibly embedding information into an image, is used in fields such as property rights protection. In recent years, some research has proposed the use of neural networks to add watermarks to natural images. We take StegaStamp as an example for our research. Whether facing traditional image editing methods, such as brightness, contrast, saturation adjustment, or style change like 1-bit conversion, GAN, StegaStamp has robustness far beyond traditional watermarking techniques, but it still has two drawbacks: it is vulnerable to cropping and is hard to train. We found that the causes of vulnerability to cropping is not the loss of information on the edge, but the movement of watermark position. By explicitly introducing the perturbation of cropping into the training, the cropping resistance is significantly improved. For the problem of difficult training, we introduce instance normalization to solve the vanishing gradient, set losses' weights as learnable parameters to reduce the number of hyperparameters, and use sigmoid to restrict pixel values of the generated image.
    Scalable Font Reconstruction with Dual Latent Manifolds. (arXiv:2109.06627v1 [cs.CV])
    (2 min) We propose a deep generative model that performs typography analysis and font reconstruction by learning disentangled manifolds of both font style and character shape. Our approach enables us to massively scale up the number of character types we can effectively model compared to previous methods. Specifically, we infer separate latent variables representing character and font via a pair of inference networks which take as input sets of glyphs that either all share a character type, or belong to the same font. This design allows our model to generalize to characters that were not observed during training time, an important task in light of the relative sparsity of most fonts. We also put forward a new loss, adapted from prior work that measures likelihood using an adaptive distribution in a projected space, resulting in more natural images without requiring a discriminator. We evaluate on the task of font reconstruction over various datasets representing character types of many languages, and compare favorably to modern style transfer systems according to both automatic and manually-evaluated metrics.
    COVID-Net MLSys: Designing COVID-Net for the Clinical Workflow. (arXiv:2109.06421v1 [eess.IV])
    (2 min) As the COVID-19 pandemic continues to devastate globally, one promising field of research is machine learning-driven computer vision to streamline various parts of the COVID-19 clinical workflow. These machine learning methods are typically stand-alone models designed without consideration for the integration necessary for real-world application workflows. In this study, we take a machine learning and systems (MLSys) perspective to design a system for COVID-19 patient screening with the clinical workflow in mind. The COVID-Net system is comprised of the continuously evolving COVIDx dataset, COVID-Net deep neural network for COVID-19 patient detection, and COVID-Net S deep neural networks for disease severity scoring for COVID-19 positive patient cases. The deep neural networks within the COVID-Net system possess state-of-the-art performance, and are designed to be integrated within a user interface (UI) for clinical decision support with automatic report generation to assist clinicians in their treatment decisions.
    Tesla-Rapture: A Lightweight Gesture Recognition System from mmWave Radar Point Clouds. (arXiv:2109.06448v1 [cs.CV])
    (2 min) We present Tesla-Rapture, a gesture recognition interface for point clouds generated by mmWave Radars. State of the art gesture recognition models are either too resource consuming or not sufficiently accurate for integration into real-life scenarios using wearable or constrained equipment such as IoT devices (e.g. Raspberry PI), XR hardware (e.g. HoloLens), or smart-phones. To tackle this issue, we developed Tesla, a Message Passing Neural Network (MPNN) graph convolution approach for mmWave radar point clouds. The model outperforms the state of the art on two datasets in terms of accuracy while reducing the computational complexity and, hence, the execution time. In particular, the approach, is able to predict a gesture almost 8 times faster than the most accurate competitor. Our performance evaluation in different scenarios (environments, angles, distances) shows that Tesla generalizes well and improves the accuracy up to 20% in challenging scenarios like a through-wall setting and sensing at extreme angles. Utilizing Tesla, we develop Tesla-Rapture, a real-time implementation using a mmWave Radar on a Raspberry PI 4 and evaluate its accuracy and time-complexity. We also publish the source code, the trained models, and the implementation of the model for embedded devices.
    Sampling Network Guided Cross-Entropy Method for Unsupervised Point Cloud Registration. (arXiv:2109.06619v1 [cs.CV])
    (2 min) In this paper, by modeling the point cloud registration task as a Markov decision process, we propose an end-to-end deep model embedded with the cross-entropy method (CEM) for unsupervised 3D registration. Our model consists of a sampling network module and a differentiable CEM module. In our sampling network module, given a pair of point clouds, the sampling network learns a prior sampling distribution over the transformation space. The learned sampling distribution can be used as a "good" initialization of the differentiable CEM module. In our differentiable CEM module, we first propose a maximum consensus criterion based alignment metric as the reward function for the point cloud registration task. Based on the reward function, for each state, we then construct a fused score function to evaluate the sampled transformations, where we weight the current and future rewards of the transformations. Particularly, the future rewards of the sampled transforms are obtained by performing the iterative closest point (ICP) algorithm on the transformed state. By selecting the top-k transformations with the highest scores, we iteratively update the sampling distribution. Furthermore, in order to make the CEM differentiable, we use the sparsemax function to replace the hard top-$k$ selection. Finally, we formulate a Geman-McClure estimator based loss to train our end-to-end registration model. Extensive experimental results demonstrate the good registration performance of our method on benchmark datasets.
    A Semantic Indexing Structure for Image Retrieval. (arXiv:2109.06583v1 [cs.CV])
    (2 min) In large-scale image retrieval, many indexing methods have been proposed to narrow down the searching scope of retrieval. The features extracted from images usually are of high dimensions or unfixed sizes due to the existence of key points. Most of existing index structures suffer from the dimension curse, the unfixed feature size and/or the loss of semantic similarity. In this paper a new classification-based indexing structure, called Semantic Indexing Structure (SIS), is proposed, in which we utilize the semantic categories rather than clustering centers to create database partitions, such that the proposed index SIS can be combined with feature extractors without the restriction of dimensions. Besides, it is observed that the size of each semantic partition is positively correlated with the semantic distribution of database. Along this way, we found that when the partition number is normalized to five, the proposed algorithm performed very well in all the tests. Compared with state-of-the-art models, SIS achieves outstanding performance.
    High-Fidelity GAN Inversion for Image Attribute Editing. (arXiv:2109.06590v1 [cs.CV])
    (2 min) We present a novel high-fidelity generative adversarial network (GAN) inversion framework that enables attribute editing with image-specific details well-preserved (e.g., background, appearance and illumination). We first formulate GAN inversion as a lossy data compression problem and carefully discuss the Rate-Distortion-Edit trade-off. Due to this trade-off, previous works fail to achieve high-fidelity reconstruction while keeping compelling editing ability with a low bit-rate latent code only. In this work, we propose a distortion consultation approach that employs the distortion map as a reference for reconstruction. In the distortion consultation inversion (DCI), the distortion map is first projected to a high-rate latent map, which then complements the basic low-rate latent code with (lost) details via consultation fusion. To achieve high-fidelity editing, we propose an adaptive distortion alignment (ADA) module with a self-supervised training scheme. Extensive experiments in the face and car domains show a clear improvement in terms of both inversion and editing quality.
    Dynamic Attentive Graph Learning for Image Restoration. (arXiv:2109.06620v1 [cs.CV])
    (2 min) Non-local self-similarity in natural images has been verified to be an effective prior for image restoration. However, most existing deep non-local methods assign a fixed number of neighbors for each query item, neglecting the dynamics of non-local correlations. Moreover, the non-local correlations are usually based on pixels, prone to be biased due to image degradation. To rectify these weaknesses, in this paper, we propose a dynamic attentive graph learning model (DAGL) to explore the dynamic non-local property on patch level for image restoration. Specifically, we propose an improved graph model to perform patch-wise graph convolution with a dynamic and adaptive number of neighbors for each node. In this way, image content can adaptively balance over-smooth and over-sharp artifacts through the number of its connected neighbors, and the patch-wise non-local correlations can enhance the message passing process. Experimental results on various image restoration tasks: synthetic image denoising, real image denoising, image demosaicing, and compression artifact reduction show that our DAGL can produce state-of-the-art results with superior accuracy and visual quality. The source code is available at https://github.com/jianzhangcs/DAGL.
    POPCORN: Progressive Pseudo-labeling with Consistency Regularization and Neighboring. (arXiv:2109.06361v1 [cs.CV])
    (2 min) Semi-supervised learning (SSL) uses unlabeled data to compensate for the scarcity of annotated images and the lack of method generalization to unseen domains, two usual problems in medical segmentation tasks. In this work, we propose POPCORN, a novel method combining consistency regularization and pseudo-labeling designed for image segmentation. The proposed framework uses high-level regularization to constrain our segmentation model to use similar latent features for images with similar segmentations. POPCORN estimates a proximity graph to select data from easiest ones to more difficult ones, in order to ensure accurate pseudo-labeling and to limit confirmation bias. Applied to multiple sclerosis lesion segmentation, our method demonstrates competitive results compared to other state-of-the-art SSL strategies.
    Camera-Tracklet-Aware Contrastive Learning for Unsupervised Vehicle Re-Identification. (arXiv:2109.06401v1 [cs.CV])
    (2 min) Recently, vehicle re-identification methods based on deep learning constitute remarkable achievement. However, this achievement requires large-scale and well-annotated datasets. In constructing the dataset, assigning globally available identities (Ids) to vehicles captured from a great number of cameras is labour-intensive, because it needs to consider their subtle appearance differences or viewpoint variations. In this paper, we propose camera-tracklet-aware contrastive learning (CTACL) using the multi-camera tracklet information without vehicle identity labels. The proposed CTACL divides an unlabelled domain, i.e., entire vehicle images, into multiple camera-level subdomains and conducts contrastive learning within and beyond the subdomains. The positive and negative samples for contrastive learning are defined using tracklet Ids of each camera. Additionally, the domain adaptation across camera networks is introduced to improve the generalisation performance of learnt representations and alleviate the performance degradation resulted from the domain gap between the subdomains. We demonstrate the effectiveness of our approach on video-based and image-based vehicle Re-ID datasets. Experimental results show that the proposed method outperforms the recent state-of-the-art unsupervised vehicle Re-ID methods. The source code for this paper is publicly available on `https://github.com/andreYoo/CTAM-CTACL-VVReID.git'.
    Open-World Active Learning with Stacking Ensemble for Self-Driving Cars. (arXiv:2109.06628v1 [cs.CV])
    (2 min) The environments, in which autonomous cars act, are high-risky, dynamic, and full of uncertainty, demanding a continuous update of their sensory information and knowledge bases. The frequency of facing an unknown object is too high making hard the usage of Artificial Intelligence (AI) classical classification models that usually rely on the close-world assumption. This problem of classifying objects in this domain is better faced with and open-world AI approach. We propose an algorithm to identify not only all the known entities that may appear in front of the car, but also to detect and learn the classes of those unknown objects that may be rare to stand on an highway (e.g., a lost box from a truck). Our approach relies on the DOC algorithm from Lei Shu et. al. as well as on the Query-by-Committee algorithm.
    Multi-Scale Input Strategies for Medulloblastoma Tumor Classification using Deep Transfer Learning. (arXiv:2109.06547v1 [eess.IV])
    (2 min) Medulloblastoma (MB) is a primary central nervous system tumor and the most common malignant brain cancer among children. Neuropathologists perform microscopic inspection of histopathological tissue slides under a microscope to assess the severity of the tumor. This is a time-consuming task and often infused with observer variability. Recently, pre-trained convolutional neural networks (CNN) have shown promising results for MB subtype classification. Typically, high-resolution images are divided into smaller tiles for classification, while the size of the tiles has not been systematically evaluated. We study the impact of tile size and input strategy and classify the two major histopathological subtypes-Classic and Demoplastic/Nodular. To this end, we use recently proposed EfficientNets and evaluate tiles with increasing size combined with various downsampling scales. Our results demonstrate using large input tiles pixels followed by intermediate downsampling and patch cropping significantly improves MB classification performance. Our top-performing method achieves the AUC-ROC value of 90.90\% compared to 84.53\% using the previous approach with smaller input tiles.
    Deep Convolutional Generative Modeling for Artificial Microstructure Development of Aluminum-Silicon Alloy. (arXiv:2109.06635v1 [eess.IV])
    (2 min) Machine learning which is a sub-domain of an Artificial Intelligence which is finding various applications in manufacturing and material science sectors. In the present study, Deep Generative Modeling which a type of unsupervised machine learning technique has been adapted for the constructing the artificial microstructure of Aluminium-Silicon alloy. Deep Generative Adversarial Networks has been used for developing the artificial microstructure of the given microstructure image dataset. The results obtained showed that the developed models had learnt to replicate the lining near the certain images of the microstructures.
    Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging. (arXiv:2109.06548v1 [eess.IV])
    (2 min) Snapshot compressive imaging (SCI) aims to record three-dimensional signals via a two-dimensional camera. For the sake of building a fast and accurate SCI recovery algorithm, we incorporate the interpretability of model-based methods and the speed of learning-based ones and present a novel dense deep unfolding network (DUN) with 3D-CNN prior for SCI, where each phase is unrolled from an iteration of Half-Quadratic Splitting (HQS). To better exploit the spatial-temporal correlation among frames and address the problem of information loss between adjacent phases in existing DUNs, we propose to adopt the 3D-CNN prior in our proximal mapping module and develop a novel dense feature map (DFM) strategy, respectively. Besides, in order to promote network robustness, we further propose a dense feature map adaption (DFMA) module to allow inter-phase information to fuse adaptively. All the parameters are learned in an end-to-end fashion. Extensive experiments on simulation data and real data verify the superiority of our method. The source code is available at https://github.com/jianzhangcs/SCI3D.
    Dodging Attack Using Carefully Crafted Natural Makeup. (arXiv:2109.06467v1 [cs.CV])
    (2 min) Deep learning face recognition models are used by state-of-the-art surveillance systems to identify individuals passing through public areas (e.g., airports). Previous studies have demonstrated the use of adversarial machine learning (AML) attacks to successfully evade identification by such systems, both in the digital and physical domains. Attacks in the physical domain, however, require significant manipulation to the human participant's face, which can raise suspicion by human observers (e.g. airport security officers). In this study, we present a novel black-box AML attack which carefully crafts natural makeup, which, when applied on a human participant, prevents the participant from being identified by facial recognition models. We evaluated our proposed attack against the ArcFace face recognition model, with 20 participants in a real-world setup that includes two cameras, different shooting angles, and different lighting conditions. The evaluation results show that in the digital domain, the face recognition system was unable to identify all of the participants, while in the physical domain, the face recognition system was able to identify the participants in only 1.22% of the frames (compared to 47.57% without makeup and 33.73% with random natural makeup), which is below a reasonable threshold of a realistic operational environment.
    Multi-Level Features Contrastive Networks for Unsupervised Domain Adaptation. (arXiv:2109.06543v1 [cs.CV])
    (2 min) Unsupervised domain adaptation aims to train a model from the labeled source domain to make predictions on the unlabeled target domain when the data distribution of the two domains is different. As a result, it needs to reduce the data distribution difference between the two domains to improve the model's generalization ability. Existing methods tend to align the two domains directly at the domain-level, or perform class-level domain alignment based on deep feature. The former ignores the relationship between the various classes in the two domains, which may cause serious negative transfer, the latter alleviates it by introducing pseudo-labels of the target domain, but it does not consider the importance of performing class-level alignment on shallow feature representations. In this paper, we develop this work on the method of class-level alignment. The proposed method reduces the difference between two domains dramaticlly by aligning multi-level features. In the case that the two domains share the label space, the class-level alignment is implemented by introducing Multi-Level Feature Contrastive Networks (MLFCNet). In practice, since the categories of samples in target domain are unavailable, we iteratively use clustering algorithm to obtain the pseudo-labels, and then minimize Multi-Level Contrastive Discrepancy (MLCD) loss to achieve more accurate class-level alignment. Experiments on three real-world benchmarks ImageCLEF-DA, Office-31 and Office-Home demonstrate that MLFCNet compares favorably against the existing state-of-the-art domain adaptation methods.
    Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding. (arXiv:2109.06400v1 [cs.CV])
    (2 min) A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.
    Image-Based Alignment of 3D Scans. (arXiv:2109.06526v1 [cs.CV])
    (2 min) Full 3D scanning can efficiently be obtained using structured light scanning combined with a rotation stage. In this setting it is, however, necessary to reposition the object and scan it in different poses in order to cover the entire object. In this case, correspondence between the scans is lost, since the object was moved. In this paper, we propose a fully automatic method for aligning the scans of an object in two different poses. This is done by matching 2D features between images from two poses and utilizing correspondence between the images and the scanned point clouds. To demonstrate the approach, we present the results of scanning three dissimilar objects.
    Space Time Recurrent Memory Network. (arXiv:2109.06474v1 [cs.CV])
    (2 min) We propose a novel visual memory network architecture for the learning and inference problem in the spatial-temporal domain. Different from the popular transformers, we maintain a fixed set of memory slots in our memory network and explore designs to input new information into the memory, combine the information in different memory slots and decide when to discard old memory slots. Finally, this architecture is benchmarked on the video object segmentation and video prediction problems. Through the experiments, we show that our memory architecture can achieve competitive results with state-of-the-art while maintaining constant memory capacity.
    Anomaly Attribution of Multivariate Time Series using Counterfactual Reasoning. (arXiv:2109.06562v1 [cs.LG])
    (2 min) There are numerous methods for detecting anomalies in time series, but that is only the first step to understanding them. We strive to exceed this by explaining those anomalies. Thus we develop a novel attribution scheme for multivariate time series relying on counterfactual reasoning. We aim to answer the counterfactual question of would the anomalous event have occurred if the subset of the involved variables had been more similarly distributed to the data outside of the anomalous interval. Specifically, we detect anomalous intervals using the Maximally Divergent Interval (MDI) algorithm, replace a subset of variables with their in-distribution values within the detected interval and observe if the interval has become less anomalous, by re-scoring it with MDI. We evaluate our method on multivariate temporal and spatio-temporal data and confirm the accuracy of our anomaly attribution of multiple well-understood extreme climate events such as heatwaves and hurricanes.
    Improved Few-shot Segmentation by Redifinition of the Roles of Multi-level CNN Features. (arXiv:2109.06432v1 [cs.CV])
    (2 min) This study is concerned with few-shot segmentation, i.e., segmenting the region of an unseen object class in a query image, given support image(s) of its instances. The current methods rely on the pretrained CNN features of the support and query images. The key to good performance depends on the proper fusion of their mid-level and high-level features; the former contains shape-oriented information, while the latter has class-oriented information. Current state-of-the-art methods follow the approach of Tian et al., which gives the mid-level features the primary role and the high-level features the secondary role. In this paper, we reinterpret this widely employed approach by redifining the roles of the multi-level features; we swap the primary and secondary roles. Specifically, we regard that the current methods improve the initial estimate generated from the high-level features using the mid-level features. This reinterpretation suggests a new application of the current methods: to apply the same network multiple times to iteratively update the estimate of the object's region, starting from its initial estimate. Our experiments show that this method is effective and has updated the previous state-of-the-art on COCO-20$^i$ in the 1-shot and 5-shot settings and on PASCAL-5$^i$ in the 1-shot setting.
    From Heatmaps to Structural Explanations of Image Classifiers. (arXiv:2109.06365v1 [cs.CV])
    (2 min) This paper summarizes our endeavors in the past few years in terms of explaining image classifiers, with the aim of including negative results and insights we have gained. The paper starts with describing the explainable neural network (XNN), which attempts to extract and visualize several high-level concepts purely from the deep network, without relying on human linguistic concepts. This helps users understand network classifications that are less intuitive and substantially improves user performance on a difficult fine-grained classification task of discriminating among different species of seagulls. Realizing that an important missing piece is a reliable heatmap visualization tool, we have developed I-GOS and iGOS++ utilizing integrated gradients to avoid local optima in heatmap generation, which improved the performance across all resolutions. During the development of those visualizations, we realized that for a significant number of images, the classifier has multiple different paths to reach a confident prediction. This has lead to our recent development of structured attention graphs (SAGs), an approach that utilizes beam search to locate multiple coarse heatmaps for a single image, and compactly visualizes a set of heatmaps by capturing how different combinations of image regions impact the confidence of a classifier. Through the research process, we have learned much about insights in building deep network explanations, the existence and frequency of multiple explanations, and various tricks of the trade that make explanations work. In this paper, we attempt to share those insights and opinions with the readers with the hope that some of them will be informative for future researchers on explainable deep learning.
    3-Dimensional Deep Learning with Spatial Erasing for Unsupervised Anomaly Segmentation in Brain MRI. (arXiv:2109.06540v1 [eess.IV])
    (2 min) Purpose. Brain Magnetic Resonance Images (MRIs) are essential for the diagnosis of neurological diseases. Recently, deep learning methods for unsupervised anomaly detection (UAD) have been proposed for the analysis of brain MRI. These methods rely on healthy brain MRIs and eliminate the requirement of pixel-wise annotated data compared to supervised deep learning. While a wide range of methods for UAD have been proposed, these methods are mostly 2D and only learn from MRI slices, disregarding that brain lesions are inherently 3D and the spatial context of MRI volumes remains unexploited. Methods. We investigate whether using increased spatial context by using MRI volumes combined with spatial erasing leads to improved unsupervised anomaly segmentation performance compared to learning from slices. We evaluate and compare 2D variational autoencoder (VAE) to their 3D counterpart, propose 3D input erasing, and systemically study the impact of the data set size on the performance. Results. Using two publicly available segmentation data sets for evaluation, 3D VAE outperform their 2D counterpart, highlighting the advantage of volumetric context. Also, our 3D erasing methods allow for further performance improvements. Our best performing 3D VAE with input erasing leads to an average DICE score of 31.40% compared to 25.76% for the 2D VAE. Conclusions. We propose 3D deep learning methods for UAD in brain MRI combined with 3D erasing and demonstrate that 3D methods clearly outperform their 2D counterpart for anomaly segmentation. Also, our spatial erasing method allows for further performance improvements and reduces the requirement for large data sets.
    Physics Driven Domain Specific Transporter Framework with Attention Mechanism for Ultrasound Imaging. (arXiv:2109.06346v1 [eess.IV])
    (3 min) Most applications of deep learning techniques in medical imaging are supervised and require a large number of labeled data which is expensive and requires many hours of careful annotation by experts. In this paper, we propose an unsupervised, physics driven domain specific transporter framework with an attention mechanism to identify relevant key points with applications in ultrasound imaging. The proposed framework identifies key points that provide a concise geometric representation highlighting regions with high structural variation in ultrasound videos. We incorporate physics driven domain specific information as a feature probability map and use the radon transform to highlight features in specific orientations. The proposed framework has been trained on130 Lung ultrasound (LUS) videos and 113 Wrist ultrasound (WUS) videos and validated on 100 Lung ultrasound (LUS) videos and 58 Wrist ultrasound (WUS) videos acquired from multiple centers across the globe. Images from both datasets were independently assessed by experts to identify clinically relevant features such as A-lines, B-lines and pleura from LUS and radial metaphysis, radial epiphysis and carpal bones from WUS videos. The key points detected from both datasets showed high sensitivity (LUS = 99\% , WUS = 74\%) in detecting the image landmarks identified by experts. Also, on employing for classification of the given lung image into normal and abnormal classes, the proposed approach, even with no prior training, achieved an average accuracy of 97\% and an average F1-score of 95\% respectively on the task of co-classification with 3 fold cross-validation. With the purely unsupervised nature of the proposed approach, we expect the key point detection approach to increase the applicability of ultrasound in various examination performed in emergency and point of care.
    Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos. (arXiv:2109.06398v1 [cs.CV])
    (2 min) We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foreground-background classification upon the video and regress on the foreground frames to adaptively generate proposals. In this way, the handcrafted proposal design is discarded and the redundant proposals are decreased. Then, a proposal consolidation module is further developed to enhance the semantic of the generated proposals. Finally, we locate the target moments with these generated proposals following the top-down framework. Extensive experiments on three challenging benchmarks show that our proposed APGN significantly outperforms previous state-of-the-art methods.
    Monocular Camera Localization for Automated Vehicles Using Image Retrieval. (arXiv:2109.06296v1 [cs.CV])
    (2 min) We address the problem of finding the current position and heading angle of an autonomous vehicle in real-time using a single camera. Compared to methods which require LiDARs and high definition (HD) 3D maps in real-time, the proposed approach is easily scalable and computationally efficient, at the price of lower precision. The new method combines and adapts existing algorithms in three different fields: image retrieval, mapping database, and particle filtering. The result is a simple, real-time localization method using an image retrieval method whose performance is comparable to other monocular camera localization methods which use a map built with LiDARs. We evaluate the proposed method using the KITTI odometry dataset and via closed-loop experiments with an indoor 1:10 autonomous vehicle. The tests demonstrate real-time capability and a 10cm level accuracy. Also, experimental results of the closed-loop indoor tests show the presence of a positive feedback loop between the localization error and the control error. Such phenomena is analysed in details at the end of the article.
    Spiking Neural Networks for Visual Place Recognition via Weighted Neuronal Assignments. (arXiv:2109.06452v1 [cs.CV])
    (2 min) Spiking neural networks (SNNs) offer both compelling potential advantages, including energy efficiency and low latencies, and challenges including the non-differentiable nature of event spikes. Much of the initial research in this area has converted deep neural networks to equivalent SNNs, but this conversion approach potentially negates some of the potential advantages of SNN-based approaches developed from scratch. One promising area for high performance SNNs is template matching and image recognition. This research introduces the first high performance SNN for the Visual Place Recognition (VPR) task: given a query image, the SNN has to find the closest match out of a list of reference images. At the core of this new system is a novel assignment scheme that implements a form of ambiguity-informed salience, by up-weighting single-place-encoding neurons and down-weighting "ambiguous" neurons that respond to multiple different reference places. In a range of experiments on the challenging Oxford RobotCar and Nordland datasets, we show that our SNN achieves comparable VPR performance to state-of-the-art and classical techniques, and degrades gracefully in performance with an increasing number of reference places. Our results provide a significant milestone towards SNNs that can provide robust, energy-efficient and low latency robot localization.
    AdaPruner: Adaptive Channel Pruning and Effective Weights Inheritance. (arXiv:2109.06397v1 [cs.CV])
    (2 min) Channel pruning is one of the major compression approaches for deep neural networks. While previous pruning methods have mostly focused on identifying unimportant channels, channel pruning is considered as a special case of neural architecture search in recent years. However, existing methods are either complicated or prone to sub-optimal pruning. In this paper, we propose a pruning framework that adaptively determines the number of each layer's channels as well as the wights inheritance criteria for sub-network. Firstly, evaluate the importance of each block in the network based on the mean of the scaling parameters of the BN layers. Secondly, use the bisection method to quickly find the compact sub-network satisfying the budget. Finally, adaptively and efficiently choose the weight inheritance criterion that fits the current architecture and fine-tune the pruned network to recover performance. AdaPruner allows to obtain pruned network quickly, accurately and efficiently, taking into account both the structure and initialization weights. We prune the currently popular CNN models (VGG, ResNet, MobileNetV2) on different image classification datasets, and the experimental results demonstrate the effectiveness of our proposed method. On ImageNet, we reduce 32.8% FLOPs of MobileNetV2 with only 0.62% decrease for top-1 accuracy, which exceeds all previous state-of-the-art channel pruning methods. The code will be released.
    Sensor Adversarial Traits: Analyzing Robustness of 3D Object Detection Sensor Fusion Models. (arXiv:2109.06363v1 [cs.CV])
    (2 min) A critical aspect of autonomous vehicles (AVs) is the object detection stage, which is increasingly being performed with sensor fusion models: multimodal 3D object detection models which utilize both 2D RGB image data and 3D data from a LIDAR sensor as inputs. In this work, we perform the first study to analyze the robustness of a high-performance, open source sensor fusion model architecture towards adversarial attacks and challenge the popular belief that the use of additional sensors automatically mitigate the risk of adversarial attacks. We find that despite the use of a LIDAR sensor, the model is vulnerable to our purposefully crafted image-based adversarial attacks including disappearance, universal patch, and spoofing. After identifying the underlying reason, we explore some potential defenses and provide some recommendations for improved sensor fusion models.
    Cross-Modality Domain Adaptation for Vestibular Schwannoma and Cochlea Segmentation. (arXiv:2109.06274v1 [eess.IV])
    (2 min) Automatic methods to segment the vestibular schwannoma (VS) tumors and the cochlea from magnetic resonance imaging (MRI) are critical to VS treatment planning. Although supervised methods have achieved satisfactory performance in VS segmentation, they require full annotations by experts, which is laborious and time-consuming. In this work, we aim to tackle the VS and cochlea segmentation problem in an unsupervised domain adaptation setting. Our proposed method leverages both the image-level domain alignment to minimize the domain divergence and semi-supervised training to further boost the performance. Furthermore, we propose to fuse the labels predicted from multiple models via noisy label correction. Our results on the challenge validation leaderboard showed that our unsupervised method has achieved promising VS and cochlea segmentation performance with mean dice score of 0.8261 $\pm$ 0.0416; The mean dice value for the tumor is 0.8302 $\pm$ 0.0772. This is comparable to the weakly-supervised based method.
    Incremental Abstraction in Distributed Probabilistic SLAM Graphs. (arXiv:2109.06241v1 [cs.CV])
    (2 min) Scene graphs represent the key components of a scene in a compact and semantically rich way, but are difficult to build during incremental SLAM operation because of the challenges of robustly identifying abstract scene elements and optimising continually changing, complex graphs. We present a distributed, graph-based SLAM framework for incrementally building scene graphs based on two novel components. First, we propose an incremental abstraction framework in which a neural network proposes abstract scene elements that are incorporated into the factor graph of a feature-based monocular SLAM system. Scene elements are confirmed or rejected through optimisation and incrementally replace the points yielding a more dense, semantic and compact representation. Second, enabled by our novel routing procedure, we use Gaussian Belief Propagation (GBP) for distributed inference on a graph processor. The time per iteration of GBP is structure-agnostic and we demonstrate the speed advantages over direct methods for inference of heterogeneous factor graphs. We run our system on real indoor datasets using planar abstractions and recover the major planes with significant compression.
    MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks. (arXiv:2109.06275v1 [cs.AI])
    (2 min) An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners' beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.
    Cross-Region Domain Adaptation for Class-level Alignment. (arXiv:2109.06422v1 [cs.CV])
    (2 min) Semantic segmentation requires a lot of training data, which necessitates costly annotation. There have been many studies on unsupervised domain adaptation (UDA) from one domain to another, e.g., from computer graphics to real images. However, there is still a gap in accuracy between UDA and supervised training on native domain data. It is arguably attributable to class-level misalignment between the source and target domain data. To cope with this, we propose a method that applies adversarial training to align two feature distributions in the target domain. It uses a self-training framework to split the image into two regions (i.e., trusted and untrusted), which form two distributions to align in the feature space. We term this approach cross-region adaptation (CRA) to distinguish from the previous methods of aligning different domain distributions, which we call cross-domain adaptation (CDA). CRA can be applied after any CDA method. Experimental results show that this always improves the accuracy of the combined CDA method, having updated the state-of-the-art.
    Generatively Augmented Neural Network Watchdog for Image Classification Networks. (arXiv:2109.06168v1 [cs.CV])
    (2 min) The identification of out-of-distribution data is vital to the deployment of classification networks. For example, a generic neural network that has been trained to differentiate between images of dogs and cats can only classify an input as either a dog or a cat. If a picture of a car or a kumquat were to be supplied to this classifier, the result would still be either a dog or a cat. In order to mitigate this, techniques such as the neural network watchdog have been developed. The compression of the image input into the latent layer of the autoencoder defines the region of in-distribution in the image space. This in-distribution set of input data has a corresponding boundary in the image space. The watchdog assesses whether inputs are in inside or outside this boundary. This paper demonstrates how to sharpen this boundary using generative network training data augmentation thereby bettering the discrimination and overall performance of the watchdog.
  • cs.IR updates on arXiv.org

    Semantic Answer Type Prediction using BERT: IAI at the ISWC SMART Task 2020. (arXiv:2109.06714v1 [cs.CL])
    (2 min) This paper summarizes our participation in the SMART Task of the ISWC 2020 Challenge. A particular question we are interested in answering is how well neural methods, and specifically transformer models, such as BERT, perform on the answer type prediction task compared to traditional approaches. Our main finding is that coarse-grained answer types can be identified effectively with standard text classification methods, with over 95% accuracy, and BERT can bring only marginal improvements. For fine-grained type detection, on the other hand, BERT clearly outperforms previous retrieval-based approaches.
    Session-aware Recommendation: A Surprising Quest for the State-of-the-art. (arXiv:2011.03424v2 [cs.IR] UPDATED)
    (2 min) Recommender systems are designed to help users in situations of information overload. In recent years, we observed increased interest in session-based recommendation scenarios, where the problem is to make item suggestions to users based only on interactions observed in an ongoing session. However, in cases where interactions from previous user sessions are available, the recommendations can be personalized according to the users' long-term preferences, a process called session-aware recommendation. Today, research in this area is scattered and many existing works only compare session-aware with session-based models. This makes it challenging to understand what represents the state-of-the-art. To close this research gap, we benchmarked recent session-aware algorithms against each other and against a number of session-based recommendation algorithms and trivial extensions thereof. Our comparison, to some surprise, revealed that (i) item simple techniques based on nearest neighbors consistently outperform recent neural techniques and that (ii) session-aware models were mostly not better than approaches that do not use long-term preference information. Our work therefore not only points to potential methodological issues where new methods are compared to weak baselines, but also indicates that there remains a huge potential for more sophisticated session-aware recommendation algorithms.
    Quantum Mathematics in Artificial Intelligence. (arXiv:2101.04255v4 [cs.AI] UPDATED)
    (3 min) In the decade since 2010, successes in artificial intelligence have been at the forefront of computer science and technology, and vector space models have solidified a position at the forefront of artificial intelligence. At the same time, quantum computers have become much more powerful, and announcements of major advances are frequently in the news. The mathematical techniques underlying both these areas have more in common than is sometimes realized. Vector spaces took a position at the axiomatic heart of quantum mechanics in the 1930s, and this adoption was a key motivation for the derivation of logic and probability from the linear geometry of vector spaces. Quantum interactions between particles are modelled using the tensor product, which is also used to express objects and operations in artificial neural networks. This paper describes some of these common mathematical areas, including examples of how they are used in artificial intelligence (AI), particularly in automated reasoning and natural language processing (NLP). Techniques discussed include vector spaces, scalar products, subspaces and implication, orthogonal projection and negation, dual vectors, density matrices, positive operators, and tensor products. Application areas include information retrieval, categorization and implication, modelling word-senses and disambiguation, inference in knowledge bases, and semantic composition. Some of these approaches can potentially be implemented on quantum hardware. Many of the practical steps in this implementation are in early stages, and some are already realized. Explaining some of the common mathematical tools can help researchers in both AI and quantum computing further exploit these overlaps, recognizing and exploring new directions along the way.
    Information Cocoons in Online Navigation. (arXiv:2109.06589v1 [physics.soc-ph])
    (2 min) Social media and online navigation bring us enjoyable experience in accessing information, and simultaneously create information cocoons (ICs) in which we are unconsciously trapped with limited and biased information. We provide a formal definition of IC in the scenario of online navigation. Subsequently, by analyzing real recommendation networks extracted from Science, PNAS and Amazon websites, and testing mainstream algorithms in disparate recommender systems, we demonstrate that similarity-based recommendation techniques result in ICs, which suppress the system navigability by hundreds of times. We further propose a flexible recommendation strategy that solves the IC-induced problem and improves retrieval accuracy in navigation, demonstrated by simulations on real data and online experiments on the largest video website in China.
    Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework. (arXiv:2109.05897v2 [cs.CL] UPDATED)
    (2 min) Answering questions asked from instructional corpora such as E-manuals, recipe books, etc., has been far less studied than open-domain factoid context-based question answering. This can be primarily attributed to the absence of standard benchmark datasets. In this paper we meticulously create a large amount of data connected with E-manuals and develop suitable algorithm to exploit it. We collect E-Manual Corpus, a huge corpus of 307,957 E-manuals and pretrain RoBERTa on this large corpus. We create various benchmark QA datasets which include question answer pairs curated by experts based upon two E-manuals, real user questions from Community Question Answering Forum pertaining to E-manuals etc. We introduce EMQAP (E-Manual Question Answering Pipeline) that answers questions pertaining to electronics devices. Built upon the pretrained RoBERTa, it harbors a supervised multi-task learning framework which efficiently performs the dual tasks of identifying the section in the E-manual where the answer can be found and the exact answer span within that section. For E-Manual annotated question-answer pairs, we show an improvement of about 40% in ROUGE-L F1 scores over the most competitive baseline. We perform a detailed ablation study and establish the versatility of EMQAP across different circumstances. The code and datasets are shared at https://github.com/abhi1nandy2/EMNLP-2021-Findings, and the corresponding project website is https://sites.google.com/view/emanualqa/home.
    Robust Retrieval Augmented Generation for Zero-shot Slot Filling. (arXiv:2108.13934v2 [cs.CL] UPDATED)
    (2 min) Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [Entity, Slot, ?], a system is asked to fill the slot by generating or extracting the missing value exploiting evidence extracted from relevant passage(s) in the given document collection. The recent works in the field try to solve this task in an end-to-end fashion using retrieval-based language models. In this paper, we present a novel approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. Our model reports large improvements on both T-REx and zsRE slot filling datasets, improving both passage retrieval and slot value generation, and ranking at the top-1 position in the KILT leaderboard. Moreover, we demonstrate the robustness of our system showing its domain adaptation capability on a new variant of the TACRED dataset for slot filling, through a combination of zero/few-shot learning. We release the source code and pre-trained models.
    The Impact of User Demographics and Task Types on Cross-App Mobile Search. (arXiv:2109.06573v1 [cs.HC])
    (2 min) Recent developments in the mobile app industry have resulted in various types of mobile apps, each targeting a different need and a specific audience. Consequently, users access distinct apps to complete their information need tasks. This leads to the use of various apps not only separately, but also collaboratively in the same session to achieve a single goal. Recent work has argued the need for a unified mobile search system that would act as metasearch on users' mobile devices. The system would identify the target apps for the user's query, submit the query to the apps, and present the results to the user in a unified way. In this work, we aim to deepen our understanding of user behavior while accessing information on their mobile phones by conducting an extensive analysis of various aspects related to the search process. In particular, we study the effect of task type and user demographics on their behavior in interacting with mobile apps. Our findings reveal trends and patterns that can inform the design of a more effective mobile information access environment.
    Position Paper on Simulating Privacy Dynamics in Recommender Systems. (arXiv:2109.06473v1 [cs.IR])
    (2 min) In this position paper, we discuss the merits of simulating privacy dynamics in recommender systems. We study this issue at hand from two perspectives: Firstly, we present a conceptual approach to integrate privacy into recommender system simulations, whose key elements are privacy agents. These agents can enhance users' profiles with different privacy preferences, e.g., their inclination to disclose data to the recommender system. Plus, they can protect users' privacy by guarding all actions that could be a threat to privacy. For example, agents can prohibit a user's privacy-threatening actions or apply privacy-enhancing techniques, e.g., Differential Privacy, to make actions less threatening. Secondly, we identify three critical topics for future research in privacy-aware recommender system simulations: (i) How could we model users' privacy preferences and protect users from performing any privacy-threatening actions? (ii) To what extent do privacy agents modify the users' document preferences? (iii) How do privacy preferences and privacy protections impact recommendations and privacy of others? Our conceptual privacy-aware simulation approach makes it possible to investigate the impact of privacy preferences and privacy protection on the micro-level, i.e., a single user, but also on the macro-level, i.e., all recommender system users. With this work, we hope to present perspectives on how privacy-aware simulations could be realized, such that they enable researchers to study the dynamics of privacy within a recommender system.
    L2R2: Leveraging Ranking for Abductive Reasoning. (arXiv:2005.11223v2 [cs.IR] UPDATED)
    (2 min) The abductive natural language inference task ($\alpha$NLI) is proposed to evaluate the abductive reasoning ability of a learning system. In the $\alpha$NLI task, two observations are given and the most plausible hypothesis is asked to pick out from the candidates. Existing methods simply formulate it as a classification problem, thus a cross-entropy log-loss objective is used during training. However, discriminating true from false does not measure the plausibility of a hypothesis, for all the hypotheses have a chance to happen, only the probabilities are different. To fill this gap, we switch to a ranking perspective that sorts the hypotheses in order of their plausibilities. With this new perspective, a novel $L2R^2$ approach is proposed under the learning-to-rank framework. Firstly, training samples are reorganized into a ranking form, where two observations and their hypotheses are treated as the query and a set of candidate documents respectively. Then, an ESIM model or pre-trained language model, e.g. BERT or RoBERTa, is obtained as the scoring function. Finally, the loss functions for the ranking task can be either pair-wise or list-wise for training. The experimental results on the ART dataset reach the state-of-the-art in the public leaderboard.
    conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers. (arXiv:2109.06501v1 [cs.CL])
    (2 min) In this paper we focus on constructing useful embeddings of textual information in vacancies and resumes, which we aim to incorporate as features into job to job seeker matching models alongside other features. We explain our task where noisy data from parsed resumes, heterogeneous nature of the different sources of data, and crosslinguality and multilinguality present domain-specific challenges. We address these challenges by fine-tuning a Siamese Sentence-BERT (SBERT) model, which we call conSultantBERT, using a large-scale, real-world, and high quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing consultants. We show how our fine-tuned model significantly outperforms unsupervised and supervised baselines that rely on TF-IDF-weighted feature vectors and BERT embeddings. In addition, we find our model successfully matches cross-lingual and multilingual textual content.
    A Semantic Indexing Structure for Image Retrieval. (arXiv:2109.06583v1 [cs.CV])
    (2 min) In large-scale image retrieval, many indexing methods have been proposed to narrow down the searching scope of retrieval. The features extracted from images usually are of high dimensions or unfixed sizes due to the existence of key points. Most of existing index structures suffer from the dimension curse, the unfixed feature size and/or the loss of semantic similarity. In this paper a new classification-based indexing structure, called Semantic Indexing Structure (SIS), is proposed, in which we utilize the semantic categories rather than clustering centers to create database partitions, such that the proposed index SIS can be combined with feature extractors without the restriction of dimensions. Besides, it is observed that the size of each semantic partition is positively correlated with the semantic distribution of database. Along this way, we found that when the partition number is normalized to five, the proposed algorithm performed very well in all the tests. Compared with state-of-the-art models, SIS achieves outstanding performance.
    UNIQORN: Unified Question Answering over RDF Knowledge Graphs and Natural Language Text. (arXiv:2108.08614v2 [cs.IR] UPDATED)
    (2 min) Question answering over knowledge graphs and other RDF data has been greatly advanced, with a number of good systems providing crisp answers for natural language questions or telegraphic queries. Some of these systems incorporate textual sources as additional evidence for the answering process, but cannot compute answers that are present in text alone. Conversely, systems from the IR and NLP communities have addressed QA over text, but barely utilize semantic data and knowledge. This paper presents the first QA system that can seamlessly operate over RDF datasets and text corpora, or both together, in a unified framework. Our method, called UNIQORN, builds a context graph on the fly, by retrieving question-relevant triples from the RDF data and/or the text corpus, where the latter case is handled by automatic information extraction. The resulting graph is typically rich but highly noisy. UNIQORN copes with this input by advanced graph algorithms for Group Steiner Trees, that identify the best answer candidates in the context graph. Experimental results on several benchmarks of complex questions with multiple entities and relations, show that UNIQORN, an unsupervised method with only five parameters, produces results comparable to the state-of-the-art on KGs, text corpora, and heterogeneous sources. The graph-based methodology provides user-interpretable evidence for the complete answering process.
    ARGO: Modeling Heterogeneity in E-commerce Recommendation. (arXiv:2109.05789v2 [cs.IR] UPDATED)
    (2 min) Nowadays, E-commerce is increasingly integrated into our daily lives. Meanwhile, shopping process has also changed incrementally from one behavior (purchase) to multiple behaviors (such as view, carting and purchase). Therefore, utilizing interaction data of auxiliary behavior data draws a lot of attention in the E-commerce recommender systems. However, all existing models ignore two kinds of intrinsic heterogeneity which are helpful to capture the difference of user preferences and the difference of item attributes. First (intra-heterogeneity), each user has multiple social identities with otherness, and these different identities can result in quite different interaction preferences. Second (inter-heterogeneity), each item can transfer an item-specific percentage of score from low-level behavior to high-level behavior for the gradual relationship among multiple behaviors. Thus, the lack of consideration of these heterogeneities damages recommendation rank performance. To model the above heterogeneities, we propose a novel method named intra- and inter-heterogeneity recommendation model (ARGO). Specifically, we embed each user into multiple vectors representing the user's identities, and the maximum of identity scores indicates the interaction preference. Besides, we regard the item-specific transition percentage as trainable transition probability between different behaviors. Extensive experiments on two real-world datasets show that ARGO performs much better than the state-of-the-art in multi-behavior scenarios.
    Automatic selection of clustering algorithms using supervised graph embedding. (arXiv:2011.08225v3 [cs.LG] UPDATED)
    (2 min) The widespread adoption of machine learning (ML) techniques and the extensive expertise required to apply them have led to increased interest in automated ML solutions that reduce the need for human intervention. One of the main challenges in applying ML to previously unseen problems is algorithm selection - the identification of high-performing algorithm(s) for a given dataset, task, and evaluation measure. This study addresses the algorithm selection challenge for data clustering, a fundamental task in data mining that is aimed at grouping similar objects. We present MARCO-GE, a novel meta-learning approach for the automated recommendation of clustering algorithms. MARCO-GE first transforms datasets into graphs and then utilizes a graph convolutional neural network technique to extract their latent representation. Using the embedding representations obtained, MARCO-GE trains a ranking meta-model capable of accurately recommending top-performing algorithms for a new dataset and clustering evaluation measure. Extensive evaluation on 210 datasets, 13 clustering algorithms, and 10 clustering measures demonstrates the effectiveness of our approach and its superiority in terms of predictive and generalization performance over state-of-the-art clustering meta-learning approaches.
    Sequential Modelling with Applications to Music Recommendation, Fact-Checking, and Speed Reading. (arXiv:2109.06736v1 [cs.IR])
    (2 min) Sequential modelling entails making sense of sequential data, which naturally occurs in a wide array of domains. One example is systems that interact with users, log user actions and behaviour, and make recommendations of items of potential interest to users on the basis of their previous interactions. In such cases, the sequential order of user interactions is often indicative of what the user is interested in next. Similarly, for systems that automatically infer the semantics of text, capturing the sequential order of words in a sentence is essential, as even a slight re-ordering could significantly alter its original meaning. This thesis makes methodological contributions and new investigations of sequential modelling for the specific application areas of systems that recommend music tracks to listeners and systems that process text semantics in order to automatically fact-check claims, or "speed read" text for efficient further classification. (Rest of abstract omitted due to arXiv abstract limit)
    Detecting Layout Templates in Complex Multiregion Files. (arXiv:2109.06630v1 [cs.IR])
    (2 min) Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.
    BERT for Target Apps Selection: Analyzing the Diversity and Performance of BERT in Unified Mobile Search. (arXiv:2109.06306v1 [cs.IR])
    (2 min) A unified mobile search framework aims to identify the mobile apps that can satisfy a user's information need and route the user's query to them. Previous work has shown that resource descriptions for mobile apps are sparse as they rely on the app's previous queries. This problem puts certain apps in dominance and leaves out the resource-scarce apps from the top ranks. In this case, we need a ranker that goes beyond simple lexical matching. Therefore, our goal is to study the extent of a BERT-based ranker's ability to improve the quality and diversity of app selection. To this end, we compare the results of the BERT-based ranker with other information retrieval models, focusing on the analysis of selected apps diversification. Our analysis shows that the BERT-based ranker selects more diverse apps while improving the quality of baseline results by selecting the relevant apps such as Facebook and Contacts for more personal queries and decreasing the bias towards the dominant resources such as the Google Search app.
    YES SIR!Optimizing Semantic Space of Negatives with Self-Involvement Ranker. (arXiv:2109.06436v1 [cs.IR])
    (2 min) Pre-trained model such as BERT has been proved to be an effective tool for dealing with Information Retrieval (IR) problems. Due to its inspiring performance, it has been widely used to tackle with real-world IR problems such as document ranking. Recently, researchers have found that selecting "hard" rather than "random" negative samples would be beneficial for fine-tuning pre-trained models on ranking tasks. However, it remains elusive how to leverage hard negative samples in a principled way. To address the aforementioned issues, we propose a fine-tuning strategy for document ranking, namely Self-Involvement Ranker (SIR), to dynamically select hard negative samples to construct high-quality semantic space for training a high-quality ranking model. Specifically, SIR consists of sequential compressors implemented with pre-trained models. Front compressor selects hard negative samples for rear compressor. Moreover, SIR leverages supervisory signal to adaptively adjust semantic space of negative samples. Finally, supervisory signal in rear compressor is computed based on condition probability and thus can control sample dynamic and further enhance the model performance. SIR is a lightweight and general framework for pre-trained models, which simplifies the ranking process in industry practice. We test our proposed solution on MS MARCO with document ranking setting, and the results show that SIR can significantly improve the ranking performance of various pre-trained models. Moreover, our method became the new SOTA model anonymously on MS MARCO Document ranking leaderboard in May 2021.
    Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse. (arXiv:2109.06424v1 [cs.IR])
    (2 min) This paper calls attention to the missing component of the recommender system evaluation process: Statistical Inference. There is active research in several components of the recommender system evaluation process: selecting baselines, standardizing benchmarks, and target item sampling. However, there has not yet been significant work on the role and use of statistical inference for analyzing recommender system evaluation results. In this paper, we argue that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention. We support this argument with systematic review of recent RecSys papers to understand how statistical inference is currently being used, along with a brief survey of studies that have been done on the use of statistical inference in the information retrieval community. We present several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.
    MMCoVaR: Multimodal COVID-19 Vaccine Focused Data Repository for Fake News Detection and a Baseline Architecture for Classification. (arXiv:2109.06416v1 [cs.IR])
    (3 min) The outbreak of COVID-19 has resulted in an "infodemic" that has encouraged the propagation of misinformation about COVID-19 and cure methods which, in turn, could negatively affect the adoption of recommended public health measures in the larger population. In this paper, we provide a new multimodal (consisting of images, text and temporal information) labeled dataset containing news articles and tweets on the COVID-19 vaccine. We collected 2,593 news articles from 80 publishers for one year between Feb 16th 2020 to May 8th 2021 and 24184 Twitter posts (collected between April 17th 2021 to May 8th 2021). We combine ratings from three news media ranking sites: Medias Bias Chart, News Guard and Media Bias/Fact Check (MBFC) to classify the news dataset into two levels of credibility: reliable and unreliable. The combination of three filters allows for higher precision of labeling. We also propose a stance detection mechanism to annotate tweets into three levels of credibility: reliable, unreliable and inconclusive. We provide several statistics as well as other analytics like, publisher distribution, publication date distribution, topic analysis, etc. We also provide a novel architecture that classifies the news data into misinformation or truth to provide a baseline performance for this dataset. We find that the proposed architecture has an F-Score of 0.919 and accuracy of 0.882 for fake news detection. Furthermore, we provide benchmark performance for misinformation detection on tweet dataset. This new multimodal dataset can be used in research on COVID-19 vaccine, including misinformation detection, influence of fake COVID-19 vaccine information, etc.
  • cs.LG updates on arXiv.org

    Entropic Inequality Constraints from $e$-separation Relations in Directed Acyclic Graphs with Hidden Variables. (arXiv:2107.07087v2 [stat.ML] UPDATED)
    (2 min) Directed acyclic graphs (DAGs) with hidden variables are often used to characterize causal relations between variables in a system. When some variables are unobserved, DAGs imply a notoriously complicated set of constraints on the distribution of observed variables. In this work, we present entropic inequality constraints that are implied by $e$-separation relations in hidden variable DAGs with discrete observed variables. The constraints can intuitively be understood to follow from the fact that the capacity of variables along a causal pathway to convey information is restricted by their entropy; e.g. at the extreme case, a variable with entropy $0$ can convey no information. We show how these constraints can be used to learn about the true causal model from an observed data distribution. In addition, we propose a measure of causal influence called the minimal mediary entropy, and demonstrate that it can augment traditional measures such as the average causal effect.
    Bilateral Denoising Diffusion Models. (arXiv:2108.11514v3 [cs.LG] UPDATED)
    (2 min) Denoising diffusion probabilistic models (DDPMs) have emerged as competitive generative models yet brought challenges to efficient sampling. In this paper, we propose novel bilateral denoising diffusion models (BDDMs), which take significantly fewer steps to generate high-quality samples. From a bilateral modeling objective, BDDMs parameterize the forward and reverse processes with a score network and a scheduling network, respectively. We show that a new lower bound tighter than the standard evidence lower bound can be derived as a surrogate objective for training the two networks. In particular, BDDMs are efficient, simple-to-train, and capable of further improving any pre-trained DDPM by optimizing the inference noise schedules. Our experiments demonstrated that BDDMs can generate high-fidelity samples with as few as 3 sampling steps and produce comparable or even higher quality samples than DDPMs using 1000 steps with only 16 sampling steps (a 62x speedup).
    What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey. (arXiv:2109.06808v1 [cs.CR])
    (2 min) Cybersecurity researchers have contributed to the automated extraction of CTI from textual sources, such as threat reports and online articles, where cyberattack strategies, procedures, and tools are described. The goal of this article is to aid cybersecurity researchers understand the current techniques used for cyberthreat intelligence extraction from text through a survey of relevant studies in the literature. We systematically collect "CTI extraction from text"-related studies from the literature and categorize the CTI extraction purposes. We propose a CTI extraction pipeline abstracted from these studies. We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline. Our work finds ten types of extraction purposes, such as extraction indicators of compromise extraction, TTPs (tactics, techniques, procedures of attack), and cybersecurity keywords. We also identify seven types of textual sources for CTI extraction, and textual data obtained from hacker forums, threat reports, social media posts, and online news articles have been used by almost 90% of the studies. Natural language processing along with both supervised and unsupervised machine learning techniques such as named entity recognition, topic modelling, dependency parsing, supervised classification, and clustering are used for CTI extraction. We observe the technical challenges associated with these studies related to obtaining available clean, labelled data which could assure replication, validation, and further extension of the studies. As we find the studies focusing on CTI information extraction from text, we advocate for building upon the current CTI extraction work to help cybersecurity practitioners with proactive decision making such as threat prioritization, automated threat modelling to utilize knowledge from past cybersecurity incidents.
    Support Recovery in Universal One-bit Compressed Sensing. (arXiv:2107.09091v3 [cs.IT] UPDATED)
    (2 min) One-bit compressed sensing (1bCS) is an extreme-quantized signal acquisition method that has been intermittently studied in the past decade. In 1bCS, linear samples of a high dimensional signal are quantized to only one bit per sample (sign of the measurement). The extreme quantization makes it an interesting case study of the more general single-index or generalized linear models. At the same time it can also be thought of as a `design' version of learning a binary linear classifier or halfspace-learning. Assuming the original signal vector to be sparse, existing results in 1bCS either aim to find the support of the vector, or approximate the signal within an $\epsilon$-ball. The focus of this paper is support recovery, which often also computationally facilitate approximate signal recovery. A \emph{universal} measurement matrix for 1bCS refers to one set of measurements that work \emph{for all} sparse signals. With universality, it is known that $\tilde{\Theta}(k^2)$ 1bCS measurements are necessary and sufficient for support recovery (where $k$ denotes the sparsity). In this work, we show that it is possible to universally recover the support with a small number of false positives with $\tilde{O}(k^{3/2})$ measurements. If the dynamic range of the signal vector is known, then with a different technique, this result can be improved to only $\tilde{O}(k)$ measurements. Other results on universal but approximate support recovery are also provided in this paper. All of our main recovery algorithms are simple and polynomial-time.
    On the use of test smells for prediction of flaky tests. (arXiv:2108.11781v2 [cs.SE] UPDATED)
    (2 min) Regression testing is an important phase to deliver software with quality. However, flaky tests hamper the evaluation of test results and can increase costs. This is because a flaky test may pass or fail non-deterministically and to identify properly the flakiness of a test requires rerunning the test suite multiple times. To cope with this challenge, approaches have been proposed based on prediction models and machine learning. Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting, presenting low performance when executed in a cross-project scenario. To overcome these limitations, we investigate the use of test smells as predictors of flaky tests. We conducted an empirical study to understand if test smells have good performance as a classifier to predict the flakiness in the cross-project context, and analyzed the information gain of each test smell. We also compared the test smell-based approach with the vocabulary-based one. As a result, we obtained a classifier that had a reasonable performance (Random Forest, 0.83) to predict the flakiness in the testing phase. This classifier presented better performance than vocabulary-based model for cross-project prediction. The Assertion Roulette and Sleepy Test test smell types are the ones associated with the best information gain values.
    Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles. (arXiv:2109.06126v1 [cs.SE] CROSS LISTED)
    (2 min) Self-driving cars and trucks, autonomous vehicles (AVs), should not be accepted by regulatory bodies and the public until they have much higher confidence in their safety and reliability -- which can most practically and convincingly be achieved by testing. But existing testing methods are inadequate for checking the end-to-end behaviors of AV controllers against complex, real-world corner cases involving interactions with multiple independent agents such as pedestrians and human-driven vehicles. While test-driving AVs on streets and highways fails to capture many rare events, existing simulation-based testing methods mainly focus on simple scenarios and do not scale well for complex driving situations that require sophisticated awareness of the surroundings. To address these limitations, we propose a new fuzz testing technique, called AutoFuzz, which can leverage widely-used AV simulators' API grammars. to generate semantically and temporally valid complex driving scenarios (sequences of scenes). AutoFuzz is guided by a constrained Neural Network (NN) evolutionary search over the API grammar to generate scenarios seeking to find unique traffic violations. Evaluation of our prototype on one state-of-the-art learning-based controller and two rule-based controllers shows that AutoFuzz efficiently finds hundreds of realistic traffic violations resembling real-world crashes. Further, fine-tuning the learning-based controller with the traffic violations found by AutoFuzz successfully reduced the traffic violations found in the new version of the AV controller software.
    Automatic hippocampal surface generation via 3D U-net and active shape modeling with hybrid particle swarm optimization. (arXiv:2109.06817v1 [cs.NE])
    (2 min) In this paper, we proposed and validated a fully automatic pipeline for hippocampal surface generation via 3D U-net coupled with active shape modeling (ASM). Principally, the proposed pipeline consisted of three steps. In the beginning, for each magnetic resonance image, a 3D U-net was employed to obtain the automatic hippocampus segmentation at each hemisphere. Secondly, ASM was performed on a group of pre-obtained template surfaces to generate mean shape and shape variation parameters through principal component analysis. Ultimately, hybrid particle swarm optimization was utilized to search for the optimal shape variation parameters that best match the segmentation. The hippocampal surface was then generated from the mean shape and the shape variation parameters. The proposed pipeline was observed to provide hippocampal surfaces at both hemispheres with high accuracy, correct anatomical topology, and sufficient smoothness.
    IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems. (arXiv:2109.06715v1 [cs.LG])
    (2 min) Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems). In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing). The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking. However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming. This poses an important barrier to network engineers that often do not have the necessary ML expertise. In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems. IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures. To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases. Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow.
    A Unified Objective for Novel Class Discovery. (arXiv:2108.08536v3 [cs.CV] UPDATED)
    (2 min) In this paper, we study the problem of Novel Class Discovery (NCD). NCD aims at inferring novel object categories in an unlabeled set by leveraging from prior knowledge of a labeled set containing different, but related classes. Existing approaches tackle this problem by considering multiple objective functions, usually involving specialized loss terms for the labeled and the unlabeled samples respectively, and often requiring auxiliary regularization terms. In this paper, we depart from this traditional scheme and introduce a UNified Objective function (UNO) for discovering novel classes, with the explicit purpose of favoring synergy between supervised and unsupervised learning. Using a multi-view self-labeling strategy, we generate pseudo-labels that can be treated homogeneously with ground truth labels. This leads to a single classification objective operating on both known and unknown classes. Despite its simplicity, UNO outperforms the state of the art by a significant margin on several benchmarks (~+10% on CIFAR-100 and +8% on ImageNet). The project page is available at: https://ncd-uno.github.io.
    PAC Learnability of Approximate Nash Equilibrium in Bimatrix Games. (arXiv:2108.07472v2 [cs.GT] UPDATED)
    (2 min) Computing Nash equilibrium in bimatrix games is PPAD-hard, and many works have focused on the approximate solutions. When games are generated from a fixed unknown distribution, learning a Nash predictor via data-driven approaches can be preferable. In this paper, we study the learnability of approximate Nash equilibrium in bimatrix games. We prove that Lipschitz function class is agnostic Probably Approximately Correct (PAC) learnable with respect to Nash approximation loss. Additionally, to demonstrate the advantages of learning a Nash predictor, we develop a model that can efficiently approximate solutions for games under the same distribution. We show by experiments that the solutions from our Nash predictor can serve as effective initializing points for other Nash solvers.
    Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework. (arXiv:2109.05897v2 [cs.CL] UPDATED)
    (2 min) Answering questions asked from instructional corpora such as E-manuals, recipe books, etc., has been far less studied than open-domain factoid context-based question answering. This can be primarily attributed to the absence of standard benchmark datasets. In this paper we meticulously create a large amount of data connected with E-manuals and develop suitable algorithm to exploit it. We collect E-Manual Corpus, a huge corpus of 307,957 E-manuals and pretrain RoBERTa on this large corpus. We create various benchmark QA datasets which include question answer pairs curated by experts based upon two E-manuals, real user questions from Community Question Answering Forum pertaining to E-manuals etc. We introduce EMQAP (E-Manual Question Answering Pipeline) that answers questions pertaining to electronics devices. Built upon the pretrained RoBERTa, it harbors a supervised multi-task learning framework which efficiently performs the dual tasks of identifying the section in the E-manual where the answer can be found and the exact answer span within that section. For E-Manual annotated question-answer pairs, we show an improvement of about 40% in ROUGE-L F1 scores over the most competitive baseline. We perform a detailed ablation study and establish the versatility of EMQAP across different circumstances. The code and datasets are shared at https://github.com/abhi1nandy2/EMNLP-2021-Findings, and the corresponding project website is https://sites.google.com/view/emanualqa/home.
    Distributed Proximal Splitting Algorithms with Rates and Acceleration. (arXiv:2010.00952v2 [math.OC] UPDATED)
    (2 min) We analyze several generic proximal splitting algorithms well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new rates on the function value suboptimality or distance to the solution, as well as new accelerated versions, using varying stepsizes. In addition, we propose distributed variants of these algorithms, which can be accelerated as well. While most existing results are ergodic, our nonergodic results significantly broaden our understanding of primal-dual optimization algorithms.
    Meta-repository of screening mammography classifiers. (arXiv:2108.04800v2 [cs.LG] UPDATED)
    (2 min) Artificial intelligence (AI) is showing promise in improving clinical diagnosis. In breast cancer screening, several recent studies show that AI has the potential to improve radiologists' accuracy, subsequently helping in early cancer diagnosis and reducing unnecessary workup. As the number of proposed models and their complexity grows, it is becoming increasingly difficult to re-implement them in order to reproduce the results and to compare different approaches. To enable reproducibility of research in this application area and to enable comparison between different methods, we release a meta-repository containing deep learning models for classification of screening mammograms. This meta-repository creates a framework that enables the evaluation of machine learning models on any private or public screening mammography data set. At its inception, our meta-repository contains five state-of-the-art models with open-source implementations and cross-platform compatibility. We compare their performance on six international data sets: two New York University breast cancer screening data sets, DDSM, INbreast, OPTIMAM and Chinese Mammography Database. Our framework has a flexible design that can be generalized to other medical image analysis tasks. The meta-repository is available at https://www.github.com/nyukat/mammography_metarepository.
    Shared Certificates for Neural Network Verification. (arXiv:2109.00542v2 [cs.LG] UPDATED)
    (2 min) Existing neural network verifiers compute a proof that each input is handled correctly under a given perturbation by propagating a convex set of reachable values at each layer. This process is repeated independently for each input (e.g., image) and perturbation (e.g., rotation), leading to an expensive overall proof effort when handling an entire dataset. In this work we introduce a new method for reducing this verification cost based on the key insight that convex sets obtained at intermediate layers can overlap across different inputs and perturbations. Leveraging this insight, we introduce the general concept of shared certificates, enabling proof effort reuse across multiple inputs and driving down overall verification costs. We validate our insight via an extensive experimental evaluation and demonstrate the effectiveness of shared certificates on a range of datasets and attack specifications including geometric, patch and $\ell_\infty$ input perturbations.
    A Machine Learning Modelling Benchmark for Temperature Field Reconstruction of Heat-Source Systems. (arXiv:2108.08298v4 [cs.LG] UPDATED)
    (2 min) Temperature field reconstruction of heat source systems (TFR-HSS) with limited monitoring sensors occurred in thermal management plays an important role in real time health detection system of electronic equipment in engineering. However, prior methods with common interpolations usually cannot provide accurate reconstruction performance as required. In addition, there exists no public dataset for widely research of reconstruction methods to further boost the reconstruction performance and engineering applications. To overcome this problem, this work develops a machine learning modelling benchmark for TFR-HSS task. First, the TFR-HSS task is mathematically modelled from real-world engineering problem and four types of numerically modellings have been constructed to transform the problem into discrete mapping forms. Then, this work proposes a set of machine learning modelling methods, including the general machine learning methods and the deep learning methods, to advance the state-of-the-art methods over temperature field reconstruction. More importantly, this work develops a novel benchmark dataset, namely Temperature Field Reconstruction Dataset (TFRD), to evaluate these machine learning modelling methods for the TFR-HSS task. Finally, a performance analysis of typical methods is given on TFRD, which can be served as the baseline results on this benchmark.
    Sparse PointPillars: Maintaining and Exploiting Input Sparsity to Improve Runtime on Embedded Systems. (arXiv:2106.06882v2 [cs.CV] UPDATED)
    (2 min) Bird's Eye View (BEV) is a popular representation for processing 3D point clouds, and by its nature is fundamentally sparse. Motivated by the computational limitations of mobile robot platforms, we take a fast, high-performance BEV 3D object detector - PointPillars - and modify its backbone to maintain and exploit this input sparsity, leading to decreased runtimes. We present results on KITTI, a canonical 3D detection dataset, and Matterport-Chair, a novel Matterport3D-derived chair detection dataset from scenes in real furnished homes. We evaluate runtime characteristics using a desktop GPU, an embedded ML accelerator, and a robot CPU, demonstrating that our method results in significant runtime decreases (2x or more) for embedded systems with only a modest decrease in detection quality. Our work represents a new approach for practitioners to optimize models for embedded systems by maintaining and exploiting input sparsity throughout their entire pipeline to reduce runtime and resource usage while preserving detection performance. All models, weights, experimental configurations, and datasets used are publicly available.
    Simulations in Recommender Systems: An industry perspective. (arXiv:2109.06723v1 [cs.LG])
    (2 min) The construction of effective Recommender Systems (RS) is a complex process, mainly due to the nature of RSs which involves large scale software-systems and human interactions. Iterative development processes require deep understanding of a current baseline as well as the ability to estimate the impact of changes in multiple variables of interest. Simulations are well suited to address both challenges and potentially leading to a high velocity construction process, a fundamental requirement in commercial contexts. Recently, there has been significant interest in RS Simulation Platforms, which allow RS developers to easily craft simulated environments where their systems can be analysed. In this work we discuss how simulations help to increase velocity, we look at the literature around RS Simulation Platforms, analyse strengths and gaps and distill a set of guiding principles for the design of RS Simulation Platforms that we believe will maximize the velocity of iterative RS construction processes.
    A Pilot Study on Visually Stimulated Cognitive Tasks for EEG-Based Dementia Recognition. (arXiv:2103.03854v3 [cs.LG] UPDATED)
    (2 min) In the status quo, dementia is yet to be cured. Precise diagnosis prior to the onset of the symptoms can prevent the rapid progression of the emerging cognitive impairment. Recent progress has shown that Electroencephalography (EEG) is the promising and cost-effective test to facilitate the detection of neurocognitive disorders. However, most of the existing works have been using only resting-state EEG. The efficiencies of EEG signals from various cognitive tasks, for dementia classification, have yet to be thoroughly investigated. In this study, we designed four cognitive tasks that engage different cognitive performances: attention, working memory, and executive function. We investigated these tasks by using statistical analysis on both time and frequency domains of EEG signals from three classes of human subjects: Dementia (DEM), Mild Cognitive Impairment (MCI), and Normal Control (NC). We also further evaluated the classification performances of two features extraction methods: Principal Component Analysis (PCA) and Filter Bank Common Spatial Pattern (FBCSP). We found that the working memory related tasks yielded good performances for dementia recognition in both cases using PCA and FBCSP. Moreover, FBCSP with features combination from four tasks revealed the best sensitivity of 0.87 and the specificity of 0.80. To our best knowledge, this is the first work that concurrently investigated several cognitive tasks for dementia recognition using both statistical analysis and classification scores. Our results yielded essential information to design and aid in conducting further experimental tasks to early diagnose dementia patients.
    Surface Warping Incorporating Machine Learning Assisted Domain Likelihood Estimation: A New Paradigm in Mine Geology Modelling and Automation. (arXiv:2103.03923v3 [physics.geo-ph] UPDATED)
    (3 min) This paper illustrates an application of machine learning (ML) within a complex system that performs grade estimation. In surface mining, assay measurements taken from production drilling often provide useful information that allows initially inaccurate surfaces created using sparse exploration data to be revised and subsequently improved. Recently, a Bayesian warping technique has been proposed to reshape modeled surfaces using geochemical and spatial constraints imposed by newly acquired blasthole data. This paper focuses on incorporating machine learning into this warping framework to make the likelihood computation generalizable. The technique works by adjusting the position of vertices on the surface to maximize the integrity of modeled geological boundaries with respect to sparse geochemical observations. Its foundation is laid by a Bayesian derivation in which the geological domain likelihood given the chemistry, p(g|c), plays a similar role to p(y(c)|g). This observation allows a manually calibrated process centered around the latter to be automated since ML techniques may be used to estimate the former in a data-driven way. Machine learning performance is evaluated for gradient boosting, neural network, random forest and other classifiers in a binary and multi-class context using precision and recall rates. Once ML likelihood estimators are integrated in the surface warping framework, surface shaping performance is evaluated using unseen data by examining the categorical distribution of test samples located above and below the warped surface. Large-scale validation experiments are performed to assess the overall efficacy of ML assisted surface warping as a fully integrated component within an ore grade estimation system where the posterior mean is obtained via Gaussian Process inference with a Matern 3/2 kernel.
    Towards Understanding the Generative Capability of Adversarially Robust Classifiers. (arXiv:2108.09093v2 [cs.LG] UPDATED)
    (2 min) Recently, some works found an interesting phenomenon that adversarially robust classifiers can generate good images comparable to generative models. We investigate this phenomenon from an energy perspective and provide a novel explanation. We reformulate adversarial example generation, adversarial training, and image generation in terms of an energy function. We find that adversarial training contributes to obtaining an energy function that is flat and has low energy around the real data, which is the key for generative capability. Based on our new understanding, we further propose a better adversarial training method, Joint Energy Adversarial Training (JEAT), which can generate high-quality images and achieve new state-of-the-art robustness under a wide range of attacks. The Inception Score of the images (CIFAR-10) generated by JEAT is 8.80, much better than original robust classifiers (7.50).
    ROMAX: Certifiably Robust Deep Multiagent Reinforcement Learning via Convex Relaxation. (arXiv:2109.06795v1 [cs.LG])
    (2 min) In a multirobot system, a number of cyber-physical attacks (e.g., communication hijack, observation perturbations) can challenge the robustness of agents. This robustness issue worsens in multiagent reinforcement learning because there exists the non-stationarity of the environment caused by simultaneously learning agents whose changing policies affect the transition and reward functions. In this paper, we propose a minimax MARL approach to infer the worst-case policy update of other agents. As the minimax formulation is computationally intractable to solve, we apply the convex relaxation of neural networks to solve the inner minimization problem. Such convex relaxation enables robustness in interacting with peer agents that may have significantly different behaviors and also achieves a certified bound of the original optimization problem. We evaluate our approach on multiple mixed cooperative-competitive tasks and show that our method outperforms the previous state of the art approaches on this topic.
    Excess Capacity and Backdoor Poisoning. (arXiv:2109.00685v2 [cs.LG] UPDATED)
    (2 min) A backdoor data poisoning attack is an adversarial attack wherein the attacker injects several watermarked, mislabeled training examples into a training set. The watermark does not impact the test-time performance of the model on typical data; however, the model reliably errs on watermarked examples. To gain a better foundational understanding of backdoor data poisoning attacks, we present a formal theoretical framework within which one can discuss backdoor data poisoning attacks for classification problems. We then use this to analyze important statistical and computational issues surrounding these attacks. On the statistical front, we identify a parameter we call the memorization capacity that captures the intrinsic vulnerability of a learning problem to a backdoor attack. This allows us to argue about the robustness of several natural learning problems to backdoor attacks. Our results favoring the attacker involve presenting explicit constructions of backdoor attacks, and our robustness results show that some natural problem settings cannot yield successful backdoor attacks. From a computational standpoint, we show that under certain assumptions, adversarial training can detect the presence of backdoors in a training set. We then show that under similar assumptions, two closely related problems we call backdoor filtering and robust generalization are nearly equivalent. This implies that it is both asymptotically necessary and sufficient to design algorithms that can identify watermarked examples in the training set in order to obtain a learning algorithm that both generalizes well to unseen data and is robust to backdoors.
    Traffic Signal Control with Communicative Deep Reinforcement Learning Agents: a Case Study. (arXiv:2107.01347v2 [cs.MA] UPDATED)
    (2 min) In this work we analyze Multi-Agent Advantage Actor-Critic (MA2C) a recently proposed multi-agent reinforcement learning algorithm that can be applied to adaptive traffic signal control (ATSC) problems. To evaluate its potential we compare MA2C with Independent Advantage Actor-Critic (IA2C) and other Reinforcement Learning or heuristic based algorithms. Specifically, we analyze MA2C theoretically with the framework provided by non-Markov decision processes, which allows a deeper insight of the algorithm, and we critically examine the effectiveness and the robustness of the method by testing it in two traffic areas located in Bologna (Italy) simulated in SUMO, a software modeling tool for ATSC problems. Our results indicate that MA2C, trained with pseudo-random vehicle flows, is a promising technique able to outperform the alternative methods.
    From the Greene--Wu Convolution to Gradient Estimation over Riemannian Manifolds. (arXiv:2108.07406v2 [cs.LG] UPDATED)
    (2 min) Over a complete Riemannian manifold of finite dimension, Greene and Wu introduced a convolution, known as Greene-Wu (GW) convolution. In this paper, we study properties of the GW convolution and apply it to non-Euclidean machine learning problems. In particular, we derive a new formula for how the curvature of the space would affect the curvature of the function through the GW convolution. Also, following the study of the GW convolution, a new method for gradient estimation over Riemannian manifolds is introduced.
    Formalizing Distribution Inference Risks. (arXiv:2106.03699v3 [cs.LG] UPDATED)
    (2 min) Property inference attacks reveal statistical properties about a training set but are difficult to distinguish from the primary purposes of statistical machine learning, which is to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal and generic definition of property inference attacks. The proposed notion describes attacks that can distinguish between possible training distributions, extending beyond previous property inference attacks that infer the ratio of a particular type of data in the training data set. In this paper, we show how our definition captures previous property inference attacks as well as a new attack that reveals the average degree of nodes of a training graph and report on experiments giving insight into the potential risks of property inference attacks.
    Learning to Navigate Intersections with Unsupervised Driver Trait Inference. (arXiv:2109.06783v1 [cs.RO])
    (2 min) Navigation through uncontrolled intersections is one of the key challenges for autonomous vehicles. Identifying the subtle differences in hidden traits of other drivers can bring significant benefits when navigating in such environments. We propose an unsupervised method for inferring driver traits such as driving styles from observed vehicle trajectories. We use a variational autoencoder with recurrent neural networks to learn a latent representation of traits without any ground truth trait labels. Then, we use this trait representation to learn a policy for an autonomous vehicle to navigate through a T-intersection with deep reinforcement learning. Our pipeline enables the autonomous vehicle to adjust its actions when dealing with drivers of different traits to ensure safety and efficiency. Our method demonstrates promising performance and outperforms state-of-the-art baselines in the T-intersection scenario.
    Stochastic geometry to generalize the Mondrian Process. (arXiv:2002.00797v3 [stat.ML] UPDATED)
    (2 min) The stable under iterated tessellation (STIT) process is a stochastic process that produces a recursive partition of space with cut directions drawn independently from a distribution over the sphere. The case of random axis-aligned cuts is known as the Mondrian process. Random forests and Laplace kernel approximations built from the Mondrian process have led to efficient online learning methods and Bayesian optimization. In this work, we utilize tools from stochastic geometry to resolve some fundamental questions concerning STIT processes in machine learning. First, we show that a STIT process with cut directions drawn from a discrete distribution can be efficiently simulated by lifting to a higher dimensional axis-aligned Mondrian process. Second, we characterize all possible kernels that stationary STIT processes and their mixtures can approximate. We also give a uniform convergence rate for the approximation error of the STIT kernels to the targeted kernels, generalizing the work of [3] for the Mondrian case. Third, we obtain consistency results for STIT forests in density estimation and regression. Finally, we give a formula for the density estimator arising from an infinite STIT random forest. This allows for precise comparisons between the Mondrian forest, the Mondrian kernel and the Laplace kernel in density estimation. Our paper calls for further developments at the novel intersection of stochastic geometry and machine learning.
    Biclustering with Alternating K-Means. (arXiv:2009.04550v2 [cs.LG] UPDATED)
    (2 min) Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-diagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.
    Benchmarking the Spectrum of Agent Capabilities. (arXiv:2109.06780v1 [cs.AI])
    (2 min) Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities.
    Automatic selection of clustering algorithms using supervised graph embedding. (arXiv:2011.08225v3 [cs.LG] UPDATED)
    (2 min) The widespread adoption of machine learning (ML) techniques and the extensive expertise required to apply them have led to increased interest in automated ML solutions that reduce the need for human intervention. One of the main challenges in applying ML to previously unseen problems is algorithm selection - the identification of high-performing algorithm(s) for a given dataset, task, and evaluation measure. This study addresses the algorithm selection challenge for data clustering, a fundamental task in data mining that is aimed at grouping similar objects. We present MARCO-GE, a novel meta-learning approach for the automated recommendation of clustering algorithms. MARCO-GE first transforms datasets into graphs and then utilizes a graph convolutional neural network technique to extract their latent representation. Using the embedding representations obtained, MARCO-GE trains a ranking meta-model capable of accurately recommending top-performing algorithms for a new dataset and clustering evaluation measure. Extensive evaluation on 210 datasets, 13 clustering algorithms, and 10 clustering measures demonstrates the effectiveness of our approach and its superiority in terms of predictive and generalization performance over state-of-the-art clustering meta-learning approaches.
    Hidden Markov models as recurrent neural networks: an application to Alzheimer's disease. (arXiv:2006.03151v3 [stat.ML] UPDATED)
    (2 min) Hidden Markov models (HMMs) are commonly used for disease progression modeling when the true patient health state is not fully known. Since HMMs typically have multiple local optima, incorporating additional patient covariates can improve parameter estimation and predictive performance. To allow for this, we develop hidden Markov recurrent neural networks (HMRNNs), a special case of recurrent neural networks that combine neural networks' flexibility with HMMs' interpretability. The HMRNN can be reduced to a standard HMM, with an identical likelihood function and parameter interpretations, but can also be combined with other predictive neural networks that take patient information as input. The HMRNN estimates all parameters simultaneously via gradient descent. Using a dataset of Alzheimer's disease patients, we demonstrate how the HMRNN can combine an HMM with other predictive neural networks to improve disease forecasting and to offer a novel clinical interpretation compared with a standard HMM trained via expectation-maximization.
    A geometric perspective on functional outlier detection. (arXiv:2109.06849v1 [stat.ML])
    (2 min) We consider functional outlier detection from a geometric perspective, specifically: for functional data sets drawn from a functional manifold which is defined by the data's modes of variation in amplitude and phase. Based on this manifold, we develop a conceptualization of functional outlier detection that is more widely applicable and realistic than previously proposed. Our theoretical and experimental analyses demonstrate several important advantages of this perspective: It considerably improves theoretical understanding and allows to describe and analyse complex functional outlier scenarios consistently and in full generality, by differentiating between structurally anomalous outlier data that are off-manifold and distributionally outlying data that are on-manifold but at its margins. This improves practical feasibility of functional outlier detection: We show that simple manifold learning methods can be used to reliably infer and visualize the geometric structure of functional data sets. We also show that standard outlier detection methods requiring tabular data inputs can be applied to functional data very successfully by simply using their vector-valued representations learned from manifold learning methods as input features. Our experiments on synthetic and real data sets demonstrate that this approach leads to outlier detection performances at least on par with existing functional data-specific methods in a large variety of settings, without the highly specialized, complex methodology and narrow domain of application these methods often entail.
    Performance of a Markovian neural network versus dynamic programming on a fishing control problem. (arXiv:2109.06856v1 [math.OC])
    (2 min) Fishing quotas are unpleasant but efficient to control the productivity of a fishing site. A popular model has a stochastic differential equation for the biomass on which a stochastic dynamic programming or a Hamilton-Jacobi-Bellman algorithm can be used to find the stochastic control -- the fishing quota. We compare the solutions obtained by dynamic programming against those obtained with a neural network which preserves the Markov property of the solution. The method is extended to a similar multi species model to check its robustness in high dimension.
    Formalizing and Estimating Distribution Inference Risks. (arXiv:2109.06024v2 [cs.LG] UPDATED)
    (0 min) Property inference attacks reveal statistical properties about a training set but are difficult to distinguish from the intrinsic purpose of statistical machine learning, namely to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal and general definition of property inference attacks. The proposed notion describes attacks that can distinguish between possible training distributions, extending beyond previous property inference attacks that infer the ratio of a particular type of data in the training data set such as the proportion of females. We show how our definition captures previous property inference attacks as well as a new attack that can reveal the average node degree or clustering coefficient of a training graph. Our definition also enables a theorem that connects the maximum possible accuracy of inference attacks distinguishing between distributions to the effective size of dataset leaked by the model. To quantify and understand property inference risks, we conduct a series of experiments across a range of different distributions using both black-box and white-box attacks. Our results show that inexpensive attacks are often as effective as expensive meta-classifier attacks, and that there are surprising asymmetries in the effectiveness of attacks. We also extend the state-of-the-art property inference attack to work on convolutional neural networks, and propose techniques to help identify parameters in a model that leak the most information, thus significantly lowering resource requirements for meta-classifier attacks.
    A Survey on GANs for Anomaly Detection. (arXiv:1906.11632v2 [cs.LG] UPDATED)
    (2 min) Anomaly detection is a significant problem faced in several research areas. Detecting and correctly classifying something unseen as anomalous is a challenging problem that has been tackled in many different manners over the years. Generative Adversarial Networks (GANs) and the adversarial training process have been recently employed to face this task yielding remarkable results. In this paper we survey the principal GAN-based anomaly detection methods, highlighting their pros and cons. Our contributions are the empirical validation of the main GAN models for anomaly detection, the increase of the experimental results on different datasets and the public release of a complete Open Source toolbox for Anomaly Detection using GANs.
    Clustering acoustic emission data streams with sequentially appearing clusters using mixture models. (arXiv:2108.11211v2 [stat.ML] UPDATED)
    (0 min) The interpretation of unlabeled acoustic emission (AE) data classically relies on general-purpose clustering methods. While several external criteria have been used in the past to select the hyperparameters of those algorithms, few studies have paid attention to the development of dedicated objective functions in clustering methods able to cope with the specificities of AE data. We investigate how to explicitly represent clusters onsets in mixture models in general, and in Gaussian Mixture Models (GMM) in particular. By modifying the internal criterion of such models, we propose the first clustering method able to provide, through parameters estimated by an expectation-maximization procedure, information about when clusters occur (onsets), how they grow (kinetics) and their level of activation through time. This new objective function accommodates continuous timestamps of AE signals and, thus, their order of occurrence. The method, called GMMSEQ, is experimentally validated to characterize the loosening phenomenon in bolted structure under vibrations. A comparison with three standard clustering methods on raw streaming data from five experimental campaigns shows that GMMSEQ not only provides useful qualitative information about the timeline of clusters, but also shows better performance in terms of cluster characterization. In view of developing an open acoustic emission initiative and according to the FAIR principles, the datasets and the codes are made available to reproduce the research of this paper.
    Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition. (arXiv:2109.06870v1 [cs.CL])
    (2 min) This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.
    A pragmatic approach to estimating average treatment effects from EHR data: the effect of prone positioning on mechanically ventilated COVID-19 patients. (arXiv:2109.06707v1 [cs.LG])
    (3 min) Despite the recent progress in the field of causal inference, to date there is no agreed upon methodology to glean treatment effect estimation from observational data. The consequence on clinical practice is that, when lacking results from a randomized trial, medical personnel is left without guidance on what seems to be effective in a real-world scenario. This article showcases a pragmatic methodology to obtain preliminary estimation of treatment effect from observational studies. Our approach was tested on the estimation of treatment effect of the proning maneuver on oxygenation levels, on a cohort of COVID-19 Intensive Care patients. We modeled our study design on a recent RCT for proning (the PROSEVA trial). Linear regression, propensity score models such as blocking and DR-IPW, BART and two versions of Counterfactual Regression were employed to provide estimates on observational data comprising first wave COVID-19 ICU patient data from 25 Dutch hospitals. 6371 data points, from 745 mechanically ventilated patients, were included in the study. Estimates for the early effect of proning -- P/F ratio from 2 to 8 hours after proning -- ranged between 14.54 and 20.11 mm Hg depending on the model. Estimates for the late effect of proning -- oxygenation from 12 to 24 hours after proning -- ranged between 13.53 and 15.26 mm Hg. All confidence interval being strictly above zero indicated that the effect of proning on oxygenation for COVID-19 patient was positive and comparable in magnitude to the effect on non COVID-19 patients. These results provide further evidence on the effectiveness of proning on the treatment of COVID-19 patients. This study, along with the accompanying open-source code, provides a blueprint for treatment effect estimation in scenarios where RCT data is lacking. Funding: SIDN fund, CovidPredict consortium, Pacmed.
    Learning to Scan: A Deep Reinforcement Learning Approach for Personalized Scanning in CT Imaging. (arXiv:2006.02420v5 [physics.med-ph] UPDATED)
    (2 min) Computed Tomography (CT) takes X-ray measurements on the subjects to reconstruct tomographic images. As X-ray is radioactive, it is desirable to control the total amount of dose of X-ray for safety concerns. Therefore, we can only select a limited number of measurement angles and assign each of them limited amount of dose. Traditional methods such as compressed sensing usually randomly select the angles and equally distribute the allowed dose on them. In most CT reconstruction models, the emphasize is on designing effective image representations, while much less emphasize is on improving the scanning strategy. The simple scanning strategy of random angle selection and equal dose distribution performs well in general, but they may not be ideal for each individual subject. It is more desirable to design a personalized scanning strategy for each subject to obtain better reconstruction result. In this paper, we propose to use Reinforcement Learning (RL) to learn a personalized scanning policy to select the angles and the dose at each chosen angle for each individual subject. We first formulate the CT scanning process as an MDP, and then use modern deep RL methods to solve it. The learned personalized scanning strategy not only leads to better reconstruction results, but also shows strong generalization to be combined with different reconstruction algorithms.
    Depth separation beyond radial functions. (arXiv:2102.01621v3 [cs.LG] UPDATED)
    (2 min) High-dimensional depth separation results for neural networks show that certain functions can be efficiently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions $d$. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The first contribution of this paper is to extend such results to a more general class of functions, namely functions with piece-wise oscillatory structure, by building on the proof strategy of (Eldan and Shamir, 2016). We complement these results by showing that, if the domain radius and the rate of oscillation of the objective function are constant, then approximation by one-hidden-layer networks holds at a $\mathrm{poly}(d)$ rate for any fixed error threshold. A common theme in the proofs of depth-separation results is the fact that one-hidden-layer networks fail to approximate high-energy functions whose Fourier representation is spread in the domain. On the other hand, existing approximation results of a function by one-hidden-layer neural networks rely on the function having a sparse Fourier representation. The choice of the domain also represents a source of gaps between upper and lower approximation bounds. Focusing on a fixed approximation domain, namely the sphere $\mathbb{S}^{d-1}$ in dimension $d$, we provide a characterisation of both functions which are efficiently approximable by one-hidden-layer networks and of functions which are provably not, in terms of their Fourier expansion.
    Attention-augmented Spatio-Temporal Segmentation for Land Cover Mapping. (arXiv:2105.02963v2 [cs.CV] UPDATED)
    (2 min) The availability of massive earth observing satellite data provide huge opportunities for land use and land cover mapping. However, such mapping effort is challenging due to the existence of various land cover classes, noisy data, and the lack of proper labels. Also, each land cover class typically has its own unique temporal pattern and can be identified only during certain periods. In this article, we introduce a novel architecture that incorporates the UNet structure with Bidirectional LSTM and Attention mechanism to jointly exploit the spatial and temporal nature of satellite data and to better identify the unique temporal patterns of each land cover. We evaluate this method for mapping crops in multiple regions over the world. We compare our method with other state-of-the-art methods both quantitatively and qualitatively on two real-world datasets which involve multiple land cover classes. We also visualise the attention weights to study its effectiveness in mitigating noise and identifying discriminative time period.
    Automatic Curricula via Expert Demonstrations. (arXiv:2106.09159v2 [cs.LG] UPDATED)
    (2 min) We propose Automatic Curricula via Expert Demonstrations (ACED), a reinforcement learning (RL) approach that combines the ideas of imitation learning and curriculum learning in order to solve challenging robotic manipulation tasks with sparse reward functions. Curriculum learning solves complicated RL tasks by introducing a sequence of auxiliary tasks with increasing difficulty, yet how to automatically design effective and generalizable curricula remains a challenging research problem. ACED extracts curricula from a small amount of expert demonstration trajectories by dividing demonstrations into sections and initializing training episodes to states sampled from different sections of demonstrations. Through moving the reset states from the end to the beginning of demonstrations as the learning agent improves its performance, ACED not only learns challenging manipulation tasks with unseen initializations and goals, but also discovers novel solutions that are distinct from the demonstrations. In addition, ACED can be naturally combined with other imitation learning methods to utilize expert demonstrations in a more efficient manner, and we show that a combination of ACED with behavior cloning allows pick-and-place tasks to be learned with as few as 1 demonstration and block stacking tasks to be learned with 20 demonstrations.
    Honest-but-Curious Nets: Sensitive Attributes of Private Inputs Can Be Secretly Coded into the Classifiers' Outputs. (arXiv:2105.12049v3 [cs.LG] UPDATED)
    (2 min) It is known that deep neural networks, trained for the classification of non-sensitive target attributes, can reveal sensitive attributes of their input data through internal representations extracted by the classifier. We take a step forward and show that deep classifiers can be trained to secretly encode a sensitive attribute of their input data into the classifier's outputs for the target attribute, at inference time. Our proposed attack works even if users have a full white-box view of the classifier, can keep all internal representations hidden, and only release the classifier's estimations for the target attribute. We introduce an information-theoretical formulation for such attacks and present efficient empirical implementations for training honest-but-curious (HBC) classifiers: classifiers that can be accurate in predicting their target attribute, but can also exploit their outputs to secretly encode a sensitive attribute. Our work highlights a vulnerability that can be exploited by malicious machine learning service providers to attack their user's privacy in several seemingly safe scenarios; such as encrypted inferences, computations at the edge, or private knowledge distillation. Experimental results on several attributes in two face-image datasets show that a semi-trusted server can train classifiers that are not only perfectly honest but also accurately curious. We conclude by showing the difficulties in distinguishing between standard and HBC classifiers, discussing challenges in defending against this vulnerability of deep classifiers, and enumerating related open directions for future studies.
    DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation. (arXiv:2103.11109v4 [cs.LG] UPDATED)
    (3 min) Recent success of deep neural networks (DNNs) hinges on the availability of large-scale dataset; however, training on such dataset often poses privacy risks for sensitive training information. In this paper, we aim to explore the power of generative models and gradient sparsity, and propose a scalable privacy-preserving generative model DATALENS. Comparing with the standard PATE privacy-preserving framework which allows teachers to vote on one-dimensional predictions, voting on the high dimensional gradient vectors is challenging in terms of privacy preservation. As dimension reduction techniques are required, we need to navigate a delicate tradeoff space between (1) the improvement of privacy preservation and (2) the slowdown of SGD convergence. To tackle this, we take advantage of communication efficient learning and propose a novel noise compression and aggregation approach TOPAGG by combining top-k compression for dimension reduction with a corresponding noise injection mechanism. We theoretically prove that the DATALENS framework guarantees differential privacy for its generated data, and provide analysis on its convergence. To demonstrate the practical usage of DATALENS, we conduct extensive experiments on diverse datasets including MNIST, Fashion-MNIST, and high dimensional CelebA, and we show that, DATALENS significantly outperforms other baseline DP generative models. In addition, we adapt the proposed TOPAGG approach, which is one of the key building blocks in DATALENS, to DP SGD training, and show that it is able to achieve higher utility than the state-of-the-art DP SGD approach in most cases. Our code is publicly available at https://github.com/AI-secure/DataLens.
    Reactive and Safe Road User Simulations using Neural Barrier Certificates. (arXiv:2109.06689v1 [cs.MA])
    (2 min) Reactive and safe agent modelings are important for nowadays traffic simulator designs and safe planning applications. In this work, we proposed a reactive agent model which can ensure safety without comprising the original purposes, by learning only high-level decisions from expert data and a low-level decentralized controller guided by the jointly learned decentralized barrier certificates. Empirical results show that our learned road user simulation models can achieve a significant improvement in safety comparing to state-of-the-art imitation learning and pure control-based methods, while being similar to human agents by having smaller errors to the expert data. Moreover, our learned reactive agents are shown to generalize better to unseen traffic conditions, and react better to other road users and therefore can help understand challenging planning problems pragmatically.
    Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion Recognition. (arXiv:2103.09154v2 [cs.CV] UPDATED)
    (2 min) Emotional expressions are the behaviors that communicate our emotional state or attitude to others. They are expressed through verbal and non-verbal communication. Complex human behavior can be understood by studying physical features from multiple modalities; mainly facial, vocal and physical gestures. Recently, spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis. In this paper, we propose a new deep learning-based approach for audio-visual emotion recognition. Our approach leverages recent advances in deep learning like knowledge distillation and high-performing deep architectures. The deep feature representations of the audio and visual modalities are fused based on a model-level fusion strategy. A recurrent neural network is then used to capture the temporal dynamics. Our proposed approach substantially outperforms state-of-the-art approaches in predicting valence on the RECOLA dataset. Moreover, our proposed visual facial expression feature extraction network outperforms state-of-the-art results on the AffectNet and Google Facial Expression Comparison datasets.
    Few-shot Quality-Diversity Optimisation. (arXiv:2109.06826v1 [cs.LG])
    (0 min) In the past few years, a considerable amount of research has been dedicated to the exploitation of previous learning experiences and the design of Few-shot and Meta Learning approaches, in problem domains ranging from Computer Vision to Reinforcement Learning based control. A notable exception, where to the best of our knowledge, little to no effort has been made in this direction is Quality-Diversity (QD) optimisation. QD methods have been shown to be effective tools in dealing with deceptive minima and sparse rewards in Reinforcement Learning. However, they remain costly due to their reliance on inherently sample inefficient evolutionary processes. We show that, given examples from a task distribution, information about the paths taken by optimisation in parameter space can be leveraged to build a prior population, which when used to initialise QD methods in unseen environments, allows for few-shot adaptation. Our proposed method does not require backpropagation. It is simple to implement and scale, and furthermore, it is agnostic to the underlying models that are being trained. Experiments carried in both sparse and dense reward settings using robotic manipulation and navigation benchmarks show that it considerably reduces the number of generations that are required for QD optimisation in these environments.
    Stiff Neural Ordinary Differential Equations. (arXiv:2103.15341v3 [math.NA] UPDATED)
    (0 min) Neural Ordinary Differential Equations (ODE) are a promising approach to learn dynamic models from time-series data in science and engineering applications. This work aims at learning Neural ODE for stiff systems, which are usually raised from chemical kinetic modeling in chemical and biological systems. We first show the challenges of learning neural ODE in the classical stiff ODE systems of Robertson's problem and propose techniques to mitigate the challenges associated with scale separations in stiff systems. We then present successful demonstrations in stiff systems of Robertson's problem and an air pollution problem. The demonstrations show that the usage of deep networks with rectified activations, proper scaling of the network outputs as well as loss functions, and stabilized gradient calculations are the key techniques enabling the learning of stiff neural ODE. The success of learning stiff neural ODE opens up possibilities of using neural ODEs in applications with widely varying time-scales, like chemical dynamics in energy conversion, environmental engineering, and the life sciences.
    Causal Decision Making and Causal Effect Estimation Are Not the Same... and Why It Matters. (arXiv:2104.04103v2 [stat.ML] UPDATED)
    (0 min) Causal decision making (CDM) based on machine learning has become a routine part of business. Businesses algorithmically target offers, incentives, and recommendations to affect consumer behavior. Recently, we have seen an acceleration of research related to CDM and causal effect estimation (CEE) using machine-learned models. This article highlights an important perspective: CDM is not the same as CEE, and counterintuitively, accurate CEE is not necessary for accurate CDM. Our experience is that this is not well understood by practitioners or most researchers. Technically, the estimand of interest is different, and this has important implications both for modeling and for the use of statistical models for CDM. We draw on prior research to highlight three implications. (1) We should consider carefully the objective function of the causal machine learning, and if possible, optimize for accurate treatment assignment rather than for accurate effect-size estimation. (2) Confounding does not have the same effect on CDM as it does on CEE. The upshot is that for supporting CDM it may be just as good or even better to learn with confounded data as with unconfounded data. Finally, (3) causal statistical modeling may not be necessary to support CDM because a proxy target for statistical modeling might do as well or better. This third observation helps to explain at least one broad common CDM practice that seems wrong at first blush: the widespread use of non-causal models for targeting interventions. The last two implications are particularly important in practice, as acquiring (unconfounded) data on all counterfactuals can be costly and often impracticable. These observations open substantial research ground. We hope to facilitate research in this area by pointing to related articles from multiple contributing fields, including two dozen articles published the last three to four years.
    AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization. (arXiv:2106.16101v3 [math.OC] UPDATED)
    (0 min) In the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems based on unified adaptive matrices, which include almost existing coordinate-wise and global adaptive learning rates. Specifically, we propose a fast Adaptive Gradient Decent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower sample complexity of $O(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. At the same time, we present an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower sample complexity of $O(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method reaches the best known sample complexity of $O(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Some experimental results on fair classifier and policy evaluation tasks demonstrate the efficiency of our algorithms.
    Session-aware Recommendation: A Surprising Quest for the State-of-the-art. (arXiv:2011.03424v2 [cs.IR] UPDATED)
    (2 min) Recommender systems are designed to help users in situations of information overload. In recent years, we observed increased interest in session-based recommendation scenarios, where the problem is to make item suggestions to users based only on interactions observed in an ongoing session. However, in cases where interactions from previous user sessions are available, the recommendations can be personalized according to the users' long-term preferences, a process called session-aware recommendation. Today, research in this area is scattered and many existing works only compare session-aware with session-based models. This makes it challenging to understand what represents the state-of-the-art. To close this research gap, we benchmarked recent session-aware algorithms against each other and against a number of session-based recommendation algorithms and trivial extensions thereof. Our comparison, to some surprise, revealed that (i) item simple techniques based on nearest neighbors consistently outperform recent neural techniques and that (ii) session-aware models were mostly not better than approaches that do not use long-term preference information. Our work therefore not only points to potential methodological issues where new methods are compared to weak baselines, but also indicates that there remains a huge potential for more sophisticated session-aware recommendation algorithms.
    Online learning-based trajectory tracking for underactuated vehicles with uncertain dynamics. (arXiv:2009.06689v2 [eess.SY] UPDATED)
    (2 min) Underactuated vehicles have gained much attention in the recent years due to the increasing amount of aerial and underwater vehicles as well as nanosatellites. Trajectory tracking control of these vehicles is a substantial aspect for an increasing range of application domains. However, external disturbances and parts of the internal dynamics are often unknown or very time-consuming to model. To overcome this issue, we present a tracking control law for underactuated rigid-body dynamics using an online learning-based oracle for the prediction of the unknown dynamics. We show that Gaussian process models are of particular interest for the role of the oracle. The presented approach guarantees a bounded tracking error with high probability where the bound is explicitly given. A numerical example highlights the effectiveness of the proposed control law.
    PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models. (arXiv:2109.06777v1 [cs.LG])
    (2 min) \textit{What should a malicious user write next to fool a detection model?} Identifying malicious users is critical to ensure the safety and integrity of internet platforms. Several deep learning based detection models have been created. However, malicious users can evade deep detection models by manipulating their behavior, rendering these models of little use. The vulnerability of such deep detection models against adversarial attacks is unknown. Here we create a novel adversarial attack model against deep user sequence embedding-based classification models, which use the sequence of user posts to generate user embeddings and detect malicious users. In the attack, the adversary generates a new post to fool the classifier. We propose a novel end-to-end Personalized Text Generation Attack model, called \texttt{PETGEN}, that simultaneously reduces the efficacy of the detection model and generates posts that have several key desirable properties. Specifically, \texttt{PETGEN} generates posts that are personalized to the user's writing style, have knowledge about a given target context, are aware of the user's historical posts on the target context, and encapsulate the user's recent topical interests. We conduct extensive experiments on two real-world datasets (Yelp and Wikipedia, both with ground-truth of malicious users) to show that \texttt{PETGEN} significantly reduces the performance of popular deep user sequence embedding-based classification models. \texttt{PETGEN} outperforms five attack baselines in terms of text quality and attack efficacy in both white-box and black-box classifier settings. Overall, this work paves the path towards the next generation of adversary-aware sequence classification models.
    Predicting Loss Risks for B2B Tendering Processes. (arXiv:2109.06815v1 [cs.LG])
    (2 min) Sellers and executives who maintain a bidding pipeline of sales engagements with multiple clients for many opportunities significantly benefit from data-driven insight into the health of each of their bids. There are many predictive models that offer likelihood insights and win prediction modeling for these opportunities. Currently, these win prediction models are in the form of binary classification and only make a prediction for the likelihood of a win or loss. The binary formulation is unable to offer any insight as to why a particular deal might be predicted as a loss. This paper offers a multi-class classification model to predict win probability, with the three loss classes offering specific reasons as to why a loss is predicted, including no bid, customer did not pursue, and lost to competition. These classes offer an indicator of how that opportunity might be handled given the nature of the prediction. Besides offering baseline results on the multi-class classification, this paper also offers results on the model after class imbalance handling, with the results achieving a high accuracy of 85% and an average AUC score of 0.94.
    Improving and Simplifying Pattern Exploiting Training. (arXiv:2103.11955v2 [cs.CL] UPDATED)
    (2 min) Recently, pre-trained language models (LMs) have achieved strong performance when fine-tuned on difficult benchmarks like SuperGLUE. However, performance can suffer when there are very few labeled examples available for fine-tuning. Pattern Exploiting Training (PET) is a recent approach that leverages patterns for few-shot learning. However, PET uses task-specific unlabeled data. In this paper, we focus on few-shot learning without any unlabeled data and introduce ADAPET, which modifies PET's objective to provide denser supervision during fine-tuning. As a result, ADAPET outperforms PET on SuperGLUE without any task-specific unlabeled data. Our code can be found at https://github.com/rrmenon10/ADAPET.
    Fast Federated Edge Learning with Overlapped Communication and Computation and Channel-Aware Fair Client Scheduling. (arXiv:2109.06710v1 [cs.IT])
    (2 min) We consider federated edge learning (FEEL) over wireless fading channels taking into account the downlink and uplink channel latencies, and the random computation delays at the clients. We speed up the training process by overlapping the communication with computation. With fountain coded transmission of the global model update, clients receive the global model asynchronously, and start performing local computations right away. Then, we propose a dynamic client scheduling policy, called MRTP, for uploading local model updates to the parameter server (PS), which, at any time, schedules the client with the minimum remaining upload time. However, MRTP can lead to biased participation of clients in the update process, resulting in performance degradation in non-iid data scenarios. To overcome this, we propose two alternative schemes with fairness considerations, termed as age-aware MRTP (A-MRTP), and opportunistically fair MRTP (OF-MRTP). In A-MRTP, the remaining clients are scheduled according to the ratio between their remaining transmission time and the update age, while in OF-MRTP, the selection mechanism utilizes the long term average channel rate of the clients to further reduce the latency while ensuring fair participation of the clients. It is shown through numerical simulations that OF-MRTP provides significant reduction in latency without sacrificing test accuracy.
    ImUnity: a generalizable VAE-GAN solution for multicenter MR image harmonization. (arXiv:2109.06756v1 [eess.IV])
    (2 min) ImUnity is an original deep-learning model designed for efficient and flexible MR image harmonization. A VAE-GAN network, coupled with a confusion module and an optional biological preservation module, uses multiple 2D-slices taken from different anatomical locations in each subject of the training database, as well as image contrast transformations for its self-supervised training. It eventually generates 'corrected' MR images that can be used for various multi-center population studies. Using 3 open source databases (ABIDE, OASIS and SRPBS), which contain MR images from multiple acquisition scanner types or vendors and a large range of subjects ages, we show that ImUnity: (1) outperforms state-of-the-art methods in terms of quality of images generated using traveling subjects; (2) removes sites or scanner biases while improving patients classification; (3) harmonizes data coming from new sites or scanners without the need for an additional fine-tuning and (4) allows the selection of multiple MR reconstructed images according to the desired applications. Tested here on T1-weighted images, ImUnity could be used to harmonize other types of medical images.
    Joint Use of Node Attributes and Proximity for Semi-Supervised Classification on Graphs. (arXiv:2010.11536v2 [cs.SI] UPDATED)
    (0 min) The task of node classification is to infer unknown node labels, given the labels for some of the nodes along with the network structure and other node attributes. Typically, approaches for this task assume homophily, whereby neighboring nodes have similar attributes and a node's label can be predicted from the labels of its neighbors or other proximate (i.e., nearby) nodes in the network. However, such an assumption may not always hold -- in fact, there are cases where labels are better predicted from the individual attributes of each node rather than the labels of its proximate nodes. Ideally, node classification methods should flexibly adapt to a range of settings wherein unknown labels are predicted either from labels of proximate nodes, or individual node attributes, or partly both. In this paper, we propose a principled approach, JANE, based on a generative probabilistic model that jointly weighs the role of attributes and node proximity via embeddings in predicting labels. Our experiments on a variety of network datasets demonstrate that JANE exhibits the desired combination of versatility and competitive performance compared to standard baselines.
    Exploration in Deep Reinforcement Learning: A Comprehensive Survey. (arXiv:2109.06668v1 [cs.AI])
    (0 min) Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant success across a wide range of domains, such as game AI, autonomous vehicles, robotics and finance. However, DRL and deep MARL agents are widely known to be sample-inefficient and millions of interactions are usually needed even for relatively simple game settings, thus preventing the wide application in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how to efficiently explore the unknown environments and collect informative experiences that could benefit the policy learning most. In this paper, we conduct a comprehensive survey on existing exploration methods in DRL and deep MARL for the purpose of providing understandings and insights on the critical problems and solutions. We first identify several key challenges to achieve efficient exploration, which most of the exploration methods aim at addressing. Then we provide a systematic survey of existing approaches by classifying them into two major categories: uncertainty-oriented exploration and intrinsic motivation-oriented exploration. The essence of uncertainty-oriented exploration is to leverage the quantification of the epistemic and aleatoric uncertainty to derive efficient exploration. By contrast, intrinsic motivation-oriented exploration methods usually incorporate different reward agnostic information for intrinsic exploration guidance. Beyond the above two main branches, we also conclude other exploration methods which adopt sophisticated techniques but are difficult to be classified into the above two categories. In addition, we provide a comprehensive empirical comparison of exploration methods for DRL on a set of commonly used benchmarks. Finally, we summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.
    Tuna-AI: tuna biomass estimation with Machine Learning models trained on oceanography and echosounder FAD data. (arXiv:2109.06732v1 [stat.ML])
    (2 min) Echo-sounder data registered by buoys attached to drifting FADs provide a very valuablesource of information on populations of tuna and their behaviour. This value increases whenthese data are supplemented with oceanographic data coming from CMEMS. We use thesesources to develop Tuna-AI, a Machine Learning model aimed at predicting tuna biomassunder a given buoy, which uses a 3-day window of echo-sounder data to capture the dailyspatio-temporal patterns characteristic of tuna schools. As the supervised signal for training,we employ more than5000set events with their corresponding tuna catch reported by theAGAC tuna purse seine fleet.
    Comparing Reconstruction- and Contrastive-based Models for Visual Task Planning. (arXiv:2109.06737v1 [cs.RO])
    (2 min) Learning state representations enables robotic planning directly from raw observations such as images. Most methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space. The similarity between observations in the space of images is often assumed and used as a proxy for estimating similarity between the underlying states of the system. However, observations commonly contain task-irrelevant factors of variation which are nonetheless important for reconstruction, such as varying lighting and different camera viewpoints. In this work, we define relevant evaluation metrics and perform a thorough study of different loss functions for state representation learning. We show that models exploiting task priors, such as Siamese networks with a simple contrastive loss, outperform reconstruction-based representations in visual task planning.
    NURBS-Diff: A differentiable programming module for NURBS. (arXiv:2104.14547v2 [cs.LG] UPDATED)
    (0 min) Boundary representations (B-reps) using Non-Uniform Rational B-splines (NURBS) are the de facto standard used in CAD, but their utility in deep learning-based approaches is not well researched. We propose a differentiable NURBS module to integrate the NURBS representation of CAD models with deep learning methods. We mathematically define the derivatives of the NURBS curves or surfaces with respect to the input parameters. These derivatives are used to define an approximate Jacobian that can be used to perform the "backward" evaluation used while training deep learning models. We have implemented our NURBS module using GPU-accelerated algorithms and integrated it with PyTorch, a popular deep learning framework. We demonstrate the efficacy of our NURBS module in performing CAD operations such as curve or surface fitting and surface offsetting. Further, we show its utility in deep learning for unsupervised point cloud reconstruction. These examples show that our module performs better for certain deep learning frameworks and can be directly integrated with any deep-learning framework requiring NURBS.
    MIMIR: Deep Regression for Automated Analysis of UK Biobank Body MRI. (arXiv:2106.11731v2 [eess.IV] UPDATED)
    (0 min) UK Biobank (UKB) conducts large-scale examinations of more than half a million volunteers, collecting health-related information on genetics, lifestyle, blood biochemistry, and more. Medical imaging of 100,000 subjects, with 70,000 follow-up sessions, enables measurements of organs, muscle, and body composition. With up to 170,000 mounting MR images, various methodologies are accordingly engaged in large-scale image analysis. This work presents an experimental inference engine that can automatically predict a comprehensive profile of subject metadata from UKB neck-to-knee body MRI. It was evaluated in cross-validation for baseline characteristics such as age, height, weight, and sex, but also measurements of body composition, organ volumes, and abstract properties like grip strength, pulse rate, and type 2 diabetic status. It predicted subsequently released test data covering twelve body composition metrics with a 3% median error. The proposed system can automatically analyze one thousand subjects within ten minutes, providing individual confidence intervals. The underlying methodology utilizes convolutional neural networks for image-based mean-variance regression on two-dimensional representations of the MRI data. This work aims to make the proposed system available for free to researchers, who can use it to obtain fast and fully-automated estimates of 72 different measurements immediately upon release of new UKB image data.
    Sequential Modelling with Applications to Music Recommendation, Fact-Checking, and Speed Reading. (arXiv:2109.06736v1 [cs.IR])
    (2 min) Sequential modelling entails making sense of sequential data, which naturally occurs in a wide array of domains. One example is systems that interact with users, log user actions and behaviour, and make recommendations of items of potential interest to users on the basis of their previous interactions. In such cases, the sequential order of user interactions is often indicative of what the user is interested in next. Similarly, for systems that automatically infer the semantics of text, capturing the sequential order of words in a sentence is essential, as even a slight re-ordering could significantly alter its original meaning. This thesis makes methodological contributions and new investigations of sequential modelling for the specific application areas of systems that recommend music tracks to listeners and systems that process text semantics in order to automatically fact-check claims, or "speed read" text for efficient further classification. (Rest of abstract omitted due to arXiv abstract limit)
    COVID-Net Clinical ICU: Enhanced Prediction of ICU Admission for COVID-19 Patients via Explainability and Trust Quantification. (arXiv:2109.06711v1 [cs.LG])
    (2 min) The COVID-19 pandemic continues to have a devastating global impact, and has placed a tremendous burden on struggling healthcare systems around the world. Given the limited resources, accurate patient triaging and care planning is critical in the fight against COVID-19, and one crucial task within care planning is determining if a patient should be admitted to a hospital's intensive care unit (ICU). Motivated by the need for transparent and trustworthy ICU admission clinical decision support, we introduce COVID-Net Clinical ICU, a neural network for ICU admission prediction based on patient clinical data. Driven by a transparent, trust-centric methodology, the proposed COVID-Net Clinical ICU was built using a clinical dataset from Hospital Sirio-Libanes comprising of 1,925 COVID-19 patients, and is able to predict when a COVID-19 positive patient would require ICU admission with an accuracy of 96.9% to facilitate better care planning for hospitals amidst the on-going pandemic. We conducted system-level insight discovery using a quantitative explainability strategy to study the decision-making impact of different clinical features and gain actionable insights for enhancing predictive performance. We further leveraged a suite of trust quantification metrics to gain deeper insights into the trustworthiness of COVID-Net Clinical ICU. By digging deeper into when and why clinical predictive models makes certain decisions, we can uncover key factors in decision making for critical clinical decision support tasks such as ICU admission prediction and identify the situations under which clinical predictive models can be trusted for greater accountability.
    An Apparatus for the Simulation of Breathing Disorders: Physically Meaningful Generation of Surrogate Data. (arXiv:2109.06699v1 [physics.med-ph])
    (2 min) Whilst debilitating breathing disorders, such as chronic obstructive pulmonary disease (COPD), are rapidly increasing in prevalence, we witness a continued integration of artificial intelligence into healthcare. While this promises improved detection and monitoring of breathing disorders, AI techniques are "data hungry" which highlights the importance of generating physically meaningful surrogate data. Such domain knowledge aware surrogates would enable both an improved understanding of respiratory waveform changes with different breathing disorders and different severities, and enhance the training of machine learning algorithms. To this end, we introduce an apparatus comprising of PVC tubes and 3D printed parts as a simple yet effective method of simulating both obstructive and restrictive respiratory waveforms in healthy subjects. Independent control over both inspiratory and expiratory resistances allows for the simulation of obstructive breathing disorders through the whole spectrum of FEV1/FVC spirometry ratios (used to classify COPD), ranging from healthy values to values seen in severe chronic obstructive pulmonary disease. Moreover, waveform characteristics of breathing disorders, such as a change in inspiratory duty cycle or peak flow are also observed in the waveforms resulting from use of the artificial breathing disorder simulation apparatus. Overall, the proposed apparatus provides us with a simple, effective and physically meaningful way to generate surrogate breathing disorder waveforms, a prerequisite for the use of artificial intelligence in respiratory health.
    W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. (arXiv:2108.06209v2 [cs.LG] UPDATED)
    (0 min) Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.
    Types of Out-of-Distribution Texts and How to Detect Them. (arXiv:2109.06827v1 [cs.CL])
    (0 min) Despite agreement on the importance of detecting out-of-distribution (OOD) examples, there is little consensus on the formal definition of OOD examples and how to best detect them. We categorize these examples by whether they exhibit a background shift or a semantic shift, and find that the two major approaches to OOD detection, model calibration and density estimation (language modeling for text), have distinct behavior on these types of OOD data. Across 14 pairs of in-distribution and OOD English natural language understanding datasets, we find that density estimation methods consistently beat calibration methods in background shift settings, while performing worse in semantic shift settings. In addition, we find that both methods generally fail to detect examples from challenge data, highlighting a weak spot for current methods. Since no single method works well across all settings, our results call for an explicit definition of OOD examples when evaluating different detection methods.
    Greenformer: Factorization Toolkit for Efficient Deep Neural Networks. (arXiv:2109.06762v1 [cs.LG])
    (2 min) While the recent advances in deep neural networks (DNN) bring remarkable success, the computational cost also increases considerably. In this paper, we introduce Greenformer, a toolkit to accelerate the computation of neural networks through matrix factorization while maintaining performance. Greenformer can be easily applied with a single line of code to any DNN model. Our experimental results show that Greenformer is effective for a wide range of scenarios. We provide the showcase of Greenformer at https://samuelcahyawijaya.github.io/greenformer-demo/.
    Neural Upscaling from Residue-level Protein Structure Networks to Atomistic Structure. (arXiv:2109.06700v1 [q-bio.BM])
    (0 min) Coarse-graining is a powerful tool for extending the reach of dynamic models of proteins and other biological macromolecules. Topological coarse-graining, in which biomolecules or sets thereof are represented via graph structures, is a particularly useful way of obtaining highly compressed representations of molecular structure, and simulations operating via such representations can achieve substantial computational savings. A drawback of coarse-graining, however, is the loss of atomistic detail - an effect that is especially acute for topological representations such as protein structure networks (PSNs). Here, we introduce an approach based on a combination of machine learning and physically-guided refinement for inferring atomic coordinates from PSNs. This "neural upscaling" procedure exploits the constraints implied by PSNs on possible configurations, as well as differences in the likelihood of observing different configurations with the same PSN. Using a 1 $\mu$s atomistic molecular dynamics trajectory of A$\beta_{1-40}$, we show that neural upscaling is able to effectively recapitulate detailed structural information for intrinsically disordered proteins, being particularly successful in recovering features such as transient secondary structure. These results suggest that scalable network-based models for protein structure and dynamics may be used in settings where atomistic detail is desired, with upscaling employed to impute atomic coordinates from PSNs.
    An Insect-Inspired Randomly, Weighted Neural Network with Random Fourier Features For Neuro-Symbolic Relational Learning. (arXiv:2109.06663v1 [cs.CV])
    (3 min) Insects, such as fruit flies and honey bees, can solve simple associative learning tasks and learn abstract concepts such as "sameness" and "difference", which is viewed as a higher-order cognitive function and typically thought to depend on top-down neocortical processing. Empirical research with fruit flies strongly supports that a randomized representational architecture is used in olfactory processing in insect brains. Based on these results, we propose a Randomly Weighted Feature Network (RWFN) that incorporates randomly drawn, untrained weights in an encoder that uses an adapted linear model as a decoder. The randomized projections between input neurons and higher-order processing centers in the input brain is mimicked in RWFN by a single-hidden-layer neural network that specially structures latent representations in the hidden layer using random Fourier features that better represent complex relationships between inputs using kernel approximation. Because of this special representation, RWFNs can effectively learn the degree of relationship among inputs by training only a linear decoder model. We compare the performance of RWFNs to LTNs for Semantic Image Interpretation (SII) tasks that have been used as a representative example of how LTNs utilize reasoning over first-order logic to surpass the performance of solely data-driven methods. We demonstrate that compared to LTNs, RWFNs can achieve better or similar performance for both object classification and detection of the part-of relations between objects in SII tasks while using much far fewer learnable parameters (1:62 ratio) and a faster learning process (1:2 ratio of running speed). Furthermore, we show that because the randomized weights do not depend on the data, several decoders can share a single randomized encoder, giving RWFNs a unique economy of spatial scale for simultaneous classification tasks.
    Expert Knowledge-Guided Length-Variant Hierarchical Label Generation for Proposal Classification. (arXiv:2109.06661v1 [cs.LG])
    (2 min) To advance the development of science and technology, research proposals are submitted to open-court competitive programs developed by government agencies (e.g., NSF). Proposal classification is one of the most important tasks to achieve effective and fair review assignments. Proposal classification aims to classify a proposal into a length-variant sequence of labels. In this paper, we formulate the proposal classification problem into a hierarchical multi-label classification task. Although there are certain prior studies, proposal classification exhibit unique features: 1) the classification result of a proposal is in a hierarchical discipline structure with different levels of granularity; 2) proposals contain multiple types of documents; 3) domain experts can empirically provide partial labels that can be leveraged to improve task performances. In this paper, we focus on developing a new deep proposal classification framework to jointly model the three features. In particular, to sequentially generate labels, we leverage previously-generated labels to predict the label of next level; to integrate partial labels from experts, we use the embedding of these empirical partial labels to initialize the state of neural networks. Our model can automatically identify the best length of label sequence to stop next label prediction. Finally, we present extensive results to demonstrate that our method can jointly model partial labels, textual information, and semantic dependencies in label sequences, and, thus, achieve advanced performances.
    LRWR: Large-Scale Benchmark for Lip Reading in Russian language. (arXiv:2109.06692v1 [cs.CV])
    (2 min) Lipreading, also known as visual speech recognition, aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas. One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages: so far, these methods have been focused only on English or Chinese. In this paper, we introduce a naturally distributed large-scale benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers. We provide a detailed description of the dataset collection pipeline and dataset statistics. We also present a comprehensive comparison of the current popular lipreading methods on LRWR and conduct a detailed analysis of their performance. The results demonstrate the differences between the benchmarked languages and provide several promising directions for lipreading models finetuning. Thanks to our findings, we also achieved new state-of-the-art results on the LRW benchmark.
    Deep hierarchical reinforcement agents for automated penetration testing. (arXiv:2109.06449v1 [cs.AI])
    (2 min) Penetration testing the organised attack of a computer system in order to test existing defences has been used extensively to evaluate network security. This is a time consuming process and requires in-depth knowledge for the establishment of a strategy that resembles a real cyber-attack. This paper presents a novel deep reinforcement learning architecture with hierarchically structured agents called HA-DRL, which employs an algebraic action decomposition strategy to address the large discrete action space of an autonomous penetration testing simulator where the number of actions is exponentially increased with the complexity of the designed cybersecurity network. The proposed architecture is shown to find the optimal attacking policy faster and more stably than a conventional deep Q-learning agent which is commonly used as a method to apply artificial intelligence in automatic penetration testing.
    A Machine-learning Framework for Acoustic Design Assessment in Early Design Stages. (arXiv:2109.06459v1 [cs.SD])
    (2 min) In time-cost scale model studies, predicting acoustic performance by using simulation methods is a commonly used method that is preferred. In this field, building acoustic simulation tools are complicated by several challenges, including the high cost of acoustic tools, the need for acoustic expertise, and the time-consuming process of acoustic simulation. The goal of this project is to introduce a simple model with a short calculation time to estimate the room acoustic condition in the early design stages of the building. This paper presents a working prototype for a new method of machine learning (ML) to approximate a series of typical room acoustic parameters using only geometric data as input characteristics. A novel dataset consisting of acoustical simulations of a single room with 2916 different configurations are used to train and test the proposed model. In the stimulation process, features that include room dimensions, window size, material absorption coefficient, furniture, and shading type have been analysed by using Pachyderm acoustic software. The mentioned dataset is used as the input of seven machine-learning models based on fully connected Deep Neural Networks (DNN). The average error of ML models is between 1% to 3%, and the average error of the new predicted samples after the validation process is between 2% to 12%.
    Knowledge-guided Self-supervised Learning for estimating River-Basin Characteristics. (arXiv:2109.06429v1 [cs.LG])
    (2 min) Machine Learning is being extensively used in hydrology, especially streamflow prediction of basins/watersheds. Basin characteristics are essential for modeling the rainfall-runoff response of these watersheds and therefore data-driven methods must take into account this ancillary characteristics data. However there are several limitations, namely uncertainty in the measured characteristics, partially missing characteristics for some of the basins or unknown characteristics that may not be present in the known measured set. In this paper we present an inverse model that uses a knowledge-guided self-supervised learning algorithm to infer basin characteristics using the meteorological drivers and streamflow response data. We evaluate our model on the the CAMELS dataset and the results validate its ability to reduce measurement uncertainty, impute missing characteristics, and identify unknown characteristics.
    When Machine Unlearning Jeopardizes Privacy. (arXiv:2005.02205v2 [cs.CR] UPDATED)
    (3 min) The right to be forgotten states that a data owner has the right to erase their data from an entity storing it. In the context of machine learning (ML), the right to be forgotten requires an ML model owner to remove the data owner's data from the training set used to build the ML model, a process known as machine unlearning. While originally designed to protect the privacy of the data owner, we argue that machine unlearning may leave some imprint of the data in the ML model and thus create unintended privacy risks. In this paper, we perform the first study on investigating the unintended information leakage caused by machine unlearning. We propose a novel membership inference attack that leverages the different outputs of an ML model's two versions to infer whether a target sample is part of the training set of the original model but out of the training set of the corresponding unlearned model. Our experiments demonstrate that the proposed membership inference attack achieves strong performance. More importantly, we show that our attack in multiple cases outperforms the classical membership inference attack on the original ML model, which indicates that machine unlearning can have counterproductive effects on privacy. We notice that the privacy degradation is especially significant for well-generalized ML models where classical membership inference does not perform well. We further investigate four mechanisms to mitigate the newly discovered privacy risks and show that releasing the predicted label only, temperature scaling, and differential privacy are effective. We believe that our results can help improve privacy protection in practical implementations of machine unlearning. Our code is available at https://github.com/MinChen00/UnlearningLeaks.
    Deep Convolutional Generative Modeling for Artificial Microstructure Development of Aluminum-Silicon Alloy. (arXiv:2109.06635v1 [eess.IV])
    (2 min) Machine learning which is a sub-domain of an Artificial Intelligence which is finding various applications in manufacturing and material science sectors. In the present study, Deep Generative Modeling which a type of unsupervised machine learning technique has been adapted for the constructing the artificial microstructure of Aluminium-Silicon alloy. Deep Generative Adversarial Networks has been used for developing the artificial microstructure of the given microstructure image dataset. The results obtained showed that the developed models had learnt to replicate the lining near the certain images of the microstructures.
    A machine-learning framework for daylight and visual comfort assessment in early design stages. (arXiv:2109.06450v1 [cs.LG])
    (2 min) This research is mainly focused on the assessment of machine learning algorithms in the prediction of daylight and visual comfort metrics in the early design stages. A dataset was primarily developed from 2880 simulations derived from Honeybee for Grasshopper. The simulations were done for a shoebox space with a one side window. The alternatives emerged from different physical features, including room dimensions, interior surfaces reflectance, window dimensions and orientations, number of windows, and shading states. 5 metrics were used for daylight evaluations, including UDI, sDA, mDA, ASE, and sVD. Quality Views were analyzed for the same shoebox spaces via a grasshopper-based algorithm, developed from the LEED v4 evaluation framework for Quality Views. The dataset was further analyzed with an Artificial Neural Network algorithm written in Python. The accuracy of the predictions was estimated at 97% on average. The developed model could be used in early design stages analyses without the need for time-consuming simulations in previously used platforms and programs.
    Identifying partial mouse brain microscopy images from Allen reference atlas using a contrastively learned semantic space. (arXiv:2109.06662v1 [cs.CV])
    (2 min) Precise identification of mouse brain microscopy images is a crucial first step when anatomical structures in the mouse brain are to be registered to a reference atlas. Practitioners usually rely on manual comparison of images or tools that assume the presence of complete images. This work explores Siamese Networks as the method for finding corresponding 2D reference atlas plates for given partial 2D mouse brain images. Siamese networks are a class of convolutional neural networks (CNNs) that use weight-shared paths to obtain low dimensional embeddings of pairs of input images. The correspondence between the partial mouse brain image and reference atlas plate is determined based on the distance between low dimensional embeddings of brain slices and atlas plates that are obtained from Siamese networks using contrastive learning. Experiments showed that Siamese CNNs can precisely identify brain slices using the Allen mouse brain atlas when training and testing images come from the same source. They achieved TOP-1 and TOP-5 accuracy of 25% and 100%, respectively, taking only 7.2 seconds to identify 29 images.
    Tesla-Rapture: A Lightweight Gesture Recognition System from mmWave Radar Point Clouds. (arXiv:2109.06448v1 [cs.CV])
    (2 min) We present Tesla-Rapture, a gesture recognition interface for point clouds generated by mmWave Radars. State of the art gesture recognition models are either too resource consuming or not sufficiently accurate for integration into real-life scenarios using wearable or constrained equipment such as IoT devices (e.g. Raspberry PI), XR hardware (e.g. HoloLens), or smart-phones. To tackle this issue, we developed Tesla, a Message Passing Neural Network (MPNN) graph convolution approach for mmWave radar point clouds. The model outperforms the state of the art on two datasets in terms of accuracy while reducing the computational complexity and, hence, the execution time. In particular, the approach, is able to predict a gesture almost 8 times faster than the most accurate competitor. Our performance evaluation in different scenarios (environments, angles, distances) shows that Tesla generalizes well and improves the accuracy up to 20% in challenging scenarios like a through-wall setting and sensing at extreme angles. Utilizing Tesla, we develop Tesla-Rapture, a real-time implementation using a mmWave Radar on a Raspberry PI 4 and evaluate its accuracy and time-complexity. We also publish the source code, the trained models, and the implementation of the model for embedded devices.
    Complexity-aware Adaptive Training and Inference for Edge-Cloud Distributed AI Systems. (arXiv:2109.06440v1 [cs.LG])
    (2 min) The ubiquitous use of IoT and machine learning applications is creating large amounts of data that require accurate and real-time processing. Although edge-based smart data processing can be enabled by deploying pretrained models, the energy and memory constraints of edge devices necessitate distributed deep learning between the edge and the cloud for complex data. In this paper, we propose a distributed AI system to exploit both the edge and the cloud for training and inference. We propose a new architecture, MEANet, with a main block, an extension block, and an adaptive block for the edge. The inference process can terminate at either the main block, the extension block, or the cloud. The MEANet is trained to categorize inputs into easy/hard/complex classes. The main block identifies instances of easy/hard classes and classifies easy classes with high confidence. Only data with high probabilities of belonging to hard classes would be sent to the extension block for prediction. Further, only if the neural network at the edge shows low confidence in the prediction, the instance is considered complex and sent to the cloud for further processing. The training technique lends to the majority of inference on edge devices while going to the cloud only for a small set of complex jobs, as determined by the edge. The performance of the proposed system is evaluated via extensive experiments using modified models of ResNets and MobileNetV2 on CIFAR-100 and ImageNet datasets. The results show that the proposed distributed model has improved accuracy and energy consumption, indicating its capacity to adapt.
    Deep Denerative Models for Drug Design and Response. (arXiv:2109.06469v1 [cs.LG])
    (2 min) Designing new chemical compounds with desired pharmaceutical properties is a challenging task and takes years of development and testing. Still, a majority of new drugs fail to prove efficient. Recent success of deep generative modeling holds promises of generation and optimization of new molecules. In this review paper, we provide an overview of the current generative models, and describe necessary biological and chemical terminology, including molecular representations needed to understand the field of drug design and drug response. We present commonly used chemical and biological databases, and tools for generative modeling. Finally, we summarize the current state of generative modeling for drug design and drug response prediction, highlighting the state-of-art approaches and limitations the field is currently facing.
    Specified Certainty Classification, with Application to Read Classification for Reference-Guided Metagenomic Assembly. (arXiv:2109.06677v1 [q-bio.QM])
    (2 min) Specified Certainty Classification (SCC) is a new paradigm for employing classifiers whose outputs carry uncertainties, typically in the form of Bayesian posterior probabilities. By allowing the classifier output to be less precise than one of a set of atomic decisions, SCC allows all decisions to achieve a specified level of certainty, as well as provides insights into classifier behavior by examining all decisions that are possible. Our primary illustration is read classification for reference-guided genome assembly, but we demonstrate the breadth of SCC by also analyzing COVID-19 vaccination data.
    Structure-Enhanced Pop Music Generation via Harmony-Aware Learning. (arXiv:2109.06441v1 [cs.SD])
    (2 min) Automatically composing pop music with a satisfactory structure is an attractive but challenging topic. Although the musical structure is easy to be perceived by human, it is difficult to be described clearly and defined accurately. And it is still far from being solved that how we should model the structure in pop music generation. In this paper, we propose to leverage harmony-aware learning for structure-enhanced pop music generation. On the one hand, one of the participants of harmony, chord, represents the harmonic set of multiple notes, which is integrated closely with the spatial structure of music, texture. On the other hand, the other participant of harmony, chord progression, usually accompanies with the development of the music, which promotes the temporal structure of music, form. Besides, when chords evolve into chord progression, the texture and the form can be bridged by the harmony naturally, which contributes to the joint learning of the two structures. Furthermore, we propose the Harmony-Aware Hierarchical Music Transformer (HAT), which can exploit the structure adaptively from the music, and interact on the music tokens at multiple levels to enhance the signals of the structure in various musical elements. Results of subjective and objective evaluations demonstrate that HAT significantly improves the quality of generated music, especially in the structureness.
    Towards Better Model Understanding with Path-Sufficient Explanations. (arXiv:2109.06181v1 [cs.LG])
    (2 min) Feature based local attribution methods are amongst the most prevalent in explainable artificial intelligence (XAI) literature. Going beyond standard correlation, recently, methods have been proposed that highlight what should be minimally sufficient to justify the classification of an input (viz. pertinent positives). While minimal sufficiency is an attractive property, the resulting explanations are often too sparse for a human to understand and evaluate the local behavior of the model, thus making it difficult to judge its overall quality. To overcome these limitations, we propose a novel method called Path-Sufficient Explanations Method (PSEM) that outputs a sequence of sufficient explanations for a given input of strictly decreasing size (or value) -- from original input to a minimally sufficient explanation -- which can be thought to trace the local boundary of the model in a smooth manner, thus providing better intuition about the local model behavior for the specific input. We validate these claims, both qualitatively and quantitatively, with experiments that show the benefit of PSEM across all three modalities (image, tabular and text). A user study depicts the strength of the method in communicating the local behavior, where (many) users are able to correctly determine the prediction made by a model.
    safe-control-gym: a Unified Benchmark Suite for Safe Learning-based Control and Reinforcement Learning. (arXiv:2109.06325v1 [cs.RO])
    (2 min) In recent years, reinforcement learning and learning-based control -- as well as the study of their safety, crucial for deployment in real-world robots -- have gained significant traction. However, to adequately gauge the progress and applicability of new results, we need the tools to equitably compare the approaches proposed by the controls and reinforcement learning communities. Here, we propose a new open-source benchmark suite, called safe-control-gym. Our starting point is OpenAI's Gym API, which is one of the de facto standard in reinforcement learning research. Yet, we highlight the reasons for its limited appeal to control theory researchers -- and safe control, in particular. E.g., the lack of analytical models and constraint specifications. Thus, we propose to extend this API with (i) the ability to specify (and query) symbolic models and constraints and (ii) introduce simulated disturbances in the control inputs, measurements, and inertial properties. We provide implementations for three dynamic systems -- the cart-pole, 1D, and 2D quadrotor -- and two control tasks -- stabilization and trajectory tracking. To demonstrate our proposal -- and in an attempt to bring research communities closer together -- we show how to use safe-control-gym to quantitatively compare the control performance, data efficiency, and safety of multiple approaches from the areas of traditional control, learning-based control, and reinforcement learning.
    Rationales for Sequential Predictions. (arXiv:2109.06387v1 [cs.CL])
    (0 min) Sequence models are a critical component of modern NLP systems, but their predictions are difficult to explain. We consider model explanations though rationales, subsets of context that can explain individual model predictions. We find sequential rationales by solving a combinatorial optimization: the best rationale is the smallest subset of input tokens that would predict the same output as the full sequence. Enumerating all subsets is intractable, so we propose an efficient greedy algorithm to approximate this objective. The algorithm, which is called greedy rationalization, applies to any model. For this approach to be effective, the model should form compatible conditional distributions when making predictions on incomplete subsets of the context. This condition can be enforced with a short fine-tuning step. We study greedy rationalization on language modeling and machine translation. Compared to existing baselines, greedy rationalization is best at optimizing the combinatorial objective and provides the most faithful rationales. On a new dataset of annotated sequential rationales, greedy rationales are most similar to human rationales.
    End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents. (arXiv:2107.05541v4 [cs.CL] UPDATED)
    (0 min) Chatbots are intelligent software built to be used as a replacement for human interaction. However, existing studies typically do not provide enough support for low-resource languages like Bangla. Moreover, due to the increasing popularity of social media, we can also see the rise of interactions in Bangla transliteration (mostly in English) among the native Bangla speakers. In this paper, we propose a novel approach to build a Bangla chatbot aimed to be used as a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently. Since annotated data was not available for this purpose, we had to work on the whole machine learning life cycle (data preparation, machine learning modeling, and model deployment) using Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks. While working with the skewed annotated dataset, we try out different setups and pipelines to evaluate which works best and provide possible reasoning behind the observed results. Finally, we present a pipeline for intent classification and entity extraction which achieves reasonable performance (accuracy: 83.02%, precision: 80.82%, recall: 83.02%, F1-score: 80%).
    LM-Critic: Language Models for Unsupervised Grammatical Error Correction. (arXiv:2109.06822v1 [cs.CL])
    (0 min) Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs, but manually annotating such pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. In this work, we show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector. We evaluate our approach on GEC datasets across multiple domains (CoNLL-2014, BEA-2019, GMEG-wiki and GMEG-yahoo) and show that it outperforms existing methods in both the unsupervised setting (+7.7 F0.5) and the supervised setting (+0.5 F0.5).
    Machine-Learned Prediction Equilibrium for Dynamic Traffic Assignment. (arXiv:2109.06713v1 [cs.GT])
    (0 min) We study a dynamic traffic assignment model, where agents base their instantaneous routing decisions on real-time delay predictions. We formulate a mathematically concise model and derive properties of the predictors that ensure a dynamic prediction equilibrium exists. We demonstrate the versatility of our framework by showing that it subsumes the well-known full information and instantaneous information models, in addition to admitting further realistic predictors as special cases. We complement our theoretical analysis by an experimental study, in which we systematically compare the induced average travel times of different predictors, including a machine-learning model trained on data gained from previously computed equilibrium flows, both on a synthetic and a real road network.
    Physics Driven Domain Specific Transporter Framework with Attention Mechanism for Ultrasound Imaging. (arXiv:2109.06346v1 [eess.IV])
    (0 min) Most applications of deep learning techniques in medical imaging are supervised and require a large number of labeled data which is expensive and requires many hours of careful annotation by experts. In this paper, we propose an unsupervised, physics driven domain specific transporter framework with an attention mechanism to identify relevant key points with applications in ultrasound imaging. The proposed framework identifies key points that provide a concise geometric representation highlighting regions with high structural variation in ultrasound videos. We incorporate physics driven domain specific information as a feature probability map and use the radon transform to highlight features in specific orientations. The proposed framework has been trained on130 Lung ultrasound (LUS) videos and 113 Wrist ultrasound (WUS) videos and validated on 100 Lung ultrasound (LUS) videos and 58 Wrist ultrasound (WUS) videos acquired from multiple centers across the globe. Images from both datasets were independently assessed by experts to identify clinically relevant features such as A-lines, B-lines and pleura from LUS and radial metaphysis, radial epiphysis and carpal bones from WUS videos. The key points detected from both datasets showed high sensitivity (LUS = 99\% , WUS = 74\%) in detecting the image landmarks identified by experts. Also, on employing for classification of the given lung image into normal and abnormal classes, the proposed approach, even with no prior training, achieved an average accuracy of 97\% and an average F1-score of 95\% respectively on the task of co-classification with 3 fold cross-validation. With the purely unsupervised nature of the proposed approach, we expect the key point detection approach to increase the applicability of ultrasound in various examination performed in emergency and point of care.
    Exploring the Connection between Knowledge Distillation and Logits Matching. (arXiv:2109.06458v1 [cs.LG])
    (0 min) Knowledge distillation is a generalized logits matching technique for model compression. Their equivalence is previously established on the condition of $\textit{infinity temperature}$ and $\textit{zero-mean normalization}$. In this paper, we prove that with only $\textit{infinity temperature}$, the effect of knowledge distillation equals to logits matching with an extra regularization. Furthermore, we reveal that an additional weaker condition -- $\textit{equal-mean initialization}$ rather than the original $\textit{zero-mean normalization}$ already suffices to set up the equivalence. The key to our proof is we realize that in modern neural networks with the cross-entropy loss and softmax activation, the mean of back-propagated gradient on logits always keeps zero.
    In-filter Computing For Designing Ultra-light Acoustic Pattern Recognizers. (arXiv:2109.06171v1 [eess.AS])
    (2 min) We present a novel in-filter computing framework that can be used for designing ultra-light acoustic classifiers for use in smart internet-of-things (IoTs). Unlike a conventional acoustic pattern recognizer, where the feature extraction and classification are designed independently, the proposed architecture integrates the convolution and nonlinear filtering operations directly into the kernels of a Support Vector Machine (SVM). The result of this integration is a template-based SVM whose memory and computational footprint (training and inference) is light enough to be implemented on an FPGA-based IoT platform. While the proposed in-filter computing framework is general enough, in this paper, we demonstrate this concept using a Cascade of Asymmetric Resonator with Inner Hair Cells (CAR-IHC) based acoustic feature extraction algorithm. The complete system has been optimized using time-multiplexing and parallel-pipeline techniques for a Xilinx Spartan 7 series Field Programmable Gate Array (FPGA). We show that the system can achieve robust classification performance on benchmark sound recognition tasks using only ~ 1.5k Look-Up Tables (LUTs) and ~ 2.8k Flip-Flops (FFs), a significant improvement over other approaches.
    Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach. (arXiv:2109.06332v1 [cs.LG])
    (2 min) Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve $\epsilon$-optimal cumulative reward with $\epsilon$ feasible policies. An $\epsilon$-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve $\epsilon$-optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of a randomized primal-dual approach to solving the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity to achieve $\epsilon$-optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the $\epsilon$-optimal policy with zero constraint violation is $\tilde{\mathcal{O}}(1/\epsilon^5)$. Hence, the proposed algorithm provides a significant improvement as compared to the state of the art.
    From Heatmaps to Structural Explanations of Image Classifiers. (arXiv:2109.06365v1 [cs.CV])
    (2 min) This paper summarizes our endeavors in the past few years in terms of explaining image classifiers, with the aim of including negative results and insights we have gained. The paper starts with describing the explainable neural network (XNN), which attempts to extract and visualize several high-level concepts purely from the deep network, without relying on human linguistic concepts. This helps users understand network classifications that are less intuitive and substantially improves user performance on a difficult fine-grained classification task of discriminating among different species of seagulls. Realizing that an important missing piece is a reliable heatmap visualization tool, we have developed I-GOS and iGOS++ utilizing integrated gradients to avoid local optima in heatmap generation, which improved the performance across all resolutions. During the development of those visualizations, we realized that for a significant number of images, the classifier has multiple different paths to reach a confident prediction. This has lead to our recent development of structured attention graphs (SAGs), an approach that utilizes beam search to locate multiple coarse heatmaps for a single image, and compactly visualizes a set of heatmaps by capturing how different combinations of image regions impact the confidence of a classifier. Through the research process, we have learned much about insights in building deep network explanations, the existence and frequency of multiple explanations, and various tricks of the trade that make explanations work. In this paper, we attempt to share those insights and opinions with the readers with the hope that some of them will be informative for future researchers on explainable deep learning.
    Detecting Safety Problems of Multi-Sensor Fusion in Autonomous Driving. (arXiv:2109.06404v1 [cs.RO])
    (0 min) Autonomous driving (AD) systems have been thriving in recent years. In general, they receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor inputs, AD systems usually leverage multi-sensor fusion (MSF) to fuse the sensor inputs and produce a more reliable understanding of the surroundings. However, MSF cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular MSF methods in an industry-grade Advanced Driver-Assistance System (ADAS) can mislead the car control and result in serious safety hazards. Misbehavior can happen regardless of the used fusion methods and the accurate data from at least one sensor. To attribute the safety hazards to a MSF method, we formally define the fusion errors and propose a way to distinguish safety violations causally induced by such errors. Further, we develop a novel evolutionary-based domain-specific search framework, FusionFuzz, for the efficient detection of fusion errors. We evaluate our framework on two widely used MSF methods. %in two driving environments. Experimental results show that FusionFuzz identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the MSF methods under study.
    Mitigating Sampling Bias and Improving Robustness in Active Learning. (arXiv:2109.06321v1 [cs.LG])
    (2 min) This paper presents simple and efficient methods to mitigate sampling bias in active learning while achieving state-of-the-art accuracy and model robustness. We introduce supervised contrastive active learning by leveraging the contrastive loss for active learning under a supervised setting. We propose an unbiased query strategy that selects informative data samples of diverse feature representations with our methods: supervised contrastive active learning (SCAL) and deep feature modeling (DFM). We empirically demonstrate our proposed methods reduce sampling bias, achieve state-of-the-art accuracy and model calibration in an active learning setup with the query computation 26x faster than Bayesian active learning by disagreement and 11x faster than CoreSet. The proposed SCAL method outperforms by a big margin in robustness to dataset shift and out-of-distribution.
    ML Based Lineage in Databases. (arXiv:2109.06339v1 [cs.DB])
    (2 min) In this work, we track the lineage of tuples throughout their database lifetime. That is, we consider a scenario in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. As time goes on, exact provenance explanations for such tuples become deeply nested, increasingly consuming space, and resulting in decreased clarity and readability. We present a novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding. The basic idea is summarizing (and approximating) the lineage of each tuple via a small set of constant-size vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our solution does not suffer from space complexity blow-up over time, and it "naturally ranks" explanations to the existence of a tuple. We devise an alternative and improved lineage tracking mechanism, that of keeping track of and querying lineage at the column level; thereby, we manage to better distinguish between the provenance features and the textual characteristics of a tuple. We integrate our lineage computations into the PostgreSQL system via an extension (ProvSQL) and experimentally exhibit useful results in terms of accuracy against exact, semiring-based, justifications. In the experiments, we focus on tuples with multiple generations of tuples in their lifelong lineage and analyze them in terms of direct and distant lineage. The experiments suggest a high usefulness potential for the proposed approximate lineage methods and the further suggested enhancements. This especially holds for the column-based vectors method which exhibits high precision and high per-level recall.
    A Practical Adversarial Attack on Contingency Detection of Smart Energy Systems. (arXiv:2109.06358v1 [cs.LG])
    (2 min) Due to the advances in computing and sensing, deep learning (DL) has widely been applied in smart energy systems (SESs). These DL-based solutions have proved their potentials in improving the effectiveness and adaptiveness of the control systems. However, in recent years, increasing evidence shows that DL techniques can be manipulated by adversarial attacks with carefully-crafted perturbations. Adversarial attacks have been studied in computer vision and natural language processing. However, there is very limited work focusing on the adversarial attack deployment and mitigation in energy systems. In this regard, to better prepare the SESs against potential adversarial attacks, we propose an innovative adversarial attack model that can practically compromise dynamical controls of energy system. We also optimize the deployment of the proposed adversarial attack model by employing deep reinforcement learning (RL) techniques. In this paper, we present our first-stage work in this direction. In simulation section, we evaluate the performance of our proposed adversarial attack model using standard IEEE 9-bus system.
    Policy Optimization Using Semiparametric Models for Dynamic Pricing. (arXiv:2109.06368v1 [cs.LG])
    (2 min) In this paper, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating success or failure of a sale is observed. Our model setting is similar to Javanmard and Nazerzadeh [2019] except that we expand the demand curve to a semiparametric model and need to learn dynamically both parametric and nonparametric components. We propose a dynamic statistical learning and decision-making policy that combines semiparametric estimation from a generalized linear model with an unknown link and online decision-making to minimize regret (maximize revenue). Under mild conditions, we show that for a market noise c.d.f. $F(\cdot)$ with $m$-th order derivative ($m\geq 2$), our policy achieves a regret upper bound of $\tilde{O}_{d}(T^{\frac{2m+1}{4m-1}})$, where $T$ is time horizon and $\tilde{O}_{d}$ is the order that hides logarithmic terms and the dimensionality of feature $d$. The upper bound is further reduced to $\tilde{O}_{d}(\sqrt{T})$ if $F$ is super smooth whose Fourier transform decays exponentially. In terms of dependence on the horizon $T$, these upper bounds are close to $\Omega(\sqrt{T})$, the lower bound where $F$ belongs to a parametric class. We further generalize these results to the case with dynamically dependent product features under the strong mixing condition.
    A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space. (arXiv:2109.06324v1 [cs.CL])
    (2 min) In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available.
    State Relevance for Off-Policy Evaluation. (arXiv:2109.06310v1 [cs.LG])
    (2 min) Importance sampling-based estimators for off-policy evaluation (OPE) are valued for their simplicity, unbiasedness, and reliance on relatively few assumptions. However, the variance of these estimators is often high, especially when trajectories are of different lengths. In this work, we introduce Omitting-States-Irrelevant-to-Return Importance Sampling (OSIRIS), an estimator which reduces variance by strategically omitting likelihood ratios associated with certain states. We formalize the conditions under which OSIRIS is unbiased and has lower variance than ordinary importance sampling, and we demonstrate these properties empirically.
    Pre-emptive learning-to-defer for sequential medical decision-making under uncertainty. (arXiv:2109.06312v1 [cs.LG])
    (2 min) We propose SLTD (`Sequential Learning-to-Defer') a framework for learning-to-defer pre-emptively to an expert in sequential decision-making settings. SLTD measures the likelihood of improving value of deferring now versus later based on the underlying uncertainty in dynamics. In particular, we focus on the non-stationarity in the dynamics to accurately learn the deferral policy. We demonstrate our pre-emptive deferral can identify regions where the current policy has a low probability of improving outcomes. SLTD outperforms existing non-sequential learning-to-defer baselines, whilst reducing overall uncertainty on multiple synthetic and real-world simulators with non-stationary dynamics. We further derive and decompose the propagated (long-term) uncertainty for interpretation by the domain expert to provide an indication of when the model's performance is reliable.
    Exploring Personality and Online Social Engagement: An Investigation of MBTI Users on Twitter. (arXiv:2109.06402v1 [cs.CL])
    (0 min) Text-based personality prediction by computational models is an emerging field with the potential to significantly improve on key weaknesses of survey-based personality assessment. We investigate 3848 profiles from Twitter with self-labeled Myers-Briggs personality traits (MBTI) - a framework closely related to the Five Factor Model of personality - to better understand how text-based digital traces from social engagement online can be used to predict user personality traits. We leverage BERT, a state-of-the-art NLP architecture based on deep learning, to analyze various sources of text that hold most predictive power for our task. We find that biographies, statuses, and liked tweets contain significant predictive power for all dimensions of the MBTI system. We discuss our findings and their implications for the validity of the MBTI and the lexical hypothesis, a foundational theory underlying the Five Factor Model that links language use and behavior. Our results hold optimistic implications for personality psychologists, computational linguists, and other social scientists aiming to predict personality from observational text data and explore the links between language and core behavioral traits.
    Exploring the Long Short-Term Dependencies to Infer Shot Influence in Badminton Matches. (arXiv:2109.06431v1 [cs.LG])
    (0 min) Identifying significant shots in a rally is important for evaluating players' performance in badminton matches. While there are several studies that have quantified player performance in other sports, analyzing badminton data is remained untouched. In this paper, we introduce a badminton language to fully describe the process of the shot and propose a deep learning model composed of a novel short-term extractor and a long-term encoder for capturing a shot-by-shot sequence in a badminton rally by framing the problem as predicting a rally result. Our model incorporates an attention mechanism to enable the transparency of the action sequence to the rally result, which is essential for badminton experts to gain interpretable predictions. Experimental evaluation based on a real-world dataset demonstrates that our proposed model outperforms the strong baselines. The source code is publicly available at https://github.com/yao0510/Shot-Influence.
    On the regularized risk of distributionally robust learning over deep neural networks. (arXiv:2109.06294v1 [math.OC])
    (2 min) In this paper we explore the relation between distributionally robust learning and different forms of regularization to enforce robustness of deep neural networks. In particular, starting from a concrete min-max distributionally robust problem, and using tools from optimal transport theory, we derive first order and second order approximations to the distributionally robust problem in terms of appropriate regularized risk minimization problems. In the context of deep ResNet models, we identify the structure of the resulting regularization problems as mean-field optimal control problems where the number and dimension of state variables is within a dimension-free factor of the dimension of the original unrobust problem. Using the Pontryagin maximum principles associated to these problems we motivate a family of scalable algorithms for the training of robust neural networks. Our analysis recovers some results and algorithms known in the literature (in settings explained throughout the paper) and provides many other theoretical and algorithmic insights that to our knowledge are novel. In our analysis we employ tools that we deem useful for a future analysis of more general adversarial learning problems.
    Theoretical Guarantees of Fictitious Discount Algorithms for Episodic Reinforcement Learning and Global Convergence of Policy Gradient Methods. (arXiv:2109.06362v1 [cs.LG])
    (2 min) When designing algorithms for finite-time-horizon episodic reinforcement learning problems, a common approach is to introduce a fictitious discount factor and use stationary policies for approximations. Empirically, it has been shown that the fictitious discount factor helps reduce variance, and stationary policies serve to save the per-iteration computational cost. Theoretically, however, there is no existing work on convergence analysis for algorithms with this fictitious discount recipe. This paper takes the first step towards analyzing these algorithms. It focuses on two vanilla policy gradient (VPG) variants: the first being a widely used variant with discounted advantage estimations (DAE), the second with an additional fictitious discount factor in the score functions of the policy gradient estimators. Non-asymptotic convergence guarantees are established for both algorithms, and the additional discount factor is shown to reduce the bias introduced in DAE and thus improve the algorithm convergence asymptotically. A key ingredient of our analysis is to connect three settings of Markov decision processes (MDPs): the finite-time-horizon, the average reward and the discounted settings. To our best knowledge, this is the first theoretical guarantee on fictitious discount algorithms for the episodic reinforcement learning of finite-time-horizon MDPs, which also leads to the (first) global convergence of policy gradient methods for finite-time-horizon episodic reinforcement learning.
    Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation. (arXiv:2109.06379v1 [cs.CL])
    (0 min) Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective based on the nature of information change in NLG tasks, including compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). Information alignment between input, context, and output text plays a common central role in characterizing the generation. With automatic alignment prediction models, we develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog.
    TREATED:Towards Universal Defense against Textual Adversarial Attacks. (arXiv:2109.06176v1 [cs.LG])
    (2 min) Recent work shows that deep neural networks are vulnerable to adversarial examples. Much work studies adversarial example generation, while very little work focuses on more critical adversarial defense. Existing adversarial detection methods usually make assumptions about the adversarial example and attack method (e.g., the word frequency of the adversarial example, the perturbation level of the attack method). However, this limits the applicability of the detection method. To this end, we propose TREATED, a universal adversarial detection method that can defend against attacks of various perturbation levels without making any assumptions. TREATED identifies adversarial examples through a set of well-designed reference models. Extensive experiments on three competitive neural networks and two widely used datasets show that our method achieves better detection performance than baselines. We finally conduct ablation studies to verify the effectiveness of our method.
    Neural Networks with Physics-Informed Architectures and Constraints for Dynamical Systems Modeling. (arXiv:2109.06407v1 [cs.LG])
    (2 min) Effective inclusion of physics-based knowledge into deep neural network models of dynamical systems can greatly improve data efficiency and generalization. Such a-priori knowledge might arise from physical principles (e.g., conservation laws) or from the system's design (e.g., the Jacobian matrix of a robot), even if large portions of the system dynamics remain unknown. We develop a framework to learn dynamics models from trajectory data while incorporating a-priori system knowledge as inductive bias. More specifically, the proposed framework uses physics-based side information to inform the structure of the neural network itself, and to place constraints on the values of the outputs and the internal states of the model. It represents the system's vector field as a composition of known and unknown functions, the latter of which are parametrized by neural networks. The physics-informed constraints are enforced via the augmented Lagrangian method during the model's training. We experimentally demonstrate the benefits of the proposed approach on a variety of dynamical systems -- including a benchmark suite of robotics environments featuring large state spaces, non-linear dynamics, external forces, contact forces, and control inputs. By exploiting a-priori system knowledge during training, the proposed approach learns to predict the system dynamics two orders of magnitude more accurately than a baseline approach that does not include prior knowledge, given the same training dataset.
    Automatic Tuning of Tensorflow's CPU Backend using Gradient-Free Optimization Algorithms. (arXiv:2109.06266v1 [cs.LG])
    (2 min) Modern deep learning (DL) applications are built using DL libraries and frameworks such as TensorFlow and PyTorch. These frameworks have complex parameters and tuning them to obtain good training and inference performance is challenging for typical users, such as DL developers and data scientists. Manual tuning requires deep knowledge of the user-controllable parameters of DL frameworks as well as the underlying hardware. It is a slow and tedious process, and it typically delivers sub-optimal solutions. In this paper, we treat the problem of tuning parameters of DL frameworks to improve training and inference performance as a black-box optimization problem. We then investigate applicability and effectiveness of Bayesian optimization (BO), genetic algorithm (GA), and Nelder-Mead simplex (NMS) to tune the parameters of TensorFlow's CPU backend. While prior work has already investigated the use of Nelder-Mead simplex for a similar problem, it does not provide insights into the applicability of other more popular algorithms. Towards that end, we provide a systematic comparative analysis of all three algorithms in tuning TensorFlow's CPU backend on a variety of DL models. Our findings reveal that Bayesian optimization performs the best on the majority of models. There are, however, cases where it does not deliver the best results.
    Machine Learning for Online Algorithm Selection under Censored Feedback. (arXiv:2109.06234v1 [cs.LG])
    (2 min) In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. As the latter is known to exhibit a heavy-tail distribution, an algorithm is normally stopped when exceeding a predefined upper time limit. As a consequence, machine learning methods used to optimize an algorithm selection strategy in a data-driven manner need to deal with right-censored samples, a problem that has received little attention in the literature so far. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. Moreover, we adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon. In an extensive experimental evaluation on an adapted version of the ASlib benchmark, we demonstrate that theoretically well-founded methods based on Thompson sampling perform specifically strong and improve in comparison to existing methods.
    Program-to-Circuit: Exploiting GNNs for Program Representation and Circuit Translation. (arXiv:2109.06265v1 [cs.LG])
    (2 min) Circuit design is complicated and requires extensive domain-specific expertise. One major obstacle stuck on the way to hardware agile development is the considerably time-consuming process of accurate circuit quality evaluation. To significantly expedite the circuit evaluation during the translation from behavioral languages to circuit designs, we formulate it as a Program-to-Circuit problem, aiming to exploit the representation power of graph neural networks (GNNs) by representing C/C++ programs as graphs. The goal of this work is four-fold. First, we build a standard benchmark containing 40k C/C++ programs, each of which is translated to a circuit design with actual hardware quality metrics, aiming to facilitate the development of effective GNNs targeting this high-demand circuit design area. Second, 14 state-of-the-art GNN models are analyzed on the Program-to-Circuit problem. We identify key design challenges of this problem, which should be carefully handled but not yet solved by existing GNNs. The goal is to provide domain-specific knowledge for designing GNNs with suitable inductive biases. Third, we discuss three sets of real-world benchmarks for GNN generalization evaluation, and analyze the performance gap between standard programs and the real-case ones. The goal is to enable transfer learning from limited training data to real-world large-scale circuit design problems. Fourth, the Program-to-Circuit problem is a representative within the Program-to-X framework, a set of program-based analysis problems with various downstream tasks. The in-depth understanding of strength and weaknesses in applying GNNs on Program-to-Circuit could largely benefit the entire family of Program-to-X. Pioneering in this direction, we expect more GNN endeavors to revolutionize this high-demand Program-to-Circuit problem and to enrich the expressiveness of GNNs on programs.
    Mitigating Catastrophic Forgetting in Scheduled Sampling with Elastic Weight Consolidation in Neural Machine Translation. (arXiv:2109.06308v1 [cs.CL])
    (2 min) Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. a discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and often empirically successful approach which addresses this issue by incorporating model-generated prefixes into the training process. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that it ameliorates exposure bias by increasing model reliance on the input sequence. We also observe that as a side-effect, it worsens performance when the model-generated prefix is correct, a form of catastrophic forgetting. We propose using Elastic Weight Consolidation as trade-off between mitigating exposure bias and retaining output quality. Experiments on two IWSLT'14 translation tasks demonstrate that our approach alleviates catastrophic forgetting and significantly improves BLEU compared to standard scheduled sampling.
    Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation. (arXiv:2109.06253v1 [cs.CL])
    (0 min) Neural Machine Translation (NMT) is known to suffer from a beam-search problem: after a certain point, increasing beam size causes an overall drop in translation quality. This effect is especially pronounced for long sentences. While much work was done analyzing this phenomenon, primarily for autoregressive NMT models, there is still no consensus on its underlying cause. In this work, we analyze errors that cause major quality degradation with large beams in NMT and Automatic Speech Recognition (ASR). We show that a factor that strongly contributes to the quality degradation with large beams is \textit{dataset length-bias} - \textit{NMT datasets are strongly biased towards short sentences}. To mitigate this issue, we propose a new data augmentation technique -- \textit{Multi-Sentence Resampling (MSR)}. This technique extends the training examples by concatenating several sentences from the original dataset to make a long training example. We demonstrate that MSR significantly reduces degradation with growing beam size and improves final translation quality on the IWSTL$15$ En-Vi, IWSTL$17$ En-Fr, and WMT$14$ En-De datasets.
    Deep Generative Models to Extend Active Directory Graphs with Honeypot Users. (arXiv:2109.06180v1 [cs.CR])
    (0 min) Active Directory (AD) is a crucial element of large organizations, given its central role in managing access to resources. Since AD is used by all users in the organization, it is hard to detect attackers. We propose to generate and place fake users (honeyusers) in AD structures to help detect attacks. However, not any honeyuser will attract attackers. Our method generates honeyusers with a Variational Autoencoder that enriches the AD structure with well-positioned honeyusers. It first learns the embeddings of the original nodes and edges in the AD, then it uses a modified Bidirectional DAG-RNN to encode the parameters of the probability distribution of the latent space of node representations. Finally, it samples nodes from this distribution and uses an MLP to decide where the nodes are connected. The model was evaluated by the similarity of the generated AD with the original, by the positions of the new nodes, by the similarity with GraphRNN and finally by making real intruders attack the generated AD structure to see if they select the honeyusers. Results show that our machine learning model is good enough to generate well-placed honeyusers for existing AD structures so that intruders are lured into them.
    Generatively Augmented Neural Network Watchdog for Image Classification Networks. (arXiv:2109.06168v1 [cs.CV])
    (0 min) The identification of out-of-distribution data is vital to the deployment of classification networks. For example, a generic neural network that has been trained to differentiate between images of dogs and cats can only classify an input as either a dog or a cat. If a picture of a car or a kumquat were to be supplied to this classifier, the result would still be either a dog or a cat. In order to mitigate this, techniques such as the neural network watchdog have been developed. The compression of the image input into the latent layer of the autoencoder defines the region of in-distribution in the image space. This in-distribution set of input data has a corresponding boundary in the image space. The watchdog assesses whether inputs are in inside or outside this boundary. This paper demonstrates how to sharpen this boundary using generative network training data augmentation thereby bettering the discrimination and overall performance of the watchdog.
    Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning. (arXiv:2109.06349v1 [cs.CL])
    (0 min) In this work, we focus on a more challenging few-shot intent detection scenario where many intents are fine-grained and semantically similar. We present a simple yet effective few-shot intent detection schema via contrastive pre-training and fine-tuning. Specifically, we first conduct self-supervised contrastive pre-training on collected intent datasets, which implicitly learns to discriminate semantically similar utterances without using any labels. We then perform few-shot intent detection together with supervised contrastive learning, which explicitly pulls utterances from the same intent closer and pushes utterances across different intents farther. Experimental results show that our proposed method achieves state-of-the-art performance on three challenging intent detection datasets under 5-shot and 10-shot settings.
    MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks. (arXiv:2109.06275v1 [cs.AI])
    (0 min) An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners' beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.
    Statistical limits of dictionary learning: random matrix theory and the spectral replica method. (arXiv:2109.06610v1 [cs.IT])
    (2 min) We consider increasingly complex models of matrix denoising and dictionary learning in the Bayes-optimal setting, in the challenging regime where the matrices to infer have a rank growing linearly with the system size. This is in contrast with most existing literature concerned with the low-rank (i.e., constant-rank) regime. We first consider a class of rotationally invariant matrix denoising problems whose mutual information and minimum mean-square error are computable using standard techniques from random matrix theory. Next, we analyze the more challenging models of dictionary learning. To do so we introduce a novel combination of the replica method from statistical mechanics together with random matrix theory, coined spectral replica method. It allows us to conjecture variational formulas for the mutual information between hidden representations and the noisy data as well as for the overlaps quantifying the optimal reconstruction error. The proposed methods reduce the number of degrees of freedom from $\Theta(N^2)$ (matrix entries) to $\Theta(N)$ (eigenvalues or singular values), and yield Coulomb gas representations of the mutual information which are reminiscent of matrix models in physics. The main ingredients are the use of HarishChandra-Itzykson-Zuber spherical integrals combined with a new replica symmetric decoupling ansatz at the level of the probability distributions of eigenvalues (or singular values) of certain overlap matrices.
    Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation. (arXiv:2109.06604v1 [cs.CL])
    (2 min) Recently, $k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, it heavily relies on high-quality in-domain parallel corpora, limiting its capability on unsupervised domain adaptation, where in-domain parallel corpora are scarce or nonexistent. In this paper, we propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval. To this end, we first introduce an autoencoder task based on the target language, and then insert lightweight adapters into the original NMT model to map the token-level representation of this task to the ideal representation of translation task. Experiments on multi-domain datasets demonstrate that our proposed approach significantly improves the translation accuracy with target-side monolingual data, while achieving comparable performance with back-translation.
    DSDF: An approach to handle stochastic agents in collaborative multi-agent reinforcement learning. (arXiv:2109.06609v1 [cs.LG])
    (2 min) Multi-Agent reinforcement learning has received lot of attention in recent years and have applications in many different areas. Existing methods involving Centralized Training and Decentralized execution, attempts to train the agents towards learning a pattern of coordinated actions to arrive at optimal joint policy. However if some agents are stochastic to varying degrees of stochasticity, the above methods often fail to converge and provides poor coordination among agents. In this paper we show how this stochasticity of agents, which could be a result of malfunction or aging of robots, can add to the uncertainty in coordination and there contribute to unsatisfactory global coordination. In this case, the deterministic agents have to understand the behavior and limitations of the stochastic agents while arriving at optimal joint policy. Our solution, DSDF which tunes the discounted factor for the agents according to uncertainty and use the values to update the utility networks of individual agents. DSDF also helps in imparting an extent of reliability in coordination thereby granting stochastic agents tasks which are immediate and of shorter trajectory with deterministic ones taking the tasks which involve longer planning. Such an method enables joint co-ordinations of agents some of which may be partially performing and thereby can reduce or delay the investment of agent/robot replacement in many circumstances. Results on benchmark environment for different scenarios shows the efficacy of the proposed approach when compared with existing approaches.
    Learnable Discrete Wavelet Pooling (LDW-Pooling) For Convolutional Networks. (arXiv:2109.06638v1 [cs.CV])
    (2 min) Pooling is a simple but essential layer in modern deep CNN architectures for feature aggregation and extraction. Typical CNN design focuses on the conv layers and activation functions, while leaving the pooling layers with fewer options. We introduce the Learning Discrete Wavelet Pooling (LDW-Pooling) that can be applied universally to replace standard pooling operations to better extract features with improved accuracy and efficiency. Motivated from the wavelet theory, we adopt the low-pass (L) and high-pass (H) filters horizontally and vertically for pooling on a 2D feature map. Feature signals are decomposed into four (LL, LH, HL, HH) subbands to retain features better and avoid information dropping. The wavelet transform ensures features after pooling can be fully preserved and recovered. We next adopt an energy-based attention learning to fine-select crucial and representative features. LDW-Pooling is effective and efficient when compared with other state-of-the-art pooling techniques such as WaveletPooling and LiftPooling. Extensive experimental validation shows that LDW-Pooling can be applied to a wide range of standard CNN architectures and consistently outperform standard (max, mean, mixed, and stochastic) pooling operations.
    Sum-Product-Attention Networks: Leveraging Self-Attention in Probabilistic Circuits. (arXiv:2109.06587v1 [cs.LG])
    (2 min) Probabilistic circuits (PCs) have become the de-facto standard for learning and inference in probabilistic modeling. We introduce Sum-Product-Attention Networks (SPAN), a new generative model that integrates probabilistic circuits with Transformers. SPAN uses self-attention to select the most relevant parts of a probabilistic circuit, here sum-product networks, to improve the modeling capability of the underlying sum-product network. We show that while modeling, SPAN focuses on a specific set of independent assumptions in every product layer of the sum-product network. Our empirical evaluations show that SPAN outperforms state-of-the-art probabilistic generative models on various benchmark data sets as well is an efficient generative image model.
    Sensor Adversarial Traits: Analyzing Robustness of 3D Object Detection Sensor Fusion Models. (arXiv:2109.06363v1 [cs.CV])
    (2 min) A critical aspect of autonomous vehicles (AVs) is the object detection stage, which is increasingly being performed with sensor fusion models: multimodal 3D object detection models which utilize both 2D RGB image data and 3D data from a LIDAR sensor as inputs. In this work, we perform the first study to analyze the robustness of a high-performance, open source sensor fusion model architecture towards adversarial attacks and challenge the popular belief that the use of additional sensors automatically mitigate the risk of adversarial attacks. We find that despite the use of a LIDAR sensor, the model is vulnerable to our purposefully crafted image-based adversarial attacks including disappearance, universal patch, and spoofing. After identifying the underlying reason, we explore some potential defenses and provide some recommendations for improved sensor fusion models.
    Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization. (arXiv:2008.08170v4 [math.OC] UPDATED)
    (2 min) In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method to solve stochastic mini-optimization problems. We prove that the Acc-ZOM method achieves a lower query complexity of $\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the parameter dimension. In particular, the Acc-ZOM does not need large batches that are required in the existing zeroth-order stochastic algorithms. At the same time, we propose an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method for black-box minimax-optimization. We prove that the Acc-ZOMDA method reaches the best known query complexity of $\tilde{O}((d_1+d_2)\kappa_y^{3}\epsilon^{-3})$ without large batches for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote dimensions of optimization parameters and $\kappa_y$ is condition number. Moreover, we propose an accelerated first-order momentum descent ascent (Acc-MDA) method for solving white-box minimax problems, and prove that it achieves a lower gradient complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ given batch size $b=\kappa_y^{4}$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(\kappa_y^{1/2})$. Extensive experimental results on the black-box adversarial attack to deep neural networks (DNNs) and poisoning attack demonstrate the efficiency of our algorithms.
    PRS-Net: Planar Reflective Symmetry Detection Net for 3D Models. (arXiv:1910.06511v6 [cs.GR] UPDATED)
    (2 min) In geometry processing, symmetry is a universal type of high-level structural information of 3D models and benefits many geometry processing tasks including shape segmentation, alignment, matching, and completion. Thus it is an important problem to analyze various symmetry forms of 3D shapes. Planar reflective symmetry is the most fundamental one. Traditional methods based on spatial sampling can be time-consuming and may not be able to identify all the symmetry planes. In this paper, we present a novel learning framework to automatically discover global planar reflective symmetry of a 3D shape. Our framework trains an unsupervised 3D convolutional neural network to extract global model features and then outputs possible global symmetry parameters, where input shapes are represented using voxels. We introduce a dedicated symmetry distance loss along with a regularization loss to avoid generating duplicated symmetry planes. Our network can also identify generalized cylinders by predicting their rotation axes. We further provide a method to remove invalid and duplicated planes and axes. We demonstrate that our method is able to produce reliable and accurate results. Our neural network based method is hundreds of times faster than the state-of-the-art methods, which are based on sampling. Our method is also robust even with noisy or incomplete input surfaces.
    Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning. (arXiv:2109.06579v1 [eess.SP])
    (2 min) In this paper, we consider the problem of wireless federated learning based on sign stochastic gradient descent (signSGD) algorithm via a multiple access channel. When sending locally computed gradient's sign information, each mobile device requires to apply precoding to circumvent wireless fading effects. In practice, however, acquiring perfect knowledge of channel state information (CSI) at all mobile devices is infeasible. In this paper, we present a simple yet effective precoding method with limited channel knowledge, called sign-alignment precoding. The idea of sign-alignment precoding is to protect sign-flipping errors from wireless fadings. Under the Gaussian prior assumption on the local gradients, we also derive the mean squared error (MSE)-optimal aggregation function called Bayesian over-the-air computation (BayAirComp). Our key finding is that one-bit precoding with BayAirComp aggregation can provide a better learning performance than the existing precoding method even using perfect CSI with AirComp aggregation.
    HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO. (arXiv:2109.06716v1 [cs.LG])
    (2 min) To achieve peak predictive performance, hyperparameter optimization (HPO) is a crucial component of machine learning and its applications. Over the last years,the number of efficient algorithms and tools for HPO grew substantially. At the same time, the community is still lacking realistic, diverse, computationally cheap,and standardized benchmarks. This is especially the case for multi-fidelity HPO methods. To close this gap, we propose HPOBench, which includes 7 existing and 5 new benchmark families, with in total more than 100 multi-fidelity benchmark problems. HPOBench allows to run this extendable set of multi-fidelity HPO benchmarks in a reproducible way by isolating and packaging the individual benchmarks in containers. It also provides surrogate and tabular benchmarks for computationally affordable yet statistically sound evaluations. To demonstrate the broad compatibility of HPOBench and its usefulness, we conduct an exemplary large-scale study evaluating 6 well known multi-fidelity HPO tools.
    Nonlinearities in Steerable SO(2)-Equivariant CNNs. (arXiv:2109.06861v1 [cs.LG])
    (2 min) Invariance under symmetry is an important problem in machine learning. Our paper looks specifically at equivariant neural networks where transformations of inputs yield homomorphic transformations of outputs. Here, steerable CNNs have emerged as the standard solution. An inherent problem of steerable representations is that general nonlinear layers break equivariance, thus restricting architectural choices. Our paper applies harmonic distortion analysis to illuminate the effect of nonlinearities on Fourier representations of SO(2). We develop a novel FFT-based algorithm for computing representations of non-linearly transformed activations while maintaining band-limitation. It yields exact equivariance for polynomial (approximations of) nonlinearities, as well as approximate solutions with tunable accuracy for general functions. We apply the approach to build a fully E(3)-equivariant network for sampled 3D surface data. In experiments with 2D and 3D data, we obtain results that compare favorably to the state-of-the-art in terms of accuracy while permitting continuous symmetry and exact equivariance.
    Scalable Scene Flow from Point Clouds in the Real World. (arXiv:2103.01306v4 [cs.CV] UPDATED)
    (2 min) Autonomous vehicles operate in highly dynamic environments necessitating an accurate assessment of which aspects of a scene are moving and where they are moving to. A popular approach to 3D motion estimation, termed scene flow, is to employ 3D point cloud data from consecutive LiDAR scans, although such approaches have been limited by the small size of real-world, annotated LiDAR data. In this work, we introduce a new large-scale dataset for scene flow estimation derived from corresponding tracked 3D objects, which is $\sim$1,000$\times$ larger than previous real-world datasets in terms of the number of annotated frames. We demonstrate how previous works were bounded based on the amount of real LiDAR data available, suggesting that larger datasets are required to achieve state-of-the-art predictive performance. Furthermore, we show how previous heuristics for operating on point clouds such as down-sampling heavily degrade performance, motivating a new class of models that are tractable on the full point cloud. To address this issue, we introduce the FastFlow3D architecture which provides real time inference on the full point cloud. Additionally, we design human-interpretable metrics that better capture real world aspects by accounting for ego-motion and providing breakdowns per object type. We hope that this dataset may provide new opportunities for developing real world scene flow systems.
    Learning Density Distribution of Reachable States for Autonomous Systems. (arXiv:2109.06728v1 [cs.AI])
    (2 min) State density distribution, in contrast to worst-case reachability, can be leveraged for safety-related problems to better quantify the likelihood of the risk for potentially hazardous situations. In this work, we propose a data-driven method to compute the density distribution of reachable states for nonlinear and even black-box systems. Our semi-supervised approach learns system dynamics and the state density jointly from trajectory data, guided by the fact that the state density evolution follows the Liouville partial differential equation. With the help of neural network reachability tools, our approach can estimate the set of all possible future states as well as their density. Moreover, we could perform online safety verification with probability ranges for unsafe behaviors to occur. We use an extensive set of experiments to show that our learned solution can produce a much more accurate estimate on density distribution, and can quantify risks less conservatively and flexibly comparing with worst-case analysis.
    Towards optimized actions in critical situations of soccer games with deep reinforcement learning. (arXiv:2109.06625v1 [cs.LG])
    (2 min) Soccer is a sparse rewarding game: any smart or careless action in critical situations can change the result of the match. Therefore players, coaches, and scouts are all curious about the best action to be performed in critical situations, such as the times with a high probability of losing ball possession or scoring a goal. This work proposes a new state representation for the soccer game and a batch reinforcement learning to train a smart policy network. This network gets the contextual information of the situation and proposes the optimal action to maximize the expected goal for the team. We performed extensive numerical experiments on the soccer logs made by InStat for 104 European soccer matches. The results show that in all 104 games, the optimized policy obtains higher rewards than its counterpart in the behavior policy. Besides, our framework learns policies that are close to the expected behavior in the real world. For instance, in the optimized policy, we observe that some actions such as foul, or ball out can be sometimes more rewarding than a shot in specific situations.
    Scalable Font Reconstruction with Dual Latent Manifolds. (arXiv:2109.06627v1 [cs.CV])
    (2 min) We propose a deep generative model that performs typography analysis and font reconstruction by learning disentangled manifolds of both font style and character shape. Our approach enables us to massively scale up the number of character types we can effectively model compared to previous methods. Specifically, we infer separate latent variables representing character and font via a pair of inference networks which take as input sets of glyphs that either all share a character type, or belong to the same font. This design allows our model to generalize to characters that were not observed during training time, an important task in light of the relative sparsity of most fonts. We also put forward a new loss, adapted from prior work that measures likelihood using an adaptive distribution in a projected space, resulting in more natural images without requiring a discriminator. We evaluate on the task of font reconstruction over various datasets representing character types of many languages, and compare favorably to modern style transfer systems according to both automatic and manually-evaluated metrics.
    Universal Graph Transformer Self-Attention Networks. (arXiv:1909.11855v11 [cs.LG] UPDATED)
    (2 min) The transformer has been extensively used in research domains such as computer vision, image processing, and natural language processing. The transformer, however, has not been actively used in graph neural networks. To this end, we introduce a transformer-based advanced GNN model, named UGformer, to learn graph representations. In particular, given an input graph, we present two UGformer variants. The first variant is to leverage the transformer on a set of sampled neighbors for each node, while the second is to leverage the transformer directly on the input graph. Experimental results demonstrate that our UGformer achieves state-of-the-art accuracies on well-known benchmark datasets for graph classification and inductive text classification. The code is available on Github: \url{https://github.com/daiquocnguyen/Graph-Transformer}.
    Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data. (arXiv:2109.06486v1 [cs.LG])
    (2 min) $\textbf{Background:}$ At the onset of a pandemic, such as COVID-19, data with proper labeling/attributes corresponding to the new disease might be unavailable or sparse. Machine Learning (ML) models trained with the available data, which is limited in quantity and poor in diversity, will often be biased and inaccurate. At the same time, ML algorithms designed to fight pandemics must have good performance and be developed in a time-sensitive manner. To tackle the challenges of limited data, and label scarcity in the available data, we propose generating conditional synthetic data, to be used alongside real data for developing robust ML models. $\textbf{Methods:}$ We present a hybrid model consisting of a conditional generative flow and a classifier for conditional synthetic data generation. The classifier decouples the feature representation for the condition, which is fed to the flow to extract the local noise. We generate synthetic data by manipulating the local noise with fixed conditional feature representation. We also propose a semi-supervised approach to generate synthetic samples in the absence of labels for a majority of the available data. $\textbf{Results:}$ We performed conditional synthetic generation for chest computed tomography (CT) scans corresponding to normal, COVID-19, and pneumonia afflicted patients. We show that our method significantly outperforms existing models both on qualitative and quantitative performance, and our semi-supervised approach can efficiently synthesize conditional samples under label scarcity. As an example of downstream use of synthetic data, we show improvement in COVID-19 detection from CT scans with conditional synthetic data augmentation.
    Instance-wise Graph-based Framework for Multivariate Time Series Forecasting. (arXiv:2109.06489v1 [cs.LG])
    (2 min) The multivariate time series forecasting has attracted more and more attention because of its vital role in different fields in the real world, such as finance, traffic, and weather. In recent years, many research efforts have been proposed for forecasting multivariate time series. Although some previous work considers the interdependencies among different variables in the same timestamp, existing work overlooks the inter-connections between different variables at different time stamps. In this paper, we propose a simple yet efficient instance-wise graph-based framework to utilize the inter-dependencies of different variables at different time stamps for multivariate time series forecasting. The key idea of our framework is aggregating information from the historical time series of different variables to the current time series that we need to forecast. We conduct experiments on the Traffic, Electricity, and Exchange-Rate multivariate time series datasets. The results show that our proposed model outperforms the state-of-the-art baseline methods.
    Dodging Attack Using Carefully Crafted Natural Makeup. (arXiv:2109.06467v1 [cs.CV])
    (2 min) Deep learning face recognition models are used by state-of-the-art surveillance systems to identify individuals passing through public areas (e.g., airports). Previous studies have demonstrated the use of adversarial machine learning (AML) attacks to successfully evade identification by such systems, both in the digital and physical domains. Attacks in the physical domain, however, require significant manipulation to the human participant's face, which can raise suspicion by human observers (e.g. airport security officers). In this study, we present a novel black-box AML attack which carefully crafts natural makeup, which, when applied on a human participant, prevents the participant from being identified by facial recognition models. We evaluated our proposed attack against the ArcFace face recognition model, with 20 participants in a real-world setup that includes two cameras, different shooting angles, and different lighting conditions. The evaluation results show that in the digital domain, the face recognition system was unable to identify all of the participants, while in the physical domain, the face recognition system was able to identify the participants in only 1.22% of the frames (compared to 47.57% without makeup and 33.73% with random natural makeup), which is below a reasonable threshold of a realistic operational environment.
    Variation-Incentive Loss Re-weighting for Regression Analysis on Biased Data. (arXiv:2109.06565v1 [cs.LG])
    (2 min) Both classification and regression tasks are susceptible to the biased distribution of training data. However, existing approaches are focused on the class-imbalanced learning and cannot be applied to the problems of numerical regression where the learning targets are continuous values rather than discrete labels. In this paper, we aim to improve the accuracy of the regression analysis by addressing the data skewness/bias during model training. We first introduce two metrics, uniqueness and abnormality, to reflect the localized data distribution from the perspectives of their feature (i.e., input) space and target (i.e., output) space. Combining these two metrics we propose a Variation-Incentive Loss re-weighting method (VILoss) to optimize the gradient descent-based model training for regression analysis. We have conducted comprehensive experiments on both synthetic and real-world data sets. The results show significant improvement in the model quality (reduction in error by up to 11.9%) when using VILoss as the loss criterion in training.
    PalmTree: Learning an Assembly Language Model for Instruction Embedding. (arXiv:2103.03809v3 [cs.LG] UPDATED)
    (2 min) Deep learning has demonstrated its strengths in numerous binary analysis tasks, including function boundary detection, binary code search, function prototype inference, value set analysis, etc. When applying deep learning to binary analysis tasks, we need to decide what input should be fed into the neural network model. More specifically, we need to answer how to represent an instruction in a fixed-length vector. The idea of automatically learning instruction representations is intriguing, however the existing schemes fail to capture the unique characteristics of disassembly. These schemes ignore the complex intra-instruction structures and mainly rely on control flow in which the contextual information is noisy and can be influenced by compiler optimizations. In this paper, we propose to pre-train an assembly language model called PalmTree for generating general-purpose instruction embeddings by conducting self-supervised training on large-scale unlabeled binary corpora. PalmTree utilizes three pre-training tasks to capture various characteristics of assembly language. These training tasks overcome the problems in existing schemes, thus can help to generate high-quality representations. We conduct both intrinsic and extrinsic evaluations, and compare PalmTree with other instruction embedding schemes. PalmTree has the best performance for intrinsic metrics, and outperforms the other instruction embedding schemes for all downstream tasks.
    CLAIMED: A CLAssification-Incorporated Minimum Energy Design to explore a multivariate response surface with feasibility constraints. (arXiv:2006.05021v2 [stat.ML] UPDATED)
    (2 min) Motivated by the problem of optimization of force-field systems in physics using large-scale computer simulations, we consider exploration of a deterministic complex multivariate response surface. The objective is to find input combinations that generate output close to some desired or "target" vector. In spite of reducing the problem to exploration of the input space with respect to a one-dimensional loss function, the search is nontrivial and challenging due to infeasible input combinations, high dimensionalities of the input and output space and multiple "desirable" regions in the input space and the difficulty of emulating the objective function well with a surrogate model. We propose an approach that is based on combining machine learning techniques with smart experimental design ideas to locate multiple good regions in the input space.
    Multiple shooting with neural differential equations. (arXiv:2109.06786v1 [cs.LG])
    (2 min) Neural differential equations have recently emerged as a flexible data-driven/hybrid approach to model time-series data. This work experimentally demonstrates that if the data contains oscillations, then standard fitting of a neural differential equation may give flattened out trajectory that fails to describe the data. We then introduce the multiple shooting method and present successful demonstrations of this method for the fitting of a neural differential equation to two datasets (synthetic and experimental) that the standard approach fails to fit. Constraints introduced by multiple shooting can be satisfied using a penalty or augmented Lagrangian method.
    One-Class Meta-Learning: Towards Generalizable Few-Shot Open-Set Classification. (arXiv:2109.06859v1 [cs.CV])
    (2 min) Real-world classification tasks are frequently required to work in an open-set setting. This is especially challenging for few-shot learning problems due to the small sample size for each known category, which prevents existing open-set methods from working effectively; however, most multiclass few-shot methods are limited to closed-set scenarios. In this work, we address the problem of few-shot open-set classification by first proposing methods for few-shot one-class classification and then extending them to few-shot multiclass open-set classification. We introduce two independent few-shot one-class classification methods: Meta Binary Cross-Entropy (Meta-BCE), which learns a separate feature representation for one-class classification, and One-Class Meta-Learning (OCML), which learns to generate one-class classifiers given standard multiclass feature representation. Both methods can augment any existing few-shot learning method without requiring retraining to work in a few-shot multiclass open-set setting without degrading its closed-set performance. We demonstrate the benefits and drawbacks of both methods in different problem settings and evaluate them on three standard benchmark datasets, miniImageNet, tieredImageNet, and Caltech-UCSD-Birds-200-2011, where they surpass the state-of-the-art methods in the few-shot multiclass open-set and few-shot one-class tasks.
    Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments. (arXiv:2109.06514v1 [cs.LG])
    (2 min) Driving in a complex urban environment is a difficult task that requires a complex decision policy. In order to make informed decisions, one needs to gain an understanding of the long-range context and the importance of other vehicles. In this work, we propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images. The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets). Furthermore, ViT's attention mechanism helps to learn an attention map for the scene which allows the ego car to determine which surrounding cars are important to its next decision. We demonstrate that a DQN agent with a ViT backbone outperforms baseline algorithms with ConvNet backbones pre-trained in various ways. In particular, the proposed method helps reinforcement learning algorithms to learn faster, with increased performance and less data than baselines.
    Anomaly Attribution of Multivariate Time Series using Counterfactual Reasoning. (arXiv:2109.06562v1 [cs.LG])
    (2 min) There are numerous methods for detecting anomalies in time series, but that is only the first step to understanding them. We strive to exceed this by explaining those anomalies. Thus we develop a novel attribution scheme for multivariate time series relying on counterfactual reasoning. We aim to answer the counterfactual question of would the anomalous event have occurred if the subset of the involved variables had been more similarly distributed to the data outside of the anomalous interval. Specifically, we detect anomalous intervals using the Maximally Divergent Interval (MDI) algorithm, replace a subset of variables with their in-distribution values within the detected interval and observe if the interval has become less anomalous, by re-scoring it with MDI. We evaluate our method on multivariate temporal and spatio-temporal data and confirm the accuracy of our anomaly attribution of multiple well-understood extreme climate events such as heatwaves and hurricanes.
    Uncertainty-Aware Machine Translation Evaluation. (arXiv:2109.06352v1 [cs.CL])
    (2 min) Several neural-based metrics have been recently proposed to evaluate machine translation quality. However, all of them resort to point estimates, which provide limited information at segment level. This is made worse as they are trained on noisy, biased and scarce human judgements, often resulting in unreliable quality predictions. In this paper, we introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality. We combine the COMET framework with two uncertainty estimation methods, Monte Carlo dropout and deep ensembles, to obtain quality scores along with confidence intervals. We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task, augmented with MQM annotations. We experiment with varying numbers of references and further discuss the usefulness of uncertainty-aware quality estimation (without references) to flag possibly critical translation mistakes.

2021-09-14

  • cs.CL updates on arXiv.org

    Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies. (arXiv:2108.12084v2 [cs.CL] UPDATED)
    (2 min) Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. In this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in English language technologies. We also detail how current language representations (e.g., GloVe, BERT) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.
    Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment. (arXiv:2106.06381v2 [cs.CL] UPDATED)
    (2 min) The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at https://github.com/CZWin32768/XLM-Align.
    Disentangling Representations of Text by Masking Transformers. (arXiv:2104.07155v2 [cs.CL] UPDATED)
    (0 min) Representations from large pretrained models such as BERT encode a range of features into monolithic vectors, affording strong predictive accuracy across a multitude of downstream tasks. In this paper we explore whether it is possible to learn disentangled representations by identifying existing subnetworks within pretrained models that encode distinct, complementary aspect representations. Concretely, we learn binary masks over transformer weights or hidden units to uncover subsets of features that correlate with a specific factor of variation; this eliminates the need to train a disentangled model from scratch for a particular task. We evaluate this method with respect to its ability to disentangle representations of sentiment from genre in movie reviews, "toxicity" from dialect in Tweets, and syntax from semantics. By combining masking with magnitude pruning we find that we can identify sparse subnetworks within BERT that strongly encode particular aspects (e.g., toxicity) while only weakly encoding others (e.g., race). Moreover, despite only learning masks, we find that disentanglement-via-masking performs as well as -- and often better than -- previously proposed methods based on variational autoencoders and adversarial training.
    DuoRAT: Towards Simpler Text-to-SQL Models. (arXiv:2010.11119v2 [cs.CL] UPDATED)
    (0 min) Recent neural text-to-SQL models can effectively translate natural language questions to corresponding SQL queries on unseen databases. Working mostly on the Spider dataset, researchers have proposed increasingly sophisticated solutions to the problem. Contrary to this trend, in this paper we focus on simplifications. We begin by building DuoRAT, a re-implementation of the state-of-the-art RAT-SQL model that unlike RAT-SQL is using only relation-aware or vanilla transformers as the building blocks. We perform several ablation experiments using DuoRAT as the baseline model. Our experiments confirm the usefulness of some techniques and point out the redundancy of others, including structural SQL features and features that link the question with the schema.
    Unsupervised Paraphrasing with Pretrained Language Models. (arXiv:2010.12885v2 [cs.CL] UPDATED)
    (0 min) Paraphrase generation has benefited extensively from recent progress in the designing of training objectives and model architectures. However, previous explorations have largely focused on supervised methods, which require a large amount of labeled data that is costly to collect. To address this drawback, we adopt a transfer learning approach and propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking (DB). To enforce a surface form dissimilar from the input, whenever the language model emits a token contained in the source sequence, DB prevents the model from outputting the subsequent source token for the next generation step. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair (QQP) and the ParaNMT datasets and is robust to domain shift between the two datasets of distinct distributions. We also demonstrate that our model transfers to paraphrasing in other languages without any additional finetuning.
    MT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs. (arXiv:2104.08692v2 [cs.CL] UPDATED)
    (0 min) Multilingual T5 (mT5) pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on eight multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
    Can NLI Models Verify QA Systems' Predictions?. (arXiv:2104.08731v2 [cs.CL] UPDATED)
    (0 min) To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just "good enough" in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pre-trained models and recent prior datasets to construct powerful question converter and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to judge the correctness of QA models' proposed answers. We show that our NLI approach can generally improve the confidence estimation of a QA model across different domains, evaluated in a selective QA setting. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, or where the answer cannot be verified as addressing all aspects of the question.
    Cross-Domain Label-Adaptive Stance Detection. (arXiv:2104.07467v2 [cs.CL] UPDATED)
    (0 min) Stance detection concerns the classification of a writer's viewpoint towards a target. There are different task variants, e.g., stance of a tweet vs. a full article, or stance with respect to a claim vs. an (implicit) topic. Moreover, task definitions vary, which includes the label inventory, the data collection, and the annotation protocol. All these aspects hinder cross-domain studies, as they require changes to standard domain adaptation approaches. In this paper, we perform an in-depth analysis of 16 stance detection datasets, and we explore the possibility for cross-domain learning from them. Moreover, we propose an end-to-end unsupervised framework for out-of-domain prediction of unseen, user-defined labels. In particular, we combine domain adaptation techniques such as mixture of experts and domain-adversarial training with label embeddings, and we demonstrate sizable performance gains over strong baselines, both (i) in-domain, i.e., for seen targets, and (ii) out-of-domain, i.e., for unseen targets. Finally, we perform an exhaustive analysis of the cross-domain results, and we highlight the important factors influencing the model performance.
    Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy. (arXiv:2104.07571v2 [cs.CL] UPDATED)
    (0 min) The goal of question answering (QA) is to answer any question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, model accuracy analysis reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic). But QA's lack of representation could itself hide evidence of bias, necessitating QA datasets that better represent global diversity.
    End-to-End Conversational Search for Online Shopping with Utterance Transfer. (arXiv:2109.05460v1 [cs.CL])
    (0 min) Successful conversational search systems can present natural, adaptive and interactive shopping experience for online shopping customers. However, building such systems from scratch faces real word challenges from both imperfect product schema/knowledge and lack of training dialog data.In this work we first propose ConvSearch, an end-to-end conversational search system that deeply combines the dialog system with search. It leverages the text profile to retrieve products, which is more robust against imperfect product schema/knowledge compared with using product attributes alone. We then address the lack of data challenges by proposing an utterance transfer approach that generates dialogue utterances by using existing dialog from other domains, and leveraging the search behavior data from e-commerce retailer. With utterance transfer, we introduce a new conversational search dataset for online shopping. Experiments show that our utterance transfer method can significantly improve the availability of training dialogue data without crowd-sourcing, and the conversational search system significantly outperformed the best tested baseline.
    GraphCodeBERT: Pre-training Code Representations with Data Flow. (arXiv:2009.08366v4 [cs.SE] UPDATED)
    (0 min) Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.
    On Classifying Continuous Constraint Satisfaction problems. (arXiv:2106.02397v2 [cs.CC] UPDATED)
    (0 min) A continuous constraint satisfaction problem (CCSP) is a constraint satisfaction problem (CSP) with a domain $U \subset \mathbb{R}$. We engage in a systematic study to classify CCSPs that are complete of the Existential Theory of the Reals, i.e., ER-complete. To define this class, we first consider the problem ETR, which also stands for Existential Theory of the Reals. In an instance of this problem we are given some sentence of the form $\exists x_1, \ldots, x_n \in \mathbb{R} : \Phi(x_1, \ldots, x_n)$, where $\Phi$ is a well-formed quantifier-free formula consisting of the symbols $\{0, 1, +, \cdot, \geq, >, \wedge, \vee, \neg\}$, the goal is to check whether this sentence is true. Now the class ER is the family of all problems that admit a polynomial-time reduction to ETR. It is known that NP $\subseteq$ ER $\subseteq$ PSPACE. We restrict our attention on CCSPs with addition constraints ($x + y = z$) and some other mild technical condition. Previously, it was shown that multiplication constraints ($x \cdot y = z$), squaring constraints ($x^2 = y$), or inversion constraints ($x\cdot y = 1$) are sufficient to establish ER-completeness. We extend this in the strongest possible sense for equality constraints as follows. We show that CCSPs (with addition constraints and some other mild technical condition) that have any one well-behaved curved equality constraint ($f(x,y) = 0$) are ER-complete. We further extend our results to inequality constraints. We show that any well-behaved convexly curved and any well-behaved concavely curved inequality constraint ($f(x,y) \geq 0$ and $g(x,y) \geq 0$) imply ER-completeness on the class of such CCSPs. We apply our findings to geometric packing and answer an open question by Abrahamsen et al. [FOCS 2020]. Namely, we establish ER-completeness of packing convex pieces into a square container under rotations and translations.
    Levenshtein Training for Word-level Quality Estimation. (arXiv:2109.05611v1 [cs.CL])
    (0 min) We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we propose a two-stage transfer learning procedure on both augmented data and human post-editing data. We also propose heuristics to construct reference labels that are compatible with subword-level finetuning and inference. Results on WMT 2020 QE shared task dataset show that our proposed method has superior data efficiency under the data-constrained setting and competitive performance under the unconstrained setting.
    DialBERT: A Hierarchical Pre-Trained Model for Conversation Disentanglement. (arXiv:2004.03760v2 [cs.CL] UPDATED)
    (0 min) Disentanglement is a problem in which multiple conversations occur in the same channel simultaneously, and the listener should decide which utterance is part of the conversation he will respond to. We propose a new model, named Dialogue BERT (DialBERT), which integrates local and global semantics in a single stream of messages to disentangle the conversations that mixed together. We employ BERT to capture the matching information in each utterance pair at the utterance-level, and use a BiLSTM to aggregate and incorporate the context-level information. With only a 3% increase in parameters, a 12% improvement has been attained in comparison to BERT, based on the F1-Score. The model achieves a state-of-the-art result on the a new dataset proposed by IBM and surpasses previous work by a substantial margin.
    Reducing Discontinuous to Continuous Parsing with Pointer Network Reordering. (arXiv:2104.06239v2 [cs.CL] UPDATED)
    (0 min) Discontinuous constituent parsers have always lagged behind continuous approaches in terms of accuracy and speed, as the presence of constituents with discontinuous yield introduces extra complexity to the task. However, a discontinuous tree can be converted into a continuous variant by reordering tokens. Based on that, we propose to reduce discontinuous parsing to a continuous problem, which can then be directly solved by any off-the-shelf continuous parser. To that end, we develop a Pointer Network capable of accurately generating the continuous token arrangement for a given input sentence and define a bijective function to recover the original order. Experiments on the main benchmarks with two continuous parsers prove that our approach is on par in accuracy with purely discontinuous state-of-the-art algorithms, but considerably faster.
    RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models. (arXiv:2109.05620v1 [cs.CL])
    (0 min) To audit the robustness of named entity recognition (NER) models, we propose RockNER, a simple yet effective method to create natural adversarial examples. Specifically, at the entity level, we replace target entities with other entities of the same semantic class in Wikidata; at the context level, we use pre-trained language models (e.g., BERT) to generate word substitutions. Together, the two levels of attack produce natural adversarial examples that result in a shifted distribution from the training data on which our target models have been trained. We apply the proposed method to the OntoNotes dataset and create a new benchmark named OntoRock for evaluating the robustness of existing NER models via a systematic evaluation protocol. Our experiments and analysis reveal that even the best model has a significant performance drop, and these models seem to memorize in-domain entity patterns instead of reasoning from the context. Our work also studies the effects of a few simple data augmentation methods to improve the robustness of NER models.
    YASO: A Targeted Sentiment Analysis Evaluation Dataset for Open-Domain Reviews. (arXiv:2012.14541v2 [cs.CL] UPDATED)
    (0 min) Current TSA evaluation in a cross-domain setup is restricted to the small set of review domains available in existing datasets. Such an evaluation is limited, and may not reflect true performance on sites like Amazon or Yelp that host diverse reviews from many domains. To address this gap, we present YASO - a new TSA evaluation dataset of open-domain user reviews. YASO contains 2,215 English sentences from dozens of review domains, annotated with target terms and their sentiment. Our analysis verifies the reliability of these annotations, and explores the characteristics of the collected data. Benchmark results using five contemporary TSA systems show there is ample room for improvement on this challenging new dataset. YASO is available at https://github.com/IBM/yaso-tsa.
    The Logic Traps in Evaluating Post-hoc Interpretations. (arXiv:2109.05463v1 [cs.LG])
    (0 min) Post-hoc interpretation aims to explain a trained model and reveal how the model arrives at a decision. Though research on post-hoc interpretations has developed rapidly, one growing pain in this field is the difficulty in evaluating interpretations. There are some crucial logic traps behind existing evaluation methods, which are ignored by most works. In this opinion piece, we summarize four kinds evaluation methods and point out the corresponding logic traps behind them. We argue that we should be clear about these traps rather than ignore them and draw conclusions assertively.
    Do Transformer Modifications Transfer Across Implementations and Applications?. (arXiv:2102.11972v2 [cs.LG] UPDATED)
    (0 min) The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.
    Good-Enough Example Extrapolation. (arXiv:2109.05602v1 [cs.CL])
    (0 min) This paper asks whether extrapolating the hidden space distribution of text examples from one class onto another is a valid inductive bias for data augmentation. To operationalize this question, I propose a simple data augmentation protocol called "good-enough example extrapolation" (GE3). GE3 is lightweight and has no hyperparameters. Applied to three text classification datasets for various data imbalance scenarios, GE3 improves performance more than upsampling and other hidden-space data augmentation methods.
    MedLatinEpi and MedLatinLit: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts. (arXiv:2006.12289v2 [cs.CL] UPDATED)
    (0 min) We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author or not. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.
    Understanding Guided Image Captioning Performance across Domains. (arXiv:2012.02339v2 [cs.CV] UPDATED)
    (0 min) Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models generally lack the ability to provide long descriptive answers, while expecting the textual question to be quite precise. We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text that refers to either groundable or ungroundable concepts in the image. Our model consists of a Transformer-based multimodal encoder that uses the guiding text together with global and object-level image features to derive early-fusion representations used to generate the guided caption. While models trained on Visual Genome data have an in-domain advantage of fitting well when guided with automatic object labels, we find that guided captioning models trained on Conceptual Captions generalize better on out-of-domain images and guiding texts. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.
    Counter-Interference Adapter for Multilingual Machine Translation. (arXiv:2104.08154v2 [cs.CL] UPDATED)
    (0 min) Developing a unified multilingual model has long been a pursuit for machine translation. However, existing approaches suffer from performance degradation -- a single multilingual model is inferior to separately trained bilingual ones on rich-resource languages. We conjecture that such a phenomenon is due to interference caused by joint training with multiple languages. To accommodate the issue, we propose CIAT, an adapted Transformer model with a small parameter overhead for multilingual machine translation. We evaluate CIAT on multiple benchmark datasets, including IWSLT, OPUS-100, and WMT. Experiments show that CIAT consistently outperforms strong multilingual baselines on 64 of total 66 language directions, 42 of which see above 0.5 BLEU improvement. Our code is available at \url{https://github.com/Yaoming95/CIAT}~.
    Pushing the Limits of Non-Autoregressive Speech Recognition. (arXiv:2104.03416v4 [eess.AS] UPDATED)
    (0 min) We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
    Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society. (arXiv:2005.00033v3 [cs.CL] UPDATED)
    (0 min) With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Ad-dressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that (i) focuses on COVID-19, (ii) combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and (iii) covers Arabic, Bulgarian, Dutch, and English. Finally, we show strong evaluation results using pretrained Transformers, thus con-firming the practical utility of the dataset in monolingual vs. multilingual, and single task vs. multitask settings.
    Exploring the Role of BERT Token Representations to Explain Sentence Probing Results. (arXiv:2104.01477v2 [cs.CL] UPDATED)
    (0 min) Several studies have been carried out on revealing linguistic features captured by BERT. This is usually achieved by training a diagnostic classifier on the representations obtained from different layers of BERT. The subsequent classification accuracy is then interpreted as the ability of the model in encoding the corresponding linguistic property. Despite providing insights, these studies have left out the potential role of token representations. In this paper, we provide a more in-depth analysis on the representation space of BERT in search for distinct and meaningful subspaces that can explain the reasons behind these probing results. Based on a set of probing tasks and with the help of attribution methods we show that BERT tends to encode meaningful knowledge in specific token representations (which are often ignored in standard classification setups), allowing the model to detect syntactic and semantic abnormalities, and to distinctively separate grammatical number and tense subspaces.
    Scaling End-to-End Models for Large-Scale Multilingual ASR. (arXiv:2104.14830v2 [cs.CL] UPDATED)
    (0 min) Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity. We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.6K to 53.5K hours. We adopt GShard [1] to efficiently scale up to 10B parameters. Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck - our 500M-param model already outperforms monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days - the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth works better than width and large encoders do better than large decoders; (4) with continuous training, they can be adapted to new languages and domains.
    On Robustness and Bias Analysis of BERT-based Relation Extraction. (arXiv:2009.06206v4 [cs.CL] UPDATED)
    (0 min) Fine-tuning pre-trained models have achieved impressive performance on standard natural language processing benchmarks. However, the resultant model generalizability remains poorly understood. We do not know, for example, how excellent performance can lead to the perfection of generalization models. In this study, we analyze a fine-tuned BERT model from different perspectives using relation extraction. We also characterize the differences in generalization techniques according to our proposed improvements. From empirical experimentation, we find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic). These findings highlight opportunities for future improvements. Our open-sourced testbed DiagnoseRE is available in \url{https://github.com/zjunlp/DiagnoseRE}.
    Detecting Polarized Topics Using Partisanship-aware Contextualized Topic Embeddings. (arXiv:2104.07814v2 [cs.CL] UPDATED)
    (0 min) Growing polarization of the news media has been blamed for fanning disagreement, controversy and even violence. Early identification of polarized topics is thus an urgent matter that can help mitigate conflict. However, accurate measurement of topic-wise polarization is still an open research challenge. To address this gap, we propose Partisanship-aware Contextualized Topic Embeddings (PaCTE), a method to automatically detect polarized topics from partisan news sources. Specifically, utilizing a language model that has been finetuned on recognizing partisanship of the news articles, we represent the ideology of a news corpus on a topic by corpus-contextualized topic embedding and measure the polarization using cosine distance. We apply our method to a dataset of news articles about the COVID-19 pandemic. Extensive experiments on different news sources and topics demonstrate the efficacy of our method to capture topical polarization, as indicated by its effectiveness of retrieving the most polarized topics.
    WeChat Neural Machine Translation Systems for WMT21. (arXiv:2108.02401v2 [cs.CL] UPDATED)
    (0 min) This paper introduces WeChat AI's participation in WMT 2021 shared news translation task on English->Chinese, English->Japanese, Japanese->English and English->German. Our systems are based on the Transformer (Vaswani et al., 2017) with several novel and effective variants. In our experiments, we employ data filtering, large-scale synthetic data generation (i.e., back-translation, knowledge distillation, forward-translation, iterative in-domain knowledge transfer), advanced finetuning approaches, and boosted Self-BLEU based model ensemble. Our constrained systems achieve 36.9, 46.9, 27.8 and 31.3 case-sensitive BLEU scores on English->Chinese, English->Japanese, Japanese->English and English->German, respectively. The BLEU scores of English->Chinese, English->Japanese and Japanese->English are the highest among all submissions, and that of English->German is the highest among all constrained submissions.
    Sentence Bottleneck Autoencoders from Transformer Language Models. (arXiv:2109.00055v2 [cs.CL] UPDATED)
    (0 min) Representation learning for text via pretraining a language model on a large corpus has become a standard starting point for building NLP systems. This approach stands in contrast to autoencoders, also trained on raw text, but with the objective of learning to encode each input as a vector that allows full reconstruction. Autoencoders are attractive because of their latent space structure and generative properties. We therefore explore the construction of a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer (an example of controlled generation), and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
    Multi-Sense Language Modelling. (arXiv:2012.05776v2 [cs.CL] UPDATED)
    (0 min) The effectiveness of a language model is influenced by its token representations, which must encode contextual information and handle the same word form having a plurality of meanings (polysemy). Currently, none of the common language modelling architectures explicitly model polysemy. We propose a language model which not only predicts the next word, but also its sense in context. We argue that this higher prediction granularity may be useful for end tasks such as assistive writing, and allow for more a precise linking of language models with knowledge bases. We find that multi-sense language modelling requires architectures that go beyond standard language models, and here propose a structured prediction framework that decomposes the task into a word followed by a sense prediction task. To aid sense prediction, we utilise a Graph Attention Network, which encodes definitions and example uses of word senses. Overall, we find that multi-sense language modelling is a highly challenging task, and suggest that future work focus on the creation of more annotated training datasets.
    Stylistic Retrieval-based Dialogue System with Unparallel Training Data. (arXiv:2109.05477v1 [cs.CL])
    (0 min) The ability of a dialog system to express consistent language style during conversations has a direct, positive impact on its usability and on user satisfaction. Although previous studies have demonstrated that style transfer is feasible with a large amount of parallel data, it is often impossible to collect such data for different styles. In this paper, instead of manually constructing conversation data with a certain style, we propose a flexible framework that adapts a generic retrieval-based dialogue system to mimic the language style of a specified persona without any parallel data. Our approach is based on automatic generation of stylized data by learning the usage of jargon, and then rewriting the generic conversations to a stylized one by incorporating the jargon. In experiments we implemented dialogue systems with five distinct language styles, and the result shows our framework significantly outperforms baselines in terms of the average score of responses' relevance and style degree, and content diversity. A/B testing on a commercial chatbot shows that users are more satisfied with our system. This study demonstrates the feasibility of building stylistic dialogue systems by simple data augmentation.
    Fine-Grained Element Identification in Complaint Text of Internet Fraud. (arXiv:2108.08676v2 [cs.CL] UPDATED)
    (0 min) Existing system dealing with online complaint provides a final decision without explanations. We propose to analyse the complaint text of internet fraud in a fine-grained manner. Considering the complaint text includes multiple clauses with various functions, we propose to identify the role of each clause and classify them into different types of fraud element. We construct a large labeled dataset originated from a real finance service platform. We build an element identification model on top of BERT and propose additional two modules to utilize the context of complaint text for better element label classification, namely, global context encoder and label refiner. Experimental results show the effectiveness of our model.
    Program Enhanced Fact Verification with Verbalization and Graph Attention Network. (arXiv:2010.03084v6 [cs.AI] UPDATED)
    (0 min) Performing fact verification based on structured data is important for many real-life applications and is a challenging research problem, particularly when it involves both symbolic operations and informal inference based on language understanding. In this paper, we present a Program-enhanced Verbalization and Graph Attention Network (ProgVGAT) to integrate programs and execution into textual inference models. Specifically, a verbalization with program execution model is proposed to accumulate evidences that are embedded in operations over the tables. Built on that, we construct the graph attention verification networks, which are designed to fuse different sources of evidences from verbalized program execution, program structures, and the original statements and tables, to make the final verification decision. To support the above framework, we propose a program selection module optimized with a new training strategy based on margin loss, to produce more accurate programs, which is shown to be effective in enhancing the final verification results. Experimental results show that the proposed framework achieves the new state-of-the-art performance, a 74.4% accuracy, on the benchmark dataset TABFACT.
    Enhancing Interpretable Clauses Semantically using Pretrained Word Representation. (arXiv:2104.06901v2 [cs.CL] UPDATED)
    (0 min) Tsetlin Machine (TM) is an interpretable pattern recognition algorithm based on propositional logic, which has demonstrated competitive performance in many Natural Language Processing (NLP) tasks, including sentiment analysis, text classification, and Word Sense Disambiguation. To obtain human-level interpretability, legacy TM employs Boolean input features such as bag-of-words (BOW). However, the BOW representation makes it difficult to use any pre-trained information, for instance, word2vec and GloVe word representations. This restriction has constrained the performance of TM compared to deep neural networks (DNNs) in NLP. To reduce the performance gap, in this paper, we propose a novel way of using pre-trained word representations for TM. The approach significantly enhances the performance and interpretability of TM. We achieve this by extracting semantically related words from pre-trained word representations as input features to the TM. Our experiments show that the accuracy of the proposed approach is significantly higher than the previous BOW-based TM, reaching the level of DNN-based models.
    Towards Improving Adversarial Training of NLP Models. (arXiv:2109.00544v2 [cs.CL] UPDATED)
    (0 min) Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models' performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP models, which we name Attacking to Training (A2T). The core part of A2T is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results empirically show that it is possible to train robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T can improve an NLP model's robustness to the attack it was originally trained with and also defend the model against other types of word substitution attacks. Furthermore, we show that A2T can improve NLP models' standard accuracy, cross-domain generalization, and interpretability. Code is available at https://github.com/QData/Textattack-A2T .
    Convex Aggregation for Opinion Summarization. (arXiv:2104.01371v2 [cs.CL] UPDATED)
    (0 min) Recent advances in text autoencoders have significantly improved the quality of the latent space, which enables models to generate grammatical and consistent text from aggregated latent vectors. As a successful application of this property, unsupervised opinion summarization models generate a summary by decoding the aggregated latent vectors of inputs. More specifically, they perform the aggregation via simple average. However, little is known about how the vector aggregation step affects the generation quality. In this study, we revisit the commonly used simple average approach by examining the latent space and generated summaries. We found that text autoencoders tend to generate overly generic summaries from simply averaged latent vectors due to an unexpected $L_2$-norm shrinkage in the aggregated latent vectors, which we refer to as summary vector degeneration. To overcome this issue, we develop a framework Coop, which searches input combinations for the latent vector aggregation using input-output word overlap. Experimental results show that Coop successfully alleviates the summary vector degeneration issue and establishes new state-of-the-art performance on two opinion summarization benchmarks. Code is available at \url{https://github.com/megagonlabs/coop}.
    ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning. (arXiv:2104.08350v2 [cs.CL] UPDATED)
    (0 min) Understanding how events are semantically related to each other is the essence of reading comprehension. Recent event-centric reading comprehension datasets focus mostly on event arguments or temporal relations. While these tasks partially evaluate machines' ability of narrative understanding, human-like reading comprehension requires the capability to process event-based information beyond arguments and temporal reasoning. For example, to understand causality between events, we need to infer motivation or purpose; to establish event hierarchy, we need to understand the composition of events. To facilitate these tasks, we introduce ESTER, a comprehensive machine reading comprehension (MRC) dataset for Event Semantic Relation Reasoning. The dataset leverages natural language queries to reason about the five most common event semantic relations, provides more than 6K questions and captures 10.1K event relation pairs. Experimental results show that the current SOTA systems achieve 22.1%, 63.3%, and 83.5% for token-based exact-match, F1, and event-based HIT@1 scores, which are all significantly below human performances (36.0%, 79.6%, 100% respectively), highlighting our dataset as a challenging benchmark.
    Extracting Event Temporal Relations via Hyperbolic Geometry. (arXiv:2109.05527v1 [cs.CL])
    (0 min) Detecting events and their evolution through time is a crucial task in natural language understanding. Recent neural approaches to event temporal relation extraction typically map events to embeddings in the Euclidean space and train a classifier to detect temporal relations between event pairs. However, embeddings in the Euclidean space cannot capture richer asymmetric relations such as event temporal relations. We thus propose to embed events into hyperbolic spaces, which are intrinsically oriented at modeling hierarchical structures. We introduce two approaches to encode events and their temporal relations in hyperbolic spaces. One approach leverages hyperbolic embeddings to directly infer event relations through simple geometrical operations. In the second one, we devise an end-to-end architecture composed of hyperbolic neural units tailored for the temporal relation extraction task. Thorough experimental assessments on widely used datasets have shown the benefits of revisiting the tasks on a different geometrical space, resulting in state-of-the-art performance on several standard metrics. Finally, the ablation study and several qualitative analyses highlighted the rich event semantics implicitly encoded into hyperbolic spaces.
    Learning From How Human Correct For Data-Centric Deep Learning. (arXiv:2102.00225v4 [cs.CL] UPDATED)
    (0 min) In industry NLP application, our manually labeled data has a certain number of noisy data. We present a simple method to find the noisy data and relabel them manually, meanwhile we collect the correction information. Then we present novel method to incorporate the human correction information into deep learning model. Human know how to correct noisy data. So the correction information can be inject into deep learning model. We do the experiment on our own text classification dataset, which is manually labeled, because we relabel the noisy data in our dataset for our industry application. The experiment result shows that our method improve the classification accuracy from 91.7% to 92.5%. The 91.7% baseline is based on BERT training on the corrected dataset, which is hard to surpass.
    Multi-task Language Modeling for Improving Speech Recognition of Rare Words. (arXiv:2011.11715v4 [cs.CL] UPDATED)
    (0 min) End-to-end automatic speech recognition (ASR) systems are increasingly popular due to their relative architectural simplicity and competitive performance. However, even though the average accuracy of these systems may be high, the performance on rare content words often lags behind hybrid ASR systems. To address this problem, second-pass rescoring is often applied leveraging upon language modeling. In this paper, we propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. We show that our rescoring model trained with these additional tasks outperforms the baseline rescoring model, trained with only the language modeling task, by 1.4% on a general test and by 2.6% on a rare word test set in terms of word-error-rate relative (WERR). Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
    ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning. (arXiv:2012.15283v2 [cs.CL] UPDATED)
    (0 min) While pre-trained language models (PTLMs) have achieved noticeable success on many NLP tasks, they still struggle for tasks that require event temporal reasoning, which is essential for event-centric applications. We present a continual pre-training approach that equips PTLMs with targeted knowledge about event temporal relations. We design self-supervised learning objectives to recover masked-out event and temporal indicators and to discriminate sentences from their corrupted counterparts (where event or temporal indicators got replaced). By further pre-training a PTLM with these objectives jointly, we reinforce its attention to event and temporal information, yielding enhanced capability on event temporal reasoning. This effective continual pre-training framework for event temporal reasoning (ECONET) improves the PTLMs' fine-tuning performances across five relation extraction and question answering tasks and achieves new or on-par state-of-the-art performances in most of our downstream tasks.
    Leveraging Table Content for Zero-shot Text-to-SQL with Meta-Learning. (arXiv:2109.05395v1 [cs.CL])
    (0 min) Single-table text-to-SQL aims to transform a natural language question into a SQL query according to one single table. Recent work has made promising progress on this task by pre-trained language models and a multi-submodule framework. However, zero-shot table, that is, the invisible table in the training set, is currently the most critical bottleneck restricting the application of existing approaches to real-world scenarios. Although some work has utilized auxiliary tasks to help handle zero-shot tables, expensive extra manual annotation limits their practicality. In this paper, we propose a new approach for the zero-shot text-to-SQL task which does not rely on any additional manual annotations. Our approach consists of two parts. First, we propose a new model that leverages the abundant information of table content to help establish the mapping between questions and zero-shot tables. Further, we propose a simple but efficient meta-learning strategy to train our model. The strategy utilizes the two-step gradient update to force the model to learn a generalization ability towards zero-shot tables. We conduct extensive experiments on a public open-domain text-to-SQL dataset WikiSQL and a domain-specific dataset ESQL. Compared to existing approaches using the same pre-trained model, our approach achieves significant improvements on both datasets. Compared to the larger pre-trained model and the tabular-specific pre-trained model, our approach is still competitive. More importantly, on the zero-shot subsets of both the datasets, our approach further increases the improvements.
    GooAQ: Open Question Answering with Diverse Answer Types. (arXiv:2104.08727v2 [cs.CL] UPDATED)
    (0 min) While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmarkT5 models on GooAQ and observe that: (a) in line with recent work, LM's strong performance on GooAQ's short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as 'how' and 'why' questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types.
    Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages. (arXiv:2103.14583v2 [cs.CL] UPDATED)
    (0 min) Pre-trained speech representations like wav2vec 2.0 are a powerful tool for automatic speech recognition (ASR). Yet many endangered languages lack sufficient data for pre-training such models, or are predominantly oral vernaculars without a standardised writing system, precluding fine-tuning. Query-by-example spoken term detection (QbE-STD) offers an alternative for iteratively indexing untranscribed speech corpora by locating spoken query terms. Using data from 7 Australian Aboriginal languages and a regional variety of Dutch, all of which are endangered or vulnerable, we show that QbE-STD can be improved by leveraging representations developed for ASR (wav2vec 2.0: the English monolingual model and XLSR53 multilingual model). Surprisingly, the English model outperformed the multilingual model on 4 Australian language datasets, raising questions around how to optimally leverage self-supervised speech representations for QbE-STD. Nevertheless, we find that wav2vec 2.0 representations (either English or XLSR53) offer large improvements (56-86% relative) over state-of-the-art approaches on our endangered language datasets.
    Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts. (arXiv:2108.11830v2 [cs.CL] UPDATED)
    (0 min) Dialogue models trained on human conversations inadvertently learn to generate toxic responses. In addition to producing explicitly offensive utterances, these models can also implicitly insult a group or individual by aligning themselves with an offensive statement. To better understand the dynamics of contextually offensive language, we investigate the stance of dialogue model responses in offensive Reddit conversations. Specifically, we create ToxiChat, a crowd-annotated dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance. Our analysis reveals that 42% of human responses agree with toxic comments, whereas only 13% agree with safe comments. This undesirable behavior is learned by neural dialogue models, such as DialoGPT, which we show are two times more likely to agree with offensive comments. To enable automatic detection of offensive language, we fine-tuned transformer-based classifiers on ToxiChat that achieve 0.71 F1 for offensive labels and 0.53 Macro-F1 for stance labels. Finally, we quantify the effectiveness of controllable text generation (CTG) methods to mitigate the tendency of neural dialogue models to agree with offensive comments. Compared to the baseline, our best CTG model achieves a 19% reduction in agreement with offensive comments and produces 29% fewer offensive replies. Our work highlights the need for further efforts to characterize and analyze inappropriate behavior in dialogue models, in order to help make them safer. Our code and corpus are available at https://github.com/abaheti95/ToxiChat .
    Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval. (arXiv:2109.05523v1 [cs.CV])
    (0 min) Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For the training, we propose multi-scale matching losses from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.
    OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings. (arXiv:2007.00049v2 [cs.CL] UPDATED)
    (0 min) Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they not only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention that demonstrate the tradeoff between bias removal and information retention. To address this challenge, we propose OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.
    Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification. (arXiv:2109.05427v1 [cs.CL])
    (0 min) Fine-grained classification involves dealing with datasets with larger number of classes with subtle differences between them. Guiding the model to focus on differentiating dimensions between these commonly confusable classes is key to improving performance on fine-grained tasks. In this work, we analyse the contrastive fine-tuning of pre-trained language models on two fine-grained text classification tasks, emotion classification and sentiment analysis. We adaptively embed class relationships into a contrastive objective function to help differently weigh the positives and negatives, and in particular, weighting closely confusable negatives more than less similar negative examples. We find that Label-aware Contrastive Loss outperforms previous contrastive methods, in the presence of larger number and/or more confusable classes, and helps models to produce output distributions that are more differentiated.
    TEASEL: A Transformer-Based Speech-Prefixed Language Model. (arXiv:2109.05522v1 [cs.CL])
    (2 min) Multimodal language analysis is a burgeoning field of NLP that aims to simultaneously model a speaker's words, acoustical annotations, and facial expressions. In this area, lexicon features usually outperform other modalities because they are pre-trained on large corpora via Transformer-based models. Despite their strong performance, training a new self-supervised learning (SSL) Transformer on any modality is not usually attainable due to insufficient data, which is the case in multimodal language learning. This work proposes a Transformer-Based Speech-Prefixed Language Model called TEASEL to approach the mentioned constraints without training a complete Transformer model. TEASEL model includes speech modality as a dynamic prefix besides the textual modality compared to a conventional language model. This method exploits a conventional pre-trained language model as a cross-modal Transformer model. We evaluated TEASEL for the multimodal sentiment analysis task defined by CMU-MOSI dataset. Extensive experiments show that our model outperforms unimodal baseline language models by 4% and outperforms the current multimodal state-of-the-art (SoTA) model by 1% in F1-score. Additionally, our proposed method is 72% smaller than the SoTA model.
    Exploring Task Difficulty for Few-Shot Relation Extraction. (arXiv:2109.05473v1 [cs.CL])
    (2 min) Few-shot relation extraction (FSRE) focuses on recognizing novel relations by learning with merely a handful of annotated instances. Meta-learning has been widely adopted for such a task, which trains on randomly generated few-shot tasks to learn generic data representations. Despite impressive results achieved, existing models still perform suboptimally when handling hard FSRE tasks, where the relations are fine-grained and similar to each other. We argue this is largely because existing models do not distinguish hard tasks from easy ones in the learning process. In this paper, we introduce a novel approach based on contrastive learning that learns better representations by exploiting relation label information. We further design a method that allows the model to adaptively learn how to focus on hard tasks. Experiments on two standard datasets demonstrate the effectiveness of our method.
    Unsupervised Domain Adaptation Schemes for Building ASR in Low-resource Languages. (arXiv:2109.05494v1 [cs.CL])
    (2 min) Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common acoustic space with a high-resource language having enough annotated data to build an ASR. In such cases, we show that the domain-independent acoustic models learned from the high-resource language through unsupervised domain adaptation (UDA) schemes can enhance the performance of the ASR in the low-resource language. We use the specific example of Hindi in the source domain and Sanskrit in the target domain. We explore two architectures: i) domain adversarial training using gradient reversal layer (GRL) and ii) domain separation networks (DSN). The GRL and DSN architectures give absolute improvements of 6.71% and 7.32%, respectively, in word error rate over the baseline deep neural network model when trained on just 5.5 hours of data in the target domain. We also show that choosing a proper language (Telugu) in the source domain can bring further improvement. The results suggest that UDA schemes can be helpful in the development of ASR systems for low-resource languages, mitigating the hassle of collecting large amounts of annotated speech data.
    Modeling Concentrated Cross-Attention for Neural Machine Translation with Gaussian Mixture Model. (arXiv:2109.05244v1 [cs.CL])
    (2 min) Cross-attention is an important component of neural machine translation (NMT), which is always realized by dot-product attention in previous methods. However, dot-product attention only considers the pair-wise correlation between words, resulting in dispersion when dealing with long sentences and neglect of source neighboring relationships. Inspired by linguistics, the above issues are caused by ignoring a type of cross-attention, called concentrated attention, which focuses on several central words and then spreads around them. In this work, we apply Gaussian Mixture Model (GMM) to model the concentrated attention in cross-attention. Experiments and analyses we conducted on three datasets show that the proposed method outperforms the baseline and has significant improvement on alignment quality, N-gram accuracy, and long sentence translation.
    Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy. (arXiv:2109.05238v1 [cs.CL])
    (2 min) Simultaneous machine translation (SiMT) generates translation before reading the entire source sentence and hence it has to trade off between translation quality and latency. To fulfill the requirements of different translation quality and latency in practical applications, the previous methods usually need to train multiple SiMT models for different latency levels, resulting in large computational costs. In this paper, we propose a universal SiMT model with Mixture-of-Experts Wait-k Policy to achieve the best translation quality under arbitrary latency with only one trained model. Specifically, our method employs multi-head attention to accomplish the mixture of experts where each head is treated as a wait-k expert with its own waiting words number, and given a test latency and source inputs, the weights of the experts are accordingly adjusted to produce the best translation. Experiments on three datasets show that our method outperforms all the strong baselines under different latency, including the state-of-the-art adaptive policy.
    Modular Self-Supervision for Document-Level Relation Extraction. (arXiv:2109.05362v1 [cs.CL])
    (2 min) Extracting relations across large text spans has been relatively underexplored in NLP, but it is particularly important for high-value domains such as biomedicine, where obtaining high recall of the latest findings is crucial for practical applications. Compared to conventional information extraction confined to short text spans, document-level relation extraction faces additional challenges in both inference and learning. Given longer text spans, state-of-the-art neural architectures are less effective and task-specific self-supervision such as distant supervision becomes very noisy. In this paper, we propose decomposing document-level relation extraction into relation detection and argument resolution, taking inspiration from Davidsonian semantics. This enables us to incorporate explicit discourse modeling and leverage modular self-supervision for each sub-problem, which is less noise-prone and can be further refined end-to-end via variational EM. We conduct a thorough evaluation in biomedical machine reading for precision oncology, where cross-paragraph relation mentions are prevalent. Our method outperforms prior state of the art, such as multi-scale learning and graph neural networks, by over 20 absolute F1 points. The gain is particularly pronounced among the most challenging relation instances whose arguments never co-occur in a paragraph.
    Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy. (arXiv:2109.05312v1 [cs.CL])
    (2 min) Generating goal-oriented questions in Visual Dialogue tasks is a challenging and long-standing problem. State-Of-The-Art systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we design Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model's conjecture about the referent. We take the GuessWhat?! game as a case-study. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking.
    To Protect and To Serve? Analyzing Entity-Centric Framing of Police Violence. (arXiv:2109.05325v1 [cs.CL])
    (2 min) Framing has significant but subtle effects on public opinion and policy. We propose an NLP framework to measure entity-centric frames. We use it to understand media coverage on police violence in the United States in a new Police Violence Frames Corpus of 82k news articles spanning 7k police killings. Our work uncovers more than a dozen framing devices and reveals significant differences in the way liberal and conservative news sources frame both the issue of police violence and the entities involved. Conservative sources emphasize when the victim is armed or attacking an officer and are more likely to mention the victim's criminal record. Liberal sources focus more on the underlying systemic injustice, highlighting the victim's race and that they were unarmed. We discover temporary spikes in these injustice frames near high-profile shooting events, and finally, we show protest volume correlates with and precedes media framing decisions.
    Implicit Premise Generation with Discourse-aware Commonsense Knowledge Models. (arXiv:2109.05358v1 [cs.CL])
    (2 min) Enthymemes are defined as arguments where a premise or conclusion is left implicit. We tackle the task of generating the implicit premise in an enthymeme, which requires not only an understanding of the stated conclusion and premise but also additional inferences that could depend on commonsense knowledge. The largest available dataset for enthymemes (Habernal et al., 2018) consists of 1.7k samples, which is not large enough to train a neural text generation model. To address this issue, we take advantage of a similar task and dataset: Abductive reasoning in narrative text (Bhagavatula et al., 2020). However, we show that simply using a state-of-the-art seq2seq model fine-tuned on this data might not generate meaningful implicit premises associated with the given enthymemes. We demonstrate that encoding discourse-aware commonsense during fine-tuning improves the quality of the generated implicit premises and outperforms all other baselines both in automatic and human evaluations on three different datasets.
    Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search. (arXiv:2109.05433v1 [cs.CV])
    (2 min) Internet search affects people's cognition of the world, so mitigating biases in search results and learning fair models is imperative for social good. We study a unique gender bias in image search in this work: the search images are often gender-imbalanced for gender-neutral natural language queries. We diagnose two typical image search models, the specialized model trained on in-domain datasets and the generalized representation model pre-trained on massive image and text data across the internet. Both models suffer from severe gender bias. Therefore, we introduce two novel debiasing approaches: an in-processing fair sampling method to address the gender imbalance issue for training models, and a post-processing feature clipping method base on mutual information to debias multimodal representations of pre-trained models. Extensive experiments on MS-COCO and Flickr30K benchmarks show that our methods significantly reduce the gender bias in image search models.
    COSMic: A Coherence-Aware Generation Metric for Image Descriptions. (arXiv:2109.05281v1 [cs.CL])
    (2 min) Developers of text generation models rely on automated evaluation metrics as a stand-in for slow and expensive manual evaluations. However, image captioning metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of output text. We address this weakness by introducing the first discourse-aware learned generation metric for evaluating image descriptions. Our approach is inspired by computational theories of discourse for capturing information goals using coherence. We present a dataset of image$\unicode{x2013}$description pairs annotated with coherence relations. We then train a coherence-aware metric on a subset of the Conceptual Captions dataset and measure its effectiveness$\unicode{x2014}$its ability to predict human ratings of output captions$\unicode{x2014}$on a test set composed of out-of-domain images. We demonstrate a higher Kendall Correlation Coefficient for our proposed metric with the human judgments for the results of a number of state-of-the-art coherence-aware caption generation models when compared to several other metrics including recently proposed learned metrics such as BLEURT and BERTScore.
    Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets. (arXiv:2109.05250v1 [cs.CL])
    (2 min) Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.
    Multilingual Translation via Grafting Pre-trained Language Models. (arXiv:2109.05256v1 [cs.CL])
    (2 min) Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised training using only monolingual data has led to the success of pre-trained (masked) language models in many NLP tasks. However, directly connecting BERT as an encoder and GPT as a decoder can be challenging in machine translation, for GPT-like models lack a cross-attention component that is needed in seq2seq decoders. In this paper, we propose Graformer to graft separately pre-trained (masked) language models for machine translation. With monolingual data for pre-training and parallel data for grafting training, we maximally take advantage of the usage of both types of data. Experiments on 60 directions show that our method achieves average improvements of 5.8 BLEU in x2en and 2.9 BLEU in en2x directions comparing with the multilingual Transformer of the same size.
    Guiding Topic Flows in the Generative Chatbot by Enhancing the ConceptNet with the Conversation Corpora. (arXiv:2109.05406v1 [cs.CL])
    (2 min) Human conversations consist of reasonable and natural topic flows, which are observed as the shifts of the mentioned concepts across utterances. Previous chatbots that incorporate the external commonsense knowledge graph prove that modeling the concept shifts can effectively alleviate the dull and uninformative response dilemma. However, there still exists a gap between the concept relations in the natural conversation and those in the external commonsense knowledge graph, which is an issue to solve. Specifically, the concept relations in the external commonsense knowledge graph are not intuitively built from the conversational scenario but the world knowledge, which makes them insufficient for the chatbot construction. To bridge the above gap, we propose the method to supply more concept relations extracted from the conversational corpora and reconstruct an enhanced concept graph for the chatbot construction. In addition, we present a novel, powerful, and fast graph encoding architecture named the Edge-Transformer to replace the traditional GNN architecture. Experimental results on the Reddit conversation dataset indicate our proposed method significantly outperforms strong baseline systems and achieves new SOTA results. Further analysis individually proves the effectiveness of the enhanced concept graph and the Edge-Transformer architecture.
    Scaling and Acceleration of Three-dimensional Structure Determination for Single-Particle Imaging Experiments with SpiniFEL. (arXiv:2109.05339v1 [physics.comp-ph])
    (0 min) The Linac Coherent Light Source (LCLS) is an X- ray free electron laser (XFEL) facility enabling the study of the structure and dynamics of single macromolecules. A major upgrade will bring the repetition rate of the X-ray source from 120 to 1 million pulses per second. Exascale high performance computing (HPC) capabilities will be required to process the corresponding data rates. We present SpiniFEL, an application used for structure determination of proteins from single-particle imaging (SPI) experiments. An emerging technique for imaging individual proteins and other large molecular complexes by outrunning radiation damage, SPI breaks free from the need for crystallization (which is difficult for some proteins) and allows for imaging molecular dynamics at near ambient conditions. SpiniFEL is being developed to run on supercomputers in near real-time while an experiment is taking place, so that the feedback about the data can guide the data collection strategy. We describe here how we reformulated the mathematical framework for parallelizable implementation and accelerated the most compute intensive parts of the application. We also describe the use of Pygion, a Python interface for the Legion task-based programming model and compare to our existing MPI+GPU implementation.
    What's in a Name? Answer Equivalence For Open-Domain Question Answering. (arXiv:2109.05289v1 [cs.CL])
    (2 min) A flaw in QA evaluation is that annotations often only provide one gold answer. Thus, model predictions semantically equivalent to the answer but superficially different are considered incorrect. This work explores mining alias entities from knowledge bases and using them as additional gold answers (i.e., equivalent answers). We incorporate answers for two settings: evaluation with additional answers and model training with equivalent answers. We analyse three QA benchmarks: Natural Questions, TriviaQA, and SQuAD. Answer expansion increases the exact match score on all datasets for evaluation, while incorporating it helps model training over real-world datasets. We ensure the additional answers are valid through a human post hoc evaluation.
    "Let Your Characters Tell Their Story": A Dataset for Character-Centric Narrative Understanding. (arXiv:2109.05438v1 [cs.CL])
    (2 min) When reading a literary piece, readers often make inferences about various characters' roles, personalities, relationships, intents, actions, etc. While humans can readily draw upon their past experiences to build such a character-centric view of the narrative, understanding characters in narratives can be a challenging task for machines. To encourage research in this field of character-centric narrative understanding, we present LiSCU -- a new dataset of literary pieces and their summaries paired with descriptions of characters that appear in them. We also introduce two new tasks on LiSCU: Character Identification and Character Description Generation. Our experiments with several pre-trained language models adapted for these tasks demonstrate that there is a need for better models of narrative comprehension.
    Bayesian Topic Regression for Causal Inference. (arXiv:2109.05317v1 [stat.ML])
    (2 min) Causal inference using observational text data is becoming increasingly popular in many research areas. This paper presents the Bayesian Topic Regression (BTR) model that uses both text and numerical information to model an outcome variable. It allows estimation of both discrete and continuous treatment effects. Furthermore, it allows for the inclusion of additional numerical confounding factors next to text data. To this end, we combine a supervised Bayesian topic model with a Bayesian regression framework and perform supervised representation learning for the text features jointly with the regression parameter training, respecting the Frisch-Waugh-Lovell theorem. Our paper makes two main contributions. First, we provide a regression framework that allows causal inference in settings when both text and numerical confounders are of relevance. We show with synthetic and semi-synthetic datasets that our joint approach recovers ground truth with lower bias than any benchmark model, when text and numerical features are correlated. Second, experiments on two real-world datasets demonstrate that a joint and supervised learning strategy also yields superior prediction results compared to strategies that estimate regression weights for text and non-text features separately, being even competitive with more complex deep neural networks.
    Knowledge Enhanced Fine-Tuning for Better Handling Unseen Entities in Dialogue Generation. (arXiv:2109.05487v1 [cs.CL])
    (2 min) Although pre-training models have achieved great success in dialogue generation, their performance drops dramatically when the input contains an entity that does not appear in pre-training and fine-tuning datasets (unseen entity). To address this issue, existing methods leverage an external knowledge base to generate appropriate responses. In real-world scenario, the entity may not be included by the knowledge base or suffer from the precision of knowledge retrieval. To deal with this problem, instead of introducing knowledge base as the input, we force the model to learn a better semantic representation by predicting the information in the knowledge base, only based on the input context. Specifically, with the help of a knowledge base, we introduce two auxiliary training objectives: 1) Interpret Masked Word, which conjectures the meaning of the masked entity given the context; 2) Hypernym Generation, which predicts the hypernym of the entity based on the context. Experiment results on two dialogue corpus verify the effectiveness of our methods under both knowledge available and unavailable settings.
    College Student Retention Risk Analysis From Educational Database using Multi-Task Multi-Modal Neural Fusion. (arXiv:2109.05178v1 [cs.CL])
    (2 min) We develop a Multimodal Spatiotemporal Neural Fusion network for Multi-Task Learning (MSNF-MTCL) to predict 5 important students' retention risks: future dropout, next semester dropout, type of dropout, duration of dropout and cause of dropout. First, we develop a general purpose multi-modal neural fusion network model MSNF for learning students' academic information representation by fusing spatial and temporal unstructured advising notes with spatiotemporal structured data. MSNF combines a Bidirectional Encoder Representations from Transformers (BERT)-based document embedding framework to represent each advising note, Long-Short Term Memory (LSTM) network to model temporal advising note embeddings, LSTM network to model students' temporal performance variables and students' static demographics altogether. The final fused representation from MSNF has been utilized on a Multi-Task Cascade Learning (MTCL) model towards building MSNF-MTCL for predicting 5 student retention risks. We evaluate MSNFMTCL on a large educational database consists of 36,445 college students over 18 years period of time that provides promising performances comparing with the nearest state-of-art models. Additionally, we test the fairness of such model given the existence of biases.
    Semantic Categorization of Social Knowledge for Commonsense Question Answering. (arXiv:2109.05168v1 [cs.CL])
    (2 min) Large pre-trained language models (PLMs) have led to great success on various commonsense question answering (QA) tasks in an end-to-end fashion. However, little attention has been paid to what commonsense knowledge is needed to deeply characterize these QA tasks. In this work, we proposed to categorize the semantics needed for these tasks using the SocialIQA as an example. Building upon our labeled social knowledge categories dataset on top of SocialIQA, we further train neural QA models to incorporate such social knowledge categories and relation information from a knowledge base. Unlike previous work, we observe our models with semantic categorizations of social knowledge can achieve comparable performance with a relatively simple model and smaller size compared to other complex approaches.
    The Impact of Positional Encodings on Multilingual Compression. (arXiv:2109.05388v1 [cs.CL])
    (2 min) In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.
    Total Recall: a Customized Continual Learning Method for Neural Semantic Parsers. (arXiv:2109.05186v1 [cs.CL])
    (2 min) This paper investigates continual learning for semantic parsing. In this setting, a neural semantic parser learns tasks sequentially without accessing full training data from previous tasks. Direct application of the SOTA continual learning algorithms to this problem fails to achieve comparable performance with re-training models with all seen tasks because they have not considered the special properties of structured outputs yielded by semantic parsers. Therefore, we propose TotalRecall, a continual learning method designed for neural semantic parsers from two aspects: i) a sampling method for memory replay that diversifies logical form templates and balances distributions of parse actions in a memory; ii) a two-stage training method that significantly improves generalization capability of the parsers across tasks. We conduct extensive experiments to study the research problems involved in continual semantic parsing and demonstrate that a neural semantic parser trained with TotalRecall achieves superior performance than the one trained directly with the SOTA continual learning algorithms and achieve a 3-6 times speedup compared to re-training from scratch. Code and datasets are available at: https://github.com/zhuang-li/cl_nsp.
    An Objective Metric for Explainable AI: How and Why to Estimate the Degree of Explainability. (arXiv:2109.05327v1 [cs.AI])
    (2 min) Numerous government initiatives (e.g. the EU with GDPR) are coming to the conclusion that the increasing complexity of modern software systems must be contrasted with some Rights to Explanation and metrics for the Impact Assessment of these tools, that allow humans to understand and oversee the output of Automated Decision Making systems. Explainable AI was born as a pathway to allow humans to explore and understand the inner working of complex systems. But establishing what is an explanation and objectively evaluating explainability, are not trivial tasks. With this paper, we present a new model-agnostic metric to measure the Degree of eXplainability of correct information in an objective way, exploiting a specific model from Ordinary Language Philosophy called the Achinstein's Theory of Explanations. In order to understand whether this metric is actually behaving as explainability is expected to, we designed a few experiments and a user-study on two realistic AI-based systems for healthcare and finance, involving famous AI technology including Artificial Neural Networks and TreeSHAP. The results we obtained are very encouraging, suggesting that our proposed metric for measuring the Degree of eXplainability is robust on several scenarios and it can be eventually exploited for a lawful Impact Assessment of an Automated Decision Making system.
    Reference-Centric Models for Grounded Collaborative Dialogue. (arXiv:2109.05042v1 [cs.CL])
    (2 min) We present a grounded neural dialogue model that successfully collaborates with people in a partially-observable reference game. We focus on a setting where two agents each observe an overlapping part of a world context and need to identify and agree on some object they share. Therefore, the agents should pool their information and communicate pragmatically to solve the task. Our dialogue agent accurately grounds referents from the partner's utterances using a structured reference resolver, conditions on these referents using a recurrent memory, and uses a pragmatic generation procedure to ensure the partner can resolve the references the agent produces. We evaluate on the OneCommon spatial grounding dialogue task (Udagawa and Aizawa 2019), involving a number of dots arranged on a board with continuously varying positions, sizes, and shades. Our agent substantially outperforms the previous state of the art for the task, obtaining a 20% relative improvement in successful task completion in self-play evaluations and a 50% relative improvement in success in human evaluations.
    MURAL: Multimodal, Multitask Retrieval Across Languages. (arXiv:2109.05125v1 [cs.IR])
    (2 min) Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.
    MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets. (arXiv:2109.05184v1 [cs.MM])
    (2 min) Internet memes have become powerful means to transmit political, psychological, and socio-cultural ideas. Although memes are typically humorous, recent days have witnessed an escalation of harmful memes used for trolling, cyberbullying, and abusing social entities. Detecting such harmful memes is challenging as they can be highly satirical and cryptic. Moreover, while previous work has focused on specific aspects of memes such as hate speech and propaganda, there has been little work on harm in general, and only one specialized dataset for it. Here, we focus on bridging this gap. In particular, we aim to solve two novel tasks: detecting harmful memes and identifying the social entities they target. We further extend the recently released HarMeme dataset to generalize on two prevalent topics - COVID-19 and US politics and name the two datasets as Harm-C and Harm-P, respectively. We then propose MOMENTA (MultimOdal framework for detecting harmful MemEs aNd Their tArgets), a novel multimodal (text + image) deep neural model, which uses global and local perspectives to detect harmful memes. MOMENTA identifies the object proposals and attributes and uses a multimodal model to perceive the comprehensive context in which the objects and the entities are portrayed in a given meme. MOMENTA is interpretable and generalizable, and it outperforms numerous baselines.
    Uncovering Main Causalities for Long-tailed Information Extraction. (arXiv:2109.05213v1 [cs.CL])
    (2 min) Information Extraction (IE) aims to extract structural information from unstructured texts. In practice, long-tailed distributions caused by the selection bias of a dataset, may lead to incorrect correlations, also known as spurious correlations, between entities and labels in the conventional likelihood models. This motivates us to propose counterfactual IE (CFIE), a novel framework that aims to uncover the main causalities behind data in the view of causal inference. Specifically, 1) we first introduce a unified structural causal model (SCM) for various IE tasks, describing the relationships among variables; 2) with our SCM, we then generate counterfactuals based on an explicit language structure to better calculate the direct causal effect during the inference stage; 3) we further propose a novel debiasing approach to yield more robust predictions. Experiments on three IE tasks across five public datasets show the effectiveness of our CFIE model in mitigating the spurious correlation issues.
    Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework. (arXiv:2109.05357v1 [cs.CL])
    (2 min) In this work, we study the problem of named entity recognition (NER) in a low resource scenario, focusing on few-shot and zero-shot settings. Built upon large-scale pre-trained language models, we propose a novel NER framework, namely SpanNER, which learns from natural language supervision and enables the identification of never-seen entity classes without using in-domain labeled data. We perform extensive experiments on 5 benchmark datasets and evaluate the proposed method in the few-shot learning, domain transfer and zero-shot learning settings. The experimental results show that the proposed method can bring 10%, 23% and 26% improvements in average over the best baselines in few-shot learning, domain transfer and zero-shot learning settings respectively.
    Speaker-Oriented Latent Structures for Dialogue-Based Relation Extraction. (arXiv:2109.05182v1 [cs.CL])
    (2 min) Dialogue-based relation extraction (DiaRE) aims to detect the structural information from unstructured utterances in dialogues. Existing relation extraction models may be unsatisfactory under such a conversational setting, due to the entangled logic and information sparsity issues in utterances involving multiple speakers. To this end, we introduce SOLS, a novel model which can explicitly induce speaker-oriented latent structures for better DiaRE. Specifically, we learn latent structures to capture the relationships among tokens beyond the utterance boundaries, alleviating the entangled logic issue. During the learning process, our speaker-specific regularization method progressively highlights speaker-related key clues and erases the irrelevant ones, alleviating the information sparsity issue. Experiments on three public datasets demonstrate the effectiveness of our proposed approach.
    Partially-supervised novel object captioning leveraging context from paired data. (arXiv:2109.05115v1 [cs.CV])
    (2 min) In this paper, we propose an approach to improve image captioning solutions for images with novel objects that do not have caption labels in the training dataset. Our approach is agnostic to model architecture, and primarily focuses on training technique that uses existing fully paired image-caption data and the images with only the novel object detection labels (partially paired data). We create synthetic paired captioning data for these novel objects by leveraging context from existing image-caption pairs. We further re-use these partially paired images with novel objects to create pseudo-label captions that are used to fine-tune the captioning model. Using a popular captioning model (Up-Down) as baseline, our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split, and improves F1 metric and CIDEr for novel object images by 75.8 and 26.6 points respectively, compared to baseline model that does not use partially paired images during training.
    Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. (arXiv:2109.05322v1 [cs.CL])
    (2 min) Hate speech has grown significantly on social media, causing serious consequences for victims of all demographics. Despite much attention being paid to characterize and detect discriminatory speech, most work has focused on explicit or overt hate speech, failing to address a more pervasive form based on coded or indirect language. To fill this gap, this work introduces a theoretically-justified taxonomy of implicit hate speech and a benchmark corpus with fine-grained labels for each message and its implication. We present systematic analyses of our dataset using contemporary baselines to detect and explain implicit hate speech, and we discuss key features that challenge existing models. This dataset will continue to serve as a useful benchmark for understanding this multifaceted issue.
    Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. (arXiv:2109.05153v1 [cs.CL])
    (2 min) Addressing the mismatch between natural language descriptions and the corresponding SQL queries is a key challenge for text-to-SQL translation. To bridge this gap, we propose an SQL intermediate representation (IR) called Natural SQL (NatSQL). Specifically, NatSQL preserves the core functionalities of SQL, while it simplifies the queries as follows: (1) dispensing with operators and keywords such as GROUP BY, HAVING, FROM, JOIN ON, which are usually hard to find counterparts for in the text descriptions; (2) removing the need for nested subqueries and set operators; and (3) making schema linking easier by reducing the required number of schema items. On Spider, a challenging text-to-SQL benchmark that contains complex and nested SQL queries, we demonstrate that NatSQL outperforms other IRs, and significantly improves the performance of several previous SOTA models. Furthermore, for existing models that do not support executable SQL generation, NatSQL easily enables them to generate executable SQL queries, and achieves the new state-of-the-art execution accuracy.
    XCoref: Cross-document Coreference Resolution in the Wild. (arXiv:2109.05252v1 [cs.CL])
    (2 min) Datasets and methods for cross-document coreference resolution (CDCR) focus on events or entities with strict coreference relations. They lack, however, annotating and resolving coreference mentions with more abstract or loose relations that may occur when news articles report about controversial and polarized events. Bridging and loose coreference relations trigger associations that may lead to exposing news readers to bias by word choice and labeling. For example, coreferential mentions of "direct talks between U.S. President Donald Trump and Kim" such as "an extraordinary meeting following months of heated rhetoric" or "great chance to solve a world problem" form a more positive perception of this event. A step towards bringing awareness of bias by word choice and labeling is the reliable resolution of coreferences with high lexical diversity. We propose an unsupervised method named XCoref, which is a CDCR method that capably resolves not only previously prevalent entities, such as persons, e.g., "Donald Trump," but also abstractly defined concepts, such as groups of persons, "caravan of immigrants," events and actions, e.g., "marching to the U.S. border." In an extensive evaluation, we compare the proposed XCoref to a state-of-the-art CDCR method and a previous method TCA that resolves such complex coreference relations and find that XCoref outperforms these methods. Outperforming an established CDCR model shows that the new CDCR models need to be evaluated on semantically complex mentions with more loose coreference relations to indicate their applicability of models to resolve mentions in the "wild" of political news articles.
    HYDRA -- Hyper Dependency Representation Attentions. (arXiv:2109.05349v1 [cs.CL])
    (2 min) Attention is all we need as long as we have enough data. Even so, it is sometimes not easy to determine how much data is enough while the models are becoming larger and larger. In this paper, we propose HYDRA heads, lightweight pretrained linguistic self-attention heads to inject knowledge into transformer models without pretraining them again. Our approach is a balanced paradigm between leaving the models to learn unsupervised and forcing them to conform to linguistic knowledge rigidly as suggested in previous studies. Our experiment proves that the approach is not only the boost performance of the model but also lightweight and architecture friendly. We empirically verify our framework on benchmark datasets to show the contribution of linguistic knowledge to a transformer model. This is a promising result for a new approach to transferring knowledge from linguistic resources into transformer-based models.
    Improved Latent Tree Induction with Distant Supervision via Span Constraints. (arXiv:2109.05112v1 [cs.CL])
    (2 min) For over thirty years, researchers have developed and analyzed methods for latent tree induction as an approach for unsupervised syntactic parsing. Nonetheless, modern systems still do not perform well enough compared to their supervised counterparts to have any practical use as structural annotation of text. In this work, we present a technique that uses distant supervision in the form of span constraints (i.e. phrase bracketing) to improve performance in unsupervised constituency parsing. Using a relatively small number of span constraints we can substantially improve the output from DIORA, an already competitive unsupervised parsing system. Compared with full parse tree annotation, span constraints can be acquired with minimal effort, such as with a lexicon derived from Wikipedia, to find exact text matches. Our experiments show span constraints based on entities improves constituency parsing on English WSJ Penn Treebank by more than 5 F1. Furthermore, our method extends to any domain where span constraints are easily attainable, and as a case study we demonstrate its effectiveness by parsing biomedical text from the CRAFT dataset.
    Extract, Integrate, Compete: Towards Verification Style Reading Comprehension. (arXiv:2109.05149v1 [cs.CL])
    (2 min) In this paper, we present a new verification style reading comprehension dataset named VGaokao from Chinese Language tests of Gaokao. Different from existing efforts, the new dataset is originally designed for native speakers' evaluation, thus requiring more advanced language understanding skills. To address the challenges in VGaokao, we propose a novel Extract-Integrate-Compete approach, which iteratively selects complementary evidence with a novel query updating mechanism and adaptively distills supportive evidence, followed by a pairwise competition to push models to learn the subtle difference among similar text pieces. Experiments show that our methods outperform various baselines on VGaokao with retrieved complementary evidence, while having the merits of efficiency and explainability. Our dataset and code are released for further research.
    Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems. (arXiv:2109.05217v1 [cs.CL])
    (2 min) In recent years, several high-performance conversational systems have been proposed based on the Transformer encoder-decoder model. Although previous studies analyzed the effects of the model parameters and the decoding method on subjective dialogue evaluations with overall metrics, they did not analyze how the differences of fine-tuning datasets affect on user's detailed impression. In addition, the Transformer-based approach has only been verified for English, not for such languages with large inter-language distances as Japanese. In this study, we develop large-scale Transformer-based Japanese dialogue models and Japanese chit-chat datasets to examine the effectiveness of the Transformer-based approach for building chit-chat dialogue systems. We evaluated and analyzed the impressions of human dialogues in different fine-tuning datasets, model parameters, and the use of additional information.
    TopicRefine: Joint Topic Prediction and Dialogue Response Generation for Multi-turn End-to-End Dialogue System. (arXiv:2109.05187v1 [cs.CL])
    (2 min) A multi-turn dialogue always follows a specific topic thread, and topic shift at the discourse level occurs naturally as the conversation progresses, necessitating the model's ability to capture different topics and generate topic-aware responses. Previous research has either predicted the topic first and then generated the relevant response, or simply applied the attention mechanism to all topics, ignoring the joint distribution of the topic prediction and response generation models and resulting in uncontrollable and unrelated responses. In this paper, we propose a joint framework with a topic refinement mechanism to learn these two tasks simultaneously. Specifically, we design a three-pass iteration mechanism to generate coarse response first, then predict corresponding topics, and finally generate refined response conditioned on predicted topics. Moreover, we utilize GPT2DoubleHeads and BERT for the topic prediction task respectively, aiming to investigate the effects of joint learning and the understanding ability of GPT model. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance at response generation task and the great potential understanding capability of GPT model.
    D-REX: Dialogue Relation Extraction with Explanations. (arXiv:2109.05126v1 [cs.CL])
    (2 min) Existing research studies on cross-sentence relation extraction in long-form multi-party conversations aim to improve relation extraction without considering the explainability of such methods. This work addresses that gap by focusing on extracting explanations that indicate that a relation exists while using only partially labeled data. We propose our model-agnostic framework, D-REX, a policy-guided semi-supervised algorithm that explains and ranks relations. We frame relation extraction as a re-ranking task and include relation- and entity-specific explanations as an intermediate step of the inference process. We find that about 90% of the time, human annotators prefer D-REX's explanations over a strong BERT-based joint relation extraction and explanation model. Finally, our evaluations on a dialogue relation extraction dataset show that our method is simple yet effective and achieves a state-of-the-art F1 score on relation extraction, improving upon existing methods by 13.5%.
    COMBO: State-of-the-Art Morphosyntactic Analysis. (arXiv:2109.05361v1 [cs.CL])
    (2 min) We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages. It maintains a balance between efficiency and quality. As it is an end-to-end system and its modules are jointly trained, its training is competitively fast. As its models are optimised for accuracy, they achieve often better prediction quality than SOTA. The COMBO library is available at: https://gitlab.clarin-pl.eu/syntactic-tools/combo.
    Pairwise Supervised Contrastive Learning of Sentence Representations. (arXiv:2109.05424v1 [cs.CL])
    (2 min) Many recent successes in sentence representation learning have been achieved by simply fine-tuning on the Natural Language Inference (NLI) datasets with triplet loss or siamese loss. Nevertheless, they share a common weakness: sentences in a contradiction pair are not necessarily from different semantic categories. Therefore, optimizing the semantic entailment and contradiction reasoning objective alone is inadequate to capture the high-level semantic structure. The drawback is compounded by the fact that the vanilla siamese or triplet losses only learn from individual sentence pairs or triplets, which often suffer from bad local optima. In this paper, we propose PairSupCon, an instance discrimination based approach aiming to bridge semantic entailment and contradiction understanding with high-level categorical concept encoding. We evaluate PairSupCon on various downstream tasks that involve understanding sentence semantics at different granularities. We outperform the previous state-of-the-art method with $10\%$--$13\%$ averaged improvement on eight clustering tasks, and $5\%$--$6\%$ averaged improvement on seven semantic textual similarity (STS) tasks.
    Eliciting Knowledge from Language Models for Event Extraction. (arXiv:2109.05190v1 [cs.CL])
    (2 min) Eliciting knowledge contained in language models via prompt-based learning has shown great potential in many natural language processing tasks, such as text classification and generation. Whereas, the applications for more complex tasks such as event extraction are less studied, since the design of prompt is not straightforward due to the complicated types and arguments. In this paper, we explore to elicit the knowledge from pre-trained language models for event trigger detection and argument extraction. Specifically, we present various joint trigger/argument prompt methods, which can elicit more complementary knowledge by modeling the interactions between different triggers or arguments. The experimental results on the benchmark dataset, namely ACE2005, show the great advantages of our proposed approach. In particular, our approach is superior to the recent advanced methods in the few-shot scenario where only a few samples are used for training.
    HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge. (arXiv:2109.05097v1 [cs.CL])
    (2 min) A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. Next, we leverage the COMeT and reverse COMeT models to do commonsense and counterfactual inference. We then generate multiple hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles creatively with high success rate and intensity scores.
    AdaK-NER: An Adaptive Top-K Approach for Named Entity Recognition with Incomplete Annotations. (arXiv:2109.05233v1 [cs.CL])
    (2 min) State-of-the-art Named Entity Recognition(NER) models rely heavily on large amountsof fully annotated training data. However, ac-cessible data are often incompletely annotatedsince the annotators usually lack comprehen-sive knowledge in the target domain. Normallythe unannotated tokens are regarded as non-entities by default, while we underline thatthese tokens could either be non-entities orpart of any entity. Here, we study NER mod-eling with incomplete annotated data whereonly a fraction of the named entities are la-beled, and the unlabeled tokens are equiva-lently multi-labeled by every possible label.Taking multi-labeled tokens into account, thenumerous possible paths can distract the train-ing model from the gold path (ground truthlabel sequence), and thus hinders the learn-ing ability. In this paper, we propose AdaK-NER, named the adaptive top-Kapproach, tohelp the model focus on a smaller feasible re-gion where the gold path is more likely to belocated. We demonstrate the superiority ofour approach through extensive experimentson both English and Chinese datasets, aver-agely improving 2% in F-score on the CoNLL-2003 and over 10% on two Chinese datasetscompared with the prior state-of-the-art works.
    Prior Omission of Dissimilar Source Domain(s) for Cost-Effective Few-Shot Learning. (arXiv:2109.05234v1 [cs.CL])
    (2 min) Few-shot slot tagging is an emerging research topic in the field of Natural Language Understanding (NLU). With sufficient annotated data from source domains, the key challenge is how to train and adapt the model to another target domain which only has few labels. Conventional few-shot approaches use all the data from the source domains without considering inter-domain relations and implicitly assume each sample in the domain contributes equally. However, our experiments show that the data distribution bias among different domains will significantly affect the adaption performance. Moreover, transferring knowledge from dissimilar domains will even introduce some extra noises so that affect the performance of models. To tackle this problem, we propose an effective similarity-based method to select data from the source domains. In addition, we propose a Shared-Private Network (SP-Net) for the few-shot slot tagging task. The words from the same class would have some shared features. We extract those shared features from the limited annotated data on the target domain and merge them together as the label embedding to help us predict other unlabelled data on the target domain. The experiment shows that our method outperforms the state-of-the-art approaches with fewer source data. The result also proves that some training data from dissimilar sources are redundant and even negative for the adaption.
    Speaker Turn Modeling for Dialogue Act Classification. (arXiv:2109.05056v1 [cs.CL])
    (2 min) Dialogue Act (DA) classification is the task of classifying utterances with respect to the function they serve in a dialogue. Existing approaches to DA classification model utterances without incorporating the turn changes among speakers throughout the dialogue, therefore treating it no different than non-interactive written text. In this paper, we propose to integrate the turn changes in conversations among speakers when modeling DAs. Specifically, we learn conversation-invariant speaker turn embeddings to represent the speaker turns in a conversation; the learned speaker turn embeddings are then merged with the utterance embeddings for the downstream task of DA classification. With this simple yet effective mechanism, our model is able to capture the semantics from the dialogue content while accounting for different speaker turns in a conversation. Validation on three benchmark public datasets demonstrates superior performance of our model.
    Asking Questions Like Educational Experts: Automatically Generating Question-Answer Pairs on Real-World Examination Data. (arXiv:2109.05179v1 [cs.CL])
    (2 min) Generating high quality question-answer pairs is a hard but meaningful task. Although previous works have achieved great results on answer-aware question generation, it is difficult to apply them into practical application in the education field. This paper for the first time addresses the question-answer pair generation task on the real-world examination data, and proposes a new unified framework on RACE. To capture the important information of the input passage we first automatically generate(rather than extracting) keyphrases, thus this task is reduced to keyphrase-question-answer triplet joint generation. Accordingly, we propose a multi-agent communication model to generate and optimize the question and keyphrases iteratively, and then apply the generated question and keyphrases to guide the generation of answers. To establish a solid benchmark, we build our model on the strong generative pre-training model. Experimental results show that our model makes great breakthroughs in the question-answer pair generation task. Moreover, we make a comprehensive analysis on our model, suggesting new directions for this challenging task.
    Entity-Based Knowledge Conflicts in Question Answering. (arXiv:2109.05052v1 [cs.CL])
    (2 min) Knowledge-dependent tasks typically use two sources of knowledge: parametric, learned at training time, and contextual, given as a passage at inference time. To understand how models use these sources together, we formalize the problem of knowledge conflicts, where the contextual information contradicts the learned information. Analyzing the behaviour of popular models, we measure their over-reliance on memorized information (the cause of hallucinations), and uncover important factors that exacerbate this behaviour. Lastly, we propose a simple method to mitigate over-reliance on parametric knowledge, which minimizes hallucination, and improves out-of-distribution generalization by 4%-7%. Our findings demonstrate the importance for practitioners to evaluate model tendency to hallucinate rather than read, and show that our mitigation strategy encourages generalization to evolving information (i.e., time-dependent queries). To encourage these practices, we have released our framework for generating knowledge conflicts.
    Enhancing Self-Disclosure In Neural Dialog Models By Candidate Re-ranking. (arXiv:2109.05090v1 [cs.CL])
    (2 min) Neural language modelling has progressed the state-of-the-art in different downstream Natural Language Processing (NLP) tasks. One such area is of open-domain dialog modelling, neural dialog models based on GPT-2 such as DialoGPT have shown promising performance in single-turn conversation. However, such (neural) dialog models have been criticized for generating responses which although may have relevance to the previous human response, tend to quickly dissipate human interest and descend into trivial conversation. One reason for such performance is the lack of explicit conversation strategy being employed in human-machine conversation. Humans employ a range of conversation strategies while engaging in a conversation, one such key social strategies is Self-disclosure(SD). A phenomenon of revealing information about one-self to others. Social penetration theory (SPT) proposes that communication between two people moves from shallow to deeper levels as the relationship progresses primarily through self-disclosure. Disclosure helps in creating rapport among the participants engaged in a conversation. In this paper, Self-disclosure enhancement architecture (SDEA) is introduced utilizing Self-disclosure Topic Model (SDTM) during inference stage of a neural dialog model to re-rank response candidates to enhance self-disclosure in single-turn responses from from the model.
    StreamHover: Livestream Transcript Summarization and Annotation. (arXiv:2109.05160v1 [cs.CL])
    (2 min) With the explosive growth of livestream broadcasting, there is an urgent need for new summarization technology that enables us to create a preview of streamed content and tap into this wealth of knowledge. However, the problem is nontrivial due to the informal nature of spoken language. Further, there has been a shortage of annotated datasets that are necessary for transcript summarization. In this paper, we present StreamHover, a framework for annotating and summarizing livestream transcripts. With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora. We explore a neural extractive summarization model that leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries. We show that our model generalizes better and improves performance over strong baselines. The results of this study provide an avenue for future research to improve summarization solutions for efficient browsing of livestreams.
    Refocusing on Relevance: Personalization in NLG. (arXiv:2109.05140v1 [cs.CL])
    (2 min) Many NLG tasks such as summarization, dialogue response, or open domain question answering focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a user's intent or context of work is not easily recoverable based solely on that source text -- a scenario that we argue is more of the rule than the exception. In this work, we argue that NLG systems in general should place a much higher level of emphasis on making use of additional context, and suggest that relevance (as used in Information Retrieval) be thought of as a crucial tool for designing user-oriented text-generating tasks. We further discuss possible harms and hazards around such personalization, and argue that value-sensitive design represents a crucial path forward through these challenges.
    Towards Zero-shot Commonsense Reasoning with Self-supervised Refinement of Language Models. (arXiv:2109.05105v1 [cs.CL])
    (2 min) Can we get existing language models and refine them for zero-shot commonsense reasoning? This paper presents an initial study exploring the feasibility of zero-shot commonsense reasoning for the Winograd Schema Challenge by formulating the task as self-supervised refinement of a pre-trained language model. In contrast to previous studies that rely on fine-tuning annotated datasets, we seek to boost conceptualization via loss landscape refinement. To this end, we propose a novel self-supervised learning approach that refines the language model utilizing a set of linguistic perturbations of similar concept relationships. Empirical analysis of our conceptually simple framework demonstrates the viability of zero-shot commonsense reasoning on multiple benchmarks.
    A Survey on Multi-modal Summarization. (arXiv:2109.05199v1 [cs.CL])
    (2 min) The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the task of automatic multi-modal summarization (MMS) essential. In this paper, we present a comprehensive survey of the existing research in the area of MMS.
    Attention-based Contrastive Learning for Winograd Schemas. (arXiv:2109.05108v1 [cs.CL])
    (2 min) Self-supervised learning has recently attracted considerable attention in the NLP community for its ability to learn discriminative features using a contrastive objective. This paper investigates whether contrastive learning can be extended to Transfomer attention to tackling the Winograd Schema Challenge. To this end, we propose a novel self-supervised framework, leveraging a contrastive loss directly at the level of self-attention. Experimental analysis of our attention-based models on multiple datasets demonstrates superior commonsense reasoning capabilities. The proposed approach outperforms all comparable unsupervised approaches while occasionally surpassing supervised ones.
    FBERT: A Neural Transformer for Identifying Offensive Content. (arXiv:2109.05074v1 [cs.CL])
    (2 min) Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances. We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.
    PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. (arXiv:2109.05093v1 [cs.CL])
    (2 min) Large pre-trained language models for textual data have an unconstrained output space; at each decoding step, they can produce any of 10,000s of sub-word tokens. When fine-tuned to target constrained formal languages like SQL, these models often generate invalid code, rendering it unusable. We propose PICARD (code and trained models available at https://github.com/ElementAI/picard), a method for constraining auto-regressive decoders of language models through incremental parsing. PICARD helps to find valid output sequences by rejecting inadmissible tokens at each decoding step. On the challenging Spider and CoSQL text-to-SQL translation tasks, we show that PICARD transforms fine-tuned T5 models with passable performance into state-of-the-art solutions.
    Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization. (arXiv:2109.05157v1 [cs.CL])
    (2 min) Recently, there has been significant progress in studying neural networks for translating text descriptions into SQL queries under the zero-shot cross-domain setting. Despite achieving good performance on some public benchmarks, we observe that existing text-to-SQL models do not generalize when facing domain knowledge that does not frequently appear in the training data, which may render the worse prediction performance for unseen domains. In this work, we investigate the robustness of text-to-SQL models when the questions require rarely observed domain knowledge. In particular, we define five types of domain knowledge and introduce Spider-DK (DK is the abbreviation of domain knowledge), a human-curated dataset based on the Spider benchmark for text-to-SQL translation. NL questions in Spider-DK are selected from Spider, and we modify some samples by adding domain knowledge that reflects real-world question paraphrases. We demonstrate that the prediction accuracy dramatically drops on samples that require such domain knowledge, even if the domain knowledge appears in the training set, and the model provides the correct predictions for related training samples.
  • cs.CV updates on arXiv.org

    FINet: Dual Branches Feature Interaction for Partial-to-Partial Point Cloud Registration. (arXiv:2106.03479v2 [cs.CV] UPDATED)
    (2 min) Data association is important in the point cloud registration. In this work, we propose to solve the partial-to-partial registration from a new perspective, by introducing multi-level feature interactions between the source and the reference clouds at the feature extraction stage, such that the registration can be realized without the attentions or explicit mask estimation for the overlapping detection as adopted previously. Specifically, we present FINet, a feature interaction-based structure with the capability to enable and strengthen the information associating between the inputs at multiple stages. To achieve this, we first split the features into two components, one for rotation and one for translation, based on the fact that they belong to different solution spaces, yielding a dual branches structure. Second, we insert several interaction modules at the feature extractor for the data association. Third, we propose a transformation sensitivity loss to obtain rotation-attentive and translation-attentive features. Experiments demonstrate that our method performs higher precision and robustness compared to the state-of-the-art traditional and learning-based methods. Code will be available at https://github.com/HaoXu-Work/FINet.
    Contrastive Quantization with Code Memory for Unsupervised Image Retrieval. (arXiv:2109.05205v1 [cs.CV])
    (2 min) The high efficiency in computation and storage makes hashing (including binary hashing and quantization) a common strategy in large-scale retrieval systems. To alleviate the reliance on expensive annotations, unsupervised deep hashing becomes an important research problem. This paper provides a novel solution to unsupervised deep quantization, namely Contrastive Quantization with Code Memory (MeCoQ). Different from existing reconstruction-based strategies, we learn unsupervised binary descriptors by contrastive learning, which can better capture discriminative visual semantics. Besides, we uncover that codeword diversity regularization is critical to prevent contrastive learning-based quantization from model degeneration. Moreover, we introduce a novel quantization code memory module that boosts contrastive learning with lower feature drift than conventional feature memories. Extensive experiments on benchmark datasets show that MeCoQ outperforms state-of-the-art methods.
    MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans. (arXiv:2103.11161v2 [cs.CV] UPDATED)
    (2 min) We propose a novel method for reconstructing floor plans from noisy 3D point clouds. Our main contribution is a principled approach that relies on the Monte Carlo Tree Search (MCTS) algorithm to maximize a suitable objective function efficiently despite the complexity of the problem. Like previous work, we first project the input point cloud to a top view to create a density map and extract room proposals from it. Our method selects and optimizes the polygonal shapes of these room proposals jointly to fit the density map and outputs an accurate vectorized floor map even for large complex scenes. To do this, we adapted MCTS, an algorithm originally designed to learn to play games, to select the room proposals by maximizing an objective function combining the fitness with the density map as predicted by a deep network and regularizing terms on the room shapes. We also introduce a refinement step to MCTS that adjusts the shape of the room proposals. For this step, we propose a novel differentiable method for rendering the polygonal shapes of these proposals. We evaluate our method on the recent and challenging Structured3D and Floor-SP datasets and show a significant improvement over the state-of-the-art, without imposing any hard constraints nor assumptions on the floor plan configurations.
    Cycle-Consistent Inverse GAN for Text-to-Image Synthesis. (arXiv:2108.01361v2 [cs.CV] UPDATED)
    (2 min) This paper investigates an open research task of text-to-image synthesis for automatically generating or manipulating images from text descriptions. Prevailing methods mainly use the text as conditions for GAN generation, and train different models for the text-guided image generation and manipulation tasks. In this paper, we propose a novel unified framework of Cycle-consistent Inverse GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation tasks. Specifically, we first train a GAN model without text input, aiming to generate images with high diversity and quality. Then we learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image, where we introduce the cycle-consistency training to learn more robust and consistent inverted latent codes. We further uncover the latent space semantics of the trained GAN model, by learning a similarity model between text representations and the latent codes. In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes. Extensive experiments on the Recipe1M and CUB datasets validate the efficacy of our proposed framework.
    MovieCuts: A New Dataset and Benchmark for Cut Type Recognition. (arXiv:2109.05569v1 [cs.CV])
    (0 min) Understanding movies and their structural patterns is a crucial task to decode the craft of video editing. While previous works have developed tools for general analysis such as detecting characters or recognizing cinematography properties at the shot level, less effort has been devoted to understanding the most basic video edit, the Cut. This paper introduces the cut type recognition task, which requires modeling of multi-modal information. To ignite research in the new task, we construct a large-scale dataset called MovieCuts, which contains more than 170K videoclips labeled among ten cut types. We benchmark a series of audio-visual approaches, including some that deal with the problem's multi-modal and multi-label nature. Our best model achieves 45.7% mAP, which suggests that the task is challenging and that attaining highly accurate cut type recognition is an open research problem.
    EventPoint: Self-Supervised Local Descriptor Learning for Event Cameras. (arXiv:2109.00210v2 [cs.CV] UPDATED)
    (0 min) We proposes a method of extracting intrest points and descriptors using self-supervised learning method on frame-based event data, which is called EventPoint. Different from other feature extraction methods on event data, we train our model on real event-form driving dataset--DSEC with the self-supervised learning method we proposed, the training progress fully consider the characteristics of event data.To verify the effectiveness of our work,we conducted several complete evaluations: we emulated DART and carried out feature matching experiments on N-caltech101 dataset, the results shows that the effect of EventPoint is better than DART; We use Vid2e tool provided by UZH to convert Oxford robotcar data into event-based format, and combined with INS information provided to carry out the global pose estimation experiment which is important in SLAM. As far as we know, this is the first work to carry out this challenging task.Sufficient experimental data show that EventPoint can get better results while achieve real time on CPU.
    Attention Augmented ConvLSTM for Environment Prediction. (arXiv:2010.09662v3 [cs.CV] UPDATED)
    (0 min) Safe and proactive planning in robotic systems generally requires accurate predictions of the environment. Prior work on environment prediction applied video frame prediction techniques to bird's-eye view environment representations, such as occupancy grids. ConvLSTM-based frameworks used previously often result in significant blurring and vanishing of moving objects, thus hindering their applicability for use in safety-critical applications. In this work, we propose two extensions to the ConvLSTM to address these issues. We present the Temporal Attention Augmented ConvLSTM (TAAConvLSTM) and Self-Attention Augmented ConvLSTM (SAAConvLSTM) frameworks for spatiotemporal occupancy prediction, and demonstrate improved performance over baseline architectures on the real-world KITTI and Waymo datasets.
    Learning Statistical Representation with Joint Deep Embedded Clustering. (arXiv:2109.05232v1 [cs.CV])
    (0 min) One of the most promising approaches for unsupervised learning is combining deep representation learning and deep clustering. Some recent works propose to simultaneously learn representation using deep neural networks and perform clustering by defining a clustering loss on top of embedded features. However, these approaches are sensitive to imbalanced data and out-of-distribution samples. Hence, these methods optimize clustering by pushing data close to randomly initialized cluster centers. This is problematic when the number of instances varies largely in different classes or a cluster with few samples has less chance to be assigned a good centroid. To overcome these limitations, we introduce StatDEC, a new unsupervised framework for joint statistical representation learning and clustering. StatDEC simultaneously trains two deep learning models, a deep statistics network that captures the data distribution, and a deep clustering network that learns embedded features and performs clustering by explicitly defining a clustering loss. Specifically, the clustering network and representation network both take advantage of our proposed statistics pooling layer that represents mean, variance, and cardinality to handle the out-of-distribution samples as well as a class imbalance. Our experiments show that using these representations, one can considerably improve results on imbalanced image clustering across a variety of image datasets. Moreover, the learned representations generalize well when transferred to the out-of-distribution dataset.
    Channelized Axial Attention for Semantic Segmentation -- Considering Channel Relation within Spatial Attention for Semantic Segmentation. (arXiv:2101.07434v5 [cs.CV] UPDATED)
    (2 min) Spatial and channel attentions, modelling the semantic interdependencies in spatial and channel dimensions respectively, have recently been widely used for semantic segmentation. However, computing spatial and channel attentions separately sometimes causes errors, especially for those difficult cases. In this paper, we propose Channelized Axial Attention (CAA) to seamlessly integrate channel attention and spatial attention into a single operation with negligible computation overhead. Specifically, we break down the dot-product operation of the spatial attention into two parts and insert channel relation in between, allowing for independently optimized channel attention on each spatial location. We further develop grouped vectorization, which allows our model to run with very little memory consumption without slowing down the running speed. Comparative experiments conducted on multiple benchmark datasets, including Cityscapes, PASCAL Context, and COCO-Stuff, demonstrate that our CAA outperforms many state-of-the-art segmentation models (including dual attention) on all tested datasets.
    Instance-Conditioned GAN. (arXiv:2109.05070v1 [cs.CV])
    (0 min) Generative Adversarial Networks (GANs) can generate near photo realistic images in narrow domains such as human faces. Yet, modeling complex distributions of datasets such as ImageNet and COCO-Stuff remains challenging in unconditional settings. In this paper, we take inspiration from kernel density estimation techniques and introduce a non-parametric approach to modeling distributions of complex datasets. We partition the data manifold into a mixture of overlapping neighborhoods described by a datapoint and its nearest neighbors, and introduce a model, called instance-conditioned GAN (IC-GAN), which learns the distribution around each datapoint. Experimental results on ImageNet and COCO-Stuff show that IC-GAN significantly improves over unconditional models and unsupervised data partitioning baselines. Moreover, we show that IC-GAN can effortlessly transfer to datasets not seen during training by simply changing the conditioning instances, and still generate realistic images. Finally, we extend IC-GAN to the class-conditional case and show semantically controllable generation and competitive quantitative results on ImageNet; while improving over BigGAN on ImageNet-LT. We will opensource our code and trained models to reproduce the reported results.
    Alleviating Mode Collapse in GAN via Diversity Penalty Module. (arXiv:2108.02353v4 [cs.CV] UPDATED)
    (3 min) The vanilla GAN (Goodfellow et al. 2014) suffers from mode collapse deeply, which usually manifests as that the images generated by generators tend to have a high similarity amongst them, even though their corresponding latent vectors have been very different. In this paper, we introduce a pluggable diversity penalty module (DPM) to alleviate mode collapse of GANs. It reduces the similarity of image pairs in feature space, i.e., if two latent vectors are different, then we enforce the generator to generate two images with different features. The normalized Gram matrix is used to measure the similarity. We compare the proposed method with Unrolled GAN (Metz et al. 2016), BourGAN (Xiao, Zhong, and Zheng 2018), PacGAN (Lin et al. 2018), VEEGAN (Srivastava et al. 2017) and ALI (Dumoulin et al. 2016) on 2D synthetic dataset, and results show that the diversity penalty module can help GAN capture much more modes of the data distribution. Further, in classification tasks, we apply this method as image data augmentation on MNIST, Fashion- MNIST and CIFAR-10, and the classification testing accuracy is improved by 0.24%, 1.34% and 0.52% compared with WGAN GP (Gulrajani et al. 2017), respectively. In domain translation, diversity penalty module can help StarGAN (Choi et al. 2018) generate more accurate attention masks and accelarate the convergence process. Finally, we quantitatively evaluate the proposed method with IS and FID on CelebA, CIFAR-10, MNIST and Fashion-MNIST, and the results suggest GAN with diversity penalty module gets much higher IS and lower FID compared with some SOTA GAN architectures.
    Understanding Guided Image Captioning Performance across Domains. (arXiv:2012.02339v2 [cs.CV] UPDATED)
    (2 min) Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models generally lack the ability to provide long descriptive answers, while expecting the textual question to be quite precise. We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text that refers to either groundable or ungroundable concepts in the image. Our model consists of a Transformer-based multimodal encoder that uses the guiding text together with global and object-level image features to derive early-fusion representations used to generate the guided caption. While models trained on Visual Genome data have an in-domain advantage of fitting well when guided with automatic object labels, we find that guided captioning models trained on Conceptual Captions generalize better on out-of-domain images and guiding texts. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.
    Perceptually Optimized Deep High-Dynamic-Range Image Tone Mapping. (arXiv:2109.00180v3 [cs.CV] UPDATED)
    (2 min) We describe a deep high-dynamic-range (HDR) image tone mapping operator that is computationally efficient and perceptually optimized. We first decompose an HDR image into a normalized Laplacian pyramid, and use two deep neural networks (DNNs) to estimate the Laplacian pyramid of the desired tone-mapped image from the normalized representation. We then end-to-end optimize the entire method over a database of HDR images by minimizing the normalized Laplacian pyramid distance (NLPD), a recently proposed perceptual metric. Qualitative and quantitative experiments demonstrate that our method produces images with better visual quality, and runs the fastest among existing local tone mapping algorithms.
    TypeNet: Deep Learning Keystroke Biometrics. (arXiv:2101.05570v3 [cs.CV] UPDATED)
    (2 min) We study the performance of Long Short-Term Memory networks for keystroke biometric authentication at large scale in free-text scenarios. For this we explore the performance of Long Short-Term Memory (LSTMs) networks trained with a moderate number of keystrokes per identity and evaluated under different scenarios including: i) three learning approaches depending on the loss function (softmax, contrastive, and triplet loss); ii) different number of training samples and lengths of keystroke sequences; iii) four databases based on two device types (physical vs touchscreen keyboard); and iv) comparison with existing approaches based on both traditional statistical methods and deep learning architectures. Our approach called TypeNet achieves state-of-the-art keystroke biometric authentication performance with an Equal Error Rate of 2.2% and 9.2% for physical and touchscreen keyboards, respectively, significantly outperforming previous approaches. Our experiments demonstrate a moderate increase in error with up to 100,000 subjects, demonstrating the potential of TypeNet to operate at an Internet scale. To the best of our knowledge, the databases used in this work are the largest existing free-text keystroke databases available for research with more than 136 million keystrokes from 168,000 subjects in physical keyboards, and 60,000 subjects with more than 63 million keystrokes acquired on mobile touchscreens.
    Physical Adversarial Attacks on an Aerial Imagery Object Detector. (arXiv:2108.11765v2 [cs.CV] UPDATED)
    (2 min) Deep neural networks (DNNs) have become essential for processing the vast amounts of aerial imagery collected using earth-observing satellite platforms. However, DNNs are vulnerable towards adversarial examples, and it is expected that this weakness also plagues DNNs for aerial imagery. In this work, we demonstrate one of the first efforts on physical adversarial attacks on aerial imagery, whereby adversarial patches were optimised, fabricated and installed on or near target objects (cars) to significantly reduce the efficacy of an object detector applied on overhead images. Physical adversarial attacks on aerial images, particularly those captured from satellite platforms, are challenged by atmospheric factors (lighting, weather, seasons) and the distance between the observer and target. To investigate the effects of these challenges, we devised novel experiments and metrics to evaluate the efficacy of physical adversarial attacks against object detectors in aerial scenes. Our results indicate the palpable threat posed by physical adversarial attacks towards DNNs for processing satellite imagery.
    Spatiotemporal Pattern Mining for Nowcasting Extreme Earthquakes in Southern California. (arXiv:2012.14336v3 [physics.geo-ph] UPDATED)
    (0 min) Geoscience and seismology have utilized the most advanced technologies and equipment to monitor seismic events globally from the past few decades. With the enormous amount of data, modern GPU-powered deep learning presents a promising approach to analyze data and discover patterns. In recent years, there are plenty of successful deep learning models for picking seismic waves. However, forecasting extreme earthquakes, which can cause disasters, is still an underdeveloped topic in history. Relevant research in spatiotemporal dynamics mining and forecasting has revealed some successful predictions, a crucial topic in many scientific research fields. Most studies of them have many successful applications of using deep neural networks. In Geology and Earth science studies, earthquake prediction is one of the world's most challenging problems, about which cutting-edge deep learning technologies may help discover some valuable patterns. In this project, we propose a deep learning modeling approach, namely \tseqpre, to mine spatiotemporal patterns from data to nowcast extreme earthquakes by discovering visual dynamics in regional coarse-grained spatial grids over time. In this modeling approach, we use synthetic deep learning neural networks with domain knowledge in geoscience and seismology to exploit earthquake patterns for prediction using convolutional long short-term memory neural networks. Our experiments show a strong correlation between location prediction and magnitude prediction for earthquakes in Southern California. Ablation studies and visualization validate the effectiveness of the proposed modeling method.
    Deep One-Class Classification via Interpolated Gaussian Descriptor. (arXiv:2101.10043v4 [cs.CV] UPDATED)
    (0 min) One-class classification (OCC) aims to learn an effective data description to enclose all normal training samples and detect anomalies based on the deviation from the data description. Current state-of-the-art OCC models learn a compact normality description by hyper-sphere minimisation, but they often suffer from overfitting the training data, especially when the training set is small or contaminated with anomalous samples. To address this issue, we introduce the interpolated Gaussian descriptor (IGD) method, a novel OCC model that learns a one-class Gaussian anomaly classifier trained with adversarially interpolated training samples. The Gaussian anomaly classifier differentiates the training samples based on their distance to the Gaussian centre and the standard deviation of these distances, offering the model a discriminability w.r.t. the given samples during training. The adversarial interpolation is enforced to consistently learn a smooth Gaussian descriptor, even when the training data is small or contaminated with anomalous samples. This enables our model to learn the data description based on the representative normal samples rather than fringe or anomalous samples, resulting in significantly improved normality description. In extensive experiments on diverse popular benchmarks, including MNIST, Fashion MNIST, CIFAR10, MVTec AD and two medical datasets, IGD achieves better detection accuracy than current state-of-the-art models. IGD also shows better robustness in problems with small or contaminated training sets. Code is available at https://github.com/tianyu0207/IGD.
    Noise Estimation for Generative Diffusion Models. (arXiv:2104.02600v2 [cs.LG] UPDATED)
    (0 min) Generative diffusion models have emerged as leading models in speech and image generation. However, in order to perform well with a small number of denoising steps, a costly tuning of the set of noise parameters is needed. In this work, we present a simple and versatile learning scheme that can step-by-step adjust those noise parameters, for any given number of steps, while the previous work needs to retune for each number separately. Furthermore, without modifying the weights of the diffusion model, we are able to significantly improve the synthesis results, for a small number of steps. Our approach comes at a negligible computation cost.
    PERT: A Progressively Region-based Network for Scene Text Removal. (arXiv:2106.13029v2 [cs.CV] UPDATED)
    (0 min) Scene text removal (STR) contains two processes: text localization and background reconstruction. Through integrating both processes into a single network, previous methods provide an implicit erasure guidance by modifying all pixels in the entire image. However, there exists two problems: 1) the implicit erasure guidance causes the excessive erasure to non-text areas; 2) the one-stage erasure lacks the exhaustive removal of text region. In this paper, we propose a ProgrEssively Region-based scene Text eraser (PERT), introducing an explicit erasure guidance and performing balanced multi-stage erasure for accurate and exhaustive text removal. Firstly, we introduce a new region-based modification strategy (RegionMS) to explicitly guide the erasure process. Different from previous implicitly guided methods, RegionMS performs targeted and regional erasure on only text region, and adaptively perceives stroke-level information to improve the integrity of non-text areas with only bounding box level annotations. Secondly, PERT performs balanced multi-stage erasure with several progressive erasing stages. Each erasing stage takes an equal step toward the text-erased image to ensure the exhaustive erasure of text regions. Compared with previous methods, PERT outperforms them by a large margin without the need of adversarial loss, obtaining SOTA results with high speed (71 FPS) and at least 25% lower parameter complexity. Code is available at https://github.com/wangyuxin87/PERT.
    An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification. (arXiv:2109.00201v2 [cs.LG] UPDATED)
    (0 min) In predictive tasks, real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head) classes have sufficient samples, the minority (the tail) classes can be under-represented by a rather limited number of samples. Data pre-processing has been shown to be very effective in dealing with such problems. On one hand, data re-sampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional technique for reducing noise and inconsistencies in a dataset. However, the possible synergy between feature selection and data re-sampling for high-performance imbalance classification has rarely been investigated before. To address this issue, we carry out a comprehensive empirical study on the joint influence of feature selection and re-sampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification by applying feature selection before or after data re-sampling. We conduct a large number of experiments, with a total of 9225 tests, on 52 publicly available datasets, using 9 feature selection methods, 6 re-sampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines; thus both of them should be considered to derive the best performing model for imbalance classification. We find that the performance of an imbalance classification model not only depends on the classifier adopted and the ratio between the number of majority and minority samples, but also depends on the ratio between the number of samples and features. Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.
    Differential Diagnosis of Frontotemporal Dementia and Alzheimer's Disease using Generative Adversarial Network. (arXiv:2109.05627v1 [eess.IV])
    (2 min) Frontotemporal dementia and Alzheimer's disease are two common forms of dementia and are easily misdiagnosed as each other due to their similar pattern of clinical symptoms. Differentiating between the two dementia types is crucial for determining disease-specific intervention and treatment. Recent development of Deep-learning-based approaches in the field of medical image computing are delivering some of the best performance for many binary classification tasks, although its application in differential diagnosis, such as neuroimage-based differentiation for multiple types of dementia, has not been explored. In this study, a novel framework was proposed by using the Generative Adversarial Network technique to distinguish FTD, AD and normal control subjects, using volumetric features extracted at coarse-to-fine structural scales from Magnetic Resonance Imaging scans. Experiments of 10-folds cross-validation on 1,954 images achieved high accuracy. With the proposed framework, we have demonstrated that the combination of multi-scale structural features and synthetic data augmentation based on generative adversarial network can improve the performance of challenging tasks such as differentiating Dementia sub-types.
    PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds. (arXiv:2109.05566v1 [cs.CV])
    (0 min) 3D scene understanding from point clouds plays a vital role for various robotic applications. Unfortunately, current state-of-the-art methods use separate neural networks for different tasks like object detection or room layout estimation. Such a scheme has two limitations: 1) Storing and running several networks for different tasks are expensive for typical robotic platforms. 2) The intrinsic structure of separate outputs are ignored and potentially violated. To this end, we propose the first transformer architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, we directly parameterize room layout as a set of quads. As such, the proposed architecture is termed as P(oint)Q(uad)-Transformer. Along with the novel quad representation, we propose a tailored physical constraint loss function that discourages object-layout interference. The quantitative and qualitative evaluations on the public benchmark ScanNet show that the proposed PQ-Transformer succeeds to jointly parse 3D objects and layouts, running at a quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization. Moreover, the new physical constraint loss can improve strong baselines, and the F1-score of the room layout is significantly promoted from 37.9% to 57.9%.
    On The Radon-Nikodym Spectral Approach With Optimal Clustering. (arXiv:1906.00460v17 [cs.LG] UPDATED)
    (0 min) Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.
    Pyramid Medical Transformer for Medical Image Segmentation. (arXiv:2104.14702v2 [cs.CV] UPDATED)
    (0 min) Deep neural networks have been a prevailing technique in the field of medical image processing. However, the most popular convolutional neural networks (CNNs) based methods for medical image segmentation are imperfect because they model long-range dependencies by stacking layers or enlarging filters. Transformers and the self-attention mechanism are recently proposed to effectively learn long-range dependencies by modeling all pairs of word-to-word attention regardless of their positions. The idea has also been extended to the computer vision field by creating and treating image patches as embeddings. Considering the computation complexity for whole image self-attention, current transformer-based models settle for a rigid partitioning scheme that potentially loses informative relations. Besides, current medical transformers model global context on full resolution images, leading to unnecessary computation costs. To address these issues, we developed a novel method to integrate multi-scale attention and CNN feature extraction using a pyramidal network architecture, namely Pyramid Medical Transformer (PMTrans). The PMTrans captured multi-range relations by working on multi-resolution images. An adaptive partitioning scheme was implemented to retain informative relations and to access different receptive fields efficiently. Experimental results on three medical image datasets (gland segmentation, MoNuSeg, and HECKTOR datasets) showed that PMTrans outperformed the latest CNN-based and transformer-based models for medical image segmentation.
    Anomaly Detection in Medical Imaging with Deep Perceptual Autoencoders. (arXiv:2006.13265v3 [cs.CV] UPDATED)
    (0 min) Anomaly detection is the problem of recognizing abnormal inputs based on the seen examples of normal data. Despite recent advances of deep learning in recognizing image anomalies, these methods still prove incapable of handling complex medical images, such as barely visible abnormalities in chest X-rays and metastases in lymph nodes. To address this problem, we introduce a new powerful method of image anomaly detection. It relies on the classical autoencoder approach with a re-designed training pipeline to handle high-resolution, complex images and a robust way of computing an image abnormality score. We revisit the very problem statement of fully unsupervised anomaly detection, where no abnormal examples at all are provided during the model setup. We propose to relax this unrealistic assumption by using a very small number of anomalies of confined variability merely to initiate the search of hyperparameters of the model. We evaluate our solution on natural image datasets with a known benchmark, as well as on two medical datasets containing radiology and digital pathology images. The proposed approach suggests a new strong baseline for image anomaly detection and outperforms state-of-the-art approaches in complex medical image analysis tasks.
    Partially-supervised novel object captioning leveraging context from paired data. (arXiv:2109.05115v1 [cs.CV])
    (0 min) In this paper, we propose an approach to improve image captioning solutions for images with novel objects that do not have caption labels in the training dataset. Our approach is agnostic to model architecture, and primarily focuses on training technique that uses existing fully paired image-caption data and the images with only the novel object detection labels (partially paired data). We create synthetic paired captioning data for these novel objects by leveraging context from existing image-caption pairs. We further re-use these partially paired images with novel objects to create pseudo-label captions that are used to fine-tune the captioning model. Using a popular captioning model (Up-Down) as baseline, our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split, and improves F1 metric and CIDEr for novel object images by 75.8 and 26.6 points respectively, compared to baseline model that does not use partially paired images during training.
    VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows. (arXiv:2108.05015v2 [cs.CV] UPDATED)
    (0 min) Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, the visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination. Therefore, the two sensors can cooperate with each other to achieve more reliable object tracking. In this work, we propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task. Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios, and it is divided into a training and a testing subset, each of which contains 500 and 320 videos, respectively. Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods by extending current single-modality trackers into dual-modality versions. More importantly, we further build a simple but effective tracking algorithm by proposing a cross-modality transformer, to achieve more effective feature fusion between visible and event data. Extensive experiments on the proposed VisEvent dataset, FE108, and two simulated datasets (i.e., OTB-DVS and VOT-DVS), validated the effectiveness of our model. The dataset and source code have been released at our project page: \url{https://sites.google.com/view/viseventtrack/}.
    Anomaly Detection in Video via Self-Supervised and Multi-Task Learning. (arXiv:2011.07491v3 [cs.CV] UPDATED)
    (0 min) Anomaly detection in video is a challenging computer vision problem. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without full supervision. In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level. We first utilize a pre-trained detector to detect objects. Then, we train a 3D convolutional neural network to produce discriminative anomaly-specific information by jointly learning multiple proxy tasks: three self-supervised and one based on knowledge distillation. The self-supervised tasks are: (i) discrimination of forward/backward moving objects (arrow of time), (ii) discrimination of objects in consecutive/intermittent frames (motion irregularity) and (iii) reconstruction of object-specific appearance information. The knowledge distillation task takes into account both classification and detection information, generating large prediction discrepancies between teacher and student models when anomalies occur. To the best of our knowledge, we are the first to approach anomalous event detection in video as a multi-task learning problem, integrating multiple self-supervised and knowledge distillation proxy tasks in a single architecture. Our lightweight architecture outperforms the state-of-the-art methods on three benchmarks: Avenue, ShanghaiTech and UCSD Ped2. Additionally, we perform an ablation study demonstrating the importance of integrating self-supervised learning and normality-specific distillation in a multi-task learning setting.
    DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval. (arXiv:2109.05534v1 [cs.CV])
    (0 min) Many previous methods on text-based person retrieval tasks are devoted to learning a latent common space mapping, with the purpose of extracting modality-invariant features from both visual and textual modality. Nevertheless, due to the complexity of high-dimensional data, the unconstrained mapping paradigms are not able to properly catch discriminative clues about the corresponding person while drop the misaligned information. Intuitively, the information contained in visual data can be divided into person information (PI) and surroundings information (SI), which are mutually exclusive from each other. To this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model in this paper to effectively extract and match person information, and hence achieve a superior retrieval accuracy. A surroundings-person separation and fusion mechanism plays the key role to realize an accurate and effective surroundings-person separation under a mutually exclusion constraint. In order to adequately utilize multi-modal and multi-granular information for a higher retrieval accuracy, five diverse alignment paradigms are adopted. Extensive experiments are carried out to evaluate the proposed DSSL on CUHK-PEDES, which is currently the only accessible dataset for text-base person retrieval task. DSSL achieves the state-of-the-art performance on CUHK-PEDES. To properly evaluate our proposed DSSL in the real scenarios, a Real Scenarios Text-based Person Reidentification (RSTPReid) dataset is constructed to benefit future research on text-based person retrieval, which will be publicly available.
    SphereFace Revived: Unifying Hyperspherical Face Recognition. (arXiv:2109.05565v1 [cs.CV])
    (0 min) This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attention and gradually become a major focus in face recognition research. As one of the earliest works in hyperspherical face recognition, SphereFace explicitly proposed to learn face embeddings with large inter-class angular margin. However, SphereFace still suffers from severe training instability which limits its application in practice. In order to address this problem, we introduce a unified framework to understand large angular margin in hyperspherical face recognition. Under this framework, we extend the study of SphereFace and propose an improved variant with substantially better training stability -- SphereFace-R. Specifically, we propose two novel ways to implement the multiplicative margin, and study SphereFace-R under three different feature normalization schemes (no feature normalization, hard feature normalization and soft feature normalization). We also propose an implementation strategy -- "characteristic gradient detachment" -- to stabilize training. Extensive experiments on SphereFace-R show that it is consistently better than or competitive with state-of-the-art methods.
    A New Split for Evaluating True Zero-Shot Action Recognition. (arXiv:2107.13029v2 [cs.CV] UPDATED)
    (0 min) Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap with classes in the zero-shot evaluation datasets. As a result, classes which are supposed to be unseen, are present during supervised pre-training, invalidating the condition of the zero-shot setting. A similar concern was previously noted several years ago for image based zero-shot recognition but has not been considered by the zero-shot action recognition community. In this paper, we propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes. We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation. In our extensive analysis, we find that our TruZesplits are significantly harder than comparable random splits as nothing is leaking from pre-training, i.e. unseen performance is consistently lower,up to 8.9% for zero-shot action recognition. In an additional evaluation we also find that similar issues exist in the splits used in few-shot action recognition, here we see differences of up to 17.1%. We publish oursplits1and hope that our benchmark analysis will change how the field is evaluating zero- and few-shot action recognition moving forward.
    A semi-supervised self-training method to develop assistive intelligence for segmenting multiclass bridge elements from inspection videos. (arXiv:2109.05078v1 [cs.CV])
    (0 min) Bridge inspection is an important step in preserving and rehabilitating transportation infrastructure for extending their service lives. The advancement of mobile robotic technology allows the rapid collection of a large amount of inspection video data. However, the data are mainly images of complex scenes, wherein a bridge of various structural elements mix with a cluttered background. Assisting bridge inspectors in extracting structural elements of bridges from the big complex video data, and sorting them out by classes, will prepare inspectors for the element-wise inspection to determine the condition of bridges. This paper is motivated to develop an assistive intelligence model for segmenting multiclass bridge elements from inspection videos captured by an aerial inspection platform. With a small initial training dataset labeled by inspectors, a Mask Region-based Convolutional Neural Network (Mask R-CNN) pre-trained on a large public dataset was transferred to the new task of multiclass bridge element segmentation. Besides, the temporal coherence analysis attempts to recover false negatives and identify the weakness that the neural network can learn to improve. Furthermore, a semi-supervised self-training (S$^3$T) method was developed to engage experienced inspectors in refining the network iteratively. Quantitative and qualitative results from evaluating the developed deep neural network demonstrate that the proposed method can utilize a small amount of time and guidance from experienced inspectors (3.58 hours for labeling 66 images) to build the network of excellent performance (91.8% precision, 93.6% recall, and 92.7% f1-score). Importantly, the paper illustrates an approach to leveraging the domain knowledge and experiences of bridge professionals into computational intelligence models to efficiently adapt the models to varied bridges in the National Bridge Inventory.
    Learning From Long-Tailed Data With Noisy Labels. (arXiv:2108.11096v2 [cs.CV] UPDATED)
    (0 min) Class imbalance and noisy labels are the norm rather than the exception in many large-scale classification datasets. Nevertheless, most works in machine learning typically assume balanced and clean data. There have been some recent attempts to tackle, on one side, the problem of learning from noisy labels and, on the other side, learning from long-tailed data. Each group of methods make simplifying assumptions about the other. Due to this separation, the proposed solutions often underperform when both assumptions are violated. In this work, we present a simple two-stage approach based on recent advances in self-supervised learning to treat both challenges simultaneously. It consists of, first, task-agnostic self-supervised pre-training, followed by task-specific fine-tuning using an appropriate loss. Most significantly, we find that self-supervised learning approaches are effectively able to cope with severe class imbalance. In addition, the resulting learned representations are also remarkably robust to label noise, when fine-tuned with an imbalance- and noise-resistant loss function. We validate our claims with experiments on CIFAR-10 and CIFAR-100 augmented with synthetic imbalance and noise, as well as the large-scale inherently noisy Clothing-1M dataset.
    Voxel Transformer for 3D Object Detection. (arXiv:2109.02497v2 [cs.CV] UPDATED)
    (0 min) We present Voxel Transformer (VoTr), a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds. Conventional 3D convolutional backbones in voxel-based 3D detectors cannot efficiently capture large context information, which is crucial for object recognition and localization, owing to the limited receptive fields. In this paper, we resolve the problem by introducing a Transformer-based architecture that enables long-range relationships between voxels by self-attention. Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, we propose the sparse voxel module and the submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, we propose two attention mechanisms for multi-head attention in those two modules: Local Attention and Dilated Attention, and we further propose Fast Voxel Query to accelerate the querying process in multi-head attention. VoTr contains a series of sparse and submanifold voxel modules and can be applied in most voxel-based detectors. Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Waymo Open dataset.
    Unsupervised Domain Adaptation for Video Semantic Segmentation. (arXiv:2107.11052v2 [cs.CV] UPDATED)
    (0 min) Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real (Sim2Real) by largely cutting out the laborious per pixel labeling efforts at real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic Segmentation. As it became easy to obtain large-scale video labels through simulation, we believe attempting to maximize Sim2Real knowledge transferability is one of the promising directions for resolving the fundamental data-hungry issue in the video. To tackle this new problem, we present a novel two-phase adaptation scheme. In the first step, we exhaustively distill source domain knowledge using supervised loss functions. Simultaneously, video adversarial training (VAT) is employed to align the features from source to target utilizing video context. In the second step, we apply video self-training (VST), focusing only on the target data. To construct robust pseudo labels, we exploit the temporal information in the video, which has been rarely explored in the previous image-based self-training approaches. We set strong baseline scores on 'VIPER to CityscapeVPS' adaptation scenario. We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
    Learning Temporal Dynamics from Cycles in Narrated Video. (arXiv:2101.02337v2 [cs.CV] UPDATED)
    (0 min) Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We propose a self-supervised solution to this problem using temporal cycle consistency jointly in vision and language, training on narrated video. Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed. This constraint leads to the discovery of high-level transitions between moments in time, since such transitions are easily inverted and shared across modalities. We justify the design of our model with an ablation study on different configurations of the cycle consistency problem. We then show qualitatively and quantitatively that our approach yields a meaningful, high-level model of the future and past. We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images. Project page: https://dave.ml/mmcc
    Learning Neural Network Subspaces. (arXiv:2102.10472v3 [cs.LG] UPDATED)
    (0 min) Recent observations have advanced our understanding of the neural network optimization landscape, revealing the existence of (1) paths of high accuracy containing diverse solutions and (2) wider minima offering improved performance. Previous methods observing diverse paths require multiple training runs. In contrast we aim to leverage both property (1) and (2) with a single method and in a single training run. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. These neural network subspaces contain diverse solutions that can be ensembled, approaching the ensemble performance of independently trained networks without the training cost. Moreover, using the subspace midpoint boosts accuracy, calibration, and robustness to label noise, outperforming Stochastic Weight Averaging.
    VSAC: Efficient and Accurate Estimator for H and F. (arXiv:2106.10240v2 [cs.CV] UPDATED)
    (0 min) We present VSAC, a RANSAC-type robust estimator with a number of novelties. It benefits from the introduction of the concept of independent inliers that improves significantly the efficacy of the dominant plane handling and, also, allows near error-free rejection of incorrect models, without false positives. The local optimization process and its application is improved so that it is run on average only once. Further technical improvements include adaptive sequential hypothesis verification and efficient model estimation via Gaussian elimination. Experiments on four standard datasets show that VSAC is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU. It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry. In the repeated runs on EVD, HPatches, PhotoTourism, and Kusvod2 datasets, it never failed.
    Hyperspectral and Multispectral Classification for Coastal Wetland Using Depthwise Feature Interaction Network. (arXiv:2106.06896v3 [cs.CV] UPDATED)
    (0 min) The monitoring of coastal wetlands is of great importance to the protection of marine and terrestrial ecosystems. However, due to the complex environment, severe vegetation mixture, and difficulty of access, it is impossible to accurately classify coastal wetlands and identify their species with traditional classifiers. Despite the integration of multisource remote sensing data for performance enhancement, there are still challenges with acquiring and exploiting the complementary merits from multisource data. In this paper, the Deepwise Feature Interaction Network (DFINet) is proposed for wetland classification. A depthwise cross attention module is designed to extract self-correlation and cross-correlation from multisource feature pairs. In this way, meaningful complementary information is emphasized for classification. DFINet is optimized by coordinating consistency loss, discrimination loss, and classification loss. Accordingly, DFINet reaches the standard solution-space under the regularity of loss functions, while the spatial consistency and feature discrimination are preserved. Comprehensive experimental results on two hyperspectral and multispectral wetland datasets demonstrate that the proposed DFINet outperforms other competitive methods in terms of overall accuracy.
    Multiple Domain Experts Collaborative Learning: Multi-Source Domain Generalization For Person Re-Identification. (arXiv:2105.12355v2 [cs.CV] UPDATED)
    (0 min) Recent years have witnessed significant progress in person re-identification (ReID). However, current ReID approaches still suffer from considerable performance degradation when unseen testing domains exhibit different characteristics from the source training ones, known as the domain generalization problem. Given multiple source training domains, previous Domain Generalizable ReID (DG-ReID) methods usually learn all domains together using a shared network, which can't learn sufficient knowledge from each domain. In this paper, we propose a novel Multiple Domain Experts Collaborative Learning (MECL) framework for better exploiting all training domains, which benefits from the proposed Domain-Domain Collaborative Learning (DDCL) and Universal-Domain Collaborative Learning (UDCL). DDCL utilizes domain-specific experts for fully exploiting each domain, and prevents experts from over-fitting the corresponding domain using a meta-learning strategy. In UDCL, a universal expert supervises the learning of domain experts and continuously gathers knowledge from all domain experts. Note, only the universal expert will be used for inference. Extensive experiments on DG-ReID benchmarks demonstrate the effectiveness of DDCL and UDCL, and show that the whole MECL framework significantly outperforms state-of-the-arts. Experimental results on DG-classification benchmarks also reveal the great potential of applying MECL to other DG tasks.
    Internet of Things (IoT) Based Video Analytics: a use case of Smart Doorbell. (arXiv:2105.06508v2 [cs.CV] UPDATED)
    (0 min) The vision of the internet of things (IoT) is a reality now. IoT devices are getting cheaper, smaller. They are becoming more and more computationally and energy-efficient. The global market of IoT-based video analytics has seen significant growth in recent years and it is expected to be a growing market segment. For any IoT-based video analytics application, few key points required, such as cost-effectiveness, widespread use, flexible design, accurate scene detection, reusability of the framework. Video-based smart doorbell system is one such application domain for video analytics where many commercial offerings are available in the consumer market. However, such existing offerings are costly, monolithic, and proprietary. Also, there will be a trade-off between accuracy and portability. To address the foreseen problems, I'm proposing a distributed framework for video analytics with a use case of a smart doorbell system. The proposed framework uses AWS cloud services as a base platform and to meet the price affordability constraint, the system was implemented on affordable Raspberry Pi. The smart doorbell will be able to recognize the known/unknown person with at most accuracy. The smart doorbell system is also having additional detection functionalities such as harmful weapon detection, noteworthy vehicle detection, animal/pet detection. An iOS application is specifically developed for this implementation which can receive the notification from the smart doorbell in real-time. Finally, the paper also mentions the classical approaches for video analytics, their feasibility in implementing with this use-case, and comparative analysis in terms of accuracy and time required to detect an object in the frame is carried out. Results conclude that AWS cloud-based approach is worthy for this smart doorbell use case.
    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification. (arXiv:2107.02314v2 [cs.CV] UPDATED)
    (0 min) The BraTS 2021 challenge celebrates its 10th anniversary and is jointly organized by the Radiological Society of North America (RSNA), the American Society of Neuroradiology (ASNR), and the Medical Image Computing and Computer Assisted Interventions (MICCAI) society. Since its inception, BraTS has been focusing on being a common benchmarking venue for brain glioma segmentation algorithms, with well-curated multi-institutional multi-parametric magnetic resonance imaging (mpMRI) data. Gliomas are the most common primary malignancies of the central nervous system, with varying degrees of aggressiveness and prognosis. The RSNA-ASNR-MICCAI BraTS 2021 challenge targets the evaluation of computational algorithms assessing the same tumor compartmentalization, as well as the underlying tumor's molecular characterization, in pre-operative baseline mpMRI data from 2,040 patients. Specifically, the two tasks that BraTS 2021 focuses on are: a) the segmentation of the histologically distinct brain tumor sub-regions, and b) the classification of the tumor's O[6]-methylguanine-DNA methyltransferase (MGMT) promoter methylation status. The performance evaluation of all participating algorithms in BraTS 2021 will be conducted through the Sage Bionetworks Synapse platform (Task 1) and Kaggle (Task 2), concluding in distributing to the top ranked participants monetary awards of $60,000 collectively.
    Border-SegGCN: Improving Semantic Segmentation by Refining the Border Outline using Graph Convolutional Network. (arXiv:2109.05353v1 [cs.CV])
    (0 min) We present Border-SegGCN, a novel architecture to improve semantic segmentation by refining the border outline using graph convolutional networks (GCN). The semantic segmentation network such as Unet or DeepLabV3+ is used as a base network to have pre-segmented output. This output is converted into a graphical structure and fed into the GCN to improve the border pixel prediction of the pre-segmented output. We explored and studied the factors such as border thickness, number of edges for a node, and the number of features to be fed into the GCN by performing experiments. We demonstrate the effectiveness of the Border-SegGCN on the CamVid and Carla dataset, achieving a test set performance of 81.96% without any post-processing on CamVid dataset. It is higher than the reported state of the art mIoU achieved on CamVid dataset by 0.404%
    Co-Correcting: Noise-tolerant Medical Image Classification via mutual Label Correction. (arXiv:2109.05159v1 [eess.IV])
    (0 min) With the development of deep learning, medical image classification has been significantly improved. However, deep learning requires massive data with labels. While labeling the samples by human experts is expensive and time-consuming, collecting labels from crowd-sourcing suffers from the noises which may degenerate the accuracy of classifiers. Therefore, approaches that can effectively handle label noises are highly desired. Unfortunately, recent progress on handling label noise in deep learning has gone largely unnoticed by the medical image. To fill the gap, this paper proposes a noise-tolerant medical image classification framework named Co-Correcting, which significantly improves classification accuracy and obtains more accurate labels through dual-network mutual learning, label probability estimation, and curriculum label correcting. On two representative medical image datasets and the MNIST dataset, we test six latest Learning-with-Noisy-Labels methods and conduct comparative studies. The experiments show that Co-Correcting achieves the best accuracy and generalization under different noise ratios in various tasks. Our project can be found at: https://github.com/JiarunLiu/Co-Correcting.
    CDN-MEDAL: Two-stage Density and Difference Approximation Framework for Motion Analysis. (arXiv:2106.03776v2 [cs.CV] UPDATED)
    (0 min) Background modeling is a promising research area in video analysis with a variety of video surveillance applications. Recent years have witnessed the proliferation of deep neural networks via effective learning-based approaches in motion analysis. However, these techniques only provide a limited description of the observed scenes' insufficient properties where a single-valued mapping is learned to approximate the temporal conditional averages of the target background. On the other hand, statistical learning in imagery domains has become one of the most prevalent approaches with high adaptation to dynamic context transformation, notably Gaussian Mixture Models, combined with a foreground extraction step. In this work, we propose a novel, two-stage method of change detection with two convolutional neural networks. The first architecture is grounded on the unsupervised Gaussian mixtures statistical learning to describe the scenes' salient features. The second one implements a light-weight pipeline of foreground detection. Our two-stage framework contains approximately 3.5K parameters in total but still maintains rapid convergence to intricate motion patterns. Our experiments on publicly available datasets show that our proposed networks are not only capable of generalizing regions of moving objects in unseen cases with promising results but also are competitive in performance efficiency and effectiveness regarding foreground segmentation.
    Convolutional Hough Matching Networks for Robust and Efficient Visual Correspondence. (arXiv:2109.05221v1 [cs.CV])
    (0 min) Despite advances in feature representation, leveraging geometric relations is crucial for establishing reliable visual correspondences under large variations of images. In this work we introduce a Hough transform perspective on convolutional matching and propose an effective geometric matching algorithm, dubbed Convolutional Hough Matching (CHM). The method distributes similarities of candidate matches over a geometric transformation space and evaluates them in a convolutional manner. We cast it into a trainable neural layer with a semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters. To further improve the efficiency of high-dimensional voting, we also propose to use an efficient kernel decomposition with center-pivot neighbors, which significantly sparsifies the proposed semi-isotropic kernels without performance degradation. To validate the proposed techniques, we develop the neural network with CHM layers that perform convolutional matching in the space of translation and scaling. Our method sets a new state of the art on standard benchmarks for semantic visual correspondence, proving its strong robustness to challenging intra-class variations.
    A Decidability-Based Loss Function. (arXiv:2109.05524v1 [cs.CV])
    (0 min) Nowadays, deep learning is the standard approach for a wide range of problems, including biometrics, such as face recognition and speech recognition, etc. Biometric problems often use deep learning models to extract features from images, also known as embeddings. Moreover, the loss function used during training strongly influences the quality of the generated embeddings. In this work, a loss function based on the decidability index is proposed to improve the quality of embeddings for the verification routine. Our proposal, the D-loss, avoids some Triplet-based loss disadvantages such as the use of hard samples and tricky parameter tuning, which can lead to slow convergence. The proposed approach is compared against the Softmax (cross-entropy), Triplets Soft-Hard, and the Multi Similarity losses in four different benchmarks: MNIST, Fashion-MNIST, CIFAR10 and CASIA-IrisV4. The achieved results show the efficacy of the proposal when compared to other popular metrics in the literature. The D-loss computation, besides being simple, non-parametric and easy to implement, favors both the inter-class and intra-class scenarios.
    Real-time multimodal image registration with partial intraoperative point-set data. (arXiv:2109.05023v1 [eess.IV])
    (0 min) We present Free Point Transformer (FPT) - a deep neural network architecture for non-rigid point-set registration. Consisting of two modules, a global feature extraction module and a point transformation module, FPT does not assume explicit constraints based on point vicinity, thereby overcoming a common requirement of previous learning-based point-set registration methods. FPT is designed to accept unordered and unstructured point-sets with a variable number of points and uses a "model-free" approach without heuristic constraints. Training FPT is flexible and involves minimizing an intuitive unsupervised loss function, but supervised, semi-supervised, and partially- or weakly-supervised training are also supported. This flexibility makes FPT amenable to multimodal image registration problems where the ground-truth deformations are difficult or impossible to measure. In this paper, we demonstrate the application of FPT to non-rigid registration of prostate magnetic resonance (MR) imaging and sparsely-sampled transrectal ultrasound (TRUS) images. The registration errors were 4.71 mm and 4.81 mm for complete TRUS imaging and sparsely-sampled TRUS imaging, respectively. The results indicate superior accuracy to the alternative rigid and non-rigid registration algorithms tested and substantially lower computation time. The rapid inference possible with FPT makes it particularly suitable for applications where real-time registration is beneficial.
    U-Net Convolutional Network for Recognition of Vessels and Materials in Chemistry Lab. (arXiv:2109.05585v1 [cs.CV])
    (0 min) Convolutional networks have been widely applied for computer vision system. Encouraged by these results, a U-Net convolutional network was applied to recognition of vessels and materials in chemistry lab using the recent Vector-LabPics dataset, which contains 2187 images of materials within mostly transparent vessels in a chemistry lab and other general settings, labeled with 13 classes. By optimizing hyperparameters including learning rates and learning rate decays, 87% accuracy in vessel recognition was achieved. In the case of relatively small training and test sets (relatively rare materials states, the number of training set samples less than 500 and the number of test set samples less than 100), a comprehensive improvement over 18% in IoU and 19% in accuracy for the best model were achieved. Further improvements may be achievable by incorporating improved convolutional network structure into our models.
    Conditional Generation of Synthetic Geospatial Images from Pixel-level and Feature-level Inputs. (arXiv:2109.05201v1 [cs.CV])
    (0 min) Training robust supervised deep learning models for many geospatial applications of computer vision is difficult due to dearth of class-balanced and diverse training data. Conversely, obtaining enough training data for many applications is financially prohibitive or may be infeasible, especially when the application involves modeling rare or extreme events. Synthetically generating data (and labels) using a generative model that can sample from a target distribution and exploit the multi-scale nature of images can be an inexpensive solution to address scarcity of labeled data. Towards this goal, we present a deep conditional generative model, called VAE-Info-cGAN, that combines a Variational Autoencoder (VAE) with a conditional Information Maximizing Generative Adversarial Network (InfoGAN), for synthesizing semantically rich images simultaneously conditioned on a pixel-level condition (PLC) and a macroscopic feature-level condition (FLC). Dimensionally, the PLC can only vary in the channel dimension from the synthesized image and is meant to be a task-specific input. The FLC is modeled as an attribute vector in the latent space of the generated image which controls the contributions of various characteristic attributes germane to the target distribution. Experiments on a GPS trajectories dataset show that the proposed model can accurately generate various forms of spatiotemporal aggregates across different geographic locations while conditioned only on a raster representation of the road network. The primary intended application of the VAE-Info-cGAN is synthetic data (and label) generation for targeted data augmentation for computer vision-based modeling of problems relevant to geospatial analysis and remote sensing.
    Stabilizing Deep Tomographic Reconstruction. (arXiv:2008.01846v5 [eess.IV] UPDATED)
    (0 min) Tomographic image reconstruction with deep learning is an emerging field, but a recent landmark study reveals that several deep reconstruction networks are unstable for computed tomography (CT) and magnetic resonance imaging (MRI). Specifically, three kinds of instabilities were reported: (1) strong image artefacts from tiny perturbations, (2) small features missing in a deeply reconstructed image, and (3) decreased imaging performance with increased input data. On the other hand, compressed sensing (CS) inspired reconstruction methods do not suffer from these instabilities because of their built-in kernel awareness. For deep reconstruction to realize its full potential and become a mainstream approach for tomographic imaging, it is thus critically important to meet this challenge by stabilizing deep reconstruction networks. Here we propose an Analytic Compressed Iterative Deep (ACID) framework to address this challenge. ACID synergizes a deep reconstruction network trained on big data, kernel awareness from CS-inspired processing, and iterative refinement to minimize the data residual relative to real measurement. Our study demonstrates that the deep reconstruction using ACID is accurate and stable, and sheds light on the converging mechanism of the ACID iteration under a Bounded Relative Error Norm (BREN) condition. In particular, the study shows that ACID-based reconstruction is resilient against adversarial attacks, superior to classic sparsity-regularized reconstruction alone, and eliminates the three kinds of instabilities. We anticipate that this integrative data-driven approach will help promote development and translation of deep tomographic image reconstruction networks into clinical applications.
    DynaNet: Neural Kalman Dynamical Model for Motion Estimation and Prediction. (arXiv:1908.03918v3 [cs.LG] UPDATED)
    (0 min) Dynamical models estimate and predict the temporal evolution of physical systems. State Space Models (SSMs) in particular represent the system dynamics with many desirable properties, such as being able to model uncertainty in both the model and measurements, and optimal (in the Bayesian sense) recursive formulations e.g. the Kalman Filter. However, they require significant domain knowledge to derive the parametric form and considerable hand-tuning to correctly set all the parameters. Data driven techniques e.g. Recurrent Neural Networks have emerged as compelling alternatives to SSMs with wide success across a number of challenging tasks, in part due to their ability to extract relevant features from rich inputs. They however lack interpretability and robustness to unseen conditions. In this work, we present DynaNet, a hybrid deep learning and time-varying state-space model which can be trained end-to-end. Our neural Kalman dynamical model allows us to exploit the relative merits of each approach. We demonstrate state-of-the-art estimation and prediction on a number of physically challenging tasks, including visual odometry, sensor fusion for visual-inertial navigation and pendulum control. In addition we show how DynaNet can indicate failures through investigation of properties such as the rate of innovation (Kalman Gain).
    Preliminary Wildfire Detection Using State-of-the-art PTZ (Pan, Tilt, Zoom) Camera Technology and Convolutional Neural Networks. (arXiv:2109.05083v1 [cs.CV])
    (0 min) Wildfires are uncontrolled fires in the environment that can be caused by humans or nature. In 2020 alone, wildfires in California have burned 4.2 million acres, damaged 10,500 buildings or structures, and killed more than 31 people, exacerbated by climate change and a rise in average global temperatures. This also means there has been an increase in the costs of extinguishing these treacherous wildfires. The objective of the research is to detect forest fires in their earlier stages to prevent them from spreading, prevent them from causing damage to a variety of things, and most importantly, reduce or eliminate the chances of someone dying from a wildfire. A fire detection system should be efficient and accurate with respect to extinguishing wildfires in their earlier stages to prevent the spread of them along with their consequences. Computer Vision is potentially a more reliable, fast, and widespread method we need. The current research in the field of preliminary fire detection has several problems related to unrepresentative data being used to train models and their existing varied amounts of label imbalance in the classes of their dataset. We propose a more representative and evenly distributed data through better settings, lighting, atmospheres, etc., and class distribution in the entire dataset. After thoroughly examining the results of this research, it can be inferred that they supported the datasets strengths by being a viable resource when tested in the real world on unfamiliar data. This is evident since as the model trains on the dataset, it is able to generalize on it, hence confirming this is a viable Machine Learning setting that has practical impact.
    MosaicOS: A Simple and Effective Use of Object-Centric Images for Long-Tailed Object Detection. (arXiv:2102.08884v3 [cs.CV] UPDATED)
    (0 min) Many objects do not appear frequently enough in complex scenes (e.g., certain handbags in living rooms) for training an accurate object detector, but are often found frequently by themselves (e.g., in product images). Yet, these object-centric images are not effectively leveraged for improving object detection in scene-centric images. In this paper, we propose Mosaic of Object-centric images as Scene-centric images (MosaicOS), a simple and novel framework that is surprisingly effective at tackling the challenges of long-tailed object detection. Keys to our approach are three-fold: (i) pseudo scene-centric image construction from object-centric images for mitigating domain differences, (ii) high-quality bounding box imputation using the object-centric images' class labels, and (iii) a multi-stage training procedure. On LVIS object detection (and instance segmentation), MosaicOS leads to a massive 60% (and 23%) relative improvement in average precision for rare object categories. We also show that our framework can be compatibly used with other existing approaches to achieve even further gains. Our pre-trained models are publicly available at https://github.com/czhang0528/MosaicOS/.
    Improved Techniques for Quantizing Deep Networks with Adaptive Bit-Widths. (arXiv:2103.01435v2 [cs.CV] UPDATED)
    (0 min) Quantizing deep networks with adaptive bit-widths is a promising technique for efficient inference across many devices and resource constraints. In contrast to static methods that repeat the quantization process and train different models for different constraints, adaptive quantization enables us to flexibly adjust the bit-widths of a single deep network during inference for instant adaptation in different scenarios. While existing research shows encouraging results on common image classification benchmarks, this paper investigates how to train such adaptive networks more effectively. Specifically, we present two novel techniques for quantizing deep neural networks with adaptive bit-widths of weights and activations. First, we propose a collaborative strategy to choose a high-precision teacher for transferring knowledge to the low-precision student while jointly optimizing the model with all bit-widths. Second, to effectively transfer knowledge, we develop a dynamic block swapping method by randomly replacing the blocks in the lower-precision student network with the corresponding blocks in the higher-precision teacher network. Extensive experiments on multiple image classification datasets including video classification benchmarks for the first time, well demonstrate the efficacy of our approach over state-of-the-art methods.
    Rethinking deinterlacing for early interlaced videos. (arXiv:2011.13675v2 [cs.CV] UPDATED)
    (0 min) With the rapid development of image restoration techniques, high-definition reconstruction of early videos has achieved impressive results. However, there are few studies about the interlacing artifacts that often appear in early videos and significantly affect visual perception. Traditional deinterlacing approaches are mainly focused on early interlacing scanning systems and thus cannot handle the complex and complicated artifacts in real-world early interlaced videos. Hence, this paper proposes a specific deinterlacing network (DIN), which is motivated by the traditional deinterlacing strategy. The proposed DIN consists of two stages, i.e., a cooperative vertical interpolation stage for split fields, and a merging stage that is applied to perceive movements and remove ghost artifacts. Experimental results demonstrate that the proposed method can effectively remove complex artifacts in early interlaced videos.
    Evaluating the Progress of Deep Learning for Visual Relational Concepts. (arXiv:2001.10857v3 [cs.CV] UPDATED)
    (0 min) Convolutional Neural Networks (CNNs) have become the state of the art method for image classification in the last ten years. Despite the fact that they achieve superhuman classification accuracy on many popular datasets, they often perform much worse on more abstract image classification tasks. We will show that these difficult tasks are linked to relational concepts from cognitive psychology and that despite progress over the last few years, such relational reasoning tasks still remain difficult for current neural network architectures. We will review deep learning research that is linked to relational concept learning, even if it was not originally presented from this angle. Reviewing the current literature, we will argue that some form of attention will be an important component of future systems to solve relational tasks. In addition, we will point out the shortcomings of currently used datasets, and we will recommend steps to make future datasets more relevant for testing systems on relational reasoning.
    MSGDD-cGAN: Multi-Scale Gradients Dual Discriminator Conditional Generative Adversarial Network. (arXiv:2109.05614v1 [cs.CV])
    (0 min) Conditional Generative Adversarial Networks (cGANs) have been used in many image processing tasks. However, they still have serious problems maintaining the balance between conditioning the output on the input and creating the output with the desired distribution based on the corresponding ground truth. The traditional cGANs, similar to most conventional GANs, suffer from vanishing gradients, which backpropagate from the discriminator to the generator. Moreover, the traditional cGANs are sensitive to architectural changes due to previously mentioned gradient problems. Therefore, balancing the architecture of the cGANs is almost impossible. Recently MSG-GAN has been proposed to stabilize the performance of the GANs by applying multiple connections between the generator and discriminator. In this work, we propose a method called MSGDD-cGAN, which first stabilizes the performance of the cGANs using multi-connections gradients flow. Secondly, the proposed network architecture balances the correlation of the output to input and the fitness of the output on the target distribution. This balance is generated by using the proposed dual discrimination procedure. We tested our model by segmentation of fetal ultrasound images. Our model shows a 3.18% increase in the F1 score comparing to the pix2pix version of cGANs.
    Check Your Other Door! Establishing Backdoor Attacks in the Frequency Domain. (arXiv:2109.05507v1 [cs.CR])
    (0 min) Deep Neural Networks (DNNs) have been utilized in various applications ranging from image classification and facial recognition to medical imagery analysis and real-time object detection. As our models become more sophisticated and complex, the computational cost of training such models becomes a burden for small companies and individuals; for this reason, outsourcing the training process has been the go-to option for such users. Unfortunately, outsourcing the training process comes at the cost of vulnerability to backdoor attacks. These attacks aim at establishing hidden backdoors in the DNN such that the model performs well on benign samples but outputs a particular target label when a trigger is applied to the input. Current backdoor attacks rely on generating triggers in the image/pixel domain; however, as we show in this paper, it is not the only domain to exploit and one should always "check the other doors". In this work, we propose a complete pipeline for generating a dynamic, efficient, and invisible backdoor attack in the frequency domain. We show the advantages of utilizing the frequency domain for establishing undetectable and powerful backdoor attacks through extensive experiments on various datasets and network architectures. The backdoored models are shown to break various state-of-the-art defences. We also show two possible defences that succeed against frequency-based backdoor attacks and possible ways for the attacker to bypass them. We conclude the work with some remarks regarding a network's learning capacity and the capability of embedding a backdoor attack in the model.
    Jacobian Regularization for Mitigating Universal Adversarial Perturbations. (arXiv:2104.10459v2 [cs.LG] UPDATED)
    (0 min) Universal Adversarial Perturbations (UAPs) are input perturbations that can fool a neural network on large sets of data. They are a class of attacks that represents a significant threat as they facilitate realistic, practical, and low-cost attacks on neural networks. In this work, we derive upper bounds for the effectiveness of UAPs based on norms of data-dependent Jacobians. We empirically verify that Jacobian regularization greatly increases model robustness to UAPs by up to four times whilst maintaining clean performance. Our theoretical analysis also allows us to formulate a metric for the strength of shared adversarial perturbations between pairs of inputs. We apply this metric to benchmark datasets and show that it is highly correlated with the actual observed robustness. This suggests that realistic and practical universal attacks can be reliably mitigated without sacrificing clean accuracy, which shows promise for the robustness of machine learning systems.
    More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching. (arXiv:2105.09597v2 [cs.CV] UPDATED)
    (0 min) Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be easily integrated into existing cross-modal attention models. Additionally, we introduce three metrics including Attention Precision, Recall, and F1-Score to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints improves the model performance in terms of both retrieval performance and attention metrics.
    DeepPyram: Enabling Pyramid View and Deformable Pyramid Reception for Semantic Segmentation in Cataract Surgery Videos. (arXiv:2109.05352v1 [cs.CV])
    (0 min) Semantic segmentation in cataract surgery has a wide range of applications contributing to surgical outcome enhancement and clinical risk reduction. However, the varying issues in segmenting the different relevant instances make the designation of a unique network quite challenging. This paper proposes a semantic segmentation network termed as DeepPyram that can achieve superior performance in segmenting relevant objects in cataract surgery videos with varying issues. This superiority mainly originates from three modules: (i) Pyramid View Fusion, which provides a varying-angle global view of the surrounding region centering at each pixel position in the input convolutional feature map; (ii) Deformable Pyramid Reception, which enables a wide deformable receptive field that can adapt to geometric transformations in the object of interest; and (iii) Pyramid Loss that adaptively supervises multi-scale semantic feature maps. These modules can effectively boost semantic segmentation performance, especially in the case of transparency, deformability, scalability, and blunt edges in objects. The proposed approach is evaluated using four datasets of cataract surgery for objects with different contextual features and compared with thirteen state-of-the-art segmentation networks. The experimental results confirm that DeepPyram outperforms the rival approaches without imposing additional trainable parameters. Our comprehensive ablation study further proves the effectiveness of the proposed modules.
    OMNet: Learning Overlapping Mask for Partial-to-Partial Point Cloud Registration. (arXiv:2103.00937v5 [cs.CV] UPDATED)
    (0 min) Point cloud registration is a key task in many computational fields. Previous correspondence matching based methods require the inputs to have distinctive geometric structures to fit a 3D rigid transformation according to point-wise sparse feature matches. However, the accuracy of transformation heavily relies on the quality of extracted features, which are prone to errors with respect to partiality and noise. In addition, they can not utilize the geometric knowledge of all the overlapping regions. On the other hand, previous global feature based approaches can utilize the entire point cloud for the registration, however they ignore the negative effect of non-overlapping points when aggregating global features. In this paper, we present OMNet, a global feature based iterative network for partial-to-partial point cloud registration. We learn overlapping masks to reject non-overlapping regions, which converts the partial-to-partial registration to the registration of the same shape. Moreover, the previously used data is sampled only once from the CAD models for each object, resulting in the same point clouds for the source and reference. We propose a more practical manner of data generation where a CAD model is sampled twice for the source and reference, avoiding the previously prevalent over-fitting issue. Experimental results show that our method achieves state-of-the-art performance compared to traditional and deep learning based methods. Code is available at https://github.com/megvii-research/OMNet.
    Semi-Supervised Learning with Multi-Head Co-Training. (arXiv:2107.04795v2 [cs.LG] UPDATED)
    (0 min) Co-training, extended from self-training, is one of the frameworks for semi-supervised learning. Without natural split of features, single-view co-training works at the cost of training extra classifiers, where the algorithm should be delicately designed to prevent individual classifiers from collapsing into each other. To remove these obstacles which deter the adoption of single-view co-training, we present a simple and efficient algorithm Multi-Head Co-Training. By integrating base learners into a multi-head structure, the model is in a minimal amount of extra parameters. Every classification head in the unified model interacts with its peers through a "Weak and Strong Augmentation" strategy, in which the diversity is naturally brought by the strong data augmentation. Therefore, the proposed method facilitates single-view co-training by 1). promoting diversity implicitly and 2). only requiring a small extra computational overhead. The effectiveness of Multi-Head Co-Training is demonstrated in an empirical study on standard semi-supervised learning benchmarks.
    Efficient Re-parameterization Residual Attention Network For Nonhomogeneous Image Dehazing. (arXiv:2109.05479v1 [eess.IV])
    (0 min) This paper proposes an end-to-end Efficient Re-parameterizationResidual Attention Network(ERRA-Net) to directly restore the nonhomogeneous hazy image. The contribution of this paper mainly has the following three aspects: 1) A novel Multi-branch Attention (MA) block. The spatial attention mechanism better reconstructs high-frequency features, and the channel attention mechanism treats the features of different channels differently. Multi-branch structure dramatically improves the representation ability of the model and can be changed into a single path structure after re-parameterization to speed up the process of inference. Local Residual Connection allows the low-frequency information in the nonhomogeneous area to pass through the block without processing so that the block can focus on detailed features. 2) A lightweight network structure. We use cascaded MA blocks to extract high-frequency features step by step, and the Multi-layer attention fusion tail combines the shallow and deep features of the model to get the residual of the clean image finally. 3)We propose two novel loss functions to help reconstruct the hazy image ColorAttenuation loss and Laplace Pyramid loss. ERRA-Net has an impressive speed, processing 1200x1600 HD quality images with an average runtime of 166.11 fps. Extensive evaluations demonstrate that ERSANet performs favorably against the SOTA approaches on the real-world hazy images.
    Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging. (arXiv:2001.01293v2 [cs.CV] UPDATED)
    (0 min) X-ray security screening is widely used to maintain aviation/transport security, and its significance poses a particular interest in automated screening systems. This paper aims to review computerised X-ray security imaging algorithms by taxonomising the field into conventional machine learning and contemporary deep learning applications. The first part briefly discusses the classical machine learning approaches utilised within X-ray security imaging, while the latter part thoroughly investigates the use of modern deep learning algorithms. The proposed taxonomy sub-categorises the use of deep learning approaches into supervised, semi-supervised and unsupervised learning, with a particular focus on object classification, detection, segmentation and anomaly detection tasks. The paper further explores well-established X-ray datasets and provides a performance benchmark. Based on the current and future trends in deep learning, the paper finally presents a discussion and future directions for X-ray security imagery.
    Generating Datasets of 3D Garments with Sewing Patterns. (arXiv:2109.05633v1 [cs.CV])
    (0 min) Garments are ubiquitous in both real and many of the virtual worlds. They are highly deformable objects, exhibit an immense variety of designs and shapes, and yet, most garments are created from a set of regularly shaped flat pieces. Exploration of garment structure presents a peculiar case for an object structure estimation task and might prove useful for downstream tasks of neural 3D garment modeling and reconstruction by providing strong prior on garment shapes. To facilitate research in these directions, we propose a method for generating large synthetic datasets of 3D garment designs and their sewing patterns. Our method consists of a flexible description structure for specifying parametric sewing pattern templates and the automatic generation pipeline to produce garment 3D models with little-to-none manual intervention. To add realism, the pipeline additionally creates corrupted versions of the final meshes that imitate artifacts of 3D scanning. With this pipeline, we created the first large-scale synthetic dataset of 3D garment models with their sewing patterns. The dataset contains more than 20000 garment design variations produced from 19 different base types. Seven of these garment types are specifically designed to target evaluation of the generalization across garment sewing pattern topologies.
    Medulloblastoma Tumor Classification using Deep Transfer Learning with Multi-Scale EfficientNets. (arXiv:2109.05025v1 [eess.IV])
    (0 min) Medulloblastoma (MB) is the most common malignant brain tumor in childhood. The diagnosis is generally based on the microscopic evaluation of histopathological tissue slides. However, visual-only assessment of histopathological patterns is a tedious and time-consuming task and is also affected by observer variability. Hence, automated MB tumor classification could assist pathologists by promoting consistency and robust quantification. Recently, convolutional neural networks (CNNs) have been proposed for this task, while transfer learning has shown promising results. In this work, we propose an end-to-end MB tumor classification and explore transfer learning with various input sizes and matching network dimensions. We focus on differentiating between the histological subtypes classic and desmoplastic/nodular. For this purpose, we systematically evaluate recently proposed EfficientNets, which uniformly scale all dimensions of a CNN. Using a data set with 161 cases, we demonstrate that pre-trained EfficientNets with larger input resolutions lead to significant performance improvements compared to commonly used pre-trained CNN architectures. Also, we highlight the importance of transfer learning, when using such large architectures. Overall, our best performing method achieves an F1-Score of 80.1%.
    A Self-Supervised Deep Framework for Reference Bony Shape Estimation in Orthognathic Surgical Planning. (arXiv:2109.05191v1 [cs.CV])
    (0 min) Virtual orthognathic surgical planning involves simulating surgical corrections of jaw deformities on 3D facial bony shape models. Due to the lack of necessary guidance, the planning procedure is highly experience-dependent and the planning results are often suboptimal. A reference facial bony shape model representing normal anatomies can provide an objective guidance to improve planning accuracy. Therefore, we propose a self-supervised deep framework to automatically estimate reference facial bony shape models. Our framework is an end-to-end trainable network, consisting of a simulator and a corrector. In the training stage, the simulator maps jaw deformities of a patient bone to a normal bone to generate a simulated deformed bone. The corrector then restores the simulated deformed bone back to normal. In the inference stage, the trained corrector is applied to generate a patient-specific normal-looking reference bone from a real deformed bone. The proposed framework was evaluated using a clinical dataset and compared with a state-of-the-art method that is based on a supervised point-cloud network. Experimental results show that the estimated shape models given by our approach are clinically acceptable and significantly more accurate than that of the competing method.
    A Deep Learning-Based Unified Framework for Red Lesions Detection on Retinal Fundus Images. (arXiv:2109.05021v1 [eess.IV])
    (0 min) Red-lesions, i.e., microaneurysms (MAs) and hemorrhages (HMs), are the early signs of diabetic retinopathy (DR). The automatic detection of MAs and HMs on retinal fundus images is a challenging task. Most of the existing methods detect either only MAs or only HMs because of the difference in their texture, sizes, and morphology. Though some methods detect both MAs and HMs, they suffer from the curse of dimensionality of shape and colors features and fail to detect all shape variations of HMs such as flame-shaped HM. Leveraging the progress in deep learning, we proposed a two-stream red lesions detection system dealing simultaneously with small and large red lesions. For this system, we introduced a new ROIs candidates generation method for large red lesions fundus images; it is based on blood vessel segmentation and morphological operations, and reduces the computational complexity, and enhances the detection accuracy by generating a small number of potential candidates. For detection, we adapted the Faster RCNN framework with two streams. We used pre-trained VGGNet as a bone model and carried out several extensive experiments to tune it for vessels segmentation and candidates generation, and finally learning the appropriate mapping, which yields better detection of the red lesions comparing with the state-of-the-art methods. The experimental results validated the effectiveness of the system in the detection of both MAs and HMs; the method yields higher performance for per lesion detection according to sensitivity under 4 FPIs on DiaretDB1-MA and DiaretDB1-HM datasets, and 1 FPI on e-ophtha and ROCh datasets than the state of the art methods w.r.t. various evaluation metrics. For DR screening, the system outperforms other methods on DiaretDB1-MA, DiaretDB1-HM, and e-ophtha datasets.
    Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification. (arXiv:2109.05542v1 [cs.CV])
    (0 min) Person re-identification (re-ID) has gained more and more attention due to its widespread applications in intelligent video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models, and annotating data is an expensive work in real-world scenarios. In addition, due to domain gaps between different datasets, the performance is dramatically decreased when re-ID models pre-trained on label-rich datasets (source domain) are directly applied to other unlabeled datasets (target domain). In this paper, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them, which free humans from heavy data collections and annotations. Based on them, we build two synthetic person re-ID datasets with different scales, "GSPR" and "mini-GSPR" datasets. Secondly, we propose a synthesis-based multi-domain collaborative refinement (SMCR) network, which contains a synthetic pretraining module and two collaborative-refinement modules to implement sufficient learning for the valuable knowledge from multiple domains. Extensive experiments show that our proposed framework obtains significant performance improvements over the state-of-the-art methods on multiple unsupervised domain adaptation tasks of person re-ID.
    Semi-supervised Semantic Segmentation of Prostate and Organs-at-Risk on 3D Pelvic CT Images. (arXiv:2009.09571v4 [cs.CV] UPDATED)
    (0 min) Automated segmentation can assist radiotherapy treatment planning by saving manual contouring efforts and reducing intra-observer and inter-observer variations. The recent development of deep learning approaches has revoluted medical data processing, including semantic segmentation, by dramatically improving performance. However, training effective deep learning models usually require a large amount of high-quality labeled data, which are often costly to collect. We developed a novel semi-supervised adversarial deep learning approach for 3D pelvic CT image semantic segmentation. Unlike supervised deep learning methods, the new approach can utilize both annotated and un-annotated data for training. It generates un-annotated synthetic data by a data augmentation scheme using generative adversarial networks (GANs). We applied the new approach to segmenting multiple organs in male pelvic CT images, where CT images without annotations and GAN-synthesized un-annotated images were used in semi-supervised learning. Experimental results, evaluated by three metrics (Dice similarity coefficient, average Hausdorff distance, and average surface Hausdorff distance), showed that the new method achieved either comparable performance with substantially fewer annotated images or better performance with the same amount of annotated data, outperforming the existing state-of-the-art methods.
    Sickle Cell Disease Severity Prediction from Percoll Gradient Images using Graph Convolutional Networks. (arXiv:2109.05372v1 [eess.IV])
    (0 min) Sickle cell disease (SCD) is a severe genetic hemoglobin disorder that results in premature destruction of red blood cells. Assessment of the severity of the disease is a challenging task in clinical routine since the causes of broad variance in SCD manifestation despite the common genetic cause remain unclear. Identification of the biomarkers that would predict the severity grade is of importance for prognosis and assessment of responsiveness of patients to therapy. Detection of the changes in red blood cell (RBC) density through separation of Percoll density gradient could be such marker as it allows to resolve intercellular differences and follow the most damaged dense cells prone to destruction and vaso-occlusion. Quantification of the images obtained from the distribution of RBCs in Percoll gradient and interpretation of the obtained is an important prerequisite for establishment of this approach. Here, we propose a novel approach combining a graph convolutional network, a convolutional neural network, fast Fourier transform, and recursive feature elimination to predict the severity of SCD directly from a Percoll image. Two important but expensive laboratory blood test parameters measurements are used for training the graph convolutional network. To make the model independent from such tests during prediction, the two parameters are estimated by a neural network from the Percoll image directly. On a cohort of 216 subjects, we achieve a prediction performance that is only slightly below an approach where the groundtruth laboratory measurements are used. Our proposed method is the first computational approach for the difficult task of SCD severity prediction. The two-step approach relies solely on inexpensive and simple blood analysis tools and can have a significant impact on the patients' survival in underdeveloped countries where access to medical instruments and doctors is limited
    An Unsupervised Deep-Learning Method for Fingerprint Classification: the CCAE Network and the Hybrid Clustering Strategy. (arXiv:2109.05526v1 [cs.CV])
    (0 min) The fingerprint classification is an important and effective method to quicken the process and improve the accuracy in the fingerprint matching process. Conventional supervised methods need a large amount of pre-labeled data and thus consume immense human resources. In this paper, we propose a new and efficient unsupervised deep learning method that can extract fingerprint features and classify fingerprint patterns automatically. In this approach, a new model named constraint convolutional auto-encoder (CCAE) is used to extract fingerprint features and a hybrid clustering strategy is applied to obtain the final clusters. A set of experiments in the NIST-DB4 dataset shows that the proposed unsupervised method exhibits the efficient performance on fingerprint classification. For example, the CCAE achieves an accuracy of 97.3% on only 1000 unlabeled fingerprints in the NIST-DB4.
    Temporal Memory Attention for Video Semantic Segmentation. (arXiv:2102.08643v2 [cs.CV] UPDATED)
    (0 min) Video semantic segmentation requires to utilize the complex temporal relations between frames of the video sequence. Previous works usually exploit accurate optical flow to leverage the temporal relations, which suffer much from heavy computational cost. In this paper, we propose a Temporal Memory Attention Network (TMANet) to adaptively integrate the long-range temporal relations over the video sequence based on the self-attention mechanism without exhaustive optical flow prediction. Specially, we construct a memory using several past frames to store the temporal information of the current frame. We then propose a temporal memory attention module to capture the relation between the current frame and the memory to enhance the representation of the current frame. Our method achieves new state-of-the-art performances on two challenging video semantic segmentation datasets, particularly 80.3% mIoU on Cityscapes and 76.5% mIoU on CamVid with ResNet-50.
    DEF: Deep Estimation of Sharp Geometric Features in 3D Shapes. (arXiv:2011.15081v2 [cs.CV] UPDATED)
    (0 min) We propose Deep Estimators of Features (DEFs), a learning-based framework for predicting sharp geometric features in sampled 3D shapes. Differently from existing data-driven methods, which reduce this problem to feature classification, we propose to regress a scalar field representing the distance from point samples to the closest feature line on local patches. Our approach is the first that scales to massive point clouds by fusing distance-to-feature estimates obtained on individual patches. We extensively evaluate our approach against five baselines on newly proposed synthetic and real-world 3D CAD model benchmarks. Our approach not only outperforms the baselines (with improvements in Recall and False Positives Rates), but generalizes to real-world scans after training our model on synthetic data and fine-tuning it on a small dataset of scanned data. We demonstrate a downstream application, where we reconstruct an explicit representation of straight and curved sharp feature lines from range scan data.
    CPF: Learning a Contact Potential Field to Model the Hand-Object Interaction. (arXiv:2012.00924v4 [cs.CV] UPDATED)
    (0 min) Modeling the hand-object (HO) interaction not only requires estimation of the HO pose, but also pays attention to the contact due to their interaction. Significant progress has been made in estimating hand and object separately with deep learning methods, simultaneous HO pose estimation and contact modeling has not yet been fully explored. In this paper, we present an explicit contact representation namely Contact Potential Field (CPF), and a learning-fitting hybrid framework namely MIHO to Modeling the Interaction of Hand and Object. In CPF, we treat each contacting HO vertex pair as a spring-mass system. Hence the whole system forms a potential field with minimal elastic energy at the grasp position. Extensive experiments on the two commonly used benchmarks have demonstrated that our method can achieve state-of-the-art in several reconstruction metrics, and allow us to produce more physically plausible HO pose even when the ground-truth exhibits severe interpenetration or disjointedness. Our code is available at https://github.com/lixiny/CPF.
    Follow the Curve: Robotic-Ultrasound Navigation with Learning Based Localization of Spinous Processes for Scoliosis Assessment. (arXiv:2109.05196v1 [eess.IV])
    (0 min) The scoliosis progression in adolescents requires close monitoring to timely take treatment measures. Ultrasound imaging is a radiation-free and low-cost alternative in scoliosis assessment to X-rays, which are typically used in clinical practice. However, ultrasound images are prone to speckle noises, making it challenging for sonographers to detect bony features and follow the spine's curvature. This paper introduces a robotic-ultrasound approach for spinal curvature tracking and automatic navigation. A fully connected network with deconvolutional heads is developed to locate the spinous process efficiently with real-time ultrasound images. We use this machine learning-based method to guide the motion of the robot-held ultrasound probe and follow the spinal curvature while capturing ultrasound images and correspondent position. We developed a new force-driven controller that automatically adjusts the probe's pose relative to the skin surface to ensure a good acoustic coupling between the probe and skin. After the scanning, the acquired data is used to reconstruct the coronal spinal image, where the deformity of the scoliosis spine can be assessed and measured. To evaluate the performance of our methodology, we conducted an experimental study with human subjects where the deviations from the image center during the robotized procedure are compared to that obtained from manual scanning. The angles of spinal deformity measured on spinal reconstruction images were similar for both methods, implying that they equally reflect human anatomy.
    A Joint Graph and Image Convolution Network for Automatic Brain Tumor Segmentation. (arXiv:2109.05580v1 [eess.IV])
    (0 min) We present a joint graph convolution-image convolution neural network as our submission to the Brain Tumor Segmentation (BraTS) 2021 challenge. We model each brain as a graph composed of distinct image regions, which is initially segmented by a graph neural network (GNN). Subsequently, the tumorous volume identified by the GNN is further refined by a simple (voxel) convolutional neural network (CNN), which produces the final segmentation. This approach captures both global brain feature interactions via the graphical representation and local image details through the use of convolutional filters. We find that the GNN component by itself can effectively identify and segment the brain tumors. The addition of the CNN further improves the median performance of the model by 2 percent across all metrics evaluated. On the validation set, our joint GNN-CNN model achieves mean Dice scores of 0.89, 0.81, 0.73 and mean Hausdorff distances (95th percentile) of 6.8, 12.6, 28.2mm on the whole tumor, core tumor, and enhancing tumor, respectively.
    Multiresolution Deep Implicit Functions for 3D Shape Representation. (arXiv:2109.05591v1 [cs.CV])
    (0 min) We introduce Multiresolution Deep Implicit Functions (MDIF), a hierarchical representation that can recover fine geometry detail, while being able to perform global operations such as shape completion. Our model represents a complex 3D shape with a hierarchy of latent grids, which can be decoded into different levels of detail and also achieve better accuracy. For shape completion, we propose latent grid dropout to simulate partial data in the latent space and therefore defer the completing functionality to the decoder side. This along with our multires design significantly improves the shape completion quality under decoder-only latent optimization. To the best of our knowledge, MDIF is the first deep implicit function model that can at the same time (1) represent different levels of detail and allow progressive decoding; (2) support both encoder-decoder inference and decoder-only latent optimization, and fulfill multiple applications; (3) perform detailed decoder-only shape completion. Experiments demonstrate its superior performance against prior art in various 3D reconstruction tasks.
    Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search. (arXiv:2109.05433v1 [cs.CV])
    (0 min) Internet search affects people's cognition of the world, so mitigating biases in search results and learning fair models is imperative for social good. We study a unique gender bias in image search in this work: the search images are often gender-imbalanced for gender-neutral natural language queries. We diagnose two typical image search models, the specialized model trained on in-domain datasets and the generalized representation model pre-trained on massive image and text data across the internet. Both models suffer from severe gender bias. Therefore, we introduce two novel debiasing approaches: an in-processing fair sampling method to address the gender imbalance issue for training models, and a post-processing feature clipping method base on mutual information to debias multimodal representations of pre-trained models. Extensive experiments on MS-COCO and Flickr30K benchmarks show that our methods significantly reduce the gender bias in image search models.
    Semantic-guided Automatic Natural Image Matting with Trimap Generation Network and Light-weight Non-local Attention. (arXiv:2103.17020v3 [cs.CV] UPDATED)
    (0 min) Natural image matting aims to precisely separate foreground objects from background using alpha matte. Fully automatic natural image matting without external annotation is challenging. Well-performed matting methods usually require accurate labor-intensive handcrafted trimap as extra input, while the performance of automatic trimap generation method of dilating foreground segmentation fluctuates with segmentation quality. Therefore, we argue that how to handle trade-off of additional information input is a major issue in automatic matting. This paper presents a semantic-guided automatic natural image matting pipeline with Trimap Generation Network and light-weight non-local attention, which does not need trimap and background as input. Specifically, guided by foreground segmentation, Trimap Generation Network estimates accurate trimap. Then, with estimated trimap as guidance, our light-weight Non-local Matting Network with Refinement produces final alpha matte, whose trimap-guided global aggregation attention block is equipped with stride downsampling convolution, reducing computation complexity and promoting performance. Experimental results show that our matting algorithm has competitive performance with state-of-the-art methods in both trimap-free and trimap-needed aspects.
    Pixel-in-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild. (arXiv:2003.03771v3 [cs.CV] UPDATED)
    (0 min) Recently, heatmap regression models have become popular due to their superior performance in locating facial landmarks. However, three major problems still exist among these models: (1) they are computationally expensive; (2) they usually lack explicit constraints on global shapes; (3) domain gaps are commonly present. To address these problems, we propose Pixel-in-Pixel Net (PIPNet) for facial landmark detection. The proposed model is equipped with a novel detection head based on heatmap regression, which conducts score and offset predictions simultaneously on low-resolution feature maps. By doing so, repeated upsampling layers are no longer necessary, enabling the inference time to be largely reduced without sacrificing model accuracy. Besides, a simple but effective neighbor regression module is proposed to enforce local constraints by fusing predictions from neighboring landmarks, which enhances the robustness of the new detection head. To further improve the cross-domain generalization capability of PIPNet, we propose self-training with curriculum. This training strategy is able to mine more reliable pseudo-labels from unlabeled data across domains by starting with an easier task, then gradually increasing the difficulty to provide more precise labels. Extensive experiments demonstrate the superiority of PIPNet, which obtains state-of-the-art results on three out of six popular benchmarks under the supervised setting. The results on two cross-domain test sets are also consistently improved compared to the baselines. Notably, our lightweight version of PIPNet runs at 35.7 FPS and 200 FPS on CPU and GPU, respectively, while still maintaining a competitive accuracy to state-of-the-art methods. The code of PIPNet is available at https://github.com/jhb86253817/PIPNet.
    Towards Robust Monocular Visual Odometry for Flying Robots on Planetary Missions. (arXiv:2109.05509v1 [cs.RO])
    (0 min) In the future, extraterrestrial expeditions will not only be conducted by rovers but also by flying robots. The technical demonstration drone Ingenuity, that just landed on Mars, will mark the beginning of a new era of exploration unhindered by terrain traversability. Robust self-localization is crucial for that. Cameras that are lightweight, cheap and information-rich sensors are already used to estimate the ego-motion of vehicles. However, methods proven to work in man-made environments cannot simply be deployed on other planets. The highly repetitive textures present in the wastelands of Mars pose a huge challenge to descriptor matching based approaches. In this paper, we present an advanced robust monocular odometry algorithm that uses efficient optical flow tracking to obtain feature correspondences between images and a refined keyframe selection criterion. In contrast to most other approaches, our framework can also handle rotation-only motions that are particularly challenging for monocular odometry systems. Furthermore, we present a novel approach to estimate the current risk of scale drift based on a principal component analysis of the relative translation information matrix. This way we obtain an implicit measure of uncertainty. We evaluate the validity of our approach on all sequences of a challenging real-world dataset captured in a Mars-like environment and show that it outperforms state-of-the-art approaches.
    Prioritized Subnet Sampling for Resource-Adaptive Supernet Training. (arXiv:2109.05432v1 [cs.CV])
    (2 min) A resource-adaptive supernet adjusts its subnets for inference to fit the dynamically available resources. In this paper, we propose Prioritized Subnet Sampling to train a resource-adaptive supernet, termed PSS-Net. We maintain multiple subnet pools, each of which stores the information of substantial subnets with similar resource consumption. Considering a resource constraint, subnets conditioned on this resource constraint are sampled from a pre-defined subnet structure space and high-quality ones will be inserted into the corresponding subnet pool. Then, the sampling will gradually be prone to sampling subnets from the subnet pools. Moreover, the one with a better performance metric is assigned with higher priority to train our PSS-Net, if sampling is from a subnet pool. At the end of training, our PSS-Net retains the best subnet in each pool to entitle a fast switch of high-quality subnets for inference when the available resources vary. Experiments on ImageNet using MobileNetV1/V2 show that our PSS-Net can well outperform state-of-the-art resource-adaptive supernets. Our project is at https://github.com/chenbong/PSS-Net.
    Facial Anatomical Landmark Detection using Regularized Transfer Learning with Application to Fetal Alcohol Syndrome Recognition. (arXiv:2109.05485v1 [cs.CV])
    (2 min) Fetal alcohol syndrome (FAS) caused by prenatal alcohol exposure can result in a series of cranio-facial anomalies, and behavioral and neurocognitive problems. Current diagnosis of FAS is typically done by identifying a set of facial characteristics, which are often obtained by manual examination. Anatomical landmark detection, which provides rich geometric information, is important to detect the presence of FAS associated facial anomalies. This imaging application is characterized by large variations in data appearance and limited availability of labeled data. Current deep learning-based heatmap regression methods designed for facial landmark detection in natural images assume availability of large datasets and are therefore not wellsuited for this application. To address this restriction, we develop a new regularized transfer learning approach that exploits the knowledge of a network learned on large facial recognition datasets. In contrast to standard transfer learning which focuses on adjusting the pre-trained weights, the proposed learning approach regularizes the model behavior. It explicitly reuses the rich visual semantics of a domain-similar source model on the target task data as an additional supervisory signal for regularizing landmark detection optimization. Specifically, we develop four regularization constraints for the proposed transfer learning, including constraining the feature outputs from classification and intermediate layers, as well as matching activation attention maps in both spatial and channel levels. Experimental evaluation on a collected clinical imaging dataset demonstrate that the proposed approach can effectively improve model generalizability under limited training samples, and is advantageous to other approaches in the literature.
    Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based Perception. (arXiv:2109.05441v1 [cs.CV])
    (2 min) State-of-the-art methods for driving-scene LiDAR-based perception (including point cloud semantic segmentation, panoptic segmentation and 3D detection, \etc) often project the point clouds to 2D space and then process them via 2D convolution. Although this cooperation shows the competitiveness in the point cloud, it inevitably alters and abandons the 3D topology and geometric relations. A natural remedy is to utilize the 3D voxelization and 3D convolution network. However, we found that in the outdoor point cloud, the improvement obtained in this way is quite limited. An important reason is the property of the outdoor point cloud, namely sparsity and varying density. Motivated by this investigation, we propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern while maintaining these inherent properties. The proposed model acts as a backbone and the learned features from this model can be used for downstream tasks such as point cloud semantic and panoptic segmentation or 3D detection. In this paper, we benchmark our model on these three tasks. For semantic segmentation, we evaluate the proposed model on several large-scale datasets, \ie, SemanticKITTI, nuScenes and A2D2. Our method achieves the state-of-the-art on the leaderboard of SemanticKITTI (both single-scan and multi-scan challenge), and significantly outperforms existing methods on nuScenes and A2D2 dataset. Furthermore, the proposed 3D framework also shows strong performance and good generalization on LiDAR panoptic segmentation and LiDAR 3D detection.
    BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation. (arXiv:2109.05346v1 [cs.CV])
    (2 min) Scene graphs are nodes and edges consisting of objects and object-object relationships, respectively. Scene graph generation (SGG) aims to identify the objects and their relationships. We propose a bidirectional GRU (BiGRU) transformer network (BGT-Net) for the scene graph generation for images. This model implements novel object-object communication to enhance the object information using a BiGRU layer. Thus, the information of all objects in the image is available for the other objects, which can be leveraged later in the object prediction step. This object information is used in a transformer encoder to predict the object class as well as to create object-specific edge information via the use of another transformer encoder. To handle the dataset bias induced by the long-tailed relationship distribution, softening with a log-softmax function and adding a bias adaptation term to regulate the bias for every relation prediction individually showed to be an effective approach. We conducted an elaborate study on experiments and ablations using open-source datasets, i.e., Visual Genome, Open-Images, and Visual Relationship Detection datasets, demonstrating the effectiveness of the proposed model over state of the art.
    LEA-Net: Layer-wise External Attention Network for Efficient Color Anomaly Detection. (arXiv:2109.05493v1 [cs.CV])
    (2 min) The utilization of prior knowledge about anomalies is an essential issue for anomaly detections. Recently, the visual attention mechanism has become a promising way to improve the performance of CNNs for some computer vision tasks. In this paper, we propose a novel model called Layer-wise External Attention Network (LEA-Net) for efficient image anomaly detection. The core idea relies on the integration of unsupervised and supervised anomaly detectors via the visual attention mechanism. Our strategy is as follows: (i) Prior knowledge about anomalies is represented as the anomaly map generated by unsupervised learning of normal instances, (ii) The anomaly map is translated to an attention map by the external network, (iii) The attention map is then incorporated into intermediate layers of the anomaly detection network. Notably, this layer-wise external attention can be applied to any CNN model in an end-to-end training manner. For a pilot study, we validate LEA-Net on color anomaly detection tasks. Through extensive experiments on PlantVillage, MVTec AD, and Cloud datasets, we demonstrate that the proposed layer-wise visual attention mechanism consistently boosts anomaly detection performances of an existing CNN model, even on imbalanced datasets. Moreover, we show that our attention mechanism successfully boosts the performance of several CNN models.
    What happens in Face during a facial expression? Using data mining techniques to analyze facial expression motion vectors. (arXiv:2109.05457v1 [cs.CV])
    (2 min) One of the most common problems encountered in human-computer interaction is automatic facial expression recognition. Although it is easy for human observer to recognize facial expressions, automatic recognition remains difficult for machines. One of the methods that machines can recognize facial expression is analyzing the changes in face during facial expression presentation. In this paper, optical flow algorithm was used to extract deformation or motion vectors created in the face because of facial expressions. Then, these extracted motion vectors are used to be analyzed. Their positions and directions were exploited for automatic facial expression recognition using different data mining techniques. It means that by employing motion vector features used as our data, facial expressions were recognized. Some of the most state-of-the-art classification algorithms such as C5.0, CRT, QUEST, CHAID, Deep Learning (DL), SVM and Discriminant algorithms were used to classify the extracted motion vectors. Using 10-fold cross validation, their performances were calculated. To compare their performance more precisely, the test was repeated 50 times. Meanwhile, the deformation of face was also analyzed in this research. For example, what exactly happened in each part of face when a person showed fear? Experimental results on Extended Cohen-Kanade (CK+) facial expression dataset demonstrated that the best methods were DL, SVM and C5.0, with the accuracy of 95.3%, 92.8% and 90.2% respectively.
    BioLCNet: Reward-modulated Locally Connected Spiking Neural Networks. (arXiv:2109.05539v1 [cs.NE])
    (2 min) Recent studies have shown that convolutional neural networks (CNNs) are not the only feasible solution for image classification. Furthermore, weight sharing and backpropagation used in CNNs do not correspond to the mechanisms present in the primate visual system. To propose a more biologically plausible solution, we designed a locally connected spiking neural network (SNN) trained using spike-timing-dependent plasticity (STDP) and its reward-modulated variant (R-STDP) learning rules. The use of spiking neurons and local connections along with reinforcement learning (RL) led us to the nomenclature BioLCNet for our proposed architecture. Our network consists of a rate-coded input layer followed by a locally connected hidden layer and a decoding output layer. A spike population-based voting scheme is adopted for decoding in the output layer. We used the MNIST dataset to obtain image classification accuracy and to assess the robustness of our rewarding system to varying target responses.
    Impact of lung segmentation on the diagnosis and explanation of COVID-19 in chest X-ray images. (arXiv:2009.09780v4 [eess.IV] UPDATED)
    (3 min) COVID-19 frequently provokes pneumonia, which can be diagnosed using imaging exams. Chest X-ray (CXR) is often useful because it is cheap, fast, widespread, and uses less radiation. Here, we demonstrate the impact of lung segmentation in COVID-19 identification using CXR images and evaluate which contents of the image influenced the most. Semantic segmentation was performed using a U-Net CNN architecture, and the classification using three CNN architectures (VGG, ResNet, and Inception). Explainable Artificial Intelligence techniques were employed to estimate the impact of segmentation. A three-classes database was composed: lung opacity (pneumonia), COVID-19, and normal. We assessed the impact of creating a CXR image database from different sources, and the COVID-19 generalization from one source to another. The segmentation achieved a Jaccard distance of 0.034 and a Dice coefficient of 0.982. The classification using segmented images achieved an F1-Score of 0.88 for the multi-class setup, and 0.83 for COVID-19 identification. In the cross-dataset scenario, we obtained an F1-Score of 0.74 and an area under the ROC curve of 0.9 for COVID-19 identification using segmented images. Experiments support the conclusion that even after segmentation, there is a strong bias introduced by underlying factors from different sources.
    Challenges and Solutions in DeepFakes. (arXiv:2109.05397v1 [cs.CV])
    (2 min) Deep learning has been successfully appertained to solve various complex problems in the area of big data analytics to computer vision. A deep learning-powered application recently emerged is Deep Fake. It helps to create fake images and videos that human cannot distinguish them from the real ones and are recent off-shelf manipulation technique that allows swapping two identities in a single video. Technology is a controversial technology with many wide-reaching issues impacting society. So, to counter this emerging problem, we introduce a dataset of 140k real and fake faces which contain 70k real faces from the Flickr dataset collected by Nvidia, as well as 70k fake faces sampled from 1 million fake faces generated by style GAN. We will train our model in the dataset so that our model can identify real or fake faces.
    Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?. (arXiv:2109.05422v1 [cs.CV])
    (2 min) Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. Code will be made publicly available.
    A Complex Constrained Total Variation Image Denoising Algorithm with Application to Phase Retrieval. (arXiv:2109.05496v1 [eess.IV])
    (2 min) This paper considers the constrained total variation (TV) denoising problem for complex-valued images. We extend the definition of TV seminorms for real-valued images to dealing with complex-valued ones. In particular, we introduce two types of complex TV in both isotropic and anisotropic forms. To solve the constrained denoising problem, we adopt a dual approach and derive an accelerated gradient projection algorithm. We further generalize the proposed denoising algorithm as a key building block of the proximal gradient scheme to solve a vast class of complex constrained optimization problems with TV regularizers. As an example, we apply the proposed algorithmic framework to phase retrieval. We combine the complex TV regularizer with the conventional projection-based method within the constraint complex TV model. Initial results from both simulated and optical experiments demonstrate the validity of the constrained TV model in extracting sparsity priors within complex-valued images, while also utilizing physically tractable constraints that help speed up convergence.
    Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent Neural Network. (arXiv:2109.05287v1 [eess.IV])
    (2 min) Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) using a 2D sensor (detector) in a single snapshot, achieving joint FoV and temporal compressive sensing, and thus enjoying the advantages of low-bandwidth, low-power, and low-cost. However, it is challenging for existing model-based decoding algorithms to reconstruct each individual scene, which usually require exhaustive parameter tuning with extremely long running time for large scale data. In this paper, we propose an optical flow-aided recurrent neural network for dual video SCI systems, which provides high-quality decoding in seconds. Firstly, we develop a diversity amplification method to enlarge the differences between scenes of two FoVs, and design a deep convolutional neural network with dual branches to separate different scenes from the single measurement. Secondly, we integrate the bidirectional optical flow extracted from adjacent frames with the recurrent neural network to jointly reconstruct each video in a sequential manner. Extensive results on both simulation and real data demonstrate the superior performance of our proposed model in a short inference time. The code and data are available at https://github.com/RuiyingLu/OFaNet-for-Dual-view-SCI.
    Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval. (arXiv:2109.05523v1 [cs.CV])
    (2 min) Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For the training, we propose multi-scale matching losses from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.
    Evaluating Computer Vision Techniques for Urban Mobility on Large-Scale, Unconstrained Roads. (arXiv:2109.05226v1 [cs.CV])
    (2 min) Conventional approaches for addressing road safety rely on manual interventions or immobile CCTV infrastructure. Such methods are expensive in enforcing compliance to traffic rules and do not scale to large road networks. This paper proposes a simple mobile imaging setup to address several common problems in road safety at scale. We use recent computer vision techniques to identify possible irregularities on roads, the absence of street lights, and defective traffic signs using videos from a moving camera-mounted vehicle. Beyond the inspection of static road infrastructure, we also demonstrate the mobile imaging solution's applicability to spot traffic violations. Before deploying our system in the real-world, we investigate the strengths and shortcomings of computer vision techniques on thirteen condition-based hierarchical labels. These conditions include different timings, road type, traffic density, and state of road damage. Our demonstrations are then carried out on 2000 km of unconstrained road scenes, captured across an entire city. Through this, we quantitatively measure the overall safety of roads in the city through carefully constructed metrics. We also show an interactive dashboard for visually inspecting and initiating action in a time, labor and cost-efficient manner. Code, models, and datasets used in this work will be publicly released.
    RobustART: Benchmarking Robustness on Architecture Design and Training Techniques. (arXiv:2109.05211v1 [cs.CV])
    (2 min) Deep neural networks (DNNs) are vulnerable to adversarial noises, which motivates the benchmark of model robustness. Existing benchmarks mainly focus on evaluating the defenses, but there are no comprehensive studies of how architecture design and general training techniques affect robustness. Comprehensively benchmarking their relationships will be highly beneficial for better understanding and developing robust DNNs. Thus, we propose RobustART, the first comprehensive Robustness investigation benchmark on ImageNet (including open-source toolkit, pre-trained model zoo, datasets, and analyses) regarding ARchitecture design (44 human-designed off-the-shelf architectures and 1200+ networks from neural architecture search) and Training techniques (10+ general techniques, e.g., data augmentation) towards diverse noises (adversarial, natural, and system noises). Extensive experiments revealed and substantiated several insights for the first time, for example: (1) adversarial training largely improves the clean accuracy and all types of robustness for Transformers and MLP-Mixers; (2) with comparable sizes, CNNs > Transformers > MLP-Mixers on robustness against natural and system noises; Transformers > MLP-Mixers > CNNs on adversarial robustness; (3) for some light-weight architectures (e.g., EfficientNet, MobileNetV2, and MobileNetV3), increasing model sizes or using extra training data cannot improve robustness. Our benchmark this http URL : (1) presents an open-source platform for conducting comprehensive evaluation on diverse robustness types; (2) provides a variety of pre-trained models with different training techniques to facilitate robustness evaluation; (3) proposes a new view to better understand the mechanism towards designing robust DNN architectures, backed up by the analysis. We will continuously contribute to building this ecosystem for the community.
    RVMDE: Radar Validated Monocular Depth Estimation for Robotics. (arXiv:2109.05265v1 [cs.RO])
    (2 min) Stereoscopy exposits a natural perception of distance in a scene, and its manifestation in 3D world understanding is an intuitive phenomenon. However, an innate rigid calibration of binocular vision sensors is crucial for accurate depth estimation. Alternatively, a monocular camera alleviates the limitation at the expense of accuracy in estimating depth, and the challenge exacerbates in harsh environmental conditions. Moreover, an optical sensor often fails to acquire vital signals in harsh environments, and radar is used instead, which gives coarse but more accurate signals. This work explores the utility of coarse signals from radar when fused with fine-grained data from a monocular camera for depth estimation in harsh environmental conditions. A variant of feature pyramid network (FPN) extensively operates on fine-grained image features at multiple scales with a fewer number of parameters. FPN feature maps are fused with sparse radar features extracted with a Convolutional neural network. The concatenated hierarchical features are used to predict the depth with ordinal regression. We performed experiments on the nuScenes dataset, and the proposed architecture stays on top in quantitative evaluations with reduced parameters and faster inference. The depth estimation results suggest that the proposed techniques can be used as an alternative to stereo depth estimation in critical applications in robotics and self-driving cars. The source code will be available in the following: \url{https://github.com/MI-Hussain/RVMDE}.
    ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis. (arXiv:2109.05488v1 [cs.CV])
    (2 min) Estimating the articulated 3D hand-object pose from a single RGB image is a highly ambiguous and challenging problem requiring large-scale datasets that contain diverse hand poses, object poses, and camera viewpoints. Most real-world datasets lack this diversity. In contrast, synthetic datasets can easily ensure vast diversity, but learning from them is inefficient and suffers from heavy training consumption. To address the above issues, we propose ArtiBoost, a lightweight online data enrichment method that boosts articulated hand-object pose estimation from the data perspective. ArtiBoost is employed along with a real-world source dataset. During training, ArtiBoost alternatively performs data exploration and synthesis. ArtiBoost can cover various hand-object poses and camera viewpoints based on a Compositional hand-object Configuration and Viewpoint space (CCV-space) and can adaptively enrich the current hard-discernable samples by a mining strategy. We apply ArtiBoost on a simple learning baseline network and demonstrate the performance boost on several hand-object benchmarks. As an illustrative example, with ArtiBoost, even a simple baseline network can outperform the previous start-of-the-art based on Transformer on the HO3D dataset. Our code is available at https://github.com/MVIG-SJTU/ArtiBoost.
    Team NeuroPoly: Description of the Pipelines for the MICCAI 2021 MS New Lesions Segmentation Challenge. (arXiv:2109.05409v1 [eess.IV])
    (2 min) This paper gives a detailed description of the pipelines used for the 2nd edition of the MICCAI 2021 Challenge on Multiple Sclerosis Lesion Segmentation. An overview of the data preprocessing steps applied is provided along with a brief description of the pipelines used, in terms of the architecture and the hyperparameters. Our code for this work can be found at: https://github.com/ivadomed/ms-challenge-2021.
    CAN3D: Fast 3D Medical Image Segmentation via Compact Context Aggregation. (arXiv:2109.05443v1 [eess.IV])
    (2 min) Direct automatic segmentation of objects from 3D medical imaging, such as magnetic resonance (MR) imaging, is challenging as it often involves accurately identifying a number of individual objects with complex geometries within a large volume under investigation. To address these challenges, most deep learning approaches typically enhance their learning capability by substantially increasing the complexity or the number of trainable parameters within their models. Consequently, these models generally require long inference time on standard workstations operating clinical MR systems and are restricted to high-performance computing hardware due to their large memory requirement. Further, to fit 3D dataset through these large models using limited computer memory, trade-off techniques such as patch-wise training are often used which sacrifice the fine-scale geometric information from input images which could be clinically significant for diagnostic purposes. To address these challenges, we present a compact convolutional neural network with a shallow memory footprint to efficiently reduce the number of model parameters required for state-of-art performance. This is critical for practical employment as most clinical environments only have low-end hardware with limited computing power and memory. The proposed network can maintain data integrity by directly processing large full-size 3D input volumes with no patches required and significantly reduces the computational time required for both training and inference. We also propose a novel loss function with extra shape constraint to improve the accuracy for imbalanced classes in 3D MR images.
    COSMic: A Coherence-Aware Generation Metric for Image Descriptions. (arXiv:2109.05281v1 [cs.CL])
    (2 min) Developers of text generation models rely on automated evaluation metrics as a stand-in for slow and expensive manual evaluations. However, image captioning metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of output text. We address this weakness by introducing the first discourse-aware learned generation metric for evaluating image descriptions. Our approach is inspired by computational theories of discourse for capturing information goals using coherence. We present a dataset of image$\unicode{x2013}$description pairs annotated with coherence relations. We then train a coherence-aware metric on a subset of the Conceptual Captions dataset and measure its effectiveness$\unicode{x2014}$its ability to predict human ratings of output captions$\unicode{x2014}$on a test set composed of out-of-domain images. We demonstrate a higher Kendall Correlation Coefficient for our proposed metric with the human judgments for the results of a number of state-of-the-art coherence-aware caption generation models when compared to several other metrics including recently proposed learned metrics such as BLEURT and BERTScore.
    Class-Distribution-Aware Calibration for Long-Tailed Visual Recognition. (arXiv:2109.05263v1 [cs.CV])
    (2 min) Despite impressive accuracy, deep neural networks are often miscalibrated and tend to overly confident predictions. Recent techniques like temperature scaling (TS) and label smoothing (LS) show effectiveness in obtaining a well-calibrated model by smoothing logits and hard labels with scalar factors, respectively. However, the use of uniform TS or LS factor may not be optimal for calibrating models trained on a long-tailed dataset where the model produces overly confident probabilities for high-frequency classes. In this study, we propose class-distribution-aware TS (CDA-TS) and LS (CDA-LS) by incorporating class frequency information in model calibration in the context of long-tailed distribution. In CDA-TS, the scalar temperature value is replaced with the CDA temperature vector encoded with class frequency to compensate for the over-confidence. Similarly, CDA-LS uses a vector smoothing factor and flattens the hard labels according to their corresponding class distribution. We also integrate CDA optimal temperature vector with distillation loss, which reduces miscalibration in self-distillation (SD). We empirically show that class-distribution-aware TS and LS can accommodate the imbalanced data distribution yielding superior performance in both calibration error and predictive accuracy. We also observe that SD with an extremely imbalanced dataset is less effective in terms of calibration performance. Code is available in https://github.com/mobarakol/Class-Distribution-Aware-TS-LS.
    Pyramid Hybrid Pooling Quantization for Efficient Fine-Grained Image Retrieval. (arXiv:2109.05206v1 [cs.CV])
    (2 min) Deep hashing approaches, including deep quantization and deep binary hashing, have become a common solution to large-scale image retrieval due to high computation and storage efficiency. Most existing hashing methods can not produce satisfactory results for fine-grained retrieval, because they usually adopt the outputs of the last CNN layer to generate binary codes, which is less effective to capture subtle but discriminative visual details. To improve fine-grained image hashing, we propose Pyramid Hybrid Pooling Quantization (PHPQ). Specifically, we propose a Pyramid Hybrid Pooling (PHP) module to capture and preserve fine-grained semantic information from multi-level features. Besides, we propose a learnable quantization module with a partial attention mechanism, which helps to optimize the most relevant codewords and improves the quantization. Comprehensive experiments demonstrate that PHPQ outperforms state-of-the-art methods.
    Bornon: Bengali Image Captioning with Transformer-based Deep learning approach. (arXiv:2109.05218v1 [cs.CV])
    (2 min) Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
    iMAP: Implicit Mapping and Positioning in Real-Time. (arXiv:2103.12352v2 [cs.CV] UPDATED)
    (2 min) We show for the first time that a multilayer perceptron (MLP) can serve as the only scene representation in a real-time SLAM system for a handheld RGB-D camera. Our network is trained in live operation without prior data, building a dense, scene-specific implicit 3D model of occupancy and colour which is also immediately used for tracking. Achieving real-time SLAM via continual training of a neural network against a live image stream requires significant innovation. Our iMAP algorithm uses a keyframe structure and multi-processing computation flow, with dynamic information-guided pixel sampling for speed, with tracking at 10 Hz and global map updating at 2 Hz. The advantages of an implicit MLP over standard dense SLAM techniques include efficient geometry representation with automatic detail control and smooth, plausible filling-in of unobserved regions such as the back surfaces of objects.
  • cs.IR updates on arXiv.org

    Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society. (arXiv:2005.00033v3 [cs.CL] UPDATED)
    (3 min) With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Ad-dressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that (i) focuses on COVID-19, (ii) combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and (iii) covers Arabic, Bulgarian, Dutch, and English. Finally, we show strong evaluation results using pretrained Transformers, thus con-firming the practical utility of the dataset in monolingual vs. multilingual, and single task vs. multitask settings.
    MedLatinEpi and MedLatinLit: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts. (arXiv:2006.12289v2 [cs.CL] UPDATED)
    (2 min) We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author or not. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.
    Dealing with Typos for BERT-based Passage Retrieval and Ranking. (arXiv:2108.12139v2 [cs.IR] UPDATED)
    (2 min) Passage retrieval and ranking is a key task in open-domain question answering and information retrieval. Current effective approaches mostly rely on pre-trained deep language model-based retrievers and rankers. These methods have been shown to effectively model the semantic matching between queries and passages, also in presence of keyword mismatch, i.e. passages that are relevant to a query but do not contain important query keywords. In this paper we consider the Dense Retriever (DR), a passage retrieval method, and the BERT re-ranker, a popular passage re-ranking method. In this context, we formally investigate how these models respond and adapt to a specific type of keyword mismatch -- that caused by keyword typos occurring in queries. Through empirical investigation, we find that typos can lead to a significant drop in retrieval and ranking effectiveness. We then propose a simple typos-aware training framework for DR and BERT re-ranker to address this issue. Our experimental results on the MS MARCO passage ranking dataset show that, with our proposed typos-aware training, DR and BERT re-ranker can become robust to typos in queries, resulting in significantly improved effectiveness compared to models trained without appropriately accounting for typos.
    A Framework for App Store Optimization. (arXiv:1905.11668v3 [cs.IR] UPDATED)
    (2 min) In this paper a framework for app store optimization is proposed. The framework is based on two main areas: developer dependent elements and user dependent elements. Developer dependent elements are similar to factors in search engine optimization. User dependent elements are similar to activities in social media. The proposed framework is modelled after downloading sample data from two leading app stores: Google Play and Apple iTunes. Results show that developer dependent elements can be better optimized. Names and descriptions of mobile apps are not fully utilized.
    An Improved Hybrid Recommender System: Integrating Document Context-Based and Behavior-Based Methods. (arXiv:2109.05516v1 [cs.IR])
    (2 min) One of the main challenges in recommender systems is data sparsity which leads to high variance. Several attempts have been made to improve the bias-variance trade-off using auxiliary information. In particular, document modeling-based methods have improved the model's accuracy by using textual data such as reviews, abstracts, and storylines when the user-to-item rating matrix is sparse. However, such models are insufficient to learn optimal representation for users and items. User-based and item-based collaborative filtering, owing to their efficiency and interpretability, have been long used for building recommender systems. They create a profile for each user and item respectively as their historically interacted items and the users who interacted with the target item. This work combines these two approaches with document context-aware recommender systems by considering users' opinions on these items. Another advantage of our model is that it supports online personalization. If a user has new interactions, it needs to refresh the user and item history representation vectors instead of updating model parameters. The proposed algorithm is implemented and tested on three real-world datasets that demonstrate our model's effectiveness over the baseline methods.
    YASO: A Targeted Sentiment Analysis Evaluation Dataset for Open-Domain Reviews. (arXiv:2012.14541v2 [cs.CL] UPDATED)
    (2 min) Current TSA evaluation in a cross-domain setup is restricted to the small set of review domains available in existing datasets. Such an evaluation is limited, and may not reflect true performance on sites like Amazon or Yelp that host diverse reviews from many domains. To address this gap, we present YASO - a new TSA evaluation dataset of open-domain user reviews. YASO contains 2,215 English sentences from dozens of review domains, annotated with target terms and their sentiment. Our analysis verifies the reliability of these annotations, and explores the characteristics of the collected data. Benchmark results using five contemporary TSA systems show there is ample room for improvement on this challenging new dataset. YASO is available at https://github.com/IBM/yaso-tsa.
    Memory Augmented Multi-Instance Contrastive Predictive Coding for Sequential Recommendation. (arXiv:2109.00368v3 [cs.IR] UPDATED)
    (2 min) The sequential recommendation aims to recommend items, such as products, songs and places, to users based on the sequential patterns of their historical records. Most existing sequential recommender models consider the next item prediction task as the training signal. Unfortunately, there are two essential challenges for these methods: (1) the long-term preference is difficult to capture, and (2) the supervision signal is too sparse to effectively train a model. In this paper, we propose a novel sequential recommendation framework to overcome these challenges based on a memory augmented multi-instance contrastive predictive coding scheme, denoted as MMInfoRec. The basic contrastive predictive coding (CPC) serves as encoders of sequences and items. The memory module is designed to augment the auto-regressive prediction in CPC to enable a flexible and general representation of the encoded preference, which can improve the ability to capture the long-term preference. For effective training of the MMInfoRec model, a novel multi-instance noise contrastive estimation (MINCE) loss is proposed, using multiple positive samples, which offers effective exploitation of samples inside a mini-batch. The proposed MMInfoRec framework falls into the contrastive learning style, within which, however, a further finetuning step is not required given that its contrastive training task is well aligned with the target recommendation task. With extensive experiments on four benchmark datasets, MMInfoRec can outperform the state-of-the-art baselines.
    Fast-adapting and Privacy-preserving Federated Recommender System. (arXiv:2104.00919v3 [cs.IR] UPDATED)
    (2 min) In the mobile Internet era, the recommender system has become an irreplaceable tool to help users discover useful items, and thus alleviating the information overload problem. Recent deep neural network (DNN)-based recommender system research have made significant progress in improving prediction accuracy, which is largely attributed to the access to a large amount of users' personal data collected from users' devices and then centrally stored in the cloud server. However, as there are rising concerns around the globe on user privacy leakage in the online platform, the public is becoming anxious by such abuse of user privacy. Therefore, it is urgent and beneficial to develop a recommender system that can achieve both high prediction accuracy and high degree of user privacy protection. To this end, we propose a DNN-based recommendation model called PrivRec running on the decentralized federated learning (FL) environment, which ensures that a user's data never leaves his/her during the course of model training. On the other hand, to better embrace the data heterogeneity commonly existing in FL, we innovatively introduce a first-order meta-learning method that enables fast in-device personalization with only few data points. Furthermore, to defense from potential malicious participant that poses serious security threat to other users, we develop a user-level differentially private DP-PrivRec model so that it is unable to determine whether a particular user is present or not solely based on the trained model. Finally, we conduct extensive experiments on two large-scale datasets in a simulated FL environment, and the results validate the superiority of our proposed PrivRec and DP-PrivRec.
    User Fairness, Item Fairness, and Diversity for Rankings in Two-Sided Markets. (arXiv:2010.01470v3 [cs.IR] UPDATED)
    (2 min) Ranking items by their probability of relevance has long been the goal of conventional ranking systems. While this maximizes traditional criteria of ranking performance, there is a growing understanding that it is an oversimplification in online platforms that serve not only a diverse user population, but also the producers of the items. In particular, ranking algorithms are expected to be fair in how they serve all groups of users -- not just the majority group -- and they also need to be fair in how they divide exposure among the items. These fairness considerations can partially be met by adding diversity to the rankings, as done in several recent works. However, we show in this paper that user fairness, item fairness and diversity are fundamentally different concepts. In particular, we find that algorithms that consider only one of the three desiderata can fail to satisfy and even harm the other two. To overcome this shortcoming, we present the first ranking algorithm that explicitly enforces all three desiderata. The algorithm optimizes user and item fairness as a convex optimization problem which can be solved optimally. From its solution, a ranking policy can be derived via a novel Birkhoff-von Neumann decomposition algorithm that optimizes diversity. Beyond the theoretical analysis, we investigate empirically on a new benchmark dataset how effectively the proposed ranking algorithm can control user fairness, item fairness and diversity, as well as the trade-offs between them.
    Top-N Recommendation with Counterfactual User Preference Simulation. (arXiv:2109.02444v2 [cs.IR] UPDATED)
    (2 min) Top-N recommendation, which aims to learn user ranking-based preference, has long been a fundamental problem in a wide range of applications. Traditional models usually motivate themselves by designing complex or tailored architectures based on different assumptions. However, the training data of recommender system can be extremely sparse and imbalanced, which poses great challenges for boosting the recommendation performance. To alleviate this problem, in this paper, we propose to reformulate the recommendation task within the causal inference framework, which enables us to counterfactually simulate user ranking-based preferences to handle the data scarce problem. The core of our model lies in the counterfactual question: "what would be the user's decision if the recommended items had been different?". To answer this question, we firstly formulate the recommendation process with a series of structural equation models (SEMs), whose parameters are optimized based on the observed data. Then, we actively indicate many recommendation lists (called intervention in the causal inference terminology) which are not recorded in the dataset, and simulate user feedback according to the learned SEMs for generating new training samples. Instead of randomly intervening on the recommendation list, we design a learning-based method to discover more informative training samples. Considering that the learned SEMs can be not perfect, we, at last, theoretically analyze the relation between the number of generated samples and the model prediction error, based on which a heuristic method is designed to control the negative effect brought by the prediction error. Extensive experiments are conducted based on both synthetic and real-world datasets to demonstrate the effectiveness of our framework.
    Self-supervised Recommendation with Cross-channel Matching Representation and Hierarchical Contrastive Learning. (arXiv:2109.00676v2 [cs.IR] UPDATED)
    (2 min) Recently, using different channels to model social semantic information, and using self-supervised learning tasks to maintain the characteristics of each channel when fusing the information, which has been proven to be a very promising work. However, how to deeply dig out the relationship between different channels and make full use of it while maintaining the uniqueness of each channel is a problem that has not been well studied and resolved in this field. Under such circumstances, this paper explores and verifies the deficiency of directly constructing contrastive learning tasks on different channels with practical experiments and proposes the scheme of interactive modeling and matching representation across different channels. This is the first attempt in the field of recommender systems, we believe the insight of this paper is inspirational to future self-supervised learning research based on multi-channel information. To solve this problem, we propose a cross-channel matching representation model based on attentive interaction, which realizes efficient modeling of the relationship between cross-channel information. Based on this, we also proposed a hierarchical self-supervised learning model, which realized two levels of self-supervised learning within and between channels and improved the ability of self-supervised tasks to autonomously mine different levels of potential information. We have conducted abundant experiments, and many experimental metrics on multiple public data sets show that the method proposed in this paper has a significant improvement compared with the state-of-the-art methods, no matter in the general or cold-start scenario. And in the experiment of model variant analysis, the benefits of the cross-channel matching representation model and the hierarchical self-supervised model proposed in this paper are also fully verified.
    Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. (arXiv:2108.08513v2 [cs.IR] UPDATED)
    (2 min) BERT-based information retrieval models are expensive, in both time (query latency) and computational resources (energy, hardware cost), making many of these models impractical especially under resource constraints. The reliance on a query encoder that only performs tokenization and on the pre-processing of passage representations at indexing, has allowed the recently proposed TILDE method to overcome the high query latency issue typical of BERT-based models. This however is at the expense of a lower effectiveness compared to other BERT-based re-rankers and dense retrievers. In addition, the original TILDE method is characterised by indexes with a very high memory footprint, as it expands each passage into the size of the BERT vocabulary. In this paper, we propose TILDEv2, a new model that stems from the original TILDE but that addresses its limitations. TILDEv2 relies on contextualized exact term matching with expanded passages. This requires to only store in the index the score of tokens that appear in the expanded passages (rather than all the vocabulary), thus producing indexes that are 99% smaller than those of TILDE. This matching mechanism also improves ranking effectiveness by 24%, without adding to the query latency. This makes TILDEv2 the state-of-the-art passage re-ranking method for CPU-only environments, capable of maintaining query latency below 100ms on commodity hardware.
    Fairness of Exposure in Stochastic Bandits. (arXiv:2103.02735v2 [cs.LG] UPDATED)
    (2 min) Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get exposed to the users. This raises questions of fairness to the items -- and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winner-takes-all allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees merit-based fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and reward regret in this setting, and present algorithms for both stochastic multi-armed bandits and stochastic linear bandits. We prove that the algorithms achieve sub-linear fairness regret and reward regret. Beyond the theoretical analysis, we also provide empirical evidence that these algorithms can fairly allocate exposure to different arms effectively.
    Detecting Polarized Topics Using Partisanship-aware Contextualized Topic Embeddings. (arXiv:2104.07814v2 [cs.CL] UPDATED)
    (2 min) Growing polarization of the news media has been blamed for fanning disagreement, controversy and even violence. Early identification of polarized topics is thus an urgent matter that can help mitigate conflict. However, accurate measurement of topic-wise polarization is still an open research challenge. To address this gap, we propose Partisanship-aware Contextualized Topic Embeddings (PaCTE), a method to automatically detect polarized topics from partisan news sources. Specifically, utilizing a language model that has been finetuned on recognizing partisanship of the news articles, we represent the ideology of a news corpus on a topic by corpus-contextualized topic embedding and measure the polarization using cosine distance. We apply our method to a dataset of news articles about the COVID-19 pandemic. Extensive experiments on different news sources and topics demonstrate the efficacy of our method to capture topical polarization, as indicated by its effectiveness of retrieving the most polarized topics.
    Mismatched Estimation of rank-one symmetric matrices under Gaussian noise. (arXiv:2107.08927v2 [cs.IT] UPDATED)
    (2 min) We consider the estimation of an n-dimensional vector s from the noisy element-wise measurements of $\mathbf{s}\mathbf{s}^T$, a generic problem that arises in statistics and machine learning. We study a mismatched Bayesian inference setting, where some of the parameters are not known to the statistician. We derive the full exact analytic expression of the asymptotic mean squared error (MSE) in the large system size limit for the particular case of Gaussian priors and additive noise. From our formulas, we see that estimation is still possible in the mismatched case; and also that the minimum MSE (MMSE) can be achieved if the statistician chooses suitable parameters. Our technique relies on the asymptotics of the spherical integrals and can be applied as long as the statistician chooses a rotationally invariant prior.
    Pyramid Hybrid Pooling Quantization for Efficient Fine-Grained Image Retrieval. (arXiv:2109.05206v1 [cs.CV])
    (2 min) Deep hashing approaches, including deep quantization and deep binary hashing, have become a common solution to large-scale image retrieval due to high computation and storage efficiency. Most existing hashing methods can not produce satisfactory results for fine-grained retrieval, because they usually adopt the outputs of the last CNN layer to generate binary codes, which is less effective to capture subtle but discriminative visual details. To improve fine-grained image hashing, we propose Pyramid Hybrid Pooling Quantization (PHPQ). Specifically, we propose a Pyramid Hybrid Pooling (PHP) module to capture and preserve fine-grained semantic information from multi-level features. Besides, we propose a learnable quantization module with a partial attention mechanism, which helps to optimize the most relevant codewords and improves the quantization. Comprehensive experiments demonstrate that PHPQ outperforms state-of-the-art methods.
    MURAL: Multimodal, Multitask Retrieval Across Languages. (arXiv:2109.05125v1 [cs.IR])
    (2 min) Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.
    Existence conditions for hidden feedback loops in online recommender systems. (arXiv:2109.05278v1 [cs.IR])
    (2 min) We explore a hidden feedback loops effect in online recommender systems. Feedback loops result in degradation of online multi-armed bandit (MAB) recommendations to a small subset and loss of coverage and novelty. We study how uncertainty and noise in user interests influence the existence of feedback loops. First, we show that an unbiased additive random noise in user interests does not prevent a feedback loop. Second, we demonstrate that a non-zero probability of resetting user interests is sufficient to limit the feedback loop and estimate the size of the effect. Our experiments confirm the theoretical findings in a simulated environment for four bandit algorithms.
    Efficient-FedRec: Efficient Federated Learning Framework for Privacy-Preserving News Recommendation. (arXiv:2109.05446v1 [cs.IR])
    (2 min) News recommendation is critical for personalized news access. Most existing news recommendation methods rely on centralized storage of users' historical news click behavior data, which may lead to privacy concerns and hazards. Federated Learning is a privacy-preserving framework for multiple clients to collaboratively train models without sharing their private data. However, the computation and communication cost of directly learning many existing news recommendation models in a federated way are unacceptable for user clients. In this paper, we propose an efficient federated learning framework for privacy-preserving news recommendation. Instead of training and communicating the whole model, we decompose the news recommendation model into a large news model maintained in the server and a light-weight user model shared on both server and clients, where news representations and user model are communicated between server and clients. More specifically, the clients request the user model and news representations from the server, and send their locally computed gradients to the server for aggregation. The server updates its global user model with the aggregated gradients, and further updates its news model to infer updated news representations. Since the local gradients may contain private information, we propose a secure aggregation method to aggregate gradients in a privacy-preserving way. Experiments on two real-world datasets show that our method can reduce the computation and communication cost on clients while keep promising model performance.
    Uni-FedRec: A Unified Privacy-Preserving News Recommendation Framework for Model Training and Online Serving. (arXiv:2109.05236v1 [cs.IR])
    (2 min) News recommendation is important for personalized online news services. Most existing news recommendation methods rely on centrally stored user behavior data to both train models offline and provide online recommendation services. However, user data is usually highly privacy-sensitive, and centrally storing them may raise privacy concerns and risks. In this paper, we propose a unified news recommendation framework, which can utilize user data locally stored in user clients to train models and serve users in a privacy-preserving way. Following a widely used paradigm in real-world recommender systems, our framework contains two stages. The first one is for candidate news generation (i.e., recall) and the second one is for candidate news ranking (i.e., ranking). At the recall stage, each client locally learns multiple interest representations from clicked news to comprehensively model user interests. These representations are uploaded to the server to recall candidate news from a large news pool, which are further distributed to the user client at the ranking stage for personalized news display. In addition, we propose an interest decomposer-aggregator method with perturbation noise to better protect private user information encoded in user interest representations. Besides, we collaboratively train both recall and ranking models on the data decentralized in a large number of user clients in a privacy-preserving way. Experiments on two real-world news datasets show that our method can outperform baseline methods and effectively protect user privacy.
    CauseRec: Counterfactual User Sequence Synthesis for Sequential Recommendation. (arXiv:2109.05261v1 [cs.IR])
    (2 min) Learning user representations based on historical behaviors lies at the core of modern recommender systems. Recent advances in sequential recommenders have convincingly demonstrated high capability in extracting effective user representations from the given behavior sequences. Despite significant progress, we argue that solely modeling the observational behaviors sequences may end up with a brittle and unstable system due to the noisy and sparse nature of user interactions logged. In this paper, we propose to learn accurate and robust user representations, which are required to be less sensitive to (attack on) noisy behaviors and trust more on the indispensable ones, by modeling counterfactual data distribution. Specifically, given an observed behavior sequence, the proposed CauseRec framework identifies dispensable and indispensable concepts at both the fine-grained item level and the abstract interest level. CauseRec conditionally samples user concept sequences from the counterfactual data distributions by replacing dispensable and indispensable concepts within the original concept sequence. With user representations obtained from the synthesized user sequences, CauseRec performs contrastive user representation learning by contrasting the counterfactual with the observational. We conduct extensive experiments on real-world public recommendation benchmarks and justify the effectiveness of CauseRec with multi-aspects model analysis. The results demonstrate that the proposed CauseRec outperforms state-of-the-art sequential recommenders by learning accurate and robust user representations.
    Contrastive Quantization with Code Memory for Unsupervised Image Retrieval. (arXiv:2109.05205v1 [cs.CV])
    (2 min) The high efficiency in computation and storage makes hashing (including binary hashing and quantization) a common strategy in large-scale retrieval systems. To alleviate the reliance on expensive annotations, unsupervised deep hashing becomes an important research problem. This paper provides a novel solution to unsupervised deep quantization, namely Contrastive Quantization with Code Memory (MeCoQ). Different from existing reconstruction-based strategies, we learn unsupervised binary descriptors by contrastive learning, which can better capture discriminative visual semantics. Besides, we uncover that codeword diversity regularization is critical to prevent contrastive learning-based quantization from model degeneration. Moreover, we introduce a novel quantization code memory module that boosts contrastive learning with lower feature drift than conventional feature memories. Extensive experiments on benchmark datasets show that MeCoQ outperforms state-of-the-art methods.
    Discovering Technology Gaps using the IntSight Knowledge Navigator. (arXiv:2109.05142v1 [cs.DB])
    (2 min) Knowledge analysis is an important application of knowledge graphs. In this paper, we present a complex knowledge analysis problem that discovers the gaps in the technology areas of interest to an organization. Our knowledge graph is developed on a heterogeneous data management platform. The analysis combines semantic search, graph analytics, and polystore query optimization.
  • cs.LG updates on arXiv.org

    On the Accuracy of Analog Neural Network Inference Accelerators. (arXiv:2109.01262v2 [cs.AR] UPDATED)
    (2 min) Specialized accelerators have recently garnered attention as a method to reduce the power consumption of neural network inference. A promising category of accelerators utilizes nonvolatile memory arrays to both store weights and perform $\textit{in situ}$ analog computation inside the array. While prior work has explored the design space of analog accelerators to optimize performance and energy efficiency, there is seldom a rigorous evaluation of the accuracy of these accelerators. This work shows how architectural design decisions, particularly in mapping neural network parameters to analog memory cells, influence inference accuracy. When evaluated using ResNet50 on ImageNet, the resilience of the system to analog non-idealities - cell programming errors, analog-to-digital converter resolution, and array parasitic resistances - all improve when analog quantities in the hardware are made proportional to the weights in the network. Moreover, contrary to the assumptions of prior work, nearly equivalent resilience to cell imprecision can be achieved by fully storing weights as analog quantities, rather than spreading weight bits across multiple devices, often referred to as bit slicing. By exploiting proportionality, analog system designers have the freedom to match the precision of the hardware to the needs of the algorithm, rather than attempting to guarantee the same level of precision in the intermediate results as an equivalent digital accelerator. This ultimately results in an analog accelerator that is more accurate, more robust to analog errors, and more energy-efficient.
    YASO: A Targeted Sentiment Analysis Evaluation Dataset for Open-Domain Reviews. (arXiv:2012.14541v2 [cs.CL] UPDATED)
    (2 min) Current TSA evaluation in a cross-domain setup is restricted to the small set of review domains available in existing datasets. Such an evaluation is limited, and may not reflect true performance on sites like Amazon or Yelp that host diverse reviews from many domains. To address this gap, we present YASO - a new TSA evaluation dataset of open-domain user reviews. YASO contains 2,215 English sentences from dozens of review domains, annotated with target terms and their sentiment. Our analysis verifies the reliability of these annotations, and explores the characteristics of the collected data. Benchmark results using five contemporary TSA systems show there is ample room for improvement on this challenging new dataset. YASO is available at https://github.com/IBM/yaso-tsa.
    Langevin Monte Carlo: random coordinate descent and variance reduction. (arXiv:2007.14209v6 [stat.ML] UPDATED)
    (2 min) Langevin Monte Carlo (LMC) is a popular Bayesian sampling method. For the log-concave distribution function, the method converges exponentially fast, up to a controllable discretization error. However, the method requires the evaluation of a full gradient in each iteration, and for a problem on $\mathbb{R}^d$, this amounts to $d$ times partial derivative evaluations per iteration. The cost is high when $d\gg1$. In this paper, we investigate how to enhance computational efficiency through the application of RCD (random coordinate descent) on LMC. There are two sides of the theory: 1 By blindly applying RCD to LMC, one surrogates the full gradient by a randomly selected directional derivative per iteration. Although the cost is reduced per iteration, the total number of iteration is increased to achieve a preset error tolerance. Ultimately there is no computational gain; 2 We then incorporate variance reduction techniques, such as SAGA (stochastic average gradient) and SVRG (stochastic variance reduced gradient), into RCD-LMC. It will be proved that the cost is reduced compared with the classical LMC, and in the underdamped case, convergence is achieved with the same number of iterations, while each iteration requires merely one-directional derivative. This means we obtain the best possible computational cost in the underdamped-LMC framework.
    Utility Fairness for the Differentially Private Federated Learning. (arXiv:2109.05267v1 [cs.LG])
    (2 min) Federated learning (FL) allows predictive model training on the sensed data in a wireless Internet of things (IoT) network evading data collection cost in terms of energy, time, and privacy. In this paper, for a FL setting, we model the learning gain achieved by an IoT device against its participation cost as its utility. The local model quality and the associated cost differs from device to device due to the device-heterogeneity which could be time-varying. We identify that this results in utility unfairness because the same global model is shared among the devices. In the vanilla FL setting, the master is unaware of devices' local model computation and transmission costs, thus it is unable to address the utility unfairness problem. In addition, a device may exploit this lack of knowledge at the master to intentionally reduce its expenditure and thereby boost its utility. We propose to control the quality of the global model shared with the devices, in each round, based on their contribution and expenditure. This is achieved by employing differential privacy to curtail global model divulgence based on the learning contribution. Furthermore, we devise adaptive computation and transmission policies for each device to control its expenditure in order to mitigate utility unfairness. Our results show that the proposed scheme reduces the standard deviation of the energy cost of devices by 99% in comparison to the benchmark scheme, while the standard deviation of the training loss of devices varies around 0.103.
    Protein sequence-to-structure learning: Is this the end(-to-end revolution)?. (arXiv:2105.07407v2 [q-bio.BM] UPDATED)
    (2 min) The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, i.e. learning on representations such as graphs, 3D Voronoi tessellations, and point clouds; (ii) pre-trained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; (vi) and finally truly end-to-end architectures, i.e. differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last two years and widely used in CASP14.
    Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms. (arXiv:2109.05276v1 [cs.DC])
    (2 min) Machine learning algorithms have enabled computers to predict things by learning from previous data. The data storage and processing power are increasing rapidly, thus increasing machine learning and Artificial intelligence applications. Much of the work is done to improve the accuracy of the models built in the past, with little research done to determine the computational costs of machine learning acquisitions. In this paper, I will proceed with this later research work and will make a performance comparison of multi-threaded machine learning clustering algorithms. I will be working on Linear Regression, Random Forest, and K-Nearest Neighbors to determine the performance characteristics of the algorithms as well as the computation costs to the obtained results. I will be benchmarking system hardware performance by running these multi-threaded algorithms to train and test the models on a dataset to note the differences in performance matrices of the algorithms. In the end, I will state the best performing algorithms concerning the performance efficiency of these algorithms on my system.
    Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information. (arXiv:2109.05198v1 [cs.LG])
    (2 min) We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is updated automatically without adding an extra hyperparameter. We provide convergence guarantees on a comprehensive collection of optimization problems, including convex, strongly convex, and nonconvex problems, in both deterministic and stochastic regimes. We also conduct an extensive empirical evaluation on standard machine learning problems, justifying our algorithm's versatility and demonstrating its strong performance compared to other start-of-the-art first-order and second-order methods.
    Learning From Long-Tailed Data With Noisy Labels. (arXiv:2108.11096v2 [cs.CV] UPDATED)
    (2 min) Class imbalance and noisy labels are the norm rather than the exception in many large-scale classification datasets. Nevertheless, most works in machine learning typically assume balanced and clean data. There have been some recent attempts to tackle, on one side, the problem of learning from noisy labels and, on the other side, learning from long-tailed data. Each group of methods make simplifying assumptions about the other. Due to this separation, the proposed solutions often underperform when both assumptions are violated. In this work, we present a simple two-stage approach based on recent advances in self-supervised learning to treat both challenges simultaneously. It consists of, first, task-agnostic self-supervised pre-training, followed by task-specific fine-tuning using an appropriate loss. Most significantly, we find that self-supervised learning approaches are effectively able to cope with severe class imbalance. In addition, the resulting learned representations are also remarkably robust to label noise, when fine-tuned with an imbalance- and noise-resistant loss function. We validate our claims with experiments on CIFAR-10 and CIFAR-100 augmented with synthetic imbalance and noise, as well as the large-scale inherently noisy Clothing-1M dataset.
    TS2Vec: Towards Universal Representation of Time Series. (arXiv:2106.10466v2 [cs.LG] UPDATED)
    (2 min) This paper presents TS2Vec, a universal framework for learning representations of time series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive learning in a hierarchical way over augmented context views, which enables a robust contextual representation for each timestamp. Furthermore, to obtain the representation of an arbitrary sub-sequence in the time series, we can apply a simple aggregation over the representations of corresponding timestamps. We conduct extensive experiments on time series classification tasks to evaluate the quality of time series representations. As a result, TS2Vec achieves significant improvement over existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets. The learned timestamp-level representations also achieve superior results in time series forecasting and anomaly detection tasks. A linear regression trained on top of the learned representations outperforms previous SOTAs of time series forecasting. Furthermore, we present a simple way to apply the learned representations for unsupervised anomaly detection, which establishes SOTA results in the literature. The source code is publicly available at https://github.com/yuezhihan/ts2vec.
    Toward Communication Efficient Adaptive Gradient Method. (arXiv:2109.05109v1 [cs.LG])
    (2 min) In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks. With the increasing computation power of GPUs, the bottleneck of training speed in distributed training is gradually shifting from computation to communication. Meanwhile, in the hope of training machine learning models on mobile devices, a new distributed training paradigm called ``federated learning'' has become popular. The communication time in federated learning is especially important due to the low bandwidth of mobile devices. While various approaches to improve the communication efficiency have been proposed for federated learning, most of them are designed with SGD as the prototype training algorithm. While adaptive gradient methods have been proven effective for training neural nets, the study of adaptive gradient methods in federated learning is scarce. In this paper, we propose an adaptive gradient method that can guarantee both the convergence and the communication efficiency for federated learning.
    Unified Framework for Feature Extraction based on Contrastive Learning. (arXiv:2101.11703v2 [cs.LG] UPDATED)
    (2 min) Feature extraction is an efficient approach for alleviating the issue of dimensionality in high-dimensional data. As a popular self-supervised learning method, contrastive learning has recently garnered considerable attention. In this study, we proposed a unified framework based on a new perspective of contrastive learning (CL) that is suitable for both unsupervised and supervised feature extraction. The proposed framework first constructed two CL graph for uniquely defining the positive and negative pairs. Subsequently, the projection matrix was determined by minimizing the contrastive loss function. In addition, the proposed framework considered both similar and dissimilar samples to unify unsupervised and supervised feature extraction. Moreover, we propose the three specific methods: unsupervised contrastive learning method, supervised contrastive learning method 1 ,and supervised contrastive learning method 2. Finally, the numerical experiments on five real datasets demonstrated the superior performance of the proposed framework in comparison to the existing methods.
    Beneficial Perturbations Network for Defending Adversarial Examples. (arXiv:2009.12724v3 [cs.LG] UPDATED)
    (2 min) Deep neural networks can be fooled by adversarial attacks: adding carefully computed small adversarial perturbations to clean inputs can cause misclassification on state-of-the-art machine learning models. The reason is that neural networks fail to accommodate the distribution drift of the input data caused by adversarial perturbations. Here, we present a new solution - Beneficial Perturbation Network (BPN) - to defend against adversarial attacks by fixing the distribution drift. During training, BPN generates and leverages beneficial perturbations (somewhat opposite to well-known adversarial perturbations) by adding new, out-of-network biasing units. Biasing units influence the parameter space of the network, to preempt and neutralize future adversarial perturbations on input data samples. To achieve this, BPN creates reverse adversarial attacks during training, with very little cost, by recycling the training gradients already computed. Reverse attacks are captured by the biasing units, and the biases can in turn effectively defend against future adversarial examples. Reverse attacks are a shortcut, i.e., they affect the network's parameters without requiring instantiation of adversarial examples that could assist training. We provide comprehensive empirical evidence showing that 1) BPN is robust to adversarial examples and is much more running memory and computationally efficient compared to classical adversarial training. 2) BPN can defend against adversarial examples with negligible additional computation and parameter costs compared to training only on clean examples; 3) BPN hurts the accuracy on clean examples much less than classic adversarial training; 4) BPN can improve the generalization of the network 5) BPN trained only with Fast Gradient Sign Attack can generalize to defend PGD attacks.
    Quaternion Graph Neural Networks. (arXiv:2008.05089v4 [cs.LG] UPDATED)
    (2 min) Recently, graph neural networks (GNNs) have become an important and active research direction in deep learning. It is worth noting that most of the existing GNN-based methods learn graph representations within the Euclidean vector space. Beyond the Euclidean space, learning representation and embeddings in hyper-complex space have also shown to be a promising and effective approach. To this end, we propose Quaternion Graph Neural Networks (QGNN) and Gated Quaternion Graph Neural Networks (GQGNN) to learn graph representations within the Quaternion space. As demonstrated, the Quaternion space, a hyper-complex vector space, provides highly meaningful computations and analogical calculus through Hamilton product compared to the Euclidean and complex vector spaces. Our QGNN obtains state-of-the-art results on a range of benchmark datasets for graph classification and node classification. Besides, regarding knowledge graphs, our QGNN-based knowledge graph embedding method gets state-of-the-art results on three new and challenging benchmark datasets for knowledge graph completion. Furthermore, regarding text graphs, our GQGNN-based text classification method works better than state-of-the-art methods on benchmark datasets for inductive text classification. Our code is available at: \url{https://github.com/daiquocnguyen/QGNN}.
    Causal Learner: A Toolbox for Causal Structure and Markov Blanket Learning. (arXiv:2103.06544v2 [cs.LG] UPDATED)
    (2 min) Causal Learner is a toolbox for learning causal structure and Markov blanket (MB) from data. It integrates functions for generating simulated Bayesian network data, a set of state-of-the-art global causal structure learning algorithms, a set of state-of-the-art local causal structure learning algorithms, a set of state-of-the-art MB learning algorithms, and functions for evaluating algorithms. The data generation part of Causal Learner is written in R, and the rest of Causal Learner is written in MATLAB. Causal Learner aims to provide researchers and practitioners with an open-source platform for causal learning from data and for the development and evaluation of new causal learning algorithms. The Causal Learner project is available at this http URL
    Knowledge Graphs. (arXiv:2003.02320v6 [cs.AI] UPDATED)
    (2 min) In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.
    Machine Learning for Naval Architecture, Ocean and Marine Engineering. (arXiv:2109.05574v1 [cs.LG])
    (2 min) Machine Learning (ML) based algorithms have found significant impact in many fields of engineering and sciences, where datasets are available from experiments and high fidelity numerical simulations. Those datasets are generally utilized in a machine learning model to extract information about the underlying physics and derive functional relationships mapping input variables to target quantities of interest. Commonplace machine learning algorithms utilized in Scientific Machine Learning (SciML) include neural networks, regression trees, random forests, support vector machines, etc. The focus of this article is to review the applications of ML in naval architecture, ocean, and marine engineering problems; and identify priority directions of research. We discuss the applications of machine learning algorithms for different problems such as wave height prediction, calculation of wind loads on ships, damage detection of offshore platforms, calculation of ship added resistance, and various other applications in coastal and marine environments. The details of the data sets including the source of data-sets utilized in the ML model development are included. The features used as the inputs to the ML models are presented in detail and finally, the methods employed in optimization of the ML models were also discussed. Based on this comprehensive analysis we point out future directions of research that may be fruitful for the application of ML to the ocean and marine engineering problems.
    Disentangling Representations of Text by Masking Transformers. (arXiv:2104.07155v2 [cs.CL] UPDATED)
    (2 min) Representations from large pretrained models such as BERT encode a range of features into monolithic vectors, affording strong predictive accuracy across a multitude of downstream tasks. In this paper we explore whether it is possible to learn disentangled representations by identifying existing subnetworks within pretrained models that encode distinct, complementary aspect representations. Concretely, we learn binary masks over transformer weights or hidden units to uncover subsets of features that correlate with a specific factor of variation; this eliminates the need to train a disentangled model from scratch for a particular task. We evaluate this method with respect to its ability to disentangle representations of sentiment from genre in movie reviews, "toxicity" from dialect in Tweets, and syntax from semantics. By combining masking with magnitude pruning we find that we can identify sparse subnetworks within BERT that strongly encode particular aspects (e.g., toxicity) while only weakly encoding others (e.g., race). Moreover, despite only learning masks, we find that disentanglement-via-masking performs as well as -- and often better than -- previously proposed methods based on variational autoencoders and adversarial training.
    Noise Estimation for Generative Diffusion Models. (arXiv:2104.02600v2 [cs.LG] UPDATED)
    (2 min) Generative diffusion models have emerged as leading models in speech and image generation. However, in order to perform well with a small number of denoising steps, a costly tuning of the set of noise parameters is needed. In this work, we present a simple and versatile learning scheme that can step-by-step adjust those noise parameters, for any given number of steps, while the previous work needs to retune for each number separately. Furthermore, without modifying the weights of the diffusion model, we are able to significantly improve the synthesis results, for a small number of steps. Our approach comes at a negligible computation cost.
    Mixing between the Cross Entropy and the Expectation Loss Terms. (arXiv:2109.05635v1 [cs.LG])
    (2 min) The cross entropy loss is widely used due to its effectiveness and solid theoretical grounding. However, as training progresses, the loss tends to focus on hard to classify samples, which may prevent the network from obtaining gains in performance. While most work in the field suggest ways to classify hard negatives, we suggest to strategically leave hard negatives behind, in order to focus on misclassified samples with higher probabilities. We show that adding to the optimization goal the expectation loss, which is a better approximation of the zero-one loss, helps the network to achieve better accuracy. We, therefore, propose to shift between the two losses during training, focusing more on the expectation loss gradually during the later stages of training. Our experiments show that the new training protocol improves performance across a diverse set of classification domains, including computer vision, natural language processing, tabular data, and sequences. Our code and scripts are available at supplementary.
    Data Analytics for Smart cities: Challenges and Promises. (arXiv:2109.05581v1 [cs.LG])
    (2 min) The explosion of advancements in artificial intelligence, sensor technologies, and wireless communication activates ubiquitous sensing through distributed sensors. These sensors are various domains of networks that lead us to smart systems in healthcare, transportation, environment, and other relevant branches/networks. Having collaborative interaction among the smart systems connects end-user devices to each other which enables achieving a new integrated entity called Smart Cities. The goal of this study is to provide a comprehensive survey of data analytics in smart cities. In this paper, we aim to focus on one of the smart cities important branches, namely Smart Mobility, and its positive ample impact on the smart cities decision-making process. Intelligent decision-making systems in smart mobility offer many advantages such as saving energy, relaying city traffic, and more importantly, reducing air pollution by offering real-time useful information and imperative knowledge. Making a decision in smart cities in time is challenging due to various and high dimensional factors and parameters, which are not frequently collected. In this paper, we first address current challenges in smart cities and provide an overview of potential solutions to these challenges. Then, we offer a framework of these solutions, called universal smart cities decision making, with three main sections of data capturing, data analysis, and decision making to optimize the smart mobility within smart cities. With this framework, we elaborate on fundamental concepts of big data, machine learning, and deep leaning algorithms that have been applied to smart cities and discuss the role of these algorithms in decision making for smart mobility in smart cities.
    Critical Learning Periods in Federated Learning. (arXiv:2109.05613v1 [cs.LG])
    (2 min) Federated learning (FL) is a popular technique to train machine learning (ML) models with decentralized data. Extensive works have studied the performance of the global model; however, it is still unclear how the training process affects the final test accuracy. Exacerbating this problem is the fact that FL executions differ significantly from traditional ML with heterogeneous data characteristics across clients, involving more hyperparameters. In this work, we show that the final test accuracy of FL is dramatically affected by the early phase of the training process, i.e., FL exhibits critical learning periods, in which small gradient errors can have irrecoverable impact on the final test accuracy. To further explain this phenomenon, we generalize the trace of the Fisher Information Matrix (FIM) to FL and define a new notion called FedFIM, a quantity reflecting the local curvature of each clients from the beginning of the training in FL. Our findings suggest that the {\em initial learning phase} plays a critical role in understanding the FL performance. This is in contrast to many existing works which generally do not connect the final accuracy of FL to the early phase training. Finally, seizing critical learning periods in FL is of independent interest and could be useful for other problems such as the choices of hyperparameters such as the number of client selected per round, batch size, and more, so as to improve the performance of FL training and testing.
    Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances. (arXiv:2105.12221v2 [cs.LG] UPDATED)
    (2 min) We study how permutation symmetries in overparameterized multi-layer neural networks generate `symmetry-induced' critical points. Assuming a network with $ L $ layers of minimal widths $ r_1^*, \ldots, r_{L-1}^* $ reaches a zero-loss minimum at $ r_1^*! \cdots r_{L-1}^*! $ isolated points that are permutations of one another, we show that adding one extra neuron to each layer is sufficient to connect all these previously discrete minima into a single manifold. For a two-layer overparameterized network of width $ r^*+ h =: m $ we explicitly describe the manifold of global minima: it consists of $ T(r^*, m) $ affine subspaces of dimension at least $ h $ that are connected to one another. For a network of width $m$, we identify the number $G(r,m)$ of affine subspaces containing only symmetry-induced critical points that are related to the critical points of a smaller network of width $r<r^*$. Via a combinatorial analysis, we derive closed-form formulas for $ T $ and $ G $ and show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold in the mildly overparameterized regime (small $ h $) and vice versa in the vastly overparameterized regime ($h \gg r^*$). Our results provide new insights into the minimization of the non-convex loss function of overparameterized neural networks.
    Quick and Robust Feature Selection: the Strength of Energy-efficient Sparse Training for Autoencoders. (arXiv:2012.00560v2 [cs.LG] UPDATED)
    (2 min) Major complications arise from the recent increase in the amount of high-dimensional data, including high computational costs and memory requirements. Feature selection, which identifies the most relevant and informative attributes of a dataset, has been introduced as a solution to this problem. Most of the existing feature selection methods are computationally inefficient; inefficient algorithms lead to high energy consumption, which is not desirable for devices with limited computational and energy resources. In this paper, a novel and flexible method for unsupervised feature selection is proposed. This method, named QuickSelection, introduces the strength of the neuron in sparse neural networks as a criterion to measure the feature importance. This criterion, blended with sparsely connected denoising autoencoders trained with the sparse evolutionary training procedure, derives the importance of all input features simultaneously. We implement QuickSelection in a purely sparse manner as opposed to the typical approach of using a binary mask over connections to simulate sparsity. It results in a considerable speed increase and memory reduction. When tested on several benchmark datasets, including five low-dimensional and three high-dimensional datasets, the proposed method is able to achieve the best trade-off of classification and clustering accuracy, running time, and maximum memory usage, among widely used approaches for feature selection. Besides, our proposed method requires the least amount of energy among the state-of-the-art autoencoder-based feature selection methods.
    Link Scheduling using Graph Neural Networks. (arXiv:2109.05536v1 [eess.SP])
    (2 min) Efficient scheduling of transmissions is a key problem in wireless networks. The main challenge stems from the fact that optimal link scheduling involves solving a maximum weighted independent set (MWIS) problem, which is known to be NP-hard. For practical link scheduling schemes, centralized and distributed greedy heuristics are commonly used to approximate the solution to the MWIS problem. However, these greedy schemes mostly ignore important topological information of the wireless network. To overcome this limitation, we propose fast heuristics based on graph convolutional networks (GCNs) that can be implemented in centralized and distributed manners. Our centralized MWIS solver is based on tree search guided by a trainable GCN module and 1-step rollout. In our distributed MWIS solver, a trainable GCN module learns topology-aware node embeddings that are combined with the network weights before calling a distributed greedy solver. Test results on medium-sized wireless networks show that a GCN-based centralized MWIS solver can reach a near-optimal solution quickly. Moreover, we demonstrate that a shallow GCN-based distributed MWIS scheduler can reduce by nearly half the suboptimality gap of the distributed greedy solver with minimal increase in complexity. The proposed scheduling solutions also exhibit good generalizability across graph and weight distributions.
    Team NeuroPoly: Description of the Pipelines for the MICCAI 2021 MS New Lesions Segmentation Challenge. (arXiv:2109.05409v1 [eess.IV])
    (2 min) This paper gives a detailed description of the pipelines used for the 2nd edition of the MICCAI 2021 Challenge on Multiple Sclerosis Lesion Segmentation. An overview of the data preprocessing steps applied is provided along with a brief description of the pipelines used, in terms of the architecture and the hyperparameters. Our code for this work can be found at: https://github.com/ivadomed/ms-challenge-2021.
    Split Computing and Early Exiting for Deep Learning Applications: Survey and Research Challenges. (arXiv:2103.04505v2 [eess.SP] UPDATED)
    (2 min) Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep neural networks (DNNs) to execute complex inference tasks such as image classification and speech recognition, among others. However, continuously executing the entire DNN on the mobile device can quickly deplete its battery. Although task offloading to edge servers may decrease the mobile device's computational burden, erratic patterns in channel quality, network and edge server load can lead to a significant delay in task execution. Recently,approaches based on split computing (SC) have been proposed, where the DNN is split into a head and a tail model, executed respectively on the mobile device and on the edge server. Ultimately, this may reduce bandwidth usage as well as energy consumption. Another approach, called early exiting (EE), trains models to present multiple "exits" earlier in the architecture, each providing increasingly higher target accuracy. Therefore, the trade-off between accuracy and delay can be tuned according to the current conditions or application demands. In this paper, we provide a comprehensive survey of the state of the art in SC and EE strategies, by presenting a comparison of the most relevant approaches. We conclude the paper by providing a set of compelling research challenges.
    BioLCNet: Reward-modulated Locally Connected Spiking Neural Networks. (arXiv:2109.05539v1 [cs.NE])
    (2 min) Recent studies have shown that convolutional neural networks (CNNs) are not the only feasible solution for image classification. Furthermore, weight sharing and backpropagation used in CNNs do not correspond to the mechanisms present in the primate visual system. To propose a more biologically plausible solution, we designed a locally connected spiking neural network (SNN) trained using spike-timing-dependent plasticity (STDP) and its reward-modulated variant (R-STDP) learning rules. The use of spiking neurons and local connections along with reinforcement learning (RL) led us to the nomenclature BioLCNet for our proposed architecture. Our network consists of a rate-coded input layer followed by a locally connected hidden layer and a decoding output layer. A spike population-based voting scheme is adopted for decoding in the output layer. We used the MNIST dataset to obtain image classification accuracy and to assess the robustness of our rewarding system to varying target responses.
    Real-time multimodal image registration with partial intraoperative point-set data. (arXiv:2109.05023v1 [eess.IV])
    (2 min) We present Free Point Transformer (FPT) - a deep neural network architecture for non-rigid point-set registration. Consisting of two modules, a global feature extraction module and a point transformation module, FPT does not assume explicit constraints based on point vicinity, thereby overcoming a common requirement of previous learning-based point-set registration methods. FPT is designed to accept unordered and unstructured point-sets with a variable number of points and uses a "model-free" approach without heuristic constraints. Training FPT is flexible and involves minimizing an intuitive unsupervised loss function, but supervised, semi-supervised, and partially- or weakly-supervised training are also supported. This flexibility makes FPT amenable to multimodal image registration problems where the ground-truth deformations are difficult or impossible to measure. In this paper, we demonstrate the application of FPT to non-rigid registration of prostate magnetic resonance (MR) imaging and sparsely-sampled transrectal ultrasound (TRUS) images. The registration errors were 4.71 mm and 4.81 mm for complete TRUS imaging and sparsely-sampled TRUS imaging, respectively. The results indicate superior accuracy to the alternative rigid and non-rigid registration algorithms tested and substantially lower computation time. The rapid inference possible with FPT makes it particularly suitable for applications where real-time registration is beneficial.
    On the Compression of Neural Networks Using $\ell_0$-Norm Regularization and Weight Pruning. (arXiv:2109.05075v1 [cs.LG])
    (2 min) Despite the growing availability of high-capacity computational platforms, implementation complexity still has been a great concern for the real-world deployment of neural networks. This concern is not exclusively due to the huge costs of state-of-the-art network architectures, but also due to the recent push towards edge intelligence and the use of neural networks in embedded applications. In this context, network compression techniques have been gaining interest due to their ability for reducing deployment costs while keeping inference accuracy at satisfactory levels. The present paper is dedicated to the development of a novel compression scheme for neural networks. To this end, a new $\ell_0$-norm-based regularization approach is firstly developed, which is capable of inducing strong sparseness in the network during training. Then, targeting the smaller weights of the trained network with pruning techniques, smaller yet highly effective networks can be obtained. The proposed compression scheme also involves the use of $\ell_2$-norm regularization to avoid overfitting as well as fine tuning to improve the performance of the pruned network. Experimental results are presented aiming to show the effectiveness of the proposed scheme as well as to make comparisons with competing approaches.
    Black-Box Optimization via Generative Adversarial Nets. (arXiv:2102.03888v3 [cs.LG] UPDATED)
    (2 min) Black-box optimization (BBO) algorithms are concerned with finding the best solutions for problems with missing analytical details. Most classical methods for such problems are based on strong and fixed a priori assumptions, such as Gaussianity. However, the complex real-world problems, especially when the global optimum is desired, could be very far from the a priori assumptions because of their diversities, bringing some unexpected obstacles to these methods. In this paper, we present a generative adversarial nets-based optimizer (OPT-GAN) to adapt to diverse black-box problems via estimating the distribution of optima. The method learns the extensive distribution of the optimal region dominated by selective and randomly moving candidates, balancing the exploration and exploitation. Experiments conducted on Black-box Optimization Benchmarking (BBOB) problems and several other benchmarks with diversified distributions exhibit that, the OPT-GAN outperforms many traditional and neural net-based BBO algorithms.
    Pathfinder: Parallel quasi-Newton variational inference. (arXiv:2108.03782v3 [stat.ML] UPDATED)
    (0 min) We introduce Pathfinder, a variational method for approximately sampling from differentiable log densities. Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathfinder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the true posterior. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL-divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder's speed advantage with multiple cores.
    Tight FPT Approximation for Socially Fair Clustering. (arXiv:2106.06755v2 [cs.DS] UPDATED)
    (2 min) In this work, we study the socially fair $k$-median/$k$-means problem. We are given a set of points $P$ in a metric space $\mathcal{X}$ with a distance function $d(.,.)$. There are $\ell$ groups: $P_1,\dotsc,P_{\ell} \subseteq P$. We are also given a set $F$ of feasible centers in $\mathcal{X}$. The goal in the socially fair $k$-median problem is to find a set $C \subseteq F$ of $k$ centers that minimizes the maximum average cost over all the groups. That is, find $C$ that minimizes the objective function $\Phi(C,P) \equiv \max_{j} \Big\{ \sum_{x \in P_j} d(C,x)/|P_j| \Big\}$, where $d(C,x)$ is the distance of $x$ to the closest center in $C$. The socially fair $k$-means problem is defined similarly by using squared distances, i.e., $d^{2}(.,.)$ instead of $d(.,.)$. The current best approximation guarantee for both the problems is $O\left( \frac{\log \ell}{\log \log \ell} \right)$ due to Makarychev and Vakilian [COLT 2021]. In this work, we study the fixed parameter tractability of the problems with respect to parameter $k$. We design $(3+\varepsilon)$ and $(9 + \varepsilon)$ approximation algorithms for the socially fair $k$-median and $k$-means problems, respectively, in FPT (fixed parameter tractable) time $f(k,\varepsilon) \cdot n^{O(1)}$, where $f(k,\varepsilon) = (k/\varepsilon)^{{O}(k)}$ and $n = |P \cup F|$. Furthermore, we show that if Gap-ETH holds, then better approximation guarantees are not possible in FPT time.
    An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification. (arXiv:2109.00201v2 [cs.LG] UPDATED)
    (3 min) In predictive tasks, real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head) classes have sufficient samples, the minority (the tail) classes can be under-represented by a rather limited number of samples. Data pre-processing has been shown to be very effective in dealing with such problems. On one hand, data re-sampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional technique for reducing noise and inconsistencies in a dataset. However, the possible synergy between feature selection and data re-sampling for high-performance imbalance classification has rarely been investigated before. To address this issue, we carry out a comprehensive empirical study on the joint influence of feature selection and re-sampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification by applying feature selection before or after data re-sampling. We conduct a large number of experiments, with a total of 9225 tests, on 52 publicly available datasets, using 9 feature selection methods, 6 re-sampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines; thus both of them should be considered to derive the best performing model for imbalance classification. We find that the performance of an imbalance classification model not only depends on the classifier adopted and the ratio between the number of majority and minority samples, but also depends on the ratio between the number of samples and features. Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.
    Enhancing Interpretable Clauses Semantically using Pretrained Word Representation. (arXiv:2104.06901v2 [cs.CL] UPDATED)
    (0 min) Tsetlin Machine (TM) is an interpretable pattern recognition algorithm based on propositional logic, which has demonstrated competitive performance in many Natural Language Processing (NLP) tasks, including sentiment analysis, text classification, and Word Sense Disambiguation. To obtain human-level interpretability, legacy TM employs Boolean input features such as bag-of-words (BOW). However, the BOW representation makes it difficult to use any pre-trained information, for instance, word2vec and GloVe word representations. This restriction has constrained the performance of TM compared to deep neural networks (DNNs) in NLP. To reduce the performance gap, in this paper, we propose a novel way of using pre-trained word representations for TM. The approach significantly enhances the performance and interpretability of TM. We achieve this by extracting semantically related words from pre-trained word representations as input features to the TM. Our experiments show that the accuracy of the proposed approach is significantly higher than the previous BOW-based TM, reaching the level of DNN-based models.
    Existence conditions for hidden feedback loops in online recommender systems. (arXiv:2109.05278v1 [cs.IR])
    (0 min) We explore a hidden feedback loops effect in online recommender systems. Feedback loops result in degradation of online multi-armed bandit (MAB) recommendations to a small subset and loss of coverage and novelty. We study how uncertainty and noise in user interests influence the existence of feedback loops. First, we show that an unbiased additive random noise in user interests does not prevent a feedback loop. Second, we demonstrate that a non-zero probability of resetting user interests is sufficient to limit the feedback loop and estimate the size of the effect. Our experiments confirm the theoretical findings in a simulated environment for four bandit algorithms.
    A Novel Intrinsic Measure of Data Separability. (arXiv:2109.05180v1 [cs.LG])
    (0 min) In machine learning, the performance of a classifier depends on both the classifier model and the separability/complexity of datasets. To quantitatively measure the separability of datasets, we create an intrinsic measure -- the Distance-based Separability Index (DSI), which is independent of the classifier model. We consider the situation in which different classes of data are mixed in the same distribution to be the most difficult for classifiers to separate. We then formally show that the DSI can indicate whether the distributions of datasets are identical for any dimensionality. And we verify the DSI to be an effective separability measure by comparing to several state-of-the-art separability/complexity measures using synthetic and real datasets. Having demonstrated the DSI's ability to compare distributions of samples, we also discuss some of its other promising applications, such as measuring the performance of generative adversarial networks (GANs) and evaluating the results of clustering methods.
    Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks. (arXiv:2109.05351v1 [cs.LG])
    (0 min) Physical and cloud storage services are well-served by functioning and reliable high-volume storage systems. Recent observations point to hard disk reliability as one of the most pressing reliability issues in data centers containing massive volumes of storage devices such as HDDs. In this regard, early detection of impending failure at the disk level aids in reducing system downtime and reduces operational loss making proactive health monitoring a priority for AIOps in such settings. In this work, we introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing the highly imbalanced health statistics data for subsequent prediction tasks using data-driven approaches. We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models to come up with several key metrics that establish the usefulness of and superiority of our model under some tightly defined operational constraints. For example, using a 15 day look back period, our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure. This helps to alert operations maintenance well in-advance about potential mitigation needs. In addition, our model reports a mean absolute error of 0.12 for predicting failure up to 60 days in advance, placing it among the state-of-the-art in recent literature.
    Cost-Effective Federated Learning in Mobile Edge Networks. (arXiv:2109.05411v1 [cs.LG])
    (0 min) Federated learning (FL) is a distributed learning paradigm that enables a large number of mobile devices to collaboratively learn a model under the coordination of a central server without sharing their raw data. Despite its practical efficiency and effectiveness, the iterative on-device learning process (e.g., local computations and global communications with the server) incurs a considerable cost in terms of learning time and energy consumption, which depends crucially on the number of selected clients and the number of local iterations in each training round. In this paper, we analyze how to design adaptive FL in mobile edge networks that optimally chooses these essential control variables to minimize the total cost while ensuring convergence. We establish the analytical relationship between the total cost and the control variables with the convergence upper bound. To efficiently solve the cost minimization problem, we develop a low-cost sampling-based algorithm to learn the convergence related unknown parameters. We derive important solution properties that effectively identify the design principles for different optimization metrics. Practically, we evaluate our theoretical results both in a simulated environment and on a hardware prototype. Experimental evidence verifies our derived properties and demonstrates that our proposed solution achieves near-optimal performance for different optimization metrics for various datasets and heterogeneous system and statistical settings.
    SphereFace Revived: Unifying Hyperspherical Face Recognition. (arXiv:2109.05565v1 [cs.CV])
    (0 min) This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attention and gradually become a major focus in face recognition research. As one of the earliest works in hyperspherical face recognition, SphereFace explicitly proposed to learn face embeddings with large inter-class angular margin. However, SphereFace still suffers from severe training instability which limits its application in practice. In order to address this problem, we introduce a unified framework to understand large angular margin in hyperspherical face recognition. Under this framework, we extend the study of SphereFace and propose an improved variant with substantially better training stability -- SphereFace-R. Specifically, we propose two novel ways to implement the multiplicative margin, and study SphereFace-R under three different feature normalization schemes (no feature normalization, hard feature normalization and soft feature normalization). We also propose an implementation strategy -- "characteristic gradient detachment" -- to stabilize training. Extensive experiments on SphereFace-R show that it is consistently better than or competitive with state-of-the-art methods.
    On the Efficiency of Subclass Knowledge Distillation in Classification Tasks. (arXiv:2109.05587v1 [cs.LG])
    (2 min) This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary detection (two classes) the amount of information transferred from the teacher to the student network is restricted, thus limiting the utility of knowledge distillation. Performance can be improved by leveraging information about possible subclasses within the available classes in the classification task. To that end, we propose the so-called Subclass Knowledge Distillation (SKD) framework, which is the process of transferring the subclasses' prediction knowledge from a large teacher model into a smaller student one. Through SKD, additional meaningful information which is not in the teacher's class logits but exists in subclasses (e.g., similarities inside classes) will be conveyed to the student and boost its performance. Mathematically, we measure how many extra information bits the teacher can provide for the student via SKD framework. The framework developed is evaluated in clinical application, namely colorectal polyp binary classification. In this application, clinician-provided annotations are used to define subclasses based on the annotation label's variability in a curriculum style of learning. A lightweight, low complexity student trained with the proposed framework achieves an F1-score of 85.05%, an improvement of 2.14% and 1.49% gain over the student that trains without and with conventional knowledge distillation, respectively. These results show that the extra subclasses' knowledge (i.e., 0.4656 label bits per training sample in our experiment) can provide more information about the teacher generalization, and therefore SKD can benefit from using more information to increase the student performance.
    Stochastic Adversarial Koopman Model for Dynamical Systems. (arXiv:2109.05095v1 [cs.LG])
    (0 min) Dynamical systems are ubiquitous and are often modeled using a non-linear system of governing equations. Numerical solution procedures for many dynamical systems have existed for several decades, but can be slow due to high-dimensional state space of the dynamical system. Thus, deep learning-based reduced order models (ROMs) are of interest and one such family of algorithms along these lines are based on the Koopman theory. This paper extends a recently developed adversarial Koopman model (Balakrishnan \& Upadhyay, arXiv:2006.05547) to stochastic space, where the Koopman operator applies on the probability distribution of the latent encoding of an encoder. Specifically, the latent encoding of the system is modeled as a Gaussian, and is advanced in time by using an auxiliary neural network that outputs two Koopman matrices $K_{\mu}$ and $K_{\sigma}$. Adversarial and gradient losses are used and this is found to lower the prediction errors. A reduced Koopman formulation is also undertaken where the Koopman matrices are assumed to have a tridiagonal structure, and this yields predictions comparable to the baseline model with full Koopman matrices. The efficacy of the stochastic Koopman model is demonstrated on different test problems in chaos, fluid dynamics, combustion, and reaction-diffusion models. The proposed model is also applied in a setting where the Koopman matrices are conditioned on other input parameters for generalization and this is applied to simulate the state of a Lithium-ion battery in time. The Koopman models discussed in this study are very promising for the wide range of problems considered.
    Jacobian Regularization for Mitigating Universal Adversarial Perturbations. (arXiv:2104.10459v2 [cs.LG] UPDATED)
    (0 min) Universal Adversarial Perturbations (UAPs) are input perturbations that can fool a neural network on large sets of data. They are a class of attacks that represents a significant threat as they facilitate realistic, practical, and low-cost attacks on neural networks. In this work, we derive upper bounds for the effectiveness of UAPs based on norms of data-dependent Jacobians. We empirically verify that Jacobian regularization greatly increases model robustness to UAPs by up to four times whilst maintaining clean performance. Our theoretical analysis also allows us to formulate a metric for the strength of shared adversarial perturbations between pairs of inputs. We apply this metric to benchmark datasets and show that it is highly correlated with the actual observed robustness. This suggests that realistic and practical universal attacks can be reliably mitigated without sacrificing clean accuracy, which shows promise for the robustness of machine learning systems.
    Low-Latency Federated Learning over Wireless Channels with Differential Privacy. (arXiv:2106.13039v2 [cs.DC] UPDATED)
    (0 min) In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server. The performance of uploaded models in such situations can vary widely due to imbalanced data distributions, potential demands on privacy protections, and quality of transmissions. In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement. We solve this problem in the framework of multi-agent multi-armed bandit (MAMAB) to deal with the situation where there are multiple clients confornting different unknown transmission environments, e.g., channel fading and interferences. Specifically, we first transform the long-term constraints on both training performance and each client's DP into a virtual queue based on the Lyapunov drift technique. Then, we convert the MAMAB to a max-min bipartite matching problem at each communication round, by estimating rewards with the upper confidence bound (UCB) approach. More importantly, we propose two efficient solutions to this matching problem, i.e., modified Hungarian algorithm and greedy matching with a better alternative (GMBA), in which the first one can achieve the optimal solution with a high complexity while the second one approaches a better trade-off by enabling a verified low-complexity with little performance loss. In addition, we develop an upper bound on the expected regret of this MAMAB based FL framework, which shows a linear growth over the logarithm of communication rounds, justifying its theoretical feasibility. Extensive experimental results are conducted to validate the effectiveness of our proposed algorithms, and the impacts of various parameters on the FL performance over wireless edge networks are also discussed.
    Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis. (arXiv:2106.14707v2 [cs.CR] UPDATED)
    (0 min) Machine learning (ML) based malicious traffic detection is an emerging security paradigm, particularly for zero-day attack detection, which is complementary to existing rule based detection. However, the existing ML based detection has low detection accuracy and low throughput incurred by inefficient traffic features extraction. Thus, they cannot detect attacks in realtime especially in high throughput networks. Particularly, these detection systems similar to the existing rule based detection can be easily evaded by sophisticated attacks. To this end, we propose Whisper, a realtime ML based malicious traffic detection system that achieves both high accuracy and high throughput by utilizing frequency domain features. It utilizes sequential features represented by the frequency domain features to achieve bounded information loss, which ensures high detection accuracy, and meanwhile constrains the scale of features to achieve high detection throughput. Particularly, attackers cannot easily interfere with the frequency domain features and thus Whisper is robust against various evasion attacks. Our experiments with 42 types of attacks demonstrate that, compared with the state-of-theart systems, Whisper can accurately detect various sophisticated and stealthy attacks, achieving at most 18.36% improvement, while achieving two orders of magnitude throughput. Even under various evasion attacks, Whisper is still able to maintain around 90% detection accuracy.
    Neural network based order parameter for phase transitions and its applications in high-entropy alloys. (arXiv:2109.05598v1 [cond-mat.mtrl-sci])
    (0 min) Phase transition is one of the most important phenomena in nature and plays a central role in materials design. All phase transitions are characterized by suitable order parameters, including the order-disorder phase transition. However, finding a representative order parameter for complex systems is nontrivial, such as for high-entropy alloys. Given variational autoencoder's (VAE) strength of reducing high dimensional data into few principal components, here we coin a new concept of "VAE order parameter". We propose that the Manhattan distance in the VAE latent space can serve as a generic order parameter for order-disorder phase transitions. The physical properties of the order parameter are quantitatively interpreted and demonstrated by multiple refractory high-entropy alloys. Assisted by it, a generally applicable alloy design concept is proposed by mimicking the nature mixing of elements. Our physically interpretable "VAE order parameter" lays the foundation for the understanding of and alloy design by chemical ordering.
    Upper Bounds on the Generalization Error of Private Algorithms for Discrete Data. (arXiv:2005.05889v3 [cs.IT] UPDATED)
    (0 min) In this work, we study the generalization capability of algorithms from an information-theoretic perspective. It has been shown that the expected generalization error of an algorithm is bounded from above by a function of the relative entropy between the conditional probability distribution of the algorithm's output hypothesis, given the dataset with which it was trained, and its marginal probability distribution. We build upon this fact and introduce a mathematical formulation to obtain upper bounds on this relative entropy. Assuming that the data is discrete, we then develop a strategy using this formulation, based on the method of types and typicality, to find explicit upper bounds on the generalization error of stable algorithms, i.e., algorithms that produce similar output hypotheses given similar input datasets. In particular, we show the bounds obtained with this strategy for the case of $\epsilon$-DP and $\mu$-GDP algorithms.
    HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation. (arXiv:2109.05490v1 [cs.LG])
    (0 min) Discrete-continuous hybrid action space is a natural setting in many practical problems, such as robot control and game AI. However, most previous Reinforcement Learning (RL) works only demonstrate the success in controlling with either discrete or continuous action space, while seldom take into account the hybrid action space. One naive way to address hybrid action RL is to convert the hybrid action space into a unified homogeneous action space by discretization or continualization, so that conventional RL algorithms can be applied. However, this ignores the underlying structure of hybrid action space and also induces the scalability issue and additional approximation difficulties, thus leading to degenerated results. In this paper, we propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space. HyAR constructs the latent space and embeds the dependence between discrete action and continuous parameter via an embedding table and conditional Variantional Auto-Encoder (VAE). To further improve the effectiveness, the action representation is trained to be semantically smooth through unsupervised environmental dynamics prediction. Finally, the agent then learns its policy with conventional DRL algorithms in the learned representation space and interacts with the environment by decoding the hybrid action embeddings to the original action space. We evaluate HyAR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of HyAR when compared with previous baselines, especially for high-dimensional action spaces.
    CDN-MEDAL: Two-stage Density and Difference Approximation Framework for Motion Analysis. (arXiv:2106.03776v2 [cs.CV] UPDATED)
    (0 min) Background modeling is a promising research area in video analysis with a variety of video surveillance applications. Recent years have witnessed the proliferation of deep neural networks via effective learning-based approaches in motion analysis. However, these techniques only provide a limited description of the observed scenes' insufficient properties where a single-valued mapping is learned to approximate the temporal conditional averages of the target background. On the other hand, statistical learning in imagery domains has become one of the most prevalent approaches with high adaptation to dynamic context transformation, notably Gaussian Mixture Models, combined with a foreground extraction step. In this work, we propose a novel, two-stage method of change detection with two convolutional neural networks. The first architecture is grounded on the unsupervised Gaussian mixtures statistical learning to describe the scenes' salient features. The second one implements a light-weight pipeline of foreground detection. Our two-stage framework contains approximately 3.5K parameters in total but still maintains rapid convergence to intricate motion patterns. Our experiments on publicly available datasets show that our proposed networks are not only capable of generalizing regions of moving objects in unseen cases with promising results but also are competitive in performance efficiency and effectiveness regarding foreground segmentation.
    Automatic Componentwise Boosting: An Interpretable AutoML System. (arXiv:2109.05583v1 [stat.ML])
    (0 min) In practice, machine learning (ML) workflows require various different steps, from data preprocessing, missing value imputation, model selection, to model tuning as well as model evaluation. Many of these steps rely on human ML experts. AutoML - the field of automating these ML pipelines - tries to help practitioners to apply ML off-the-shelf without any expert knowledge. Most modern AutoML systems like auto-sklearn, H20-AutoML or TPOT aim for high predictive performance, thereby generating ensembles that consist almost exclusively of black-box models. This, in turn, makes the interpretation for the layperson more intricate and adds another layer of opacity for users. We propose an AutoML system that constructs an interpretable additive model that can be fitted using a highly scalable componentwise boosting algorithm. Our system provides tools for easy model interpretation such as visualizing partial effects and pairwise interactions, allows for a straightforward calculation of feature importance, and gives insights into the required model complexity to fit the given task. We introduce the general framework and outline its implementation autocompboost. To demonstrate the frameworks efficacy, we compare autocompboost to other existing systems based on the OpenML AutoML-Benchmark. Despite its restriction to an interpretable model space, our system is competitive in terms of predictive performance on most data sets while being more user-friendly and transparent.
    Locally Valid and Discriminative Predictive Intervals for Deep Learning Models. (arXiv:2106.00225v2 [cs.LG] UPDATED)
    (0 min) Crucial for building trust in deep learning models for critical real-world applications is efficient and theoretically sound uncertainty quantification, a task that continues to be challenging. Useful uncertainty information is expected to have two key properties: It should be valid (guaranteeing coverage) and discriminative (more uncertain when the expected risk is high). Moreover, when combined with deep learning (DL) methods, it should be scalable and affect the DL model performance minimally. Most existing Bayesian methods lack frequentist coverage guarantees and usually affect model performance. The few available frequentist methods are rarely discriminative and/or violate coverage guarantees due to unrealistic assumptions. Moreover, many methods are expensive or require substantial modifications to the base neural network. Building upon recent advances in conformal prediction [13, 31] and leveraging the classical idea of kernel regression, we propose Locally Valid and Discriminative predictive intervals (LVD), a simple, efficient, and lightweight method to construct discriminative predictive intervals (PIs) for almost any DL model. With no assumptions on the data distribution, such PIs also offer finite-sample local coverage guarantees (contrasted to the simpler marginal coverage). We empirically verify, using diverse datasets, that besides being the only locally valid method, LVD also exceeds or matches the performance (including coverage rate and prediction accuracy) of existing uncertainty quantification methods, while offering additional benefits in scalability and flexibility.
    Edge-augmented Graph Transformers: Global Self-attention is Enough for Graphs. (arXiv:2108.03348v2 [cs.LG] UPDATED)
    (0 min) Transformer neural networks have achieved state-of-the-art results for unstructured data such as text and images but their adoption for graph-structured data has been limited. This is partly due to the difficulty of incorporating complex structural information in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process and output structural information as well as node information. It allows us to use global self-attention, the key element of transformers, directly for graphs and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. In addition, we introduce a generalized positional encoding scheme for graphs based on Singular Value Decomposition which can improve the performance of EGT. Our framework, which relies on global node feature aggregation, achieves better performance compared to Convolutional/Message-Passing Graph Neural Networks, which rely on local feature aggregation within a neighborhood. We verify the performance of EGT in a supervised learning setting on a wide range of experiments on benchmark datasets. Our findings indicate that convolutional aggregation is not an essential inductive bias for graphs and global self-attention can serve as a flexible and adaptive alternative.
    MURAL: Multimodal, Multitask Retrieval Across Languages. (arXiv:2109.05125v1 [cs.IR])
    (0 min) Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.
    Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised Learning. (arXiv:2007.08844v2 [cs.LG] UPDATED)
    (0 min) While semi-supervised learning (SSL) has proven to be a promising way for leveraging unlabeled data when labeled data is scarce, the existing SSL algorithms typically assume that training class distributions are balanced. However, these SSL algorithms trained under imbalanced class distributions can severely suffer when generalizing to a balanced testing criterion, since they utilize biased pseudo-labels of unlabeled data toward majority classes. To alleviate this issue, we formulate a convex optimization problem to softly refine the pseudo-labels generated from the biased model, and develop a simple algorithm, named Distribution Aligning Refinery of Pseudo-label (DARP) that solves it provably and efficiently. Under various class-imbalanced semi-supervised scenarios, we demonstrate the effectiveness of DARP and its compatibility with state-of-the-art SSL schemes.
    DynSTGAT: Dynamic Spatial-Temporal Graph Attention Network for Traffic Signal Control. (arXiv:2109.05491v1 [cs.LG])
    (0 min) Adaptive traffic signal control plays a significant role in the construction of smart cities. This task is challenging because of many essential factors, such as cooperation among neighboring intersections and dynamic traffic scenarios. First, to facilitate cooperation of traffic signals, existing work adopts graph neural networks to incorporate the temporal and spatial influences of the surrounding intersections into the target intersection, where spatial-temporal information is used separately. However, one drawback of these methods is that the spatial-temporal correlations are not adequately exploited to obtain a better control scheme. Second, in a dynamic traffic environment, the historical state of the intersection is also critical for predicting future signal switching. Previous work mainly solves this problem using the current intersection's state, neglecting the fact that traffic flow is continuously changing both spatially and temporally and does not handle the historical state. In this paper, we propose a novel neural network framework named DynSTGAT, which integrates dynamic historical state into a new spatial-temporal graph attention network to address the above two problems. More specifically, our DynSTGAT model employs a novel multi-head graph attention mechanism, which aims to adequately exploit the joint relations of spatial-temporal information. Then, to efficiently utilize the historical state information of the intersection, we design a sequence model with the temporal convolutional network (TCN) to capture the historical information and further merge it with the spatial information to improve its performance. Extensive experiments conducted in the multi-intersection scenario on synthetic data and real-world data confirm that our method can achieve superior performance in travel time and throughput against the state-of-the-art methods.
    Particle reconstruction of volumetric particle image velocimetry with strategy of machine learning. (arXiv:1909.07815v2 [eess.IV] UPDATED)
    (0 min) Three-dimensional particle reconstruction with limited two-dimensional projections is an under-determined inverse problem that the exact solution is often difficult to be obtained. In general, approximate solutions can be obtained by iterative optimization methods. In the current work, a practical particle reconstruction method based on a convolutional neural network (CNN) with geometry-informed features is proposed. The proposed technique can refine the particle reconstruction from a very coarse initial guess of particle distribution generated by any traditional algebraic reconstruction technique (ART) based methods. Compared with available ART-based algorithms, the novel technique makes significant improvements in terms of reconstruction quality, {robustness to noises}, and at least an order of magnitude faster in the offline stage.
    Check Your Other Door! Establishing Backdoor Attacks in the Frequency Domain. (arXiv:2109.05507v1 [cs.CR])
    (0 min) Deep Neural Networks (DNNs) have been utilized in various applications ranging from image classification and facial recognition to medical imagery analysis and real-time object detection. As our models become more sophisticated and complex, the computational cost of training such models becomes a burden for small companies and individuals; for this reason, outsourcing the training process has been the go-to option for such users. Unfortunately, outsourcing the training process comes at the cost of vulnerability to backdoor attacks. These attacks aim at establishing hidden backdoors in the DNN such that the model performs well on benign samples but outputs a particular target label when a trigger is applied to the input. Current backdoor attacks rely on generating triggers in the image/pixel domain; however, as we show in this paper, it is not the only domain to exploit and one should always "check the other doors". In this work, we propose a complete pipeline for generating a dynamic, efficient, and invisible backdoor attack in the frequency domain. We show the advantages of utilizing the frequency domain for establishing undetectable and powerful backdoor attacks through extensive experiments on various datasets and network architectures. The backdoored models are shown to break various state-of-the-art defences. We also show two possible defences that succeed against frequency-based backdoor attacks and possible ways for the attacker to bypass them. We conclude the work with some remarks regarding a network's learning capacity and the capability of embedding a backdoor attack in the model.
    COSMic: A Coherence-Aware Generation Metric for Image Descriptions. (arXiv:2109.05281v1 [cs.CL])
    (0 min) Developers of text generation models rely on automated evaluation metrics as a stand-in for slow and expensive manual evaluations. However, image captioning metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of output text. We address this weakness by introducing the first discourse-aware learned generation metric for evaluating image descriptions. Our approach is inspired by computational theories of discourse for capturing information goals using coherence. We present a dataset of image$\unicode{x2013}$description pairs annotated with coherence relations. We then train a coherence-aware metric on a subset of the Conceptual Captions dataset and measure its effectiveness$\unicode{x2014}$its ability to predict human ratings of output captions$\unicode{x2014}$on a test set composed of out-of-domain images. We demonstrate a higher Kendall Correlation Coefficient for our proposed metric with the human judgments for the results of a number of state-of-the-art coherence-aware caption generation models when compared to several other metrics including recently proposed learned metrics such as BLEURT and BERTScore.
    Optimizing a domestic battery and solar photovoltaic system with deep reinforcement learning. (arXiv:2109.05024v1 [cs.LG])
    (2 min) A lowering in the cost of batteries and solar PV systems has led to a high uptake of solar battery home systems. In this work, we use the deep deterministic policy gradient algorithm to optimise the charging and discharging behaviour of a battery within such a system. Our approach outputs a continuous action space when it charges and discharges the battery, and can function well in a stochastic environment. We show good performance of this algorithm by lowering the expenditure of a single household on electricity to almost \$1AUD for large batteries across selected weeks within a year.
    Improved Algorithms for Misspecified Linear Markov Decision Processes. (arXiv:2109.05546v1 [cs.LG])
    (0 min) For the misspecified linear Markov decision process (MLMDP) model of Jin et al. [2020], we propose an algorithm with three desirable properties. (P1) Its regret after $K$ episodes scales as $K \max \{ \varepsilon_{\text{mis}}, \varepsilon_{\text{tol}} \}$, where $\varepsilon_{\text{mis}}$ is the degree of misspecification and $\varepsilon_{\text{tol}}$ is a user-specified error tolerance. (P2) Its space and per-episode time complexities remain bounded as $K \rightarrow \infty$. (P3) It does not require $\varepsilon_{\text{mis}}$ as input. To our knowledge, this is the first algorithm satisfying all three properties. For concrete choices of $\varepsilon_{\text{tol}}$, we also improve existing regret bounds (up to log factors) while achieving either (P2) or (P3) (existing algorithms satisfy neither). At a high level, our algorithm generalizes (to MLMDPs) and refines the Sup-Lin-UCB algorithm, which Takemura et al. [2021] recently showed satisfies (P3) in the contextual bandit setting.
    Semi-Supervised Learning with Multi-Head Co-Training. (arXiv:2107.04795v2 [cs.LG] UPDATED)
    (0 min) Co-training, extended from self-training, is one of the frameworks for semi-supervised learning. Without natural split of features, single-view co-training works at the cost of training extra classifiers, where the algorithm should be delicately designed to prevent individual classifiers from collapsing into each other. To remove these obstacles which deter the adoption of single-view co-training, we present a simple and efficient algorithm Multi-Head Co-Training. By integrating base learners into a multi-head structure, the model is in a minimal amount of extra parameters. Every classification head in the unified model interacts with its peers through a "Weak and Strong Augmentation" strategy, in which the diversity is naturally brought by the strong data augmentation. Therefore, the proposed method facilitates single-view co-training by 1). promoting diversity implicitly and 2). only requiring a small extra computational overhead. The effectiveness of Multi-Head Co-Training is demonstrated in an empirical study on standard semi-supervised learning benchmarks.
    Towards a Rigorous Evaluation of Time-series Anomaly Detection. (arXiv:2109.05257v1 [cs.LG])
    (0 min) In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefore, the comparison of TAD methods with F1 scores after the PA protocol can lead to misguided rankings. Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even without PA. Based on our findings, we propose a new baseline and an evaluation protocol. We expect that our study will help a rigorous evaluation of TAD and lead to further improvement in future researches.
    Program Enhanced Fact Verification with Verbalization and Graph Attention Network. (arXiv:2010.03084v6 [cs.AI] UPDATED)
    (0 min) Performing fact verification based on structured data is important for many real-life applications and is a challenging research problem, particularly when it involves both symbolic operations and informal inference based on language understanding. In this paper, we present a Program-enhanced Verbalization and Graph Attention Network (ProgVGAT) to integrate programs and execution into textual inference models. Specifically, a verbalization with program execution model is proposed to accumulate evidences that are embedded in operations over the tables. Built on that, we construct the graph attention verification networks, which are designed to fuse different sources of evidences from verbalized program execution, program structures, and the original statements and tables, to make the final verification decision. To support the above framework, we propose a program selection module optimized with a new training strategy based on margin loss, to produce more accurate programs, which is shown to be effective in enhancing the final verification results. Experimental results show that the proposed framework achieves the new state-of-the-art performance, a 74.4% accuracy, on the benchmark dataset TABFACT.
    FedTriNet: A Pseudo Labeling Method with Three Players for Federated Semi-supervised Learning. (arXiv:2109.05612v1 [cs.LG])
    (0 min) Federated Learning has shown great potentials for the distributed data utilization and privacy protection. Most existing federated learning approaches focus on the supervised setting, which means all the data stored in each client has labels. However, in real-world applications, the client data are impossible to be fully labeled. Thus, how to exploit the unlabeled data should be a new challenge for federated learning. Although a few studies are attempting to overcome this challenge, they may suffer from information leakage or misleading information usage problems. To tackle these issues, in this paper, we propose a novel federated semi-supervised learning method named FedTriNet, which consists of two learning phases. In the first phase, we pre-train FedTriNet using labeled data with FedAvg. In the second phase, we aim to make most of the unlabeled data to help model learning. In particular, we propose to use three networks and a dynamic quality control mechanism to generate high-quality pseudo labels for unlabeled data, which are added to the training set. Finally, FedTriNet uses the new training set to retrain the model. Experimental results on three publicly available datasets show that the proposed FedTriNet outperforms state-of-the-art baselines under both IID and Non-IID settings.
    Hyperspectral and Multispectral Classification for Coastal Wetland Using Depthwise Feature Interaction Network. (arXiv:2106.06896v3 [cs.CV] UPDATED)
    (0 min) The monitoring of coastal wetlands is of great importance to the protection of marine and terrestrial ecosystems. However, due to the complex environment, severe vegetation mixture, and difficulty of access, it is impossible to accurately classify coastal wetlands and identify their species with traditional classifiers. Despite the integration of multisource remote sensing data for performance enhancement, there are still challenges with acquiring and exploiting the complementary merits from multisource data. In this paper, the Deepwise Feature Interaction Network (DFINet) is proposed for wetland classification. A depthwise cross attention module is designed to extract self-correlation and cross-correlation from multisource feature pairs. In this way, meaningful complementary information is emphasized for classification. DFINet is optimized by coordinating consistency loss, discrimination loss, and classification loss. Accordingly, DFINet reaches the standard solution-space under the regularity of loss functions, while the spatial consistency and feature discrimination are preserved. Comprehensive experimental results on two hyperspectral and multispectral wetland datasets demonstrate that the proposed DFINet outperforms other competitive methods in terms of overall accuracy.
    Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies. (arXiv:2108.12084v2 [cs.CL] UPDATED)
    (0 min) Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. In this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in English language technologies. We also detail how current language representations (e.g., GloVe, BERT) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.
    Anomaly Detection in Video via Self-Supervised and Multi-Task Learning. (arXiv:2011.07491v3 [cs.CV] UPDATED)
    (0 min) Anomaly detection in video is a challenging computer vision problem. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without full supervision. In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level. We first utilize a pre-trained detector to detect objects. Then, we train a 3D convolutional neural network to produce discriminative anomaly-specific information by jointly learning multiple proxy tasks: three self-supervised and one based on knowledge distillation. The self-supervised tasks are: (i) discrimination of forward/backward moving objects (arrow of time), (ii) discrimination of objects in consecutive/intermittent frames (motion irregularity) and (iii) reconstruction of object-specific appearance information. The knowledge distillation task takes into account both classification and detection information, generating large prediction discrepancies between teacher and student models when anomalies occur. To the best of our knowledge, we are the first to approach anomalous event detection in video as a multi-task learning problem, integrating multiple self-supervised and knowledge distillation proxy tasks in a single architecture. Our lightweight architecture outperforms the state-of-the-art methods on three benchmarks: Avenue, ShanghaiTech and UCSD Ped2. Additionally, we perform an ablation study demonstrating the importance of integrating self-supervised learning and normality-specific distillation in a multi-task learning setting.
    Cooperative Multi-Agent Fairness and Equivariant Policies. (arXiv:2106.05727v2 [cs.AI] UPDATED)
    (0 min) We study fairness through the lens of cooperative multi-agent learning. Our work is motivated by empirical evidence that naive maximization of team reward yields unfair outcomes for individual team members. To address fairness in multi-agent contexts, we introduce team fairness, a group-based fairness measure for multi-agent learning. We then prove that it is possible to enforce team fairness during policy optimization by transforming the team's joint policy into an equivariant map. We refer to our multi-agent learning strategy as Fairness through Equivariance (Fair-E) and demonstrate its effectiveness empirically. We then introduce Fairness through Equivariance Regularization (Fair-ER) as a soft-constraint version of Fair-E and show that it reaches higher levels of utility than Fair-E and fairer outcomes than non-equivariant policies. Finally, we present novel findings regarding the fairness-utility trade-off in multi-agent settings; showing that the magnitude of the trade-off is dependent on agent skill level.
    OmniLytics: A Blockchain-based Secure Data Market for Decentralized Machine Learning. (arXiv:2107.05252v2 [cs.CR] UPDATED)
    (0 min) We propose OmniLytics, a blockchain-based secure data trading marketplace for machine learning applications. Utilizing OmniLytics, many distributed data owners can contribute their private data to collectively train an ML model requested by some model owners, and receive compensation for data contribution. OmniLytics enables such model training while simultaneously providing 1) model security against curious data owners; 2) data security against the curious model and data owners; 3) resilience to malicious data owners who provide faulty results to poison model training; and 4) resilience to malicious model owners who intend to evade payment. OmniLytics is implemented as a blockchain smart contract to guarantee the atomicity of payment. In OmniLytics, a model owner splits its model into the private and public parts and publishes the public part on the contract. Through the execution of the contract, the participating data owners securely aggregate their locally trained models to update the model owner's public model and receive reimbursement through the contract. We implement a working prototype of OmniLytics on Ethereum blockchain and perform extensive experiments to measure its gas cost, execution time, and model quality under various parameter combinations. For training a CNN on the MNIST dataset, the MO is able to boost its model accuracy from 62% to 83% within 500ms in blockchain processing time.This demonstrates the effectiveness of OmniLytics for practical deployment.
    Meta-Interpretive Learning as Metarule Specialisation. (arXiv:2106.07464v2 [cs.LG] UPDATED)
    (0 min) In Meta-Interpretive Learning (MIL) the metarules, second-order datalog clauses acting as inductive bias, are manually defined by the user. In this work we show that second-order metarules for MIL can be learned by MIL. We define a generality ordering of metarules by $\theta$-subsumption and show that user-defined sort metarules are derivable by specialisation of the most-general matrix metarules in a language class; and that these matrix metarules are in turn derivable by specialisation of third-order punch metarules with variables that range over the set of second-order literals and for which only an upper bound on their number of literals need be user-defined. We show that the cardinality of a metarule language is polynomial in the number of literals in punch metarules. We re-frame MIL as metarule specialisation by resolution. We modify the MIL metarule specialisation operator to return new metarules rather than first-order clauses and prove the correctness of the new operator. We implement the new operator as TOIL, a sub-system of the MIL system Louise. Our experiments show that as user-defined sort metarules are progressively replaced by sort metarules learned by TOIL, Louise's predictive accuracy is maintained at the cost of a small increase in training times. We conclude that automatically derived metarules can replace user-defined metarules.
    Understanding and Resolving Performance Degradation in Graph Convolutional Networks. (arXiv:2006.07107v3 [cs.LG] UPDATED)
    (0 min) A Graph Convolutional Network (GCN) stacks several layers and in each layer performs a PROPagation operation (PROP) and a TRANsformation operation (TRAN) for learning node representations over graph-structured data. Though powerful, GCNs tend to suffer performance drop when the model gets deep. Previous works focus on PROPs to study and mitigate this issue, but the role of TRANs is barely investigated. In this work, we study performance degradation of GCNs by experimentally examining how stacking only TRANs or PROPs works. We find that TRANs contribute significantly, or even more than PROPs, to declining performance, and moreover that they tend to amplify node-wise feature variance in GCNs, causing variance inflammation that we identify as a key factor for causing performance drop. Motivated by such observations, we propose a variance-controlling technique termed Node Normalization (NodeNorm), which scales each node's features using its own standard deviation. Experimental results validate the effectiveness of NodeNorm on addressing performance degradation of GCNs. Specifically, it enables deep GCNs to outperform shallow ones in cases where deep models are needed, and to achieve comparable results with shallow ones on 6 benchmark datasets. NodeNorm is a generic plug-in and can well generalize to other GNN architectures. Code is publicly available at https://github.com/miafei/NodeNorm.
    Towards Improving Adversarial Training of NLP Models. (arXiv:2109.00544v2 [cs.CL] UPDATED)
    (0 min) Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models' performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP models, which we name Attacking to Training (A2T). The core part of A2T is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results empirically show that it is possible to train robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T can improve an NLP model's robustness to the attack it was originally trained with and also defend the model against other types of word substitution attacks. Furthermore, we show that A2T can improve NLP models' standard accuracy, cross-domain generalization, and interpretability. Code is available at https://github.com/QData/Textattack-A2T .
    Constrained Non-Affine Alignment of Embeddings. (arXiv:1910.05862v2 [cs.LG] UPDATED)
    (0 min) Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a data- and computation-expensive process; yet the holistic analysis and adjustment of them after they have been created is still a developing area. In this paper, we first propose a very general quantitatively measure for the presence of features in the embedding data based on if it can be learned. We then devise a method to remove or alleviate undesired features in the embedding while retaining the essential structure of the data. We use a Domain Adversarial Network (DAN) to generate a non-affine transformation, but we add constraints to ensure the essential structure of the embedding is preserved. Our empirical results demonstrate that the proposed algorithm significantly outperforms the state-of-art unsupervised algorithm on several data sets, including novel applications from the industry.
    Data-Driven Reachability Analysis Using Matrix Zonotopes. (arXiv:2011.08472v3 [eess.SY] UPDATED)
    (2 min) In this paper, we propose a data-driven reachability analysis approach for unknown system dynamics. Reachability analysis is an essential tool for guaranteeing safety properties. However, most current reachability analysis heavily relies on the existence of a suitable system model, which is often not directly available in practice. We instead propose a data-driven reachability analysis approach from noisy data. More specifically, we first provide an algorithm for over-approximating the reachable set of a linear time-invariant system using matrix zonotopes. Then we introduce an extension for Lipschitz nonlinear systems. We provide theoretical guarantees in both cases. Numerical examples show the potential and applicability of the introduced methods.
    Learning To Describe Player Form in The MLB. (arXiv:2109.05280v1 [cs.LG])
    (2 min) Major League Baseball (MLB) has a storied history of using statistics to better understand and discuss the game of baseball, with an entire discipline of statistics dedicated to the craft, known as sabermetrics. At their core, all sabermetrics seek to quantify some aspect of the game, often a specific aspect of a player's skill set - such as a batter's ability to drive in runs (RBI) or a pitcher's ability to keep batters from reaching base (WHIP). While useful, such statistics are fundamentally limited by the fact that they are derived from an account of what happened on the field, not how it happened. As a first step towards alleviating this shortcoming, we present a novel, contrastive learning-based framework for describing player form in the MLB. We use form to refer to the way in which a player has impacted the course of play in their recent appearances. Concretely, a player's form is described by a 72-dimensional vector. By comparing clusters of players resulting from our form representations and those resulting from traditional abermetrics, we demonstrate that our form representations contain information about how players impact the course of play, not present in traditional, publicly available statistics. We believe these embeddings could be utilized to predict both in-game and game-level events, such as the result of an at-bat or the winner of a game.
    Pairwise Supervised Contrastive Learning of Sentence Representations. (arXiv:2109.05424v1 [cs.CL])
    (2 min) Many recent successes in sentence representation learning have been achieved by simply fine-tuning on the Natural Language Inference (NLI) datasets with triplet loss or siamese loss. Nevertheless, they share a common weakness: sentences in a contradiction pair are not necessarily from different semantic categories. Therefore, optimizing the semantic entailment and contradiction reasoning objective alone is inadequate to capture the high-level semantic structure. The drawback is compounded by the fact that the vanilla siamese or triplet losses only learn from individual sentence pairs or triplets, which often suffer from bad local optima. In this paper, we propose PairSupCon, an instance discrimination based approach aiming to bridge semantic entailment and contradiction understanding with high-level categorical concept encoding. We evaluate PairSupCon on various downstream tasks that involve understanding sentence semantics at different granularities. We outperform the previous state-of-the-art method with $10\%$--$13\%$ averaged improvement on eight clustering tasks, and $5\%$--$6\%$ averaged improvement on seven semantic textual similarity (STS) tasks.
    On The Radon-Nikodym Spectral Approach With Optimal Clustering. (arXiv:1906.00460v17 [cs.LG] UPDATED)
    (3 min) Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.
    Detecting Polarized Topics Using Partisanship-aware Contextualized Topic Embeddings. (arXiv:2104.07814v2 [cs.CL] UPDATED)
    (2 min) Growing polarization of the news media has been blamed for fanning disagreement, controversy and even violence. Early identification of polarized topics is thus an urgent matter that can help mitigate conflict. However, accurate measurement of topic-wise polarization is still an open research challenge. To address this gap, we propose Partisanship-aware Contextualized Topic Embeddings (PaCTE), a method to automatically detect polarized topics from partisan news sources. Specifically, utilizing a language model that has been finetuned on recognizing partisanship of the news articles, we represent the ideology of a news corpus on a topic by corpus-contextualized topic embedding and measure the polarization using cosine distance. We apply our method to a dataset of news articles about the COVID-19 pandemic. Extensive experiments on different news sources and topics demonstrate the efficacy of our method to capture topical polarization, as indicated by its effectiveness of retrieving the most polarized topics.
    MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans. (arXiv:2103.11161v2 [cs.CV] UPDATED)
    (2 min) We propose a novel method for reconstructing floor plans from noisy 3D point clouds. Our main contribution is a principled approach that relies on the Monte Carlo Tree Search (MCTS) algorithm to maximize a suitable objective function efficiently despite the complexity of the problem. Like previous work, we first project the input point cloud to a top view to create a density map and extract room proposals from it. Our method selects and optimizes the polygonal shapes of these room proposals jointly to fit the density map and outputs an accurate vectorized floor map even for large complex scenes. To do this, we adapted MCTS, an algorithm originally designed to learn to play games, to select the room proposals by maximizing an objective function combining the fitness with the density map as predicted by a deep network and regularizing terms on the room shapes. We also introduce a refinement step to MCTS that adjusts the shape of the room proposals. For this step, we propose a novel differentiable method for rendering the polygonal shapes of these proposals. We evaluate our method on the recent and challenging Structured3D and Floor-SP datasets and show a significant improvement over the state-of-the-art, without imposing any hard constraints nor assumptions on the floor plan configurations.
    Machine-Learned Phase Diagrams of Generalized Kitaev Honeycomb Magnets. (arXiv:2102.01103v2 [cond-mat.str-el] UPDATED)
    (2 min) We use a recently developed interpretable and unsupervised machine-learning method, the tensorial kernel support vector machine (TK-SVM), to investigate the low-temperature classical phase diagram of a generalized Heisenberg-Kitaev-$\Gamma$ ($J$-$K$-$\Gamma$) model on a honeycomb lattice. Aside from reproducing phases reported by previous quantum and classical studies, our machine finds a hitherto missed nested zigzag-stripy order and establishes the robustness of a recently identified modulated $S_3 \times Z_3$ phase, which emerges through the competition between the Kitaev and $\Gamma$ spin liquids, against Heisenberg interactions. The results imply that, in the restricted parameter space spanned by the three primary exchange interactions -- $J$, $K$, and $\Gamma$, the representative Kitaev material $\alpha$-${\rm RuCl}_3$ lies close to the boundaries of several phases, including a simple ferromagnet, the unconventional $S_3 \times Z_3$ and nested zigzag-stripy magnets. A zigzag order is stabilized by a finite $\Gamma^{\prime}$ and/or $J_3$ term, whereas the four magnetic orders may compete in particular if $\Gamma^{\prime}$ is anti-ferromagnetic.
    Physics-based machine learning for modeling stochastic IP3-dependent calcium dynamics. (arXiv:2109.05053v1 [cs.LG])
    (2 min) We present a machine learning method for model reduction which incorporates domain-specific physics through candidate functions. Our method estimates an effective probability distribution and differential equation model from stochastic simulations of a reaction network. The close connection between reduced and fine scale descriptions allows approximations derived from the master equation to be introduced into the learning problem. This representation is shown to improve generalization and allows a large reduction in network size for a classic model of inositol trisphosphate (IP3) dependent calcium oscillations in non-excitable cells.
    Systematic Generalisation through Task Temporal Logic and Deep Reinforcement Learning. (arXiv:2006.08767v3 [cs.LG] UPDATED)
    (2 min) This work introduces a neuro-symbolic agent that combines deep reinforcement learning (DRL) with temporal logic (TL) to achieve systematic zero-shot, i.e., never-seen-before, generalisation of formally specified instructions. In particular, we present a neuro-symbolic framework where a symbolic module transforms TL specifications into a form that helps the training of a DRL agent targeting generalisation, while a neural module learns systematically to solve the given tasks. We study the emergence of systematic learning in different settings and find that the architecture of the convolutional layers is key when generalising to new instructions. We also provide evidence that systematic learning can emerge with abstract operators such as negation when learning from a few training examples, which previous research have struggled with.
    Nested Counterfactual Identification from Arbitrary Surrogate Experiments. (arXiv:2107.03190v2 [cs.AI] UPDATED)
    (2 min) The Ladder of Causation describes three qualitatively different types of activities an agent may be interested in engaging in, namely, seeing (observational), doing (interventional), and imagining (counterfactual) (Pearl and Mackenzie, 2018). The inferential challenge imposed by the causal hierarchy is that data is collected by an agent observing or intervening in a system (layers 1 and 2), while its goal may be to understand what would have happened had it taken a different course of action, contrary to what factually ended up happening (layer 3). While there exists a solid understanding of the conditions under which cross-layer inferences are allowed from observations to interventions, the results are somewhat scarcer when targeting counterfactual quantities. In this paper, we study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, building on a more explicit definition of nested counterfactuals, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones. For instance, applications in mediation and fairness analysis usually evoke notions of direct, indirect, and spurious effects, which naturally require nesting. Second, we introduce a sufficient and necessary graphical condition for counterfactual identification from an arbitrary combination of observational and experimental distributions. Lastly, we develop an efficient and complete algorithm for identifying nested counterfactuals; failure of the algorithm returning an expression for a query implies it is not identifiable.
    Local-Global Knowledge Distillation in Heterogeneous Federated Learning with Non-IID Data. (arXiv:2107.00051v2 [cs.LG] UPDATED)
    (0 min) Federated learning enables multiple clients to collaboratively learn a global model by periodically aggregating the clients' models without transferring the local data. However, due to the heterogeneity of the system and data, many approaches suffer from the "client-drift" issue that could significantly slow down the convergence of the global model training. As clients perform local updates on heterogeneous data through heterogeneous systems, their local models drift apart. To tackle this issue, one intuitive idea is to guide the local model training by the global teachers, i.e., past global models, where each client learns the global knowledge from past global models via adaptive knowledge distillation techniques. Coming from these insights, we propose a novel approach for heterogeneous federated learning, namely FedGKD, which fuses the knowledge from historical global models for local training to alleviate the "client-drift" issue. In this paper, we evaluate FedGKD with extensive experiments on various CV/NLP datasets (i.e., CIFAR-10/100, Tiny-ImageNet, AG News, SST5) and different heterogeneous settings. The proposed method is guaranteed to converge under common assumptions, and achieves superior empirical accuracy in fewer communication runs than five state-of-the-art methods.
    On the Fundamental Limits of Matrix Completion: Leveraging Hierarchical Similarity Graphs. (arXiv:2109.05408v1 [cs.IT])
    (2 min) We study the matrix completion problem that leverages hierarchical similarity graphs as side information in the context of recommender systems. Under a hierarchical stochastic block model that well respects practically-relevant social graphs and a low-rank rating matrix model, we characterize the exact information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) by proving sharp upper and lower bounds on the sample complexity. In the achievability proof, we demonstrate that probability of error of the maximum likelihood estimator vanishes for sufficiently large number of users and items, if all sufficient conditions are satisfied. On the other hand, the converse (impossibility) proof is based on the genie-aided maximum likelihood estimator. Under each necessary condition, we present examples of a genie-aided estimator to prove that the probability of error does not vanish for sufficiently large number of users and items. One important consequence of this result is that exploiting the hierarchical structure of social graphs yields a substantial gain in sample complexity relative to the one that simply identifies different groups without resorting to the relational structure across them. More specifically, we analyze the optimal sample complexity and identify different regimes whose characteristics rely on quality metrics of side information of the hierarchical similarity graph. Finally, we present simulation results to corroborate our theoretical findings and show that the characterized information-theoretic limit can be asymptotically achieved.
    The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training. (arXiv:2007.12826v2 [stat.ML] UPDATED)
    (3 min) Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).
    Towards Real-World BCI: CCSPNet, A Compact Subject-Independent Motor Imagery Framework. (arXiv:2012.13567v5 [cs.LG] UPDATED)
    (3 min) A conventional subject-dependent (SD) brain-computer interface (BCI) requires a complete data-gathering, training, and calibration phase for each user before it can be used. In recent years, a number of subject-independent (SI) BCIs have been developed. However, there are many problems preventing them from being used in real-world BCI applications. A weaker performance compared to the subject-dependent (SD) approach, and a relatively large model requiring high computational power are the most important ones. Therefore, a potential real-world BCI would greatly benefit from a compact low-power subject-independent BCI framework, ready to be used immediately after the user puts it on. To move towards this goal, we propose a novel subject-independent BCI framework named CCSPNet (Convolutional Common Spatial Pattern Network) trained on the motor imagery (MI) paradigm of a large-scale electroencephalography (EEG) signals database consisting of 21600 trials for 54 subjects performing two-class hand-movement MI tasks. The proposed framework applies a wavelet kernel convolutional neural network (WKCNN) and a temporal convolutional neural network (TCNN) in order to represent and extract the diverse spectral features of EEG signals. The outputs of the convolutional layers go through a common spatial pattern (CSP) algorithm for spatial feature extraction. The number of CSP features is reduced by a dense neural network, and the final class label is determined by a linear discriminative analysis (LDA) classifier. The CCSPNet framework evaluation results show that it is possible to have a low-power compact BCI that achieves both SD and SI performance comparable to complex and computationally expensive models.
    GraphITE: Estimating Individual Effects of Graph-structured Treatments. (arXiv:2009.14061v3 [cs.LG] UPDATED)
    (2 min) Outcome estimation of treatments for target individuals is an important foundation for decision making based on causal relations. Most existing outcome estimation methods deal with binary or multiple-choice treatments; however, in some applications, the number of treatments can be significantly large, while the treatments themselves have rich information. In this study, we considered one important instance of such cases: the outcome estimation problem of graph-structured treatments such as drugs. Owing to the large number of possible treatments, the counterfactual nature of observational data that appears in conventional treatment effect estimation becomes more of a concern for this problem. Our proposed method, GraphITE (pronounced "graphite") learns the representations of graph-structured treatments using graph neural networks while mitigating observation biases using Hilbert-Schmidt Independence Criterion regularization, which increases the independence of the representations of the targets and treatments. Experiments on two real-world datasets show that GraphITE outperforms baselines, especially in cases with a large number of treatments.
    TEASEL: A Transformer-Based Speech-Prefixed Language Model. (arXiv:2109.05522v1 [cs.CL])
    (2 min) Multimodal language analysis is a burgeoning field of NLP that aims to simultaneously model a speaker's words, acoustical annotations, and facial expressions. In this area, lexicon features usually outperform other modalities because they are pre-trained on large corpora via Transformer-based models. Despite their strong performance, training a new self-supervised learning (SSL) Transformer on any modality is not usually attainable due to insufficient data, which is the case in multimodal language learning. This work proposes a Transformer-Based Speech-Prefixed Language Model called TEASEL to approach the mentioned constraints without training a complete Transformer model. TEASEL model includes speech modality as a dynamic prefix besides the textual modality compared to a conventional language model. This method exploits a conventional pre-trained language model as a cross-modal Transformer model. We evaluated TEASEL for the multimodal sentiment analysis task defined by CMU-MOSI dataset. Extensive experiments show that our model outperforms unimodal baseline language models by 4% and outperforms the current multimodal state-of-the-art (SoTA) model by 1% in F1-score. Additionally, our proposed method is 72% smaller than the SoTA model.
    A Joint Graph and Image Convolution Network for Automatic Brain Tumor Segmentation. (arXiv:2109.05580v1 [eess.IV])
    (2 min) We present a joint graph convolution-image convolution neural network as our submission to the Brain Tumor Segmentation (BraTS) 2021 challenge. We model each brain as a graph composed of distinct image regions, which is initially segmented by a graph neural network (GNN). Subsequently, the tumorous volume identified by the GNN is further refined by a simple (voxel) convolutional neural network (CNN), which produces the final segmentation. This approach captures both global brain feature interactions via the graphical representation and local image details through the use of convolutional filters. We find that the GNN component by itself can effectively identify and segment the brain tumors. The addition of the CNN further improves the median performance of the model by 2 percent across all metrics evaluated. On the validation set, our joint GNN-CNN model achieves mean Dice scores of 0.89, 0.81, 0.73 and mean Hausdorff distances (95th percentile) of 6.8, 12.6, 28.2mm on the whole tumor, core tumor, and enhancing tumor, respectively.
    Tensor Estimation with Nearly Linear Samples Given Weak Side Information. (arXiv:2007.00736v2 [stat.ML] UPDATED)
    (2 min) Tensor completion exhibits an interesting computational-statistical gap in terms of the number of samples needed to perform tensor estimation. While there are only $\Theta(tn)$ degrees of freedom in a $t$-order tensor with $n^t$ entries, the best known polynomial time algorithm requires $O(n^{t/2})$ samples in order to guarantee consistent estimation. In this paper, we show that weak side information is sufficient to reduce the sample complexity to $O(n)$. The side information consists of a weight vector for each of the modes which is not orthogonal to any of the latent factors along that mode; this is significantly weaker than assuming noisy knowledge of the subspaces. We provide an algorithm that utilizes this side information to produce a consistent estimator with $O(n^{1+\kappa})$ samples for any small constant $\kappa > 0$.
    Efficient and Interpretable Robot Manipulation with Graph Neural Networks. (arXiv:2102.13177v3 [cs.RO] UPDATED)
    (2 min) Manipulation tasks like loading a dishwasher can be seen as a sequence of spatial constraints and relationships between different objects. For example, a plate can be placed in a tray only if the tray is open. We aim to discover such task-specific rules from demonstrations. We pose manipulation as a classification problem over a graph, whose nodes represent task relevant entities like objects and goals, transform the environment scene into a graph and learn a graph neural network (GNN) policy using imitation learning. In our experiments, a single learned GNN policy, trained using 20 expert demonstrations, can solve multiple blockstacking and rearrangement tasks in both simulation and on hardware, without any task description. The policy successfully generalizes over the number of objects in the environment, their positions, and goal configurations (trained on single stacks, generalizes to pyramids and multiple stacks). We also apply our approach to a complex simulated dishwasher environment, where a robot learns to load a dishwasher from only 5 high-level human demonstrations. These experiments show that imitation learning on a graphical state and policy is a simple, yet powerful tool for solving complex long-horizon manipulation problems, without requiring detailed task descriptions. Videos can be found at: https://youtu.be/x9hcKBh6K0A.
    Differentially Private Variable Selection via the Knockoff Filter. (arXiv:2109.05402v1 [stat.ML])
    (2 min) The knockoff filter, recently developed by Barber and Candes, is an effective procedure to perform variable selection with a controlled false discovery rate (FDR). We propose a private version of the knockoff filter by incorporating Gaussian and Laplace mechanisms, and show that variable selection with controlled FDR can be achieved. Simulations demonstrate that our setting has reasonable statistical power.
    A review of possible effects of cognitive biases on the interpretation of rule-based machine learning models. (arXiv:1804.02969v7 [stat.ML] UPDATED)
    (2 min) While the interpretability of machine learning models is often equated with their mere syntactic comprehensibility, we think that interpretability goes beyond that, and that human interpretability should also be investigated from the point of view of cognitive science. The goal of this paper is to discuss to what extent cognitive biases may affect human understanding of interpretable machine learning models, in particular of logical rules discovered from data. Twenty cognitive biases are covered, as are possible debiasing techniques that can be adopted by designers of machine learning algorithms and software. Our review transfers results obtained in cognitive psychology to the domain of machine learning, aiming to bridge the current gap between these two areas. It needs to be followed by empirical studies specifically focused on the machine learning domain.
    Spatiotemporal Pattern Mining for Nowcasting Extreme Earthquakes in Southern California. (arXiv:2012.14336v3 [physics.geo-ph] UPDATED)
    (2 min) Geoscience and seismology have utilized the most advanced technologies and equipment to monitor seismic events globally from the past few decades. With the enormous amount of data, modern GPU-powered deep learning presents a promising approach to analyze data and discover patterns. In recent years, there are plenty of successful deep learning models for picking seismic waves. However, forecasting extreme earthquakes, which can cause disasters, is still an underdeveloped topic in history. Relevant research in spatiotemporal dynamics mining and forecasting has revealed some successful predictions, a crucial topic in many scientific research fields. Most studies of them have many successful applications of using deep neural networks. In Geology and Earth science studies, earthquake prediction is one of the world's most challenging problems, about which cutting-edge deep learning technologies may help discover some valuable patterns. In this project, we propose a deep learning modeling approach, namely \tseqpre, to mine spatiotemporal patterns from data to nowcast extreme earthquakes by discovering visual dynamics in regional coarse-grained spatial grids over time. In this modeling approach, we use synthetic deep learning neural networks with domain knowledge in geoscience and seismology to exploit earthquake patterns for prediction using convolutional long short-term memory neural networks. Our experiments show a strong correlation between location prediction and magnitude prediction for earthquakes in Southern California. Ablation studies and visualization validate the effectiveness of the proposed modeling method.
    Real-time End-to-End Federated Learning: An Automotive Case Study. (arXiv:2103.11879v2 [cs.LG] UPDATED)
    (0 min) With the development and the increasing interests in ML/DL fields, companies are eager to apply Machine Learning/Deep Learning approaches to increase service quality and customer experience. Federated Learning was implemented as an effective model training method for distributing and accelerating time-consuming model training while protecting user data privacy. However, common Federated Learning approaches, on the other hand, use a synchronous protocol to conduct model aggregation, which is inflexible and unable to adapt to rapidly changing environments and heterogeneous hardware settings in real-world scenarios. In this paper, we present an approach to real-time end-to-end Federated Learning combined with a novel asynchronous model aggregation protocol. Our method is validated in an industrial use case in the automotive domain, focusing on steering wheel angle prediction for autonomous driving. Our findings show that asynchronous Federated Learning can significantly improve the prediction performance of local edge models while maintaining the same level of accuracy as centralized machine learning. Furthermore, by using a sliding training window, the approach can minimize communication overhead, accelerate model training speed and consume real-time streaming data, proving high efficiency when deploying ML/DL components to heterogeneous real-world embedded systems.
    Cross-Domain Label-Adaptive Stance Detection. (arXiv:2104.07467v2 [cs.CL] UPDATED)
    (0 min) Stance detection concerns the classification of a writer's viewpoint towards a target. There are different task variants, e.g., stance of a tweet vs. a full article, or stance with respect to a claim vs. an (implicit) topic. Moreover, task definitions vary, which includes the label inventory, the data collection, and the annotation protocol. All these aspects hinder cross-domain studies, as they require changes to standard domain adaptation approaches. In this paper, we perform an in-depth analysis of 16 stance detection datasets, and we explore the possibility for cross-domain learning from them. Moreover, we propose an end-to-end unsupervised framework for out-of-domain prediction of unseen, user-defined labels. In particular, we combine domain adaptation techniques such as mixture of experts and domain-adversarial training with label embeddings, and we demonstrate sizable performance gains over strong baselines, both (i) in-domain, i.e., for seen targets, and (ii) out-of-domain, i.e., for unseen targets. Finally, we perform an exhaustive analysis of the cross-domain results, and we highlight the important factors influencing the model performance.
    Multi-source Heterogeneous Domain Adaptation with Conditional Weighting Adversarial Network. (arXiv:2008.02714v2 [cs.LG] UPDATED)
    (0 min) Heterogeneous domain adaptation (HDA) tackles the learning of cross-domain samples with both different probability distributions and feature representations. Most of the existing HDA studies focus on the single-source scenario. In reality, however, it is not uncommon to obtain samples from multiple heterogeneous domains. In this article, we study the multisource HDA problem and propose a conditional weighting adversarial network (CWAN) to address it. The proposed CWAN adversarially learns a feature transformer, a label classifier, and a domain discriminator. To quantify the importance of different source domains, CWAN introduces a sophisticated conditional weighting scheme to calculate the weights of the source domains according to the conditional distribution divergence between the source and target domains. Different from existing weighting schemes, the proposed conditional weighting scheme not only weights the source domains but also implicitly aligns the conditional distributions during the optimization process. Experimental results clearly demonstrate that the proposed CWAN performs much better than several state-of-the-art methods on four real-world datasets.
    HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge. (arXiv:2109.05097v1 [cs.CL])
    (0 min) A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. Next, we leverage the COMeT and reverse COMeT models to do commonsense and counterfactual inference. We then generate multiple hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles creatively with high success rate and intensity scores.
    On the Initial Behavior Monitoring Issues in Federated Learning. (arXiv:2109.05385v1 [cs.LG])
    (0 min) In Federated Learning (FL), a group of workers participate to build a global model under the coordination of one node, the chief. Regarding the cybersecurity of FL, some attacks aim at injecting the fabricated local model updates into the system. Some defenses are based on malicious worker detection and behavioral pattern analysis. In this context, without timely and dynamic monitoring methods, the chief cannot detect and remove the malicious or unreliable workers from the system. Our work emphasize the urgency to prepare the federated learning process for monitoring and eventually behavioral pattern analysis. We study the information inside the learning process in the early stages of training, propose a monitoring process and evaluate the monitoring period required. The aim is to analyse at what time is it appropriate to start the detection algorithm in order to remove the malicious or unreliable workers from the system and optimise the defense mechanism deployment. We tested our strategy on a behavioral pattern analysis defense applied to the FL process of different benchmark systems for text and image classification. Our results show that the monitoring process lowers false positives and false negatives and consequently increases system efficiency by enabling the distributed learning system to achieve better performance in the early stage of training.
    Context-aware surrogate modeling for balancing approximation and sampling costs in multi-fidelity importance sampling and Bayesian inverse problems. (arXiv:2010.11708v2 [math.NA] UPDATED)
    (0 min) Multi-fidelity methods leverage low-cost surrogate models to speed up computations and make occasional recourse to expensive high-fidelity models to establish accuracy guarantees. Because surrogate and high-fidelity models are used together, poor predictions by surrogate models can be compensated with frequent recourse to high-fidelity models. Thus, there is a trade-off between investing computational resources to improve the accuracy of surrogate models versus simply making more frequent recourse to expensive high-fidelity models; however, this trade-off is ignored by traditional modeling methods that construct surrogate models that are meant to replace high-fidelity models rather than being used together with high-fidelity models. This work considers multi-fidelity importance sampling and theoretically and computationally trades off increasing the fidelity of surrogate models for constructing more accurate biasing densities and the numbers of samples that are required from the high-fidelity models to compensate poor biasing densities. Numerical examples demonstrate that such context-aware surrogate models for multi-fidelity importance sampling have lower fidelity than what typically is set as tolerance in traditional model reduction, leading to runtime speedups of up to one order of magnitude in the presented examples.
    Understanding Structural Vulnerability in Graph Convolutional Networks. (arXiv:2108.06280v2 [cs.LG] CROSS LISTED)
    (0 min) Recent studies have shown that Graph Convolutional Networks (GCNs) are vulnerable to adversarial attacks on the graph structure. Although multiple works have been proposed to improve their robustness against such structural adversarial attacks, the reasons for the success of the attacks remain unclear. In this work, we theoretically and empirically demonstrate that structural adversarial examples can be attributed to the non-robust aggregation scheme (i.e., the weighted mean) of GCNs. Specifically, our analysis takes advantage of the breakdown point which can quantitatively measure the robustness of aggregation schemes. The key insight is that weighted mean, as the basic design of GCNs, has a low breakdown point and its output can be dramatically changed by injecting a single edge. We show that adopting the aggregation scheme with a high breakdown point (e.g., median or trimmed mean) could significantly enhance the robustness of GCNs against structural attacks. Extensive experiments on four real-world datasets demonstrate that such a simple but effective method achieves the best robustness performance compared to state-of-the-art models.
    Stochastic Iterative Graph Matching. (arXiv:2106.02206v2 [cs.LG] UPDATED)
    (0 min) Recent works leveraging Graph Neural Networks to approach graph matching tasks have shown promising results. Recent progress in learning discrete distributions poses new opportunities for learning graph matching models. In this work, we propose a new model, Stochastic Iterative Graph MAtching (SIGMA), to address the graph matching problem. Our model defines a distribution of matchings for a graph pair so the model can explore a wide range of possible matchings. We further introduce a novel multi-step matching procedure, which learns how to refine a graph pair's matching results incrementally. The model also includes dummy nodes so that the model does not have to find matchings for nodes without correspondence. We fit this model to data via scalable stochastic optimization. We conduct extensive experiments across synthetic graph datasets as well as biochemistry and computer vision applications. Across all tasks, our results show that SIGMA can produce significantly improved graph matching results compared to state-of-the-art models. Ablation studies verify that each of our components (stochastic training, iterative matching, and dummy nodes) offers noticeable improvement.
    Accelerating Federated Learning with a Global Biased Optimiser. (arXiv:2108.09134v2 [cs.LG] UPDATED)
    (0 min) Federated Learning (FL) is a recent development in the field of machine learning that collaboratively trains models without the training data leaving client devices, to preserve data privacy. In realistic FL settings, the training set is distributed over clients in a highly non-Independent and Identically Distributed (non-IID) fashion, which has been shown extensively to harm FL convergence speed and final model performance. To address this challenge, we propose a novel, generalised approach for incorporating adaptive optimisation techniques into FL with the Federated Global Biased Optimiser (FedGBO) algorithm. FedGBO accelerates FL by employing a set of global biased optimiser values during the client-training phase, which helps to reduce `client-drift' from non-IID data, whilst also benefiting from adaptive optimisation. We show that the FedGBO update with a generic optimiser can be reformulated as centralised training using biased gradients and optimiser updates, and apply this theoretical framework to prove the convergence of FedGBO using momentum-Stochastic Gradient Descent (SGDm). We also conduct extensive experiments using 4 realistic FL benchmark datasets (CIFAR100, Sent140, FEMNIST, Shakespeare) and 3 popular adaptive optimisers (RMSProp, SGDm, Adam) to compare the performance of state-of-the-art adaptive-FL algorithms. The results demonstrate that FedGBO has highly competitive performance whilst achieving lower communication and computation costs, and provide practical insights into the trade-offs associated with the different adaptive-FL algorithms and optimisers for real-world FL deployments.
    MO-PaDGAN: Reparameterizing Engineering Designs for Augmented Multi-objective Optimization. (arXiv:2009.07110v2 [cs.LG] UPDATED)
    (0 min) Multi-objective optimization is key to solving many Engineering Design problems, where design parameters are optimized for several performance indicators. However, optimization results are highly dependent on how the designs are parameterized. Researchers have shown that deep generative models can learn compact design representations, providing a new way of parameterizing designs to achieve faster convergence and improved optimization performance. Despite their success in capturing complex distributions, existing generative models face three challenges when used for design problems: 1) generated designs have limited design space coverage, 2) the generator ignores design performance, and 3)~the new parameterization is unable to represent designs beyond training data. To address these challenges, we propose MO-PaDGAN, which adds a Determinantal Point Processes based loss function to the generative adversarial network to simultaneously model diversity and (multi-variate) performance. MO-PaDGAN can thus improve the performances and coverage of generated designs, and even generate designs with performances exceeding those from training data. When using MO-PaDGAN as a new parameterization in multi-objective optimization, we can discover much better Pareto fronts even though the training data do not cover those Pareto fronts. In a real-world multi-objective airfoil design example, we demonstrate that MO-PaDGAN achieves, on average, an over 180\% improvement in the hypervolume indicator when compared to the vanilla GAN or other state-of-the-art parameterization methods.
    No Size Fits All: Automated Radio Configuration for LPWANs. (arXiv:2109.05103v1 [cs.NI])
    (0 min) Low power long-range networks like LoRa have become increasingly mainstream for Internet of Things deployments. Given the versatility of applications that these protocols enable, they support many data rates and bandwidths. Yet, for a given network that supports hundreds of devices over multiple miles, the network operator typically needs to specify the same configuration or among a small subset of configurations for all the client devices to communicate with the gateway. This one-size-fits-all approach is highly inefficient in large networks. We propose an alternative approach -- we allow network devices to transmit at any data rate they choose. The gateway uses the first few symbols in the preamble to classify the correct data rate, switches its configuration, and then decodes the data. Our design leverages the inherent asymmetry in outdoor IoT deployments where the clients are power-starved and resource-constrained, but the gateway is not. Our gateway design, Proteus, runs a neural network architecture and is backward compatible with existing LoRa protocols. Our experiments reveal that Proteus can identify the correct configuration with over 97% accuracy in both indoor and outdoor deployments. Our network architecture leads to a 3.8 to 11 times increase in throughput for our LoRa testbed.
    Physics-based Deep Learning. (arXiv:2109.05237v1 [cs.LG])
    (0 min) This digital book contains a practical and comprehensive introduction of everything related to deep learning in the context of physical simulations. As much as possible, all topics come with hands-on code examples in the form of Jupyter notebooks to quickly get started. Beyond standard supervised learning from data, we'll look at physical loss constraints, more tightly coupled learning algorithms with differentiable simulations, as well as reinforcement learning and uncertainty modeling. We live in exciting times: these methods have a huge potential to fundamentally change what computer simulations can achieve.
    MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning. (arXiv:2109.05294v1 [physics.geo-ph])
    (0 min) Among the biggest challenges we face in utilizing neural networks trained on waveform data (i.e., seismic, electromagnetic, or ultrasound) is its application to real data. The requirement for accurate labels forces us to develop solutions using synthetic data, where labels are readily available. However, synthetic data often do not capture the reality of the field/real experiment, and we end up with poor performance of the trained neural network (NN) at the inference stage. We describe a novel approach to enhance supervised training on synthetic data with real data features (domain adaptation). Specifically, for tasks in which the absolute values of the vertical axis (time or depth) of the input data are not crucial, like classification, or can be corrected afterward, like velocity model building using a well-log, we suggest a series of linear operations on the input so the training and application data have similar distributions. This is accomplished by applying two operations on the input data to the NN model: 1) The crosscorrelation of the input data (i.e., shot gather, seismic image, etc.) with a fixed reference trace from the same dataset. 2) The convolution of the resulting data with the mean (or a random sample) of the autocorrelated data from another domain. In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain, and random samples from real data are drawn at every training epoch. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain. Example applications on passive seismic data for microseismic event source location determination and active seismic data for predicting low frequencies are used to demonstrate the power of this approach in improving the applicability of trained models to real data.
    Near Instance Optimal Model Selection for Pure Exploration Linear Bandits. (arXiv:2109.05131v1 [stat.ML])
    (2 min) The model selection problem in the pure exploration linear bandit setting is introduced and studied in both the fixed confidence and fixed budget settings. The model selection problem considers a nested sequence of hypothesis classes of increasing complexities. Our goal is to automatically adapt to the instance-dependent complexity measure of the smallest hypothesis class containing the true model, rather than suffering from the complexity measure related to the largest hypothesis class. We provide evidence showing that a standard doubling trick over dimension fails to achieve the optimal instance-dependent sample complexity. Our algorithms define a new optimization problem based on experimental design that leverages the geometry of the action set to efficiently identify a near-optimal hypothesis class. Our fixed budget algorithm uses a novel application of a selection-validation trick in bandits. This provides a new method for the understudied fixed budget setting in linear bandits (even without the added challenge of model selection). We further generalize the model selection problem to the misspecified regime, adapting our algorithms in both fixed confidence and fixed budget settings.
    Explainable Deep Learning: A Field Guide for the Uninitiated. (arXiv:2004.14545v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks (DNNs) have become a proven and indispensable machine learning tool. As a black-box model, it remains difficult to diagnose what aspects of the model's input drive the decisions of a DNN. In countless real-world domains, from legislation and law enforcement to healthcare, such diagnosis is essential to ensure that DNN decisions are driven by aspects appropriate in the context of its use. The development of methods and studies enabling the explanation of a DNN's decisions has thus blossomed into an active, broad area of research. A practitioner wanting to study explainable deep learning may be intimidated by the plethora of orthogonal directions the field has taken. This complexity is further exacerbated by competing definitions of what it means ``to explain'' the actions of a DNN and to evaluate an approach's ``ability to explain''. This article offers a field guide to explore the space of explainable deep learning aimed at those uninitiated in the field. The field guide: i) Introduces three simple dimensions defining the space of foundational methods that contribute to explainable deep learning, ii) discusses the evaluations for model explanations, iii) places explainability in the context of other related deep learning research areas, and iv) finally elaborates on user-oriented explanation designing and potential future directions on explainable deep learning. We hope the guide is used as an easy-to-digest starting point for those just embarking on research in this field.
    Visualization of Labeled Mixed-featured Datasets. (arXiv:1904.06366v4 [stat.ML] UPDATED)
    (2 min) We develop methodology for visualization of labeled mixed-featured datasets. We first investigate datasets with continuous features where our Max-Ratio Projection (MRP) method utilizes the group information in high dimensions to provide distinctive lower-dimensional projections that are then displayed using Radviz3D. Our methodology is extended to datasets with discrete and continuous features where a Gaussianized distributional transform is used in conjunction with copula models before applying MRP and visualizing the result using RadViz3D. A R package $radviz3d$ implementing our complete methodology is available.
    Improved Analysis of the Tsallis-INF Algorithm in Stochastically Constrained Adversarial Bandits and Stochastic Bandits with Adversarial Corruptions. (arXiv:2103.12487v2 [cs.LG] UPDATED)
    (0 min) We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(\Delta,C,T)$ self-bounding constraint the algorithm achieves $\mathcal{O}\left(\left(\sum_{i\neq i^*} \frac{1}{\Delta_i}\right)\log_+\left(\frac{(K-1)T}{\left(\sum_{i\neq i^*} \frac{1}{\Delta_i}\right)^2}\right)+\sqrt{C\left(\sum_{i\neq i^*}\frac{1}{\Delta_i}\right)\log_+\left(\frac{(K-1)T}{C\sum_{i\neq i^*}\frac{1}{\Delta_i}}\right)}\right)$ regret bound, where $T$ is the time horizon, $K$ is the number of arms, $\Delta_i$ are the suboptimality gaps, $i^*$ is the best arm, $C$ is the corruption magnitude, and $\log_+(x) = \max\left(1,\log x\right)$. The regime includes stochastic bandits, stochastically constrained adversarial bandits, and stochastic bandits with adversarial corruptions as special cases. Additionally, we provide a general analysis, which allows to achieve the same kind of improvement for generalizations of Tsallis-INF to other settings beyond multiarmed bandits.
    Federated Ensemble Model-based Reinforcement Learning. (arXiv:2109.05549v1 [cs.LG])
    (0 min) Federated learning (FL) is a privacy-preserving machine learning paradigm that enables collaborative training among geographically distributed and heterogeneous users without gathering their data. Extending FL beyond the conventional supervised learning paradigm, federated Reinforcement Learning (RL) was proposed to handle sequential decision-making problems for various privacy-sensitive applications such as autonomous driving. However, the existing federated RL algorithms directly combine model-free RL with FL, and thus generally have high sample complexity and lack theoretical guarantees. To address the above challenges, we propose a new federated RL algorithm that incorporates model-based RL and ensemble knowledge distillation into FL. Specifically, we utilise FL and knowledge distillation to create an ensemble of dynamics models from clients, and then train the policy by solely using the ensemble model without interacting with the real environment. Furthermore, we theoretically prove that the monotonic improvement of the proposed algorithm is guaranteed. Extensive experimental results demonstrate that our algorithm obtains significantly higher sample efficiency compared to federated model-free RL algorithms in the challenging continuous control benchmark environments. The results also show the impact of non-IID client data and local update steps on the performance of federated RL, validating the insights obtained from our theoretical analysis.
    Pushing the Limits of Non-Autoregressive Speech Recognition. (arXiv:2104.03416v4 [eess.AS] UPDATED)
    (0 min) We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
    Learning Temporal Dynamics from Cycles in Narrated Video. (arXiv:2101.02337v2 [cs.CV] UPDATED)
    (0 min) Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We propose a self-supervised solution to this problem using temporal cycle consistency jointly in vision and language, training on narrated video. Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed. This constraint leads to the discovery of high-level transitions between moments in time, since such transitions are easily inverted and shared across modalities. We justify the design of our model with an ablation study on different configurations of the cycle consistency problem. We then show qualitatively and quantitatively that our approach yields a meaningful, high-level model of the future and past. We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images. Project page: https://dave.ml/mmcc
    Impact of lung segmentation on the diagnosis and explanation of COVID-19 in chest X-ray images. (arXiv:2009.09780v4 [eess.IV] UPDATED)
    (0 min) COVID-19 frequently provokes pneumonia, which can be diagnosed using imaging exams. Chest X-ray (CXR) is often useful because it is cheap, fast, widespread, and uses less radiation. Here, we demonstrate the impact of lung segmentation in COVID-19 identification using CXR images and evaluate which contents of the image influenced the most. Semantic segmentation was performed using a U-Net CNN architecture, and the classification using three CNN architectures (VGG, ResNet, and Inception). Explainable Artificial Intelligence techniques were employed to estimate the impact of segmentation. A three-classes database was composed: lung opacity (pneumonia), COVID-19, and normal. We assessed the impact of creating a CXR image database from different sources, and the COVID-19 generalization from one source to another. The segmentation achieved a Jaccard distance of 0.034 and a Dice coefficient of 0.982. The classification using segmented images achieved an F1-Score of 0.88 for the multi-class setup, and 0.83 for COVID-19 identification. In the cross-dataset scenario, we obtained an F1-Score of 0.74 and an area under the ROC curve of 0.9 for COVID-19 identification using segmented images. Experiments support the conclusion that even after segmentation, there is a strong bias introduced by underlying factors from different sources.
    Lenient Regret for Multi-Armed Bandits. (arXiv:2008.03959v4 [cs.LG] UPDATED)
    (0 min) We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. For example, in large problems, or when the interaction with the environment is brief, finding an optimal arm is infeasible, and regret-minimizing algorithms tend to over-explore. To overcome this issue, algorithms for such settings should instead focus on playing near-optimal arms. To this end, we suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $\epsilon$. We then present a variant of the Thompson Sampling (TS) algorithm, called $\epsilon$-TS, and prove its asymptotic optimality in terms of the lenient regret. Importantly, we show that when the mean of the optimal arm is high enough, the lenient regret of $\epsilon$-TS is bounded by a constant. Finally, we show that $\epsilon$-TS can be applied to improve the performance when the agent knows a lower bound of the suboptimality gaps.
    AstronomicAL: An interactive dashboard for visualisation, integration and classification of data using Active Learning. (arXiv:2109.05207v1 [astro-ph.IM])
    (0 min) AstronomicAL is a human-in-the-loop interactive labelling and training dashboard that allows users to create reliable datasets and robust classifiers using active learning. This technique prioritises data that offer high information gain, leading to improved performance using substantially less data. The system allows users to visualise and integrate data from different sources and deal with incorrect or missing labels and imbalanced class sizes. AstronomicAL enables experts to visualise domain-specific plots and key information relating both to broader context and details of a point of interest drawn from a variety of data sources, ensuring reliable labels. In addition, AstronomicAL provides functionality to explore all aspects of the training process, including custom models and query strategies. This makes the software a tool for experimenting with both domain-specific classifications and more general-purpose machine learning strategies. We illustrate using the system with an astronomical dataset due to the field's immediate need; however, AstronomicAL has been designed for datasets from any discipline. Finally, by exporting a simple configuration file, entire layouts, models, and assigned labels can be shared with the community. This allows for complete transparency and ensures that the process of reproducing results is effortless
    DynaNet: Neural Kalman Dynamical Model for Motion Estimation and Prediction. (arXiv:1908.03918v3 [cs.LG] UPDATED)
    (0 min) Dynamical models estimate and predict the temporal evolution of physical systems. State Space Models (SSMs) in particular represent the system dynamics with many desirable properties, such as being able to model uncertainty in both the model and measurements, and optimal (in the Bayesian sense) recursive formulations e.g. the Kalman Filter. However, they require significant domain knowledge to derive the parametric form and considerable hand-tuning to correctly set all the parameters. Data driven techniques e.g. Recurrent Neural Networks have emerged as compelling alternatives to SSMs with wide success across a number of challenging tasks, in part due to their ability to extract relevant features from rich inputs. They however lack interpretability and robustness to unseen conditions. In this work, we present DynaNet, a hybrid deep learning and time-varying state-space model which can be trained end-to-end. Our neural Kalman dynamical model allows us to exploit the relative merits of each approach. We demonstrate state-of-the-art estimation and prediction on a number of physically challenging tasks, including visual odometry, sensor fusion for visual-inertial navigation and pendulum control. In addition we show how DynaNet can indicate failures through investigation of properties such as the rate of innovation (Kalman Gain).
    Facial Anatomical Landmark Detection using Regularized Transfer Learning with Application to Fetal Alcohol Syndrome Recognition. (arXiv:2109.05485v1 [cs.CV])
    (0 min) Fetal alcohol syndrome (FAS) caused by prenatal alcohol exposure can result in a series of cranio-facial anomalies, and behavioral and neurocognitive problems. Current diagnosis of FAS is typically done by identifying a set of facial characteristics, which are often obtained by manual examination. Anatomical landmark detection, which provides rich geometric information, is important to detect the presence of FAS associated facial anomalies. This imaging application is characterized by large variations in data appearance and limited availability of labeled data. Current deep learning-based heatmap regression methods designed for facial landmark detection in natural images assume availability of large datasets and are therefore not wellsuited for this application. To address this restriction, we develop a new regularized transfer learning approach that exploits the knowledge of a network learned on large facial recognition datasets. In contrast to standard transfer learning which focuses on adjusting the pre-trained weights, the proposed learning approach regularizes the model behavior. It explicitly reuses the rich visual semantics of a domain-similar source model on the target task data as an additional supervisory signal for regularizing landmark detection optimization. Specifically, we develop four regularization constraints for the proposed transfer learning, including constraining the feature outputs from classification and intermediate layers, as well as matching activation attention maps in both spatial and channel levels. Experimental evaluation on a collected clinical imaging dataset demonstrate that the proposed approach can effectively improve model generalizability under limited training samples, and is advantageous to other approaches in the literature.
    Generating Datasets of 3D Garments with Sewing Patterns. (arXiv:2109.05633v1 [cs.CV])
    (0 min) Garments are ubiquitous in both real and many of the virtual worlds. They are highly deformable objects, exhibit an immense variety of designs and shapes, and yet, most garments are created from a set of regularly shaped flat pieces. Exploration of garment structure presents a peculiar case for an object structure estimation task and might prove useful for downstream tasks of neural 3D garment modeling and reconstruction by providing strong prior on garment shapes. To facilitate research in these directions, we propose a method for generating large synthetic datasets of 3D garment designs and their sewing patterns. Our method consists of a flexible description structure for specifying parametric sewing pattern templates and the automatic generation pipeline to produce garment 3D models with little-to-none manual intervention. To add realism, the pipeline additionally creates corrupted versions of the final meshes that imitate artifacts of 3D scanning. With this pipeline, we created the first large-scale synthetic dataset of 3D garment models with their sewing patterns. The dataset contains more than 20000 garment design variations produced from 19 different base types. Seven of these garment types are specifically designed to target evaluation of the generalization across garment sewing pattern topologies.
    LEA-Net: Layer-wise External Attention Network for Efficient Color Anomaly Detection. (arXiv:2109.05493v1 [cs.CV])
    (0 min) The utilization of prior knowledge about anomalies is an essential issue for anomaly detections. Recently, the visual attention mechanism has become a promising way to improve the performance of CNNs for some computer vision tasks. In this paper, we propose a novel model called Layer-wise External Attention Network (LEA-Net) for efficient image anomaly detection. The core idea relies on the integration of unsupervised and supervised anomaly detectors via the visual attention mechanism. Our strategy is as follows: (i) Prior knowledge about anomalies is represented as the anomaly map generated by unsupervised learning of normal instances, (ii) The anomaly map is translated to an attention map by the external network, (iii) The attention map is then incorporated into intermediate layers of the anomaly detection network. Notably, this layer-wise external attention can be applied to any CNN model in an end-to-end training manner. For a pilot study, we validate LEA-Net on color anomaly detection tasks. Through extensive experiments on PlantVillage, MVTec AD, and Cloud datasets, we demonstrate that the proposed layer-wise visual attention mechanism consistently boosts anomaly detection performances of an existing CNN model, even on imbalanced datasets. Moreover, we show that our attention mechanism successfully boosts the performance of several CNN models.
    Regularized Orthogonal Machine Learning for Nonlinear Semiparametric Models. (arXiv:1806.04823v8 [math.ST] UPDATED)
    (0 min) This paper proposes a Lasso-type estimator for a high-dimensional sparse parameter identified by a single index conditional moment restriction (CMR). In addition to this parameter, the moment function can also depend on a nuisance function, such as the propensity score or the conditional choice probability, which we estimate by modern machine learning tools. We first adjust the moment function so that the gradient of the future loss function is insensitive (formally, Neyman-orthogonal) with respect to the first-stage regularization bias, preserving the single index property. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the single index. The proposed Lasso estimator converges at the oracle rate, where the oracle knows the nuisance function and solves only the parametric problem. We demonstrate our method by estimating the short-term heterogeneous impact of Connecticut's Jobs First welfare reform experiment on women's welfare participation decision.
    Constrained Ensemble Langevin Monte Carlo. (arXiv:2102.04279v3 [stat.ML] UPDATED)
    (0 min) The classical Langevin Monte Carlo method looks for samples from a target distribution by descending the samples along the gradient of the target distribution. The method enjoys a fast convergence rate. However, the numerical cost is sometimes high because each iteration requires the computation of a gradient. One approach to eliminate the gradient computation is to employ the concept of "ensemble". A large number of particles are evolved together so the neighboring particles provide gradient information to each other. In this article, we discuss two algorithms that integrate the ensemble feature into LMC and the associated properties. In particular, we find that if one directly surrogates the gradient using the ensemble approximation, the algorithm, termed Ensemble Langevin Monte Carlo, is unstable due to a high variance term. If the gradients are replaced by the ensemble approximations only in a constrained manner, to protect from the unstable points, the algorithm, termed Constrained Ensemble Langevin Monte Carlo, resembles the classical LMC up to an ensemble error but removes most of the gradient computation.
    A Modified AUC for Training Convolutional Neural Networks: Taking Confidence into Account. (arXiv:2006.04836v2 [cs.LG] UPDATED)
    (0 min) Receiver operating characteristic (ROC) curve is an informative tool in binary classification and Area Under ROC Curve (AUC) is a popular metric for reporting performance of binary classifiers. In this paper, first we present a comprehensive review of ROC curve and AUC metric. Next, we propose a modified version of AUC that takes confidence of the model into account and at the same time, incorporates AUC into Binary Cross Entropy (BCE) loss used for training a Convolutional neural Network for classification tasks. We demonstrate this on three datasets: MNIST, prostate MRI, and brain MRI. Furthermore, we have published GenuineAI, a new python library, which provides the functions for conventional AUC and the proposed modified AUC along with metrics including sensitivity, specificity, recall, precision, and F1 for each point of the ROC curve.
    RVMDE: Radar Validated Monocular Depth Estimation for Robotics. (arXiv:2109.05265v1 [cs.RO])
    (0 min) Stereoscopy exposits a natural perception of distance in a scene, and its manifestation in 3D world understanding is an intuitive phenomenon. However, an innate rigid calibration of binocular vision sensors is crucial for accurate depth estimation. Alternatively, a monocular camera alleviates the limitation at the expense of accuracy in estimating depth, and the challenge exacerbates in harsh environmental conditions. Moreover, an optical sensor often fails to acquire vital signals in harsh environments, and radar is used instead, which gives coarse but more accurate signals. This work explores the utility of coarse signals from radar when fused with fine-grained data from a monocular camera for depth estimation in harsh environmental conditions. A variant of feature pyramid network (FPN) extensively operates on fine-grained image features at multiple scales with a fewer number of parameters. FPN feature maps are fused with sparse radar features extracted with a Convolutional neural network. The concatenated hierarchical features are used to predict the depth with ordinal regression. We performed experiments on the nuScenes dataset, and the proposed architecture stays on top in quantitative evaluations with reduced parameters and faster inference. The depth estimation results suggest that the proposed techniques can be used as an alternative to stereo depth estimation in critical applications in robotics and self-driving cars. The source code will be available in the following: \url{https://github.com/MI-Hussain/RVMDE}.
    Attention Augmented ConvLSTM for Environment Prediction. (arXiv:2010.09662v3 [cs.CV] UPDATED)
    (0 min) Safe and proactive planning in robotic systems generally requires accurate predictions of the environment. Prior work on environment prediction applied video frame prediction techniques to bird's-eye view environment representations, such as occupancy grids. ConvLSTM-based frameworks used previously often result in significant blurring and vanishing of moving objects, thus hindering their applicability for use in safety-critical applications. In this work, we propose two extensions to the ConvLSTM to address these issues. We present the Temporal Attention Augmented ConvLSTM (TAAConvLSTM) and Self-Attention Augmented ConvLSTM (SAAConvLSTM) frameworks for spatiotemporal occupancy prediction, and demonstrate improved performance over baseline architectures on the real-world KITTI and Waymo datasets.
    Single-Read Reconstruction for DNA Data Storage Using Transformers. (arXiv:2109.05478v1 [cs.ET])
    (0 min) As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.
    Generalized Second Order Value Iteration in Markov Decision Processes. (arXiv:1905.03927v2 [cs.LG] UPDATED)
    (0 min) Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.
    OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings. (arXiv:2007.00049v2 [cs.CL] UPDATED)
    (0 min) Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they not only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention that demonstrate the tradeoff between bias removal and information retention. To address this challenge, we propose OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.
    Data Generation Method for Learning a Low-dimensional Safe Region in Safe Reinforcement Learning. (arXiv:2109.05077v1 [eess.SY])
    (0 min) Safe reinforcement learning aims to learn a control policy while ensuring that neither the system nor the environment gets damaged during the learning process. For implementing safe reinforcement learning on highly nonlinear and high-dimensional dynamical systems, one possible approach is to find a low-dimensional safe region via data-driven feature extraction methods, which provides safety estimates to the learning algorithm. As the reliability of the learned safety estimates is data-dependent, we investigate in this work how different training data will affect the safe reinforcement learning approach. By balancing between the learning performance and the risk of being unsafe, a data generation method that combines two sampling methods is proposed to generate representative training data. The performance of the method is demonstrated with a three-link inverted pendulum example.
    Learning Neural Network Subspaces. (arXiv:2102.10472v3 [cs.LG] UPDATED)
    (0 min) Recent observations have advanced our understanding of the neural network optimization landscape, revealing the existence of (1) paths of high accuracy containing diverse solutions and (2) wider minima offering improved performance. Previous methods observing diverse paths require multiple training runs. In contrast we aim to leverage both property (1) and (2) with a single method and in a single training run. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. These neural network subspaces contain diverse solutions that can be ensembled, approaching the ensemble performance of independently trained networks without the training cost. Moreover, using the subspace midpoint boosts accuracy, calibration, and robustness to label noise, outperforming Stochastic Weight Averaging.
    Towards a variational Jordan-Lee-Preskill quantum algorithm. (arXiv:2109.05547v1 [quant-ph])
    (0 min) Rapid developments of quantum information technology show promising opportunities for simulating quantum field theory in near-term quantum devices. In this work, we formulate the theory of (time-dependent) variational quantum simulation, explicitly designed for quantum simulation of quantum field theory. We develop hybrid quantum-classical algorithms for crucial ingredients in particle scattering experiments, including encoding, state preparation, and time evolution, with several numerical simulations to demonstrate our algorithms in the 1+1 dimensional $\lambda \phi^4$ quantum field theory. These algorithms could be understood as near-term analogs of the Jordan-Lee-Preskill algorithm, the basic algorithm for simulating quantum field theory using universal quantum devices. Our contribution also includes a bosonic version of the Unitary Coupled Cluster ansatz with physical interpretation in quantum field theory, a discussion about the subspace fidelity, a comparison among different bases in the 1+1 dimensional $\lambda \phi^4$ theory, and the "spectral crowding" in the quantum field theory simulation.
    Estimation of Local Average Treatment Effect by Data Combination. (arXiv:2109.05175v1 [stat.ML])
    (0 min) It is important to estimate the local average treatment effect (LATE) when compliance with a treatment assignment is incomplete. The previously proposed methods for LATE estimation required all relevant variables to be jointly observed in a single dataset; however, it is sometimes difficult or even impossible to collect such data in many real-world problems for technical or privacy reasons. We consider a novel problem setting in which LATE, as a function of covariates, is nonparametrically identified from the combination of separately observed datasets. For estimation, we show that the direct least squares method, which was originally developed for estimating the average treatment effect under complete compliance, is applicable to our setting. However, model selection and hyperparameter tuning for the direct least squares estimator can be unstable in practice since it is defined as a solution to the minimax problem. We then propose a weighted least squares estimator that enables simpler model selection by avoiding the minimax objective formulation. Unlike the inverse probability weighted (IPW) estimator, the proposed estimator directly uses the pre-estimated weight without inversion, avoiding the problems caused by the IPW methods. We demonstrate the effectiveness of our method through experiments using synthetic and real-world datasets.
    Detecting Handwritten Mathematical Terms with Sensor Based Data. (arXiv:2109.05594v1 [cs.LG])
    (0 min) In this work we propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified based on time series sensor data captured on the DigiPen. The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters. The label should first be split into separate characters to classify them one by one. This issue is solved by applying a data-dependant and rule-based information extraction algorithm to the labeled data. Using the resulting data, two classifiers are constructed. The first is a binary classifier that is able to predict, for unknown data, if a sample is part of a writing activity, and consists of a Deep Neural Network feature extractor in concatenation with a Random Forest that is trained to classify the extracted features at an F1 score of >90%. The second classifier is a Deep Neural Network that combines convolution layers with recurrent layers to predict windows with a single label, out of the 15 possible classes, at an F1 score of >60%. A simulation of the challenge evaluation procedure reports a Levensthein Distance of 8 and shows that the chosen approach still lacks in overall accuracy and real-time applicability.
    Bilevel Programs Meet Deep Learning: A Unifying View on Inference Learning Methods. (arXiv:2105.07231v2 [cs.LG] UPDATED)
    (0 min) In this work we unify a number of inference learning methods, that are proposed in the literature as alternative training algorithms to the ones based on regular error back-propagation. These inference learning methods were developed with very diverse motivations, mainly aiming to enhance the biological plausibility of deep neural networks and to improve the intrinsic parallelism of training methods. We show that these superficially very different methods can all be obtained by successively applying a particular reformulation of bilevel optimization programs. As a by-product it becomes also evident that all considered inference learning methods include back-propagation as a special case, and therefore at least approximate error back-propagation in typical settings. Finally, we propose Fenchel back-propagation, that replaces the propagation of infinitesimal corrections performed in standard back-propagation with finite targets as the learning signal. Fenchel back-propagation can therefore be seen as an instance of learning via explicit target propagation.
    DRo: A data-scarce mechanism to revolutionize the performance of Deep Learning based Security Systems. (arXiv:2109.05470v1 [cs.CR])
    (0 min) Supervised Deep Learning requires plenty of labeled data to converge, and hence perform optimally for task-specific learning. Therefore, we propose a novel mechanism named DRo (for Deep Routing) for data-scarce domains like security. The DRo approach builds upon some of the recent developments in Deep-Clustering. In particular, it exploits the self-augmented training mechanism using synthetically generated local perturbations. DRo not only allays the challenges with sparse-labeled data but also offers many unique advantages. We also developed a system named DRoID that uses the DRo mechanism for enhancing the performance of an existing Malware Detection System that uses (low information features like the) Android implicit Intent(s) as the only features. We conduct experiments on DRoID using a popular and standardized Android malware dataset and found that the DRo mechanism could successfully reduce the false-alarms generated by the downstream classifier by 67.9%, and also simultaneously boosts its accuracy by 11.3%. This is significant not only because the gains achieved are unparalleled but also because the features used were never considered rich enough to train a classifier on; and hence no decent performance could ever be reported by any malware classification system till-date using these features in isolation. Owing to the results achieved, the DRo mechanism claims a dominant position amongst all known systems that aims to enhance the classification performance of deep learning models with sparse-labeled data.
    2-in-1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency. (arXiv:2109.05223v1 [cs.LG])
    (0 min) The recent breakthroughs of deep neural networks (DNNs) and the advent of billions of Internet of Things (IoT) devices have excited an explosive demand for intelligent IoT devices equipped with domain-specific DNN accelerators. However, the deployment of DNN accelerator enabled intelligent functionality into real-world IoT devices still remains particularly challenging. First, powerful DNNs often come at prohibitive complexities, whereas IoT devices often suffer from stringent resource constraints. Second, while DNNs are vulnerable to adversarial attacks especially on IoT devices exposed to complex real-world environments, many IoT applications require strict security. Existing DNN accelerators mostly tackle only one of the two aforementioned challenges (i.e., efficiency or adversarial robustness) while neglecting or even sacrificing the other. To this end, we propose a 2-in-1 Accelerator, an integrated algorithm-accelerator co-design framework aiming at winning both the adversarial robustness and efficiency of DNN accelerators. Specifically, we first propose a Random Precision Switch (RPS) algorithm that can effectively defend DNNs against adversarial attacks by enabling random DNN quantization as an in-situ model switch. Furthermore, we propose a new precision-scalable accelerator featuring (1) a new precision-scalable MAC unit architecture which spatially tiles the temporal MAC units to boost both the achievable efficiency and flexibility and (2) a systematically optimized dataflow that is searched by our generic accelerator optimizer. Extensive experiments and ablation studies validate that our 2-in-1 Accelerator can not only aggressively boost both the adversarial robustness and efficiency of DNN accelerators under various attacks, but also naturally support instantaneous robustness-efficiency trade-offs adapting to varied resources without the necessity of DNN retraining.
    Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection. (arXiv:2109.05468v1 [cs.LG])
    (0 min) Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state of the art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We show that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.
    Fundamental limits of over-the-air optimization: Are analog schemes optimal?. (arXiv:2109.05222v1 [cs.IT])
    (0 min) We consider over-the-air convex optimization on a d dimensional space where coded gradients are sent over an additive Gaussian noise channel with variance \sigma^2. The codewords satisfy an average power constraint P, resulting in the signal-to-noise ratio (SNR) of P/\sigma^2. We derive bounds for the convergence rates for over-the-air optimization. Our first result is a lower bound for the convergence rate showing that any code must slowdown the convergence rate by a factor of roughly \sqrt{d/log(1 + SNR)}. Next, we consider a popular class of schemes called analog coding, where a linear function of the gradient is sent. We show that a simple scaled transmission analog coding scheme results in a slowdown in convergence rate by a factor of \sqrt{d(1 + 1/SNR)}. This matches the previous lower bound up to constant factors for low SNR, making the scaled transmission scheme optimal at low SNR. However, we show that this slowdown is necessary for any analog coding scheme. In particular, a slowdown in convergence by a factor of \sqrt{d} for analog coding remains even when SNR tends to infinity. Remarkably, we present a simple quantize-and-modulate scheme that uses Amplitude Shift Keying and almost attains the optimal convergence rate at all SNRs.
    EMVLight: A Decentralized Reinforcement Learning Framework for EfficientPassage of Emergency Vehicles. (arXiv:2109.05429v1 [cs.LG])
    (0 min) Emergency vehicles (EMVs) play a crucial role in responding to time-critical events such as medical emergencies and fire outbreaks in an urban area. The less time EMVs spend traveling through the traffic, the more likely it would help save people's lives and reduce property loss. To reduce the travel time of EMVs, prior work has used route optimization based on historical traffic-flow data and traffic signal pre-emption based on the optimal route. However, traffic signal pre-emption dynamically changes the traffic flow which, in turn, modifies the optimal route of an EMV. In addition, traffic signal pre-emption practices usually lead to significant disturbances in traffic flow and subsequently increase the travel time for non-EMVs. In this paper, we propose EMVLight, a decentralized reinforcement learning (RL) framework for simultaneous dynamic routing and traffic signal control. EMVLight extends Dijkstra's algorithm to efficiently update the optimal route for the EMVs in real time as it travels through the traffic network. The decentralized RL agents learn network-level cooperative traffic signal phase strategies that not only reduce EMV travel time but also reduce the average travel time of non-EMVs in the network. This benefit has been demonstrated through comprehensive experiments with synthetic and real-world maps. These experiments show that EMVLight outperforms benchmark transportation engineering techniques and existing RL-based signal control methods.
    Internet of Things (IoT) Based Video Analytics: a use case of Smart Doorbell. (arXiv:2105.06508v2 [cs.CV] UPDATED)
    (0 min) The vision of the internet of things (IoT) is a reality now. IoT devices are getting cheaper, smaller. They are becoming more and more computationally and energy-efficient. The global market of IoT-based video analytics has seen significant growth in recent years and it is expected to be a growing market segment. For any IoT-based video analytics application, few key points required, such as cost-effectiveness, widespread use, flexible design, accurate scene detection, reusability of the framework. Video-based smart doorbell system is one such application domain for video analytics where many commercial offerings are available in the consumer market. However, such existing offerings are costly, monolithic, and proprietary. Also, there will be a trade-off between accuracy and portability. To address the foreseen problems, I'm proposing a distributed framework for video analytics with a use case of a smart doorbell system. The proposed framework uses AWS cloud services as a base platform and to meet the price affordability constraint, the system was implemented on affordable Raspberry Pi. The smart doorbell will be able to recognize the known/unknown person with at most accuracy. The smart doorbell system is also having additional detection functionalities such as harmful weapon detection, noteworthy vehicle detection, animal/pet detection. An iOS application is specifically developed for this implementation which can receive the notification from the smart doorbell in real-time. Finally, the paper also mentions the classical approaches for video analytics, their feasibility in implementing with this use-case, and comparative analysis in terms of accuracy and time required to detect an object in the frame is carried out. Results conclude that AWS cloud-based approach is worthy for this smart doorbell use case.
    Space Meets Time: Local Spacetime Neural Network For Traffic Flow Forecasting. (arXiv:2109.05225v1 [cs.LG])
    (0 min) Traffic flow forecasting is a crucial task in urban computing. The challenge arises as traffic flows often exhibit intrinsic and latent spatio-temporal correlations that cannot be identified by extracting the spatial and temporal patterns of traffic data separately. We argue that such correlations are universal and play a pivotal role in traffic flow. We put forward spacetime interval learning as a paradigm to explicitly capture these correlations through a unified analysis of both spatial and temporal features. Unlike the state-of-the-art methods, which are restricted to a particular road network, we model the universal spatio-temporal correlations that are transferable from cities to cities. To this end, we propose a new spacetime interval learning framework that constructs a local-spacetime context of a traffic sensor comprising the data from its neighbors within close time points. Based on this idea, we introduce spacetime neural network (STNN), which employs novel spacetime convolution and attention mechanism to learn the universal spatio-temporal correlations. The proposed STNN captures local traffic patterns, which does not depend on a specific network structure. As a result, a trained STNN model can be applied on any unseen traffic networks. We evaluate the proposed STNN on two public real-world traffic datasets and a simulated dataset on dynamic networks. The experiment results show that STNN not only improves prediction accuracy by 15% over state-of-the-art methods, but is also effective in handling the case when the traffic network undergoes dynamic changes as well as the superior generalization capability.
    Conditional Generation of Synthetic Geospatial Images from Pixel-level and Feature-level Inputs. (arXiv:2109.05201v1 [cs.CV])
    (0 min) Training robust supervised deep learning models for many geospatial applications of computer vision is difficult due to dearth of class-balanced and diverse training data. Conversely, obtaining enough training data for many applications is financially prohibitive or may be infeasible, especially when the application involves modeling rare or extreme events. Synthetically generating data (and labels) using a generative model that can sample from a target distribution and exploit the multi-scale nature of images can be an inexpensive solution to address scarcity of labeled data. Towards this goal, we present a deep conditional generative model, called VAE-Info-cGAN, that combines a Variational Autoencoder (VAE) with a conditional Information Maximizing Generative Adversarial Network (InfoGAN), for synthesizing semantically rich images simultaneously conditioned on a pixel-level condition (PLC) and a macroscopic feature-level condition (FLC). Dimensionally, the PLC can only vary in the channel dimension from the synthesized image and is meant to be a task-specific input. The FLC is modeled as an attribute vector in the latent space of the generated image which controls the contributions of various characteristic attributes germane to the target distribution. Experiments on a GPS trajectories dataset show that the proposed model can accurately generate various forms of spatiotemporal aggregates across different geographic locations while conditioned only on a raster representation of the road network. The primary intended application of the VAE-Info-cGAN is synthetic data (and label) generation for targeted data augmentation for computer vision-based modeling of problems relevant to geospatial analysis and remote sensing.
    CupNet -- Pruning a network for geometric data. (arXiv:2005.05276v2 [cs.LG] UPDATED)
    (0 min) Using data from a simulated cup drawing process, we demonstrate how the inherent geometrical structure of cup meshes can be used to effectively prune an artificial neural network in a straightforward way.
    Auto-encoding brain networks with applications to analyzing large-scale brain imaging datasets. (arXiv:1911.02728v3 [stat.ML] UPDATED)
    (0 min) There has been huge interest in studying human brain connectomes inferred from different imaging modalities and exploring their relationship with human traits, such as cognition. Brain connectomes are usually represented as networks, with nodes corresponding to different regions of interest (ROIs) and edges to connection strengths between ROIs. Due to the high-dimensionality and non-Euclidean nature of networks, it is challenging to depict their population distribution and relate them to human traits. Current approaches focus on summarizing the network using either pre-specified topological features or principal components analysis (PCA). In this paper, building on recent advances in deep learning, we develop a nonlinear latent factor model to characterize the population distribution of brain graphs and infer the relationships between brain structural connectomes and human traits. We refer to our method as Graph AuTo-Encoding (GATE). We applied GATE to two large-scale brain imaging datasets, the Adolescent Brain Cognitive Development (ABCD) study and the Human Connectome Project (HCP) for adults, to understand the structural brain connectome and its relationship with cognition. Numerical results demonstrate huge advantages of GATE over competitors in terms of prediction accuracy, statistical inference and computing efficiency. We found that structural connectomes have a stronger association with a wide range of human cognitive traits than was apparent using previous approaches.
    Robust Max Entrywise Error Bounds for Sparse Tensor Estimation via Similarity Based Collaborative Filtering. (arXiv:1908.01241v3 [cs.LG] UPDATED)
    (0 min) Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for sparse tensor estimation and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a low rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. % as long as tensor has finite rank $r = \Theta(1)$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\epsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by ${\sf poly}(\epsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\epsilon \geq 0$ by a finite rank tensor, then the estimation error converges to ${\sf poly}(\epsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.
    Debiased Machine Learning of Set-Identified Linear Models. (arXiv:1712.10024v4 [stat.ML] UPDATED)
    (0 min) This paper provides estimation and inference methods for an identified set's boundary (i.e., support function) where the selection among a very large number of covariates is based on modern regularized tools. I characterize the boundary using a semiparametric moment equation. Combining Neyman-orthogonality and sample splitting ideas, I construct a root-N consistent, uniformly asymptotically Gaussian estimator of the boundary and propose a multiplier bootstrap procedure to conduct inference. I apply this result to the partially linear model and the partially linear IV model with an interval-valued outcome.
    No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets. (arXiv:2109.05554v1 [cs.LG])
    (0 min) Out-of-distribution detection is an important component of reliable ML systems. Prior literature has proposed various methods (e.g., MSP (Hendrycks & Gimpel, 2017), ODIN (Liang et al., 2018), Mahalanobis (Lee et al., 2018)), claiming they are state-of-the-art by showing they outperform previous methods on a selected set of in-distribution (ID) and out-of-distribution (OOD) datasets. In this work, we show that none of these methods are inherently better at OOD detection than others on a standardized set of 16 (ID, OOD) pairs. We give possible explanations for these inconsistencies with simple toy datasets where whether one method outperforms another depends on the structure of the ID and OOD datasets in question. Finally, we show that a method outperforming another on a certain (ID, OOD) pair may not do so in a low-data regime. In the low-data regime, we propose a distance-based method, Pairwise OOD detection (POD), which is based on Siamese networks and improves over Mahalanobis by sidestepping the expensive covariance estimation step. Our results suggest that the OOD detection problem may be too broad, and we should consider more specific structures for leverage.
    Adaptive network reliability analysis: Methodology and applications to power grid. (arXiv:2109.05360v1 [cs.LG])
    (0 min) Flow network models can capture the underlying physics and operational constraints of many networked systems including the power grid and transportation and water networks. However, analyzing reliability of systems using computationally expensive flow-based models faces substantial challenges, especially for rare events. Existing actively trained meta-models, which present a new promising direction in reliability analysis, are not applicable to networks due to the inability of these methods to handle high-dimensional problems as well as discrete or mixed variable inputs. This study presents the first adaptive surrogate-based Network Reliability Analysis using Bayesian Additive Regression Trees (ANR-BART). This approach integrates BART and Monte Carlo simulation (MCS) via an active learning method that identifies the most valuable training samples based on the credible intervals derived by BART over the space of predictor variables as well as the proximity of the points to the estimated limit state. Benchmark power grids including IEEE 30, 57, 118, and 300-bus systems and their power flow models for cascading failure analysis are considered to investigate ANR-BART, MCS, subset simulation, and passively-trained optimal deep neural networks and BART. Results indicate that ANR-BART is robust and yields accurate estimates of network failure probability, while significantly reducing the computational cost of reliability analysis.
    Structure-preserving Sparse Identification of Nonlinear Dynamics for Data-driven Modeling. (arXiv:2109.05364v1 [cs.LG])
    (0 min) Discovery of dynamical systems from data forms the foundation for data-driven modeling and recently, structure-preserving geometric perspectives have been shown to provide improved forecasting, stability, and physical realizability guarantees. We present here a unification of the Sparse Identification of Nonlinear Dynamics (SINDy) formalism with neural ordinary differential equations. The resulting framework allows learning of both "black-box" dynamics and learning of structure preserving bracket formalisms for both reversible and irreversible dynamics. We present a suite of benchmarks demonstrating effectiveness and structure preservation, including for chaotic systems.
    Stabilizing Deep Tomographic Reconstruction. (arXiv:2008.01846v5 [eess.IV] UPDATED)
    (0 min) Tomographic image reconstruction with deep learning is an emerging field, but a recent landmark study reveals that several deep reconstruction networks are unstable for computed tomography (CT) and magnetic resonance imaging (MRI). Specifically, three kinds of instabilities were reported: (1) strong image artefacts from tiny perturbations, (2) small features missing in a deeply reconstructed image, and (3) decreased imaging performance with increased input data. On the other hand, compressed sensing (CS) inspired reconstruction methods do not suffer from these instabilities because of their built-in kernel awareness. For deep reconstruction to realize its full potential and become a mainstream approach for tomographic imaging, it is thus critically important to meet this challenge by stabilizing deep reconstruction networks. Here we propose an Analytic Compressed Iterative Deep (ACID) framework to address this challenge. ACID synergizes a deep reconstruction network trained on big data, kernel awareness from CS-inspired processing, and iterative refinement to minimize the data residual relative to real measurement. Our study demonstrates that the deep reconstruction using ACID is accurate and stable, and sheds light on the converging mechanism of the ACID iteration under a Bounded Relative Error Norm (BREN) condition. In particular, the study shows that ACID-based reconstruction is resilient against adversarial attacks, superior to classic sparsity-regularized reconstruction alone, and eliminates the three kinds of instabilities. We anticipate that this integrative data-driven approach will help promote development and translation of deep tomographic image reconstruction networks into clinical applications.
    Making Table Understanding Work in Practice. (arXiv:2109.05173v1 [cs.DB])
    (0 min) Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap between the performance of these models on these benchmarks and their applicability in practice. In this paper, we address the question: what do we need for these models to work in practice? We discuss three challenges of deploying table understanding models and propose a framework to address them. These challenges include 1) difficulty in customizing models to specific domains, 2) lack of training data for typical database tables often found in enterprises, and 3) lack of confidence in the inferences made by models. We present SigmaTyper which implements this framework for the semantic column type detection task. SigmaTyper encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model. Lastly, we highlight avenues for future research that further close the gap towards making table understanding effective in practice.
    FBERT: A Neural Transformer for Identifying Offensive Content. (arXiv:2109.05074v1 [cs.CL])
    (0 min) Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances. We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.
    The Logic Traps in Evaluating Post-hoc Interpretations. (arXiv:2109.05463v1 [cs.LG])
    (0 min) Post-hoc interpretation aims to explain a trained model and reveal how the model arrives at a decision. Though research on post-hoc interpretations has developed rapidly, one growing pain in this field is the difficulty in evaluating interpretations. There are some crucial logic traps behind existing evaluation methods, which are ignored by most works. In this opinion piece, we summarize four kinds evaluation methods and point out the corresponding logic traps behind them. We argue that we should be clear about these traps rather than ignore them and draw conclusions assertively.
    On the Ergodicity, Bias and Asymptotic Normality of Randomized Midpoint Sampling Method. (arXiv:2011.03176v2 [stat.ML] UPDATED)
    (0 min) The randomized midpoint method, proposed by [SL19], has emerged as an optimal discretization procedure for simulating the continuous time Langevin diffusions. Focusing on the case of strong-convex and smooth potentials, in this paper, we analyze several probabilistic properties of the randomized midpoint discretization method for both overdamped and underdamped Langevin diffusions. We first characterize the stationary distribution of the discrete chain obtained with constant step-size discretization and show that it is biased away from the target distribution. Notably, the step-size needs to go to zero to obtain asymptotic unbiasedness. Next, we establish the asymptotic normality for numerical integration using the randomized midpoint method and highlight the relative advantages and disadvantages over other discretizations. Our results collectively provide several insights into the behavior of the randomized midpoint discretization method, including obtaining confidence intervals for numerical integrations.
    Memory-based Deep Reinforcement Learning for POMDPs. (arXiv:2102.12344v5 [cs.LG] UPDATED)
    (0 min) A promising characteristic of Deep Reinforcement Learning (DRL) is its capability to learn optimal policy in an end-to-end manner without relying on feature engineering. However, most approaches assume a fully observable state space, i.e. fully observable Markov Decision Processes (MDPs). In real-world robotics, this assumption is unpractical, because of issues such as sensor sensitivity limitations and sensor noise, and the lack of knowledge about whether the observation design is complete or not. These scenarios lead to Partially Observable MDPs (POMDPs). In this paper, we propose Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3) by introducing a memory component to TD3, and compare its performance with other DRL algorithms in both MDPs and POMDPs. Our results demonstrate the significant advantages of the memory component in addressing POMDPs, including the ability to handle missing and noisy observation data.
    Fairness of Exposure in Stochastic Bandits. (arXiv:2103.02735v2 [cs.LG] UPDATED)
    (0 min) Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get exposed to the users. This raises questions of fairness to the items -- and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winner-takes-all allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees merit-based fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and reward regret in this setting, and present algorithms for both stochastic multi-armed bandits and stochastic linear bandits. We prove that the algorithms achieve sub-linear fairness regret and reward regret. Beyond the theoretical analysis, we also provide empirical evidence that these algorithms can fairly allocate exposure to different arms effectively.
    Do Transformer Modifications Transfer Across Implementations and Applications?. (arXiv:2102.11972v2 [cs.LG] UPDATED)
    (0 min) The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.
    DeepPyram: Enabling Pyramid View and Deformable Pyramid Reception for Semantic Segmentation in Cataract Surgery Videos. (arXiv:2109.05352v1 [cs.CV])
    (0 min) Semantic segmentation in cataract surgery has a wide range of applications contributing to surgical outcome enhancement and clinical risk reduction. However, the varying issues in segmenting the different relevant instances make the designation of a unique network quite challenging. This paper proposes a semantic segmentation network termed as DeepPyram that can achieve superior performance in segmenting relevant objects in cataract surgery videos with varying issues. This superiority mainly originates from three modules: (i) Pyramid View Fusion, which provides a varying-angle global view of the surrounding region centering at each pixel position in the input convolutional feature map; (ii) Deformable Pyramid Reception, which enables a wide deformable receptive field that can adapt to geometric transformations in the object of interest; and (iii) Pyramid Loss that adaptively supervises multi-scale semantic feature maps. These modules can effectively boost semantic segmentation performance, especially in the case of transparency, deformability, scalability, and blunt edges in objects. The proposed approach is evaluated using four datasets of cataract surgery for objects with different contextual features and compared with thirteen state-of-the-art segmentation networks. The experimental results confirm that DeepPyram outperforms the rival approaches without imposing additional trainable parameters. Our comprehensive ablation study further proves the effectiveness of the proposed modules.
    Machine learning reveals how personalized climate communication can both succeed and backfire. (arXiv:2109.05104v1 [cs.LG])
    (0 min) Different advertising messages work for different people. Machine learning can be an effective way to personalise climate communications. In this paper we use machine learning to reanalyse findings from a recent study, showing that online advertisements increased some people's belief in climate change while resulting in decreased belief in others. In particular, we show that the effect of the advertisements could change depending on people's age and ethnicity.
    Accurate Prediction Using Triangular Type-2 Fuzzy Linear Regression. (arXiv:2109.05461v1 [cs.LG])
    (0 min) Many works have been done to handle the uncertainties in the data using type 1 fuzzy regression. Few type 2 fuzzy regression works used interval type 2 for indeterminate modeling using type 1 fuzzy membership. The current survey proposes a triangular type-2 fuzzy regression (TT2FR) model to ameliorate the efficiency of the model by handling the uncertainty in the data. The triangular secondary membership function is used instead of widely used interval type models. In the proposed model, vagueness in primary and secondary fuzzy sets is minimized and also, a specified x-plane of observed value is included in the same {\alpha}- plane of the predicted value. Complex calculations of the type-2 fuzzy (T2F) model are simplified by reducing three dimensional type-2 fuzzy set (3DT2FS) into two dimensional interval type-2 fuzzy (2DIT2F) models. The current survey presents a new regression model of T2F by considering the more general form of T2F membership functions and thus avoids high complexity. The performance of the developed model is evaluated using the TAIEX and COVID-19 forecasting datasets. Our developed model reached the highest performance as compared to the other state-of-art techniques. Our developed method is ready to be tested with more uncertain data and has the potential to use to predict the weather and stock prediction.
    Adversarial Representation Learning With Closed-Form Solvers. (arXiv:2109.05535v1 [cs.LG])
    (0 min) Adversarial representation learning aims to learn data representations for a target task while removing unwanted sensitive information at the same time. Existing methods learn model parameters iteratively through stochastic gradient descent-ascent, which is often unstable and unreliable in practice. To overcome this challenge, we adopt closed-form solvers for the adversary and target task. We model them as kernel ridge regressors and analytically determine an upper-bound on the optimal dimensionality of representation. Our solution, dubbed OptNet-ARL, reduces to a stable one one-shot optimization problem that can be solved reliably and efficiently. OptNet-ARL can be easily generalized to the case of multiple target tasks and sensitive attributes. Numerical experiments, on both small and large scale datasets, show that, from an optimization perspective, OptNet-ARL is stable and exhibits three to five times faster convergence. Performance wise, when the target and sensitive attributes are dependent, OptNet-ARL learns representations that offer a better trade-off front between (a) utility and bias for fair classification and (b) utility and privacy by mitigating leakage of private information than existing solutions.
    Global and Local Interpretation of black-box Machine Learning models to determine prognostic factors from early COVID-19 data. (arXiv:2109.05087v1 [cs.LG])
    (0 min) The COVID-19 corona virus has claimed 4.1 million lives, as of July 24, 2021. A variety of machine learning models have been applied to related data to predict important factors such as the severity of the disease, infection rate and discover important prognostic factors. Often the usefulness of the findings from the use of these techniques is reduced due to lack of method interpretability. Some recent progress made on the interpretability of machine learning models has the potential to unravel more insights while using conventional machine learning models. In this work, we analyze COVID-19 blood work data with some of the popular machine learning models; then we employ state-of-the-art post-hoc local interpretability techniques(e.g.- SHAP, LIME), and global interpretability techniques(e.g. - symbolic metamodeling) to the trained black-box models to draw interpretable conclusions. In the gamut of machine learning algorithms, regressions remain one of the simplest and most explainable models with clear mathematical formulation. We explore one of the most recent techniques called symbolic metamodeling to find the mathematical expression of the machine learning models for COVID-19. We identify Acute Kidney Injury (AKI), initial Albumin level (ALBI), Aspartate aminotransferase (ASTI), Total Bilirubin initial(TBILI) and D-Dimer initial (DIMER) as major prognostic factors of the disease severity. Our contributions are- (i) uncover the underlying mathematical expression for the black-box models on COVID-19 severity prediction task (ii) we are the first to apply symbolic metamodeling to this task, and (iii) discover important features and feature interactions.
    Concave Utility Reinforcement Learning with Zero-Constraint Violations. (arXiv:2109.05439v1 [cs.LG])
    (0 min) We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. Various learning applications with constraints, such as robotics, do not allow for policies that can violate constraints. To this end, we propose a model-based learning algorithm that achieves zero constraint violations. To obtain this result, we assume that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures. We then solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We also propose a novel Bellman error based analysis for tabular infinite-horizon setups which allows to analyse stochastic policies. Combining the Bellman error based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors.
    An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment. (arXiv:2109.05110v1 [cs.LG])
    (0 min) Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. We empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks: the Rooms task, and the High Variance Rooms task. The tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ and can sometimes be two. To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which in turn slows down learning. The High Variance Rooms task is more extreme in that the product of the ratios can become as large as $2^{14}\times 25$. This paper builds upon the empirical study of off-policy prediction learning algorithms by Ghiassian and Sutton (2021). We consider the same set of algorithms as theirs and employ the same experimental methodology. The algorithms considered are: Off-policy TD($\lambda$), five Gradient-TD algorithms, two Emphatic-TD algorithms, Tree Backup($\lambda$), Vtrace($\lambda$), and ABTD($\zeta$). We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. The data shows that Tree Backup($\lambda$), Vtrace($\lambda$), and ABTD($\zeta$) are not affected by the high variance as much as other algorithms but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TD($\lambda$) tends to have lower asymptotic error than other algorithms, but might learn more slowly in some cases. We suggest algorithms for practitioners based on their problem of interest, and suggest approaches that can be applied to specific algorithms that might result in substantially improved algorithms.
    HyP-ABC: A Novel Automated Hyper-Parameter Tuning Algorithm Using Evolutionary Optimization. (arXiv:2109.05319v1 [cs.LG])
    (0 min) Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application, a large number of hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance (accuracy and run-time). However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally challenging. Existing automated hyper-parameter tuning techniques suffer from high time complexity. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms, namely random forest, extreme gradient boosting, and support vector machine. Compared to the state-of-the-art techniques, HyP-ABC is more efficient and has a limited number of parameters to be tuned, making it worthwhile for real-world hyper-parameter optimization problems. We further compare our proposed HyP-ABC algorithm with state-of-the-art techniques. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values, and is tested using a real-world educational dataset.
    Multi-task Language Modeling for Improving Speech Recognition of Rare Words. (arXiv:2011.11715v4 [cs.CL] UPDATED)
    (0 min) End-to-end automatic speech recognition (ASR) systems are increasingly popular due to their relative architectural simplicity and competitive performance. However, even though the average accuracy of these systems may be high, the performance on rare content words often lags behind hybrid ASR systems. To address this problem, second-pass rescoring is often applied leveraging upon language modeling. In this paper, we propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. We show that our rescoring model trained with these additional tasks outperforms the baseline rescoring model, trained with only the language modeling task, by 1.4% on a general test and by 2.6% on a rare word test set in terms of word-error-rate relative (WERR). Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
    Stochastic Cluster Embedding. (arXiv:2108.08003v2 [cs.LG] UPDATED)
    (0 min) Neighbor Embedding (NE) aims to preserve pairwise similarities between data items and has been shown to yield an effective principle for data visualization. However, even the best existing NE methods such as Stochastic Neighbor Embedding (SNE) may leave large-scale patterns hidden, for example, clusters, despite strong signals being present in the data. To address this, we propose a new cluster visualization method based on the Neighbor Embedding principle. We first present a family of Neighbor Embedding methods which generalizes SNE by using non-normalized Kullback-Leibler divergence with a scale parameter. In this family, much better cluster visualizations often appear with a parameter value different from the one corresponding to SNE. We also develop an efficient software which employs asynchronous stochastic block coordinate descent to optimize the new family of objective functions. Our experimental results demonstrate that the method consistently and substantially improves visualization of data clusters compared with the state-of-the-art NE approaches.
    CoG: a Two-View Co-training Framework for Defending Adversarial Attacks on Graph. (arXiv:2109.05558v1 [cs.LG])
    (0 min) Graph neural networks exhibit remarkable performance in graph data analysis. However, the robustness of GNN models remains a challenge. As a result, they are not reliable enough to be deployed in critical applications. Recent studies demonstrate that GNNs could be easily fooled with adversarial perturbations, especially structural perturbations. Such vulnerability is attributed to the excessive dependence on the structure information to make predictions. To achieve better robustness, it is desirable to build the prediction of GNNs with more comprehensive features. Graph data, in most cases, has two views of information, namely structure information and feature information. In this paper, we propose CoG, a simple yet effective co-training framework to combine these two views for the purpose of robustness. CoG trains sub-models from the feature view and the structure view independently and allows them to distill knowledge from each other by adding their most confident unlabeled data into the training set. The orthogonality of these two views diversifies the sub-models, thus enhancing the robustness of their ensemble. We evaluate our framework on three popular datasets, and results show that CoG significantly improves the robustness of graph models against adversarial attacks without sacrificing their performance on clean data. We also show that CoG still achieves good robustness when both node features and graph structures are perturbed.
    Ordering-Based Causal Discovery with Reinforcement Learning. (arXiv:2105.06631v3 [cs.LG] UPDATED)
    (0 min) It is a long-standing question to discover causal relations among a set of variables in many empirical sciences. Recently, Reinforcement Learning (RL) has achieved promising results in causal discovery from observational data. However, searching the space of directed graphs and enforcing acyclicity by implicit penalties tend to be inefficient and restrict the existing RL-based method to small scale problems. In this work, we propose a novel RL-based approach for causal discovery, by incorporating RL into the ordering-based paradigm. Specifically, we formulate the ordering search problem as a multi-step Markov decision process, implement the ordering generating process with an encoder-decoder architecture, and finally use RL to optimize the proposed model based on the reward mechanisms designed for~each ordering. A generated ordering would then be processed using variable selection to obtain the final causal graph. We analyze the consistency and computational complexity of the proposed method, and empirically show that a pretrained model can be exploited to accelerate training. Experimental results on both synthetic and real data sets shows that the proposed method achieves a much improved performance over existing RL-based method.
    Bayesian Topic Regression for Causal Inference. (arXiv:2109.05317v1 [stat.ML])
    (0 min) Causal inference using observational text data is becoming increasingly popular in many research areas. This paper presents the Bayesian Topic Regression (BTR) model that uses both text and numerical information to model an outcome variable. It allows estimation of both discrete and continuous treatment effects. Furthermore, it allows for the inclusion of additional numerical confounding factors next to text data. To this end, we combine a supervised Bayesian topic model with a Bayesian regression framework and perform supervised representation learning for the text features jointly with the regression parameter training, respecting the Frisch-Waugh-Lovell theorem. Our paper makes two main contributions. First, we provide a regression framework that allows causal inference in settings when both text and numerical confounders are of relevance. We show with synthetic and semi-synthetic datasets that our joint approach recovers ground truth with lower bias than any benchmark model, when text and numerical features are correlated. Second, experiments on two real-world datasets demonstrate that a joint and supervised learning strategy also yields superior prediction results compared to strategies that estimate regression weights for text and non-text features separately, being even competitive with more complex deep neural networks.
    Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?. (arXiv:2109.05641v1 [cs.LG])
    (0 min) Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by using the graph structures based on the relational inductive bias (homophily assumption). Though GNNs are believed to outperform NNs in real-world tasks, performance advantages of GNNs over graph-agnostic NNs seem not generally satisfactory. Heterophily has been considered as a main cause and numerous works have been put forward to address it. In this paper, we first show that not all cases of heterophily are harmful for GNNs with aggregation operation. Then, we propose new metrics based on a similarity matrix which considers the influence of both graph structure and input features on GNNs. The metrics demonstrate advantages over the commonly used homophily metrics by tests on synthetic graphs. From the metrics and the observations, we find some cases of harmful heterophily can be addressed by diversification operation. With this fact and knowledge of filterbanks, we propose the Adaptive Channel Mixing (ACM) framework to adaptively exploit aggregation, diversification and identity channels in each GNN layer to address harmful heterophily. We validate the ACM-augmented baselines with 10 real-world node classification tasks. They consistently achieve significant performance gain and exceed the state-of-the-art GNNs on most of the tasks without incurring significant computational burden.
    Potential-based Reward Shaping in Sokoban. (arXiv:2109.05022v1 [cs.LG])
    (0 min) Learning to solve sparse-reward reinforcement learning problems is difficult, due to the lack of guidance towards the goal. But in some problems, prior knowledge can be used to augment the learning process. Reward shaping is a way to incorporate prior knowledge into the original reward function in order to speed up the learning. While previous work has investigated the use of expert knowledge to generate potential functions, in this work, we study whether we can use a search algorithm(A*) to automatically generate a potential function for reward shaping in Sokoban, a well-known planning task. The results showed that learning with shaped reward function is faster than learning from scratch. Our results indicate that distance functions could be a suitable function for Sokoban. This work demonstrates the possibility of solving multiple instances with the help of reward shaping. The result can be compressed into a single policy, which can be seen as the first phrase towards training a general policy that is able to solve unseen instances.
    Omnipredictors. (arXiv:2109.05389v1 [cs.LG])
    (0 min) Loss minimization is a dominant paradigm in machine learning, where a predictor is trained to minimize some loss function that depends on an uncertain event (e.g., "will it rain tomorrow?''). Different loss functions imply different learning algorithms and, at times, very different predictors. While widespread and appealing, a clear drawback of this approach is that the loss function may not be known at the time of learning, requiring the algorithm to use a best-guess loss function. We suggest a rigorous new paradigm for loss minimization in machine learning where the loss function can be ignored at the time of learning and only be taken into account when deciding an action. We introduce the notion of an (${\mathcal{L}},\mathcal{C}$)-omnipredictor, which could be used to optimize any loss in a family ${\mathcal{L}}$. Once the loss function is set, the outputs of the predictor can be post-processed (a simple univariate data-independent transformation of individual predictions) to do well compared with any hypothesis from the class $\mathcal{C}$. The post processing is essentially what one would perform if the outputs of the predictor were true probabilities of the uncertain events. In a sense, omnipredictors extract all the predictive power from the class $\mathcal{C}$, irrespective of the loss function in $\mathcal{L}$. We show that such "loss-oblivious'' learning is feasible through a connection to multicalibration, a notion introduced in the context of algorithmic fairness. In addition, we show how multicalibration can be viewed as a solution concept for agnostic boosting, shedding new light on past results. Finally, we transfer our insights back to the context of algorithmic fairness by providing omnipredictors for multi-group loss minimization.
    Compute and Energy Consumption Trends in Deep Learning Inference. (arXiv:2109.05472v1 [cs.LG])
    (0 min) The progress of some AI paradigms such as deep learning is said to be linked to an exponential growth in the number of parameters. There are many studies corroborating these trends, but does this translate into an exponential increase in energy consumption? In order to answer this question we focus on inference costs rather than training costs, as the former account for most of the computing effort, solely because of the multiplicative factors. Also, apart from algorithmic innovations, we account for more specific and powerful hardware (leading to higher FLOPS) that is usually accompanied with important energy efficiency optimisations. We also move the focus from the first implementation of a breakthrough paper towards the consolidated version of the techniques one or two year later. Under this distinctive and comprehensive perspective, we study relevant models in the areas of computer vision and natural language processing: for a sustained increase in performance we see a much softer growth in energy consumption than previously anticipated. The only caveat is, yet again, the multiplicative factor, as future AI increases penetration and becomes more pervasive.
    Entity-Based Knowledge Conflicts in Question Answering. (arXiv:2109.05052v1 [cs.CL])
    (0 min) Knowledge-dependent tasks typically use two sources of knowledge: parametric, learned at training time, and contextual, given as a passage at inference time. To understand how models use these sources together, we formalize the problem of knowledge conflicts, where the contextual information contradicts the learned information. Analyzing the behaviour of popular models, we measure their over-reliance on memorized information (the cause of hallucinations), and uncover important factors that exacerbate this behaviour. Lastly, we propose a simple method to mitigate over-reliance on parametric knowledge, which minimizes hallucination, and improves out-of-distribution generalization by 4%-7%. Our findings demonstrate the importance for practitioners to evaluate model tendency to hallucinate rather than read, and show that our mitigation strategy encourages generalization to evolving information (i.e., time-dependent queries). To encourage these practices, we have released our framework for generating knowledge conflicts.
    Kernel PCA with the Nystr\"om method. (arXiv:2109.05578v1 [stat.ML])
    (0 min) Kernel methods are powerful but computationally demanding techniques for non-linear learning. A popular remedy, the Nystr\"om method has been shown to be able to scale up kernel methods to very large datasets with little loss in accuracy. However, kernel PCA with the Nystr\"om method has not been widely studied. In this paper we derive kernel PCA with the Nystr\"om method and study its accuracy, providing a finite-sample confidence bound on the difference between the Nystr\"om and standard empirical reconstruction errors. The behaviours of the method and bound are illustrated through extensive computer experiments on real-world data. As an application of the method we present kernel principal component regression with the Nystr\"om method.
    An Unsupervised Deep-Learning Method for Fingerprint Classification: the CCAE Network and the Hybrid Clustering Strategy. (arXiv:2109.05526v1 [cs.CV])
    (0 min) The fingerprint classification is an important and effective method to quicken the process and improve the accuracy in the fingerprint matching process. Conventional supervised methods need a large amount of pre-labeled data and thus consume immense human resources. In this paper, we propose a new and efficient unsupervised deep learning method that can extract fingerprint features and classify fingerprint patterns automatically. In this approach, a new model named constraint convolutional auto-encoder (CCAE) is used to extract fingerprint features and a hybrid clustering strategy is applied to obtain the final clusters. A set of experiments in the NIST-DB4 dataset shows that the proposed unsupervised method exhibits the efficient performance on fingerprint classification. For example, the CCAE achieves an accuracy of 97.3% on only 1000 unlabeled fingerprints in the NIST-DB4.
    Graph Attention Network Based Single-Pixel Compressive Direction of Arrival Estimation. (arXiv:2109.05466v1 [eess.SP])
    (0 min) In this paper, we present a single-pixel compressive direction of arrival (DoA) estimation technique leveraging a graph attention network (GAT) based deep-learning framework. The physical layer compression is achieved using a coded-aperture technique, probing the spectrum of far-field sources incident on the aperture using a set of spatio-temporally incoherent modes. This information is then encoded and compressed into the channel of the coded-aperture. The coded-aperture based receiver exhibits a single-channel, replacing the conventional multichannel raster scan based solutions for DoA estimation. The GAT network enables the compressive DoA estimation framework to learn the DoA information directly from the measurements acquired using the coded-aperture. This step eliminates the need for an additional reconstruction step and significantly simplifies the processing layer to obtain the DoA estimate. We show that the presented GAT integrated single-pixel radar framework can retrieve high fidelity DoA information even under relatively low signal-to-noise ratio (SNR) levels.
    Gradients and Subgradients of Buffered Failure Probability. (arXiv:2109.05391v1 [math.OC])
    (0 min) Gradients and subgradients are central to optimization and sensitivity analysis of buffered failure probabilities. We furnish a characterization of subgradients based on subdifferential calculus in the case of finite probability distributions and, under additional assumptions, also a gradient expression for general distributions. Several examples illustrate the application of the results, especially in the context of optimality conditions.
    Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction. (arXiv:2104.08028v2 [cs.LG] UPDATED)
    (0 min) Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. Source code to reproduce our experimental results, including a keyword extraction library, are available in the following repository: https://github.com/asahi417/kex
    Instance-Conditioned GAN. (arXiv:2109.05070v1 [cs.CV])
    (0 min) Generative Adversarial Networks (GANs) can generate near photo realistic images in narrow domains such as human faces. Yet, modeling complex distributions of datasets such as ImageNet and COCO-Stuff remains challenging in unconditional settings. In this paper, we take inspiration from kernel density estimation techniques and introduce a non-parametric approach to modeling distributions of complex datasets. We partition the data manifold into a mixture of overlapping neighborhoods described by a datapoint and its nearest neighbors, and introduce a model, called instance-conditioned GAN (IC-GAN), which learns the distribution around each datapoint. Experimental results on ImageNet and COCO-Stuff show that IC-GAN significantly improves over unconditional models and unsupervised data partitioning baselines. Moreover, we show that IC-GAN can effortlessly transfer to datasets not seen during training by simply changing the conditioning instances, and still generate realistic images. Finally, we extend IC-GAN to the class-conditional case and show semantically controllable generation and competitive quantitative results on ImageNet; while improving over BigGAN on ImageNet-LT. We will opensource our code and trained models to reproduce the reported results.
    Co-Correcting: Noise-tolerant Medical Image Classification via mutual Label Correction. (arXiv:2109.05159v1 [eess.IV])
    (0 min) With the development of deep learning, medical image classification has been significantly improved. However, deep learning requires massive data with labels. While labeling the samples by human experts is expensive and time-consuming, collecting labels from crowd-sourcing suffers from the noises which may degenerate the accuracy of classifiers. Therefore, approaches that can effectively handle label noises are highly desired. Unfortunately, recent progress on handling label noise in deep learning has gone largely unnoticed by the medical image. To fill the gap, this paper proposes a noise-tolerant medical image classification framework named Co-Correcting, which significantly improves classification accuracy and obtains more accurate labels through dual-network mutual learning, label probability estimation, and curriculum label correcting. On two representative medical image datasets and the MNIST dataset, we test six latest Learning-with-Noisy-Labels methods and conduct comparative studies. The experiments show that Co-Correcting achieves the best accuracy and generalization under different noise ratios in various tasks. Our project can be found at: https://github.com/JiarunLiu/Co-Correcting.
    Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks. (arXiv:2103.04886v3 [cs.LG] UPDATED)
    (0 min) Attention based neural networks are state of the art in a large range of applications. However, their performance tends to degrade when the number of layers increases. In this work, we show that enforcing Lipschitz continuity by normalizing the attention scores can significantly improve the performance of deep attention models. First, we show that, for deep graph attention networks (GAT), gradient explosion appears during training, leading to poor performance of gradient-based training algorithms. To address this issue, we derive a theoretical analysis of the Lipschitz continuity of attention modules and introduce LipschitzNorm, a simple and parameter-free normalization for self-attention mechanisms that enforces the model to be Lipschitz continuous. We then apply LipschitzNorm to GAT and Graph Transformers and show that their performance is substantially improved in the deep setting (10 to 30 layers). More specifically, we show that a deep GAT model with LipschitzNorm achieves state of the art results for node label prediction tasks that exhibit long-range dependencies, while showing consistent improvements over their unnormalized counterparts in benchmark node classification tasks.
    RealFormer: Transformer Likes Residual Attention. (arXiv:2012.11747v3 [cs.LG] UPDATED)
    (0 min) Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.
    Controlled Gaussian Process Dynamical Models with Application to Robotic Cloth Manipulation. (arXiv:2103.06615v2 [cs.RO] UPDATED)
    (0 min) Over the last years, robotic cloth manipulation has gained relevance within the research community. While significant advances have been made in robotic manipulation of rigid objects, the manipulation of non-rigid objects such as cloth garments is still a challenging problem. The uncertainty on how cloth behaves often requires the use of model-based approaches. However, cloth models have a very high dimensionality. Therefore, it is difficult to find a middle point between providing a manipulator with a dynamics model of cloth and working with a state space of tractable dimensionality. For this reason, most cloth manipulation approaches in literature perform static or quasi-static manipulation. In this paper, we propose a variation of Gaussian Process Dynamical Models (GPDMs) to model cloth dynamics in a low-dimensional manifold. GPDMs project a high-dimensional state space into a smaller dimension latent space which is capable of keeping the dynamic properties. Using such approach, we add control variables to the original formulation. In this way, it is possible to take into account the robot commands exerted on the cloth dynamics. We call this new version Controlled Gaussian Process Dynamical Model (CGPDM). Moreover, we propose an alternative parametric structure for the model, that is richer than the one employed in previous GPDM realizations. The modeling capacity of our proposal has been tested in both a simulated and a real scenario, where CGPDM proved to be capable of generalizing over a wide range of movements and correctly predicting the cloth motions obtained by previously unseen sequences of control actions.
    Direct Random Search for Fine Tuning of Deep Reinforcement Learning Policies. (arXiv:2109.05604v1 [cs.RO])
    (0 min) Researchers have demonstrated that Deep Reinforcement Learning (DRL) is a powerful tool for finding policies that perform well on complex robotic systems. However, these policies are often unpredictable and can induce highly variable behavior when evaluated with only slightly different initial conditions. Training considerations constrain DRL algorithm designs in that most algorithms must use stochastic policies during training. The resulting policy used during deployment, however, can and frequently is a deterministic one that uses the Maximum Likelihood Action (MLA) at each step. In this work, we show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts. We illustrate this across a large collection of reinforcement learning environments, using a wide variety of policies obtained from different algorithms. Our results show that this method yields more consistent and higher performing agents on the environments we tested. Furthermore, we demonstrate how this method can be used to extend our previous work on shrinking the dimensionality of the reachable state space of closed-loop systems run under Deep Neural Network (DNN) policies.
    Machine Learning Interpretability Meets TLS Fingerprinting. (arXiv:2011.06304v2 [cs.NI] UPDATED)
    (0 min) Protecting users' privacy over the Internet is of great importance; however, it becomes harder and harder to maintain due to the increasing complexity of network protocols and components. Therefore, investigating and understanding how data is leaked from the information transmission platforms and protocols can lead us to a more secure environment. In this paper, we propose a framework to systematically find the most vulnerable information fields in a network protocol. To this end, focusing on the transport layer security (TLS) protocol, we perform different machine-learning-based fingerprinting attacks on the collected data from more than 70 domains (websites) to understand how and where this information leakage occurs in the TLS protocol. Then, by employing the interpretation techniques developed in the machine learning community and applying our framework, we find the most vulnerable information fields in the TLS protocol. Our findings demonstrate that the TLS handshake (which is mainly unencrypted), the TLS record length appearing in the TLS application data header, and the initialization vector (IV) field are among the most critical leaker parts in this protocol, respectively.
    A decreasing scaling transition scheme from Adam to SGD. (arXiv:2106.06749v2 [cs.LG] UPDATED)
    (2 min) Adaptive gradient algorithm (AdaGrad) and its variants, such as RMSProp, Adam, AMSGrad, etc, have been widely used in deep learning. Although these algorithms are faster in the early phase of training, their generalization performance is often not as good as stochastic gradient descent (SGD). Hence, a trade-off method of transforming Adam to SGD after a certain iteration to gain the merits of both algorithms is theoretically and practically significant. To that end, we propose a decreasing scaling transition scheme to achieve a smooth and stable transition from Adam to SGD, which is called DSTAdam. The convergence of the proposed DSTAdam is also proved in an online convex setting. Finally, the effectiveness of the DSTAdam is verified on the CIFAR-10/100 datasets. Our implementation is available at: https://github.com/kunzeng/DSTAdam.
    Variational Graph Normalized Auto-Encoders. (arXiv:2108.08046v2 [cs.LG] UPDATED)
    (2 min) Link prediction is one of the key problems for graph-structured data. With the advancement of graph neural networks, graph autoencoders (GAEs) and variational graph autoencoders (VGAEs) have been proposed to learn graph embeddings in an unsupervised way. It has been shown that these methods are effective for link prediction tasks. However, they do not work well in link predictions when a node whose degree is zero (i.g., isolated node) is involved. We have found that GAEs/VGAEs make embeddings of isolated nodes close to zero regardless of their content features. In this paper, we propose a novel Variational Graph Normalized AutoEncoder (VGNAE) that utilize L2-normalization to derive better embeddings for isolated nodes. We show that our VGNAEs outperform the existing state-of-the-art models for link prediction tasks. The code is available at https://github.com/SeongJinAhn/VGNAE.
    A physics-informed variational DeepONet for predicting the crack path in brittle materials. (arXiv:2108.06905v2 [cs.LG] UPDATED)
    (2 min) Failure trajectories, identifying the probable failure zones, and damage statistics are some of the key quantities of relevance in brittle fracture applications. High-fidelity numerical solvers that reliably estimate these relevant quantities exist but they are computationally demanding requiring a high resolution of the crack. Moreover, independent intensive simulations need to be carried out even for a small change in domain parameters and/or material properties. Therefore, fast and generalizable surrogate models are needed to alleviate the computational burden but the discontinuous nature of fracture mechanics presents a major challenge to developing such models. We propose a physics-informed variational formulation of DeepONet (V-DeepONet) for brittle fracture analysis. V-DeepONet is trained to map the initial configuration of the defect to the relevant fields of interests (e.g., damage and displacement fields). Once the network is trained, the entire global solution can be rapidly obtained for any initial crack configuration and loading steps on that domain. While the original DeepONet is solely data-driven, we take a different path to train the V-DeepONet by imposing the governing equations in variational form and we also use some labelled data. We demonstrate the effectiveness of V-DeepOnet through two benchmarks of brittle fracture, and we verify its accuracy using results from high-fidelity solvers. Encoding the physical laws and also some data to train the network renders the surrogate model capable of accurately performing both interpolation and extrapolation tasks, considering that fracture modeling is very sensitive to fluctuations. The proposed hybrid training of V-DeepONet is superior to state-of-the-art methods and can be applied to a wide array of dynamical systems with complex responses.
    AdViCE: Aggregated Visual Counterfactual Explanations for Machine Learning Model Validation. (arXiv:2109.05629v1 [cs.HC])
    (2 min) Rapid improvements in the performance of machine learning models have pushed them to the forefront of data-driven decision-making. Meanwhile, the increased integration of these models into various application domains has further highlighted the need for greater interpretability and transparency. To identify problems such as bias, overfitting, and incorrect correlations, data scientists require tools that explain the mechanisms with which these model decisions are made. In this paper we introduce AdViCE, a visual analytics tool that aims to guide users in black-box model debugging and validation. The solution rests on two main visual user interface innovations: (1) an interactive visualization design that enables the comparison of decisions on user-defined data subsets; (2) an algorithm and visual design to compute and visualize counterfactual explanations - explanations that depict model outcomes when data features are perturbed from their original values. We provide a demonstration of the tool through a use case that showcases the capabilities and potential limitations of the proposed approach.

2021-09-13

  • cs.CL updates on arXiv.org

    Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning. (arXiv:2104.08799v2 [cs.CL] UPDATED)
    (2 min) Aiming to generate a set of keyphrases, Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document. Based on Seq2Seq models, the previous reinforcement learning framework on KG tasks utilizes the evaluation metrics to further improve the well-trained neural models. However, these KG evaluation metrics such as $F_1@5$ and $F_1@M$ are only aware of the exact correctness of predictions on phrase-level and ignore the semantic similarities between similar predictions and targets, which inhibits the model from learning deep linguistic patterns. In response to this problem, we propose a new fine-grained evaluation metric to improve the RL framework, which considers different granularities: token-level $F_1$ score, edit distance, duplication, and prediction quantities. On the whole, the new framework includes two reward functions: the fine-grained evaluation score and the vanilla $F_1$ score. This framework helps the model identifying some partial match phrases which can be further optimized as the exact match ones. Experiments on KG benchmarks show that our proposed training framework outperforms the previous RL training frameworks among all evaluation scores. In addition, our method can effectively ease the synonym problem and generate a higher quality prediction. The source code is available at \url{https://github.com/xuyige/FGRL4KG}.
    SimCSE: Simple Contrastive Learning of Sentence Embeddings. (arXiv:2104.08821v3 [cs.CL] UPDATED)
    (2 min) This paper presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework, by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to previous best results. We also show -- both theoretically and empirically -- that contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
    Linguistic Dependencies and Statistical Dependence. (arXiv:2104.08685v2 [cs.CL] UPDATED)
    (2 min) Are pairs of words that tend to occur together also likely to stand in a linguistic dependency? This empirical question is motivated by a long history of literature in cognitive science, psycholinguistics, and NLP. In this work we contribute an extensive analysis of the relationship between linguistic dependencies and statistical dependence between words. Improving on previous work, we introduce the use of large pretrained language models to compute contextualized estimates of the pointwise mutual information between words (CPMI). For multiple models and languages, we extract dependency trees which maximize CPMI, and compare to gold standard linguistic dependencies. Overall, we find that CPMI dependencies achieve an unlabelled undirected attachment score of at most $\approx 0.5$. While far above chance, and consistently above a non-contextualized PMI baseline, this score is generally comparable to a simple baseline formed by connecting adjacent words. We analyze which kinds of linguistic dependencies are best captured in CPMI dependencies, and also find marked differences between the estimates of the large pretrained language models, illustrating how their different training schemes affect the type of dependencies they capture.
    Detect and Classify -- Joint Span Detection and Classification for Health Outcomes. (arXiv:2104.07789v2 [cs.CL] UPDATED)
    (2 min) A health outcome is a measurement or an observation used to capture and assess the effect of a treatment. Automatic detection of health outcomes from text would undoubtedly speed up access to evidence necessary in healthcare decision making. Prior work on outcome detection has modelled this task as either (a) a sequence labelling task, where the goal is to detect which text spans describe health outcomes, or (b) a classification task, where the goal is to classify a text into a pre-defined set of categories depending on an outcome that is mentioned somewhere in that text. However, this decoupling of span detection and classification is problematic from a modelling perspective and ignores global structural correspondences between sentence-level and word-level information present in a given text. To address this, we propose a method that uses both word-level and sentence-level information to simultaneously perform outcome span detection and outcome type classification. In addition to injecting contextual information to hidden vectors, we use label attention to appropriately weight both word and sentence level information. Experimental results on several benchmark datasets for health outcome detection show that our proposed method consistently outperforms decoupled methods, reporting competitive results.
    Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little. (arXiv:2104.06644v2 [cs.CL] UPDATED)
    (2 min) A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks -- including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
    MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment. (arXiv:2109.04870v1 [cs.CL])
    (2 min) Readability assessment is the task of determining how difficult or easy a text is or which level/grade it has. Traditionally, language dependent readability formula have been used, but these formulae take few text characteristics into account. However, Natural Language Processing (NLP) tools that assess the complexity of texts are able to measure more different features and can be adapted to different languages. In this paper, we present the MultiAzterTest tool: (i) an open source NLP tool which analyzes texts on over 125 measures of cohesion,language, and readability for English, Spanish and Basque, but whose architecture is designed to easily adapt other languages; (ii) readability assessment classifiers that improve the performance of Coh-Metrix in English, Coh-Metrix-Esp in Spanish and ErreXail in Basque; iii) a web tool. MultiAzterTest obtains 90.09 % in accuracy when classifying into three reading levels (elementary, intermediate, and advanced) in English and 95.50 % in Basque and 90 % in Spanish when classifying into two reading levels (simple and complex) using a SMO classifier. Using cross-lingual features, MultiAzterTest also obtains competitive results above all in a complex vs simple distinction.
    Universal Sentence Representation Learning with Conditional Masked Language Model. (arXiv:2012.14388v3 [cs.CL] UPDATED)
    (2 min) This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.
    XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment. (arXiv:2104.08597v2 [cs.CL] UPDATED)
    (2 min) Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align), a technique to automatically mine cross-lingual entity lexica from mined web data. We demonstrate LSP-Align outperforms baselines at extracting cross-lingual entity pairs and mine 164 million entity pairs from 120 different languages aligned with English. We release these cross-lingual entity pairs along with the massively multilingual tagged named entity corpus as a resource to the NLP community.
    AutoTriggER: Named Entity Recognition with Auxiliary Trigger Extraction. (arXiv:2109.04726v1 [cs.CL])
    (2 min) Deep neural models for low-resource named entity recognition (NER) have shown impressive results by leveraging distant super-vision or other meta-level information (e.g. explanation). However, the costs of acquiring such additional information are generally prohibitive, especially in domains where existing resources (e.g. databases to be used for distant supervision) may not exist. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging "entity triggers" which are essentially human-readable clues in the text that can help guide the model to make better decisions. Thus, the framework is able to both create and leverage auxiliary supervision by itself. Through experiments on three well-studied NER datasets, we show that our automatically extracted triggers are well-matched to human triggers, and AutoTriggER improves performance over a RoBERTa-CRFarchitecture by nearly 0.5 F1 points on average and much more in a low resource setting.
    SummerTime: Text Summarization Toolkit for Non-experts. (arXiv:2108.12738v2 [cs.CL] UPDATED)
    (2 min) Recent advances in summarization provide models that can generate summaries of higher quality. Such models now exist for a number of summarization tasks, including query-based summarization, dialogue summarization, and multi-document summarization. While such models and tasks are rapidly growing in the research field, it has also become challenging for non-experts to keep track of them. To make summarization methods more accessible to a wider audience, we develop SummerTime by rethinking the summarization task from the perspective of an NLP non-expert. SummerTime is a complete toolkit for text summarization, including various models, datasets and evaluation metrics, for a full spectrum of summarization-related tasks. SummerTime integrates with libraries designed for NLP researchers, and enables users with easy-to-use APIs. With SummerTime, users can locate pipeline solutions and search for the best model with their own data, and visualize the differences, all with a few lines of code. We also provide explanations for models and evaluation metrics to help users understand the model behaviors and select models that best suit their needs. Our library, along with a notebook demo, is available at https://github.com/Yale-LILY/SummerTime.
    Visual Goal-Step Inference using wikiHow. (arXiv:2104.05845v2 [cs.CV] UPDATED)
    (2 min) Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
    Contrastive Explanations for Model Interpretability. (arXiv:2103.01378v2 [cs.CL] UPDATED)
    (2 min) Contrastive explanations clarify why an event occurred in contrast to another. They are more inherently intuitive to humans to both produce and comprehend. We propose a methodology to produce contrastive explanations for classification models by modifying the representation to disregard non-contrastive information, and modifying model behavior to only be based on contrastive reasoning. Our method is based on projecting model representation to a latent space that captures only the features that are useful (to the model) to differentiate two potential decisions. We demonstrate the value of contrastive explanations by analyzing two different scenarios, using both high-level abstract concept attribution and low-level input token/span attribution, on two widely used text classification tasks. Specifically, we produce explanations for answering: for which label, and against which alternative label, is some aspect of the input useful? And which aspects of the input are useful for and against particular decisions? Overall, our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
    Rule-based Morphological Inflection Improves Neural Terminology Translation. (arXiv:2109.04620v1 [cs.CL])
    (2 min) Current approaches to incorporating terminology constraints in machine translation (MT) typically assume that the constraint terms are provided in their correct morphological forms. This limits their application to real-world scenarios where constraint terms are provided as lemmas. In this paper, we introduce a modular framework for incorporating lemma constraints in neural MT (NMT) in which linguistic knowledge and diverse types of NMT models can be flexibly applied. It is based on a novel cross-lingual inflection module that inflects the target lemma constraints based on the source context. We explore linguistically motivated rule-based and data-driven neural-based inflection modules and design English-German health and English-Lithuanian news test suites to evaluate them in domain adaptation and low-resource MT settings. Results show that our rule-based inflection module helps NMT models incorporate lemma constraints more accurately than a neural module and outperforms the existing end-to-end approach with lower training costs.
    Active learning for reducing labeling effort in text classification tasks. (arXiv:2109.04847v1 [cs.CL])
    (2 min) Labeling data can be an expensive task as it is usually performed manually by domain experts. This is cumbersome for deep learning, as it is dependent on large labeled datasets. Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative. Little research has been done on AL in a text classification setting and next to none has involved the more recent, state-of-the-art NLP models. Here, we present an empirical study that compares different uncertainty-based algorithms with BERT$_{base}$ as the used classifier. We evaluate the algorithms on two NLP classification datasets: Stanford Sentiment Treebank and KvK-Frontpages. Additionally, we explore heuristics that aim to solve presupposed problems of uncertainty-based AL; namely, that it is unscalable and that it is prone to selecting outliers. Furthermore, we explore the influence of the query-pool size on the performance of AL. Whereas it was found that the proposed heuristics for AL did not improve performance of AL; our results show that using uncertainty-based AL with BERT$_{base}$ outperforms random sampling of data. This difference in performance can decrease as the query-pool size gets larger.
    Bursting Scientific Filter Bubbles: Boosting Innovation via Novel Author Discovery. (arXiv:2108.05669v2 [cs.DL] UPDATED)
    (2 min) Isolated silos of scientific research and the growing challenge of information overload limit awareness across the literature and hinder innovation. Algorithmic curation and recommendation, which often prioritize relevance, can further reinforce these informational "filter bubbles." In response, we describe Bridger, a system for facilitating discovery of scholars and their work, to explore design tradeoffs between relevant and novel recommendations. We construct a faceted representation of authors with information gleaned from their papers and inferred author personas, and use it to develop an approach that locates commonalities ("bridges") and contrasts between scientists -- retrieving partially similar authors rather than aiming for strict similarity. In studies with computer science researchers, this approach helps users discover authors considered useful for generating novel research directions, outperforming a state-of-art neural model. In addition to recommending new content, we also demonstrate an approach for displaying it in a manner that boosts researchers' ability to understand the work of authors with whom they are unfamiliar. Finally, our analysis reveals that Bridger connects authors who have different citation profiles, publish in different venues, and are more distant in social co-authorship networks, raising the prospect of bridging diverse communities and facilitating discovery.
    Zero-Shot Dialogue State Tracking via Cross-Task Transfer. (arXiv:2109.04655v1 [cs.CL])
    (2 min) Zero-shot transfer learning for dialogue state tracking (DST) enables us to handle a variety of task-oriented dialogue domains without the expense of collecting in-domain data. In this work, we propose to transfer the \textit{cross-task} knowledge from general question answering (QA) corpora for the zero-shot DST task. Specifically, we propose TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle "none" value slots in the zero-shot DST setting. The extensive experiments show that our approaches substantially improve the existing zero-shot and few-shot results on MultiWoz. Moreover, compared to the fully trained baseline on the Schema-Guided Dialogue dataset, our approach shows better generalization ability in unseen domains.
    Heterogeneous Graph Neural Networks for Keyphrase Generation. (arXiv:2109.04703v1 [cs.CL])
    (2 min) The encoder-decoder framework achieves state-of-the-art results in keyphrase generation (KG) tasks by predicting both present keyphrases that appear in the source document and absent keyphrases that do not. However, relying solely on the source document can result in generating uncontrollable and inaccurate absent keyphrases. To address these problems, we propose a novel graph-based method that can capture explicit knowledge from related references. Our model first retrieves some document-keyphrases pairs similar to the source document from a pre-defined index as references. Then a heterogeneous graph is constructed to capture relationships of different granularities between the source document and its references. To guide the decoding process, a hierarchical attention and copy mechanism is introduced, which directly copies appropriate words from both the source document and its references based on their relevance and significance. The experimental results on multiple KG benchmarks show that the proposed model achieves significant improvements against other baseline models, especially with regard to the absent keyphrase prediction.
    COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval. (arXiv:2010.12800v2 [cs.CL] UPDATED)
    (2 min) We present a large, challenging dataset, COUGH, for COVID-19 FAQ retrieval. Similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, Query Bank and Relevance Set. The FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce Query Bank and Relevance Set, where the former contains 1,236 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 48.8 under P@5, indicating a great challenge presented by COUGH and encouraging future research for further improvement. Our COUGH dataset is available at https://github.com/sunlab-osu/covid-faq.
    Analyzing the Surprising Variability in Word Embedding Stability Across Languages. (arXiv:2004.14876v2 [cs.CL] UPDATED)
    (0 min) Word embeddings are powerful representations that form the foundation of many natural language processing architectures, both in English and in other languages. To gain further insight into word embeddings, we explore their stability (e.g., overlap between the nearest neighbors of a word in different embedding spaces) in diverse languages. We discuss linguistic properties that are related to stability, drawing out insights about correlations with affixing, language gender systems, and other features. This has implications for embedding use, particularly in research that uses them to study language trends.
    Pre-train or Annotate? Domain Adaptation with a Constrained Budget. (arXiv:2109.04711v1 [cs.CL])
    (2 min) Recent work has demonstrated that pre-training in-domain language models can boost performance when adapting to a new domain. However, the costs associated with pre-training raise an important question: given a fixed budget, what steps should an NLP practitioner take to maximize performance? In this paper, we study domain adaptation under budget constraints, and approach it as a customer choice problem between data annotation and pre-training. Specifically, we measure the annotation cost of three procedural text datasets and the pre-training cost of three in-domain language models. Then we evaluate the utility of different combinations of pre-training and data annotation under varying budget constraints to assess which combination strategy works best. We find that, for small budgets, spending all funds on annotation leads to the best performance; once the budget becomes large enough, a combination of data annotation and in-domain pre-training works more optimally. We therefore suggest that task-specific data annotation should be part of an economical strategy when adapting an NLP model to a new domain.
    Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT. (arXiv:2109.04810v1 [cs.CL])
    (2 min) Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.
    Probing Commonsense Explanation in Dialogue Response Generation. (arXiv:2104.09574v4 [cs.CL] UPDATED)
    (0 min) Humans use commonsense reasoning (CSR) implicitly to produce natural and coherent responses in conversations. Aiming to close the gap between current response generation (RG) models and human communication abilities, we want to understand why RG models respond as they do by probing RG model's understanding of commonsense reasoning that elicits proper responses. We formalize the problem by framing commonsense as a latent variable in the RG task and using explanations for responses as textual form of commonsense. We collect 6k annotated explanations justifying responses from four dialogue datasets and ask humans to verify them and propose two probing settings to evaluate RG models' CSR capabilities. Probing results show that models fail to capture the logical relations between commonsense explanations and responses and fine-tuning on in-domain data and increasing model sizes do not lead to understanding of CSR for RG. We hope our study motivates more research in making RG models emulate the human reasoning process in pursuit of smooth human-AI communication.
    Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution. (arXiv:2104.07425v2 [cs.CL] UPDATED)
    (0 min) Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task that trains MLMs on anaphoric relations with explicit supervision, and the second proposal is a new finetuning method that remedies a notorious issue, the pretrain-finetune discrepancy. Our experiments on Japanese ZAR demonstrated that our two proposals boost the state-of-the-art performance, and our detailed analysis provides new insights on the remaining challenges.
    Studying word order through iterative shuffling. (arXiv:2109.04867v1 [cs.CL])
    (0 min) As neural language models approach human performance on NLP benchmark tasks, their advances are widely seen as evidence of an increasingly complex understanding of syntax. This view rests upon a hypothesis that has not yet been empirically tested: that word order encodes meaning essential to performing these tasks. We refute this hypothesis in many cases: in the GLUE suite and in various genres of English text, the words in a sentence or phrase can rarely be permuted to form a phrase carrying substantially different information. Our surprising result relies on inference by iterative shuffling (IBIS), a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model. IBIS can use any black-box model without additional training and is superior to existing word ordering algorithms. Coalescing our findings, we discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.
    A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations. (arXiv:2109.04727v1 [cs.CL])
    (2 min) Language agnostic and semantic-language information isolation is an emerging research direction for multilingual representations models. We explore this problem from a novel angle of geometric algebra and semantic space. A simple but highly effective method "Language Information Removal (LIR)" factors out language identity information from semantic related components in multilingual representations pre-trained on multi-monolingual data. A post-training and model-agnostic method, LIR only uses simple linear operations, e.g. matrix factorization and orthogonal projection. LIR reveals that for weak-alignment multilingual systems, the principal components of semantic spaces primarily encodes language identity information. We first evaluate the LIR on a cross-lingual question answer retrieval task (LAReQA), which requires the strong alignment for the multilingual embedding space. Experiment shows that LIR is highly effectively on this task, yielding almost 100% relative improvement in MAP for weak-alignment models. We then evaluate the LIR on Amazon Reviews and XEVAL dataset, with the observation that removing language information is able to improve the cross-lingual transfer performance.
    What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers. (arXiv:2109.04650v1 [cs.CL])
    (2 min) GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean. Also, we show the performance benefits of prompt-based learning and demonstrate how it can be integrated into the prompt engineering pipeline. Then we discuss the possibility of materializing the No Code AI paradigm by providing AI prototyping capabilities to non-experts of ML by introducing HyperCLOVA studio, an interactive prompt engineering interface. Lastly, we demonstrate the potential of our methods with three successful in-house applications.
    Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables. (arXiv:2109.04705v1 [cs.CL])
    (2 min) Zero-shot translation, directly translating between language pairs unseen in training, is a promising capability of multilingual neural machine translation (NMT). However, it usually suffers from capturing spurious correlations between the output language and language invariant semantics due to the maximum likelihood training objective, leading to poor transfer performance on zero-shot translation. In this paper, we introduce a denoising autoencoder objective based on pivot language into traditional training objective to improve the translation accuracy on zero-shot directions. The theoretical analysis from the perspective of latent variables shows that our approach actually implicitly maximizes the probability distributions for zero-shot directions. On two benchmark machine translation datasets, we demonstrate that the proposed method is able to effectively eliminate the spurious correlations and significantly outperforms state-of-the-art methods with a remarkable performance. Our code is available at https://github.com/Victorwz/zs-nmt-dae.
    Predicting emergent linguistic compositions through time: Syntactic frame extension via multimodal chaining. (arXiv:2109.04652v1 [cs.CL])
    (2 min) Natural language relies on a finite lexicon to express an unbounded set of emerging ideas. One result of this tension is the formation of new compositions, such that existing linguistic units can be combined with emerging items into novel expressions. We develop a framework that exploits the cognitive mechanisms of chaining and multimodal knowledge to predict emergent compositional expressions through time. We present the syntactic frame extension model (SFEM) that draws on the theory of chaining and knowledge from "percept", "concept", and "language" to infer how verbs extend their frames to form new compositions with existing and novel nouns. We evaluate SFEM rigorously on the 1) modalities of knowledge and 2) categorization models of chaining, in a syntactically parsed English corpus over the past 150 years. We show that multimodal SFEM predicts newly emerged verb syntax and arguments substantially better than competing models using purely linguistic or unimodal knowledge. We find support for an exemplar view of chaining as opposed to a prototype view and reveal how the joint approach of multimodal chaining may be fundamental to the creation of literal and figurative language uses including metaphor and metonymy.
    Semantic Parsing in Task-Oriented Dialog with Recursive Insertion-based Encoder. (arXiv:2109.04500v1 [cs.CL])
    (2 min) We introduce a Recursive INsertion-based Encoder (RINE), a novel approach for semantic parsing in task-oriented dialog. Our model consists of an encoder network that incrementally builds the semantic parse tree by predicting the non-terminal label and its positions in the linearized tree. At the generation time, the model constructs the semantic parse tree by recursively inserting the predicted non-terminal labels at the predicted positions until termination. RINE achieves state-of-the-art exact match accuracy on low- and high-resource versions of the conversational semantic parsing benchmark TOP (Gupta et al., 2018; Chen et al., 2020), outperforming strong sequence-to-sequence models and transition-based parsers. We also show that our model design is applicable to nested named entity recognition task, where it performs on par with state-of-the-art approach designed for that task. Finally, we demonstrate that our approach is 2-3.5 times faster than the sequence-to-sequence model at inference time.
    Constructing Contrastive samples via Summarization for Text Classification with limited annotations. (arXiv:2104.05094v2 [cs.CL] UPDATED)
    (2 min) Contrastive Learning has emerged as a powerful representation learning method and facilitates various downstream tasks especially when supervised data is limited. How to construct efficient contrastive samples through data augmentation is key to its success. Unlike vision tasks, the data augmentation method for contrastive learning has not been investigated sufficiently in language tasks. In this paper, we propose a novel approach to construct contrastive samples for language tasks using text summarization. We use these samples for supervised contrastive learning to gain better text representations which greatly benefit text classification tasks with limited annotations. To further improve the method, we mix up samples from different classes and add an extra regularization, named Mixsum, in addition to the cross-entropy-loss. Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News, and IMDb) demonstrate the effectiveness of the proposed contrastive learning framework with summarization-based data augmentation and Mixsum regularization.
    A scalable framework for learning from implicit user feedback to improve natural language understanding in large-scale conversational AI systems. (arXiv:2010.12251v2 [cs.CL] UPDATED)
    (2 min) Natural Language Understanding (NLU) is an established component within a conversational AI or digital assistant system, and it is responsible for producing semantic understanding of a user request. We propose a scalable and automatic approach for improving NLU in a large-scale conversational AI system by leveraging implicit user feedback, with an insight that user interaction data and dialog context have rich information embedded from which user satisfaction and intention can be inferred. In particular, we propose a general domain-agnostic framework for curating new supervision data for improving NLU from live production traffic. With an extensive set of experiments, we show the results of applying the framework and improving NLU for a large-scale production system and show its impact across 10 domains.
    CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification. (arXiv:2109.04853v1 [cs.CL])
    (2 min) Large-Scale Multi-Label Text Classification (LMTC) includes tasks with hierarchical label spaces, such as automatic assignment of ICD-9 codes to discharge summaries. Performance of models in prior art is evaluated with standard precision, recall, and F1 measures without regard for the rich hierarchical structure. In this work we argue for hierarchical evaluation of the predictions of neural LMTC models. With the example of the ICD-9 ontology we describe a structural issue in the representation of the structured label space in prior art, and propose an alternative representation based on the depth of the ontology. We propose a set of metrics for hierarchical evaluation using the depth-based representation. We compare the evaluation scores from the proposed metrics with previously used metrics on prior art LMTC models for ICD-9 coding in MIMIC-III. We also propose further avenues of research involving the proposed ontological representation.
    TIAGE: A Benchmark for Topic-Shift Aware Dialog Modeling. (arXiv:2109.04562v1 [cs.CL])
    (2 min) Human conversations naturally evolve around different topics and fluently move between them. In research on dialog systems, the ability to actively and smoothly transition to new topics is often ignored. In this paper we introduce TIAGE, a new topic-shift aware dialog benchmark constructed utilizing human annotations on topic shifts. Based on TIAGE, we introduce three tasks to investigate different scenarios of topic-shift modeling in dialog settings: topic-shift detection, topic-shift triggered response generation and topic-aware dialog generation. Experiments on these tasks show that the topic-shift signals in TIAGE are useful for topic-shift response generation. On the other hand, dialog systems still struggle to decide when to change topic. This indicates further research is needed in topic-shift aware dialog modeling.
    Revisiting Robust Neural Machine Translation: A Transformer Case Study. (arXiv:2012.15710v2 [cs.CL] UPDATED)
    (2 min) Transformers (Vaswani et al., 2017) have brought a remarkable improvement in the performance of neural machine translation (NMT) systems but they could be surprisingly vulnerable to noise. In this work, we try to investigate how noise breaks Transformers and if there exist solutions to deal with such issues. There is a large body of work in the NMT literature on analyzing the behavior of conventional models for the problem of noise but Transformers are relatively understudied in this context. Motivated by this, we introduce a novel data-driven technique called Target Augmented Fine-tuning (TAFT) to incorporate noise during training. This idea is comparable to the well-known fine-tuning strategy. Moreover, we propose two other novel extensions to the original Transformer: Controlled Denoising (CD) and Dual-Channel Decoding (DCD), that modify the neural architecture as well as the training process to handle noise. One important characteristic of our techniques is that they only impact the training phase and do not impose any overhead at inference time. We evaluated our techniques to translate the English--German pair in both directions and observed that our models have a higher tolerance to noise. More specifically, they perform with no deterioration where up to 10% of entire test words are infected by noise.
    An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model. (arXiv:2109.04834v1 [cs.CL])
    (2 min) Multi-turn response selection models have recently shown comparable performance to humans in several benchmark datasets. However, in the real environment, these models often have weaknesses, such as making incorrect predictions based heavily on superficial patterns without a comprehensive understanding of the context. For example, these models often give a high score to the wrong response candidate containing several keywords related to the context but using the inconsistent tense. In this study, we analyze the weaknesses of the open-domain Korean Multi-turn response selection models and publish an adversarial dataset to evaluate these weaknesses. We also suggest a strategy to build a robust model in this adversarial environment.
    We went to look for meaning and all we got were these lousy representations: aspects of meaning representation for computational semantics. (arXiv:2109.04949v1 [cs.CL])
    (2 min) In this paper we examine different meaning representations that are commonly used in different natural language applications today and discuss their limits, both in terms of the aspects of the natural language meaning they are modelling and in terms of the aspects of the application for which they are used.
    LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation. (arXiv:2109.04993v1 [cs.CV])
    (2 min) Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space
    Dual-State Capsule Networks for Text Classification. (arXiv:2109.04762v1 [cs.CL])
    (2 min) Text classification systems based on contextual embeddings are not viable options for many of the low resource languages. On the other hand, recently introduced capsule networks have shown performance in par with these text classification models. Thus, they could be considered as a viable alternative for text classification for languages that do not have pre-trained contextual embedding models. However, current capsule networks depend upon spatial patterns without considering the sequential features of the text. They are also sub-optimal in capturing the context-level information in longer sequences. This paper presents a novel Dual-State Capsule (DS-Caps) network-based technique for text classification, which is optimized to mitigate these issues. Two varieties of states, namely sentence-level and word-level, are integrated with capsule layers to capture deeper context-level information for language modeling. The dynamic routing process among capsules was also optimized using the context-level information obtained through sentence-level states. The DS-Caps networks outperform the existing capsule network architectures for multiple datasets, particularly for tasks with longer sequences of text. We also demonstrate the superiority of DS-Caps in text classification for a low resource language.
    Assessing the Reliability of Word Embedding Gender Bias Measures. (arXiv:2109.04732v1 [cs.CL])
    (2 min) Various measures have been proposed to quantify human-like social biases in word embeddings. However, bias scores based on these measures can suffer from measurement error. One indication of measurement quality is reliability, concerning the extent to which a measure produces consistent results. In this paper, we assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency. Specifically, we investigate the consistency of bias scores across different choices of random seeds, scoring rules and words. Furthermore, we analyse the effects of various factors on these measures' reliability scores. Our findings inform better design of word embedding gender bias measures. Moreover, we urge researchers to be more critical about the application of such measures.
    SPECTRA: Sparse Structured Text Rationalization. (arXiv:2109.04552v1 [cs.CL])
    (2 min) Selective rationalization aims to produce decisions along with rationales (e.g., text highlights or word alignments between two sentences). Commonly, rationales are modeled as stochastic binary masks, requiring sampling-based gradient estimators, which complicates training and requires careful hyperparameter tuning. Sparse attention mechanisms are a deterministic alternative, but they lack a way to regularize the rationale extraction (e.g., to control the sparsity of a text highlight or the number of alignments). In this paper, we present a unified framework for deterministic extraction of structured explanations via constrained inference on a factor graph, forming a differentiable layer. Our approach greatly eases training and rationale regularization, generally outperforming previous work on what comes to performance and plausibility of the extracted rationales. We further provide a comparative study of stochastic and deterministic methods for rationale extraction for classification and natural language inference tasks, jointly assessing their predictive power, quality of the explanations, and model variability.
    What to Pre-Train on? Efficient Intermediate Task Selection. (arXiv:2104.08247v2 [cs.CL] UPDATED)
    (2 min) Intermediate task fine-tuning has been shown to culminate in large transfer gains across many NLP tasks. With an abundance of candidate datasets as well as pre-trained language models, it has become infeasible to run the cross-product of all combinations to find the best transfer setting. In this work we first establish that similar sequential fine-tuning gains can be achieved in adapter settings, and subsequently consolidate previously proposed methods that efficiently identify beneficial tasks for intermediate transfer learning. We experiment with a diverse set of 42 intermediate and 11 target English classification, multiple choice, question answering, and sequence tagging tasks. Our results show that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches. Our best methods achieve an average Regret@3 of less than 1% across all target tasks, demonstrating that we are able to efficiently identify the best datasets for intermediate training.
    Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training. (arXiv:2104.08645v2 [cs.CL] UPDATED)
    (2 min) Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized cross-lingual transfer setting, where the pair of input sentences belong to two different languages.
    The Effect of Efficient Messaging and Input Variability on Neural-Agent Iterated Language Learning. (arXiv:2104.07637v2 [cs.CL] UPDATED)
    (2 min) Natural languages display a trade-off among different strategies to convey syntactic structure, such as word order or inflection. This trade-off, however, has not appeared in recent simulations of iterated language learning with neural network agents (Chaabouni et al., 2019b). We re-evaluate this result in light of three factors that play an important role in comparable experiments from the Language Evolution field: (i) speaker bias towards efficient messaging, (ii) non systematic input languages, and (iii) learning bottleneck. Our simulations show that neural agents mainly strive to maintain the utterance type distribution observed during learning, instead of developing a more efficient or systematic language.
    UNKs Everywhere: Adapting Multilingual Language Models to New Scripts. (arXiv:2012.15562v3 [cs.CL] UPDATED)
    (2 min) Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model's embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT's and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.
    TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning. (arXiv:2104.06979v3 [cs.CL] UPDATED)
    (2 min) Learning sentence embeddings often requires a large amount of labeled data. However, for most tasks and domains, labeled data is seldom available and creating it is expensive. In this work, we present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE) which outperforms previous approaches by up to 6.4 points. It can achieve up to 93.1% of the performance of in-domain supervised approaches. Further, we show that TSDAE is a strong domain adaptation and pre-training method for sentence embeddings, significantly outperforming other approaches like Masked Language Model. A crucial shortcoming of previous studies is the narrow evaluation: Most work mainly evaluates on the single task of Semantic Textual Similarity (STS), which does not require any domain knowledge. It is unclear if these proposed methods generalize to other domains and tasks. We fill this gap and evaluate TSDAE and other recent approaches on four different datasets from heterogeneous domains.
    Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent. (arXiv:2010.09697v3 [cs.LG] UPDATED)
    (2 min) The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.
    Query-driven Segment Selection for Ranking Long Documents. (arXiv:2109.04611v1 [cs.IR])
    (2 min) Transformer-based rankers have shown state-of-the-art performance. However, their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose query-driven segment selection from long documents to build training data. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that can process longer input sequences. Our findings open up new direction to design efficient transformer-based rankers.
    Visual Cues and Error Correction for Translation Robustness. (arXiv:2103.07352v2 [cs.CL] UPDATED)
    (2 min) Neural Machine Translation models are sensitive to noise in the input texts, such as misspelled words and ungrammatical constructions. Existing robustness techniques generally fail when faced with unseen types of noise and their performance degrades on clean texts. In this paper, we focus on three types of realistic noise that are commonly generated by humans and introduce the idea of visual context to improve translation robustness for noisy texts. In addition, we describe a novel error correction training regime that can be used as an auxiliary task to further improve translation robustness. Experiments on English-French and English-German translation show that both multimodal and error correction components improve model robustness to noisy texts, while still retaining translation quality on clean texts.
    Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation. (arXiv:2105.03432v2 [cs.CL] UPDATED)
    (2 min) Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. Previous approaches in this task have been able to generalise to rare or unseen instances by relying on a delexicalisation of the input. However, this often requires that the input appears verbatim in the output text. This poses challenges in multilingual settings, where the task expands to generate the output text in multiple languages given the same input. In this paper, we explore the application of multilingual models in concept-to-text and propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text and that our framework outperforms previous approaches, especially for low resource languages.
    Lessons on Parameter Sharing across Layers in Transformers. (arXiv:2104.06022v2 [cs.CL] UPDATED)
    (2 min) We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.
    Retrieval Augmented Code Generation and Summarization. (arXiv:2108.11601v2 [cs.SE] UPDATED)
    (2 min) Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers' code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.
    "Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks. (arXiv:2104.08384v2 [cs.CL] UPDATED)
    (2 min) We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong supervised baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our wikily supervised translation models to unsupervised image captioning, and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a translated version of the English captioning data, using our wikily-supervised translation models. Our captioning results on Arabic are slightly better than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as artificial training data in an annotation projection framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers.
    De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective. (arXiv:2108.07971v2 [cs.CL] UPDATED)
    (2 min) In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.
    HELMHOLTZ: A Verifier for Tezos Smart Contracts Based on Refinement Types. (arXiv:2108.12971v2 [cs.PL] UPDATED)
    (2 min) A smart contract is a program executed on a blockchain, based on which many cryptocurrencies are implemented, and is being used for automating transactions. Due to the large amount of money that smart contracts deal with, there is a surging demand for a method that can statically and formally verify them. This article describes our type-based static verification tool HELMHOLTZ for Michelson, which is a statically typed stack-based language for writing smart contracts that are executed on the blockchain platform Tezos. HELMHOLTZ is designed on top of our extension of Michelson's type system with refinement types. HELMHOLTZ takes a Michelson program annotated with a user-defined specification written in the form of a refinement type as input; it then typechecks the program against the specification based on the refinement type system, discharging the generated verification conditions with the SMT solver Z3. We briefly introduce our refinement type system for the core calculus Mini-Michelson of Michelson, which incorporates the characteristic features such as compound datatypes (e.g., lists and pairs), higher-order functions, and invocation of another contract. \HELMHOLTZ{} successfully verifies several practical Michelson programs, including one that transfers money to an account and that checks a digital signature.
    NegatER: Unsupervised Discovery of Negatives in Commonsense Knowledge Bases. (arXiv:2011.07497v2 [cs.AI] UPDATED)
    (2 min) Codifying commonsense knowledge in machines is a longstanding goal of artificial intelligence. Recently, much progress toward this goal has been made with automatic knowledge base (KB) construction techniques. However, such techniques focus primarily on the acquisition of positive (true) KB statements, even though negative (false) statements are often also important for discriminative reasoning over commonsense KBs. As a first step toward the latter, this paper proposes NegatER, a framework that ranks potential negatives in commonsense KBs using a contextual language model (LM). Importantly, as most KBs do not contain negatives, NegatER relies only on the positive knowledge in the LM and does not require ground-truth negative examples. Experiments demonstrate that, compared to multiple contrastive data augmentation approaches, NegatER yields negatives that are more grammatical, coherent, and informative -- leading to statistically significant accuracy improvements in a challenging KB completion task and confirming that the positive knowledge in LMs can be "re-purposed" to generate negative knowledge.
    A Search Engine for Discovery of Scientific Challenges and Directions. (arXiv:2108.13751v2 [cs.CL] UPDATED)
    (2 min) Keeping track of scientific challenges, advances and emerging directions is a fundamental part of research. However, researchers face a flood of papers that hinders discovery of important knowledge. In biomedicine, this directly impacts human lives. To address this problem, we present a novel task of extraction and search of scientific challenges and directions, to facilitate rapid knowledge discovery. We construct and release an expert-annotated corpus of texts sampled from full-length papers, labeled with novel semantic categories that generalize across many types of challenges and directions. We focus on a large corpus of interdisciplinary work relating to the COVID-19 pandemic, ranging from biomedicine to areas such as AI and economics. We apply a model trained on our data to identify challenges and directions across the corpus and build a dedicated search engine. In experiments with 19 researchers and clinicians using our system, we outperform a popular scientific search engine in assisting knowledge discovery. Finally, we show that models trained on our resource generalize to the wider biomedical domain and to AI papers, highlighting its broad utility. We make our data, model and search engine publicly available. https://challenges.apps.allenai.org/
    EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion in Task-Oriented Dialogue Systems. (arXiv:2109.04919v1 [cs.CL])
    (2 min) The ability to recognise emotions lends a conversational artificial intelligence a human touch. While emotions in chit-chat dialogues have received substantial attention, emotions in task-oriented dialogues have been largely overlooked despite having an equally important role, such as to signal failure or success. Existing emotion-annotated task-oriented corpora are limited in size, label richness, and public availability, creating a bottleneck for downstream tasks. To lay a foundation for studies on emotions in task-oriented dialogues, we introduce EmoWOZ, a large-scale manually emotion-annotated corpus of task-oriented dialogues. EmoWOZ is based on MultiWOZ, a multi-domain task-oriented dialogue dataset. It contains more than 11K dialogues with more than 83K emotion annotations of user utterances. In addition to Wizzard-of-Oz dialogues from MultiWOZ, we collect human-machine dialogues within the same set of domains to sufficiently cover the space of various emotions that can happen during the lifetime of a data-driven dialogue system. To the best of our knowledge, this is the first large-scale open-source corpus of its kind. We propose a novel emotion labelling scheme, which is tailored to task-oriented dialogues. We report a set of experimental results to show the usability of this corpus for emotion recognition and state tracking in task-oriented dialogues.
    Relational World Knowledge Representation in Contextual Language Models: A Review. (arXiv:2104.05837v2 [cs.CL] UPDATED)
    (2 min) Relational knowledge bases (KBs) are commonly used to represent world knowledge in machines. However, while advantageous for their high degree of precision and interpretability, KBs are usually organized according to manually-defined schemas, which limit their expressiveness and require significant human efforts to engineer and maintain. In this review, we take a natural language processing perspective to these limitations, examining how they may be addressed in part by training deep contextual language models (LMs) to internalize and express relational knowledge in more flexible forms. We propose to organize knowledge representation strategies in LMs by the level of KB supervision provided, from no KB supervision at all to entity- and relation-level supervision. Our contributions are threefold: (1) We provide a high-level, extensible taxonomy for knowledge representation in LMs; (2) Within our taxonomy, we highlight notable models, evaluation tasks, and findings, in order to provide an up-to-date review of current knowledge representation capabilities in LMs; and (3) We suggest future research directions that build upon the complementary aspects of LMs and KBs as knowledge representations.
    Generative Spoken Language Modeling from Raw Audio. (arXiv:2102.01192v2 [cs.CL] UPDATED)
    (2 min) We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
    Box Embeddings: An open-source library for representation learning using geometric structures. (arXiv:2109.04997v1 [cs.CL])
    (2 min) A major factor contributing to the success of modern representation learning is the ease of performing various vector operations. Recently, objects with geometric structures (eg. distributions, complex or hyperbolic vectors, or regions such as cones, disks, or boxes) have been explored for their alternative inductive biases and additional representational capacities. In this work, we introduce Box Embeddings, a Python library that enables researchers to easily apply and extend probabilistic box embeddings.
    Does Pretraining for Summarization Require Knowledge Transfer?. (arXiv:2109.04953v1 [cs.CL])
    (2 min) Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer.
    FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging. (arXiv:2012.15781v2 [cs.LG] UPDATED)
    (0 min) Influence functions approximate the "influences" of training data-points for test predictions and have a wide variety of applications. Despite the popularity, their computational cost does not scale well with model and training data size. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. We use k-Nearest Neighbors (kNN) to narrow the search space down to a subset of good candidate data points, identify the configurations that best balance the speed-quality trade-off in estimating the inverse Hessian-vector product, and introduce a fast parallel variant. Our proposed method achieves about 80X speedup while being highly correlated with the original influence values. With the availability of the fast influence functions, we demonstrate their usefulness in four applications. First, we examine whether influential data-points can "explain" test time behavior using the framework of simulatability. Second, we visualize the influence interactions between training and test data-points. Third, we show that we can correct model errors by additional fine-tuning on certain influential data-points, improving the accuracy of a trained MultiNLI model by 2.5% on the HANS dataset. Finally, we experiment with a similar setup but fine-tuning on datapoints not seen during training, improving the model accuracy by 2.8% and 1.7% on HANS and ANLI datasets respectively. Overall, our fast influence functions can be efficiently applied to large models and datasets, and our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors. Code is available at https://github.com/salesforce/fast-influence-functions
    Learning Hard Retrieval Decoder Attention for Transformers. (arXiv:2009.14658v2 [cs.CL] UPDATED)
    (0 min) The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.
    Knowledge-Aware Meta-learning for Low-Resource Text Classification. (arXiv:2109.04707v1 [cs.CL])
    (0 min) Meta-learning has achieved great success in leveraging the historical learned knowledge to facilitate the learning process of the new task. However, merely learning the knowledge from the historical tasks, adopted by current meta-learning algorithms, may not generalize well to testing tasks when they are not well-supported by training tasks. This paper studies a low-resource text classification problem and bridges the gap between meta-training and meta-testing tasks by leveraging the external knowledge bases. Specifically, we propose KGML to introduce additional representation for each sentence learned from the extracted sentence-specific knowledge graph. The extensive experiments on three datasets demonstrate the effectiveness of KGML under both supervised adaptation and unsupervised adaptation settings.
    Speechformer: Reducing Information Loss in Direct Speech Translation. (arXiv:2109.04574v1 [cs.CL])
    (0 min) Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en->de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.
    FNet: Mixing Tokens with Fourier Transforms. (arXiv:2105.03824v3 [cs.CL] UPDATED)
    (0 min) We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding. (arXiv:2109.04947v1 [cs.CL])
    (0 min) Large-scale, pre-trained language models (LMs) have achieved human-level performance on a breadth of language understanding tasks. However, evaluations only based on end task performance shed little light on machines' true ability in language understanding and reasoning. In this paper, we highlight the importance of evaluating the underlying reasoning process in addition to end performance. Toward this goal, we introduce Tiered Reasoning for Intuitive Physics (TRIP), a novel commonsense reasoning dataset with dense annotations that enable multi-tiered evaluation of machines' reasoning process. Our empirical results show that while large LMs can achieve high end performance, they struggle to support their predictions with valid supporting evidence. The TRIP dataset and our baseline results will motivate verifiable evaluation of commonsense reasoning and facilitate future research toward developing better language understanding and reasoning models.
    IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. (arXiv:2109.04607v1 [cs.CL])
    (0 min) We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.
    How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy. (arXiv:2109.04740v1 [cs.CL])
    (0 min) It is widely accepted that fine-tuning pre-trained language models usually brings about performance improvements in downstream tasks. However, there are limited studies on the reasons behind this effectiveness, particularly from the viewpoint of structural changes in the embedding space. Trying to fill this gap, in this paper, we analyze the extent to which the isotropy of the embedding space changes after fine-tuning. We demonstrate that, even though isotropy is a desirable geometrical property, fine-tuning does not necessarily result in isotropy enhancements. Moreover, local structures in pre-trained contextual word representations (CWRs), such as those encoding token types or frequency, undergo a massive change during fine-tuning. Our experiments show dramatic growth in the number of elongated directions in the embedding space, which, in contrast to pre-trained CWRs, carry the essential linguistic knowledge in the fine-tuned embedding space, making existing isotropy enhancement methods ineffective.
    An Exploratory Study on Long Dialogue Summarization: What Works and What's Next. (arXiv:2109.04609v1 [cs.CL])
    (0 min) Dialogue summarization helps readers capture salient information from long conversations in meetings, interviews, and TV series. However, real-world dialogues pose a great challenge to current summarization models, as the dialogue length typically exceeds the input limits imposed by recent transformer-based pre-trained models, and the interactive nature of dialogues makes relevant information more context-dependent and sparsely distributed than news articles. In this work, we perform a comprehensive study on long dialogue summarization by investigating three strategies to deal with the lengthy input problem and locate relevant information: (1) extended transformer models such as Longformer, (2) retrieve-then-summarize pipeline models with several dialogue utterance retrieval methods, and (3) hierarchical dialogue encoding models such as HMNet. Our experimental results on three long dialogue datasets (QMSum, MediaSum, SummScreen) show that the retrieve-then-summarize pipeline models yield the best performance. We also demonstrate that the summary quality can be further improved with a stronger retrieval model and pretraining on proper external summarization datasets.
    Modeling Disclosive Transparency in NLP Application Descriptions. (arXiv:2101.00433v4 [cs.CL] UPDATED)
    (0 min) Broader disclosive transparency$-$truth and clarity in communication regarding the function of AI systems$-$is widely considered desirable. Unfortunately, it is a nebulous concept, difficult to both define and quantify. This is problematic, as previous work has demonstrated possible trade-offs and negative consequences to disclosive transparency, such as a confusion effect, where "too much information" clouds a reader's understanding of what a system description means. Disclosive transparency's subjective nature has rendered deep study into these problems and their remedies difficult. To improve this state of affairs, We introduce neural language model-based probabilistic metrics to directly model disclosive transparency, and demonstrate that they correlate with user and expert opinions of system transparency, making them a valid objective proxy. Finally, we demonstrate the use of these metrics in a pilot study quantifying the relationships between transparency, confusion, and user perceptions in a corpus of real NLP system descriptions.
    AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages. (arXiv:2109.04715v1 [cs.CL])
    (0 min) Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level alignment information and pseudo-monolingual data for pretraining multilingual sequence-to-sequence models. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines. We also show gains of up to 12 BLEU points over cross-lingual transfer baselines in data-constrained scenarios. All code and pretrained models will be released as further steps towards larger reproducible benchmarks for African languages.
    Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations. (arXiv:2109.04602v1 [cs.CL])
    (0 min) Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.
    Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training. (arXiv:2109.05003v1 [cs.CL])
    (0 min) We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
    Asking It All: Generating Contextualized Questions for any Semantic Role. (arXiv:2109.04832v1 [cs.CL])
    (0 min) Asking questions about a situation is an inherent step towards understanding it. To this end, we introduce the task of role question generation, which, given a predicate mention and a passage, requires producing a set of questions asking about all possible semantic roles of the predicate. We develop a two-stage model for this task, which first produces a context-independent question prototype for each role and then revises it to be contextually appropriate for the passage. Unlike most existing approaches to question generation, our approach does not require conditioning on existing answers in the text. Instead, we condition on the type of information to inquire about, regardless of whether the answer appears explicitly in the text, could be inferred from it, or should be sought elsewhere. Our evaluation demonstrates that we generate diverse and well-formed questions for a large, broad-coverage ontology of predicates and roles.
    Euphemistic Phrase Detection by Masked Language Model. (arXiv:2109.04666v1 [cs.CL])
    (2 min) It is a well-known approach for fringe groups and organizations to use euphemisms -- ordinary-sounding and innocent-looking words with a secret meaning -- to conceal what they are discussing. For instance, drug dealers often use "pot" for marijuana and "avocado" for heroin. From a social media content moderation perspective, though recent advances in NLP have enabled the automatic detection of such single-word euphemisms, no existing work is capable of automatically detecting multi-word euphemisms, such as "blue dream" (marijuana) and "black tar" (heroin). Our paper tackles the problem of euphemistic phrase detection without human effort for the first time, as far as we are aware. We first perform phrase mining on a raw text corpus (e.g., social media posts) to extract quality phrases. Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates. Finally, we rank those candidates by a masked language model -- SpanBERT. Compared to strong baselines, we report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
    Measuring Association Between Labels and Free-Text Rationales. (arXiv:2010.12762v3 [cs.CL] UPDATED)
    (2 min) In interpretable NLP, we require faithful rationales that reflect the model's decision-making process for an explained instance. While prior work focuses on extractive rationales (a subset of the input words), we investigate their less-studied counterpart: free-text natural language rationales. We demonstrate that pipelines, existing models for faithful extractive rationalization on information-extraction style tasks, do not extend as reliably to "reasoning" tasks requiring free-text rationales. We turn to models that jointly predict and rationalize, a class of widely used high-performance models for free-text rationalization whose faithfulness is not yet established. We define label-rationale association as a necessary property for faithfulness: the internal mechanisms of the model producing the label and the rationale must be meaningfully correlated. We propose two measurements to test this property: robustness equivalence and feature importance agreement. We find that state-of-the-art T5-based joint models exhibit both properties for rationalizing commonsense question-answering and natural language inference, indicating their potential for producing faithful free-text rationales.
    Controlled Neural Sentence-Level Reframing of News Articles. (arXiv:2109.04957v1 [cs.CL])
    (2 min) Framing a news article means to portray the reported event from a specific perspective, e.g., from an economic or a health perspective. Reframing means to change this perspective. Depending on the audience or the submessage, reframing can become necessary to achieve the desired effect on the readers. Reframing is related to adapting style and sentiment, which can be tackled with neural text generation techniques. However, it is more challenging since changing a frame requires rewriting entire sentences rather than single phrases. In this paper, we study how to computationally reframe sentences in news articles while maintaining their coherence to the context. We treat reframing as a sentence-level fill-in-the-blank task for which we train neural models on an existing media frame corpus. To guide the training, we propose three strategies: framed-language pretraining, named-entity preservation, and adversarial learning. We evaluate respective models automatically and manually for topic consistency, coherence, and successful reframing. Our results indicate that generating properly-framed text works well but with tradeoffs.
    Evaluating the Morphosyntactic Well-formedness of Generated Texts. (arXiv:2103.16590v2 [cs.CL] UPDATED)
    (2 min) Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
    Integrating Approaches to Word Representation. (arXiv:2109.04876v1 [cs.CL])
    (2 min) The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing. I present a survey of the distributional, compositional, and relational approaches to addressing this task, and discuss various means of integrating them into systems, with special emphasis on the word level and the out-of-vocabulary phenomenon.
    RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms. (arXiv:2005.00782v4 [cs.CL] UPDATED)
    (2 min) Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations. To generate data for this challenge, we develop a systematic and scalable procedure using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Extensive experiments on our generated probe sets with more than 10k statements show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks. We also find that fine-tuning on similar statements offer limited gains, as PTLMs still fail to generalize to unseen inferences. Our new large-scale benchmark exposes a significant gap between PTLMs and human-level language understanding and offers a new challenge for PTLMs to demonstrate commonsense.
    Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes. (arXiv:2109.04921v1 [cs.CL])
    (2 min) State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mare\v{c}ek, 2021) allows us to answer this question for specific linguistic features and learn a projection based only on mono-lingual annotated datasets. We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages. We observe that for languages closely related to English, no transformation is needed. The evaluated information is encoded in a shared cross-lingual embedding space. For other languages, it is beneficial to apply orthogonal transformation learned separately for each language. We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
    Scaling Creative Inspiration with Fine-Grained Functional Facets of Ideas. (arXiv:2102.09761v2 [cs.HC] UPDATED)
    (2 min) Large repositories of products, patents and scientific papers offer an opportunity for building systems that scour millions of ideas and help users discover inspirations. However, idea descriptions are typically in the form of unstructured text, lacking key structure that is required for supporting creative innovation interactions. Prior work has explored idea representations that were limited in expressivity, required significant manual effort from users, or dependent on curated knowledge bases with poor coverage. We explore a novel representation that automatically breaks up products into fine-grained functional facets capturing the purposes and mechanisms of ideas, and use it to support important creative innovation interactions: functional search for ideas, and exploration of the design space around a focal problem by viewing related problem perspectives pooled from across many products. In user studies, our approach boosts the quality of creative search and inspirations, outperforming strong baselines by 50-60%.
    FR-Detect: A Multi-Modal Framework for Early Fake News Detection on Social Media Using Publishers Features. (arXiv:2109.04835v1 [cs.SI])
    (2 min) In recent years, with the expansion of the Internet and attractive social media infrastructures, people prefer to follow the news through these media. Despite the many advantages of these media in the news field, the lack of any control and verification mechanism has led to the spread of fake news, as one of the most important threats to democracy, economy, journalism and freedom of expression. Designing and using automatic methods to detect fake news on social media has become a significant challenge. In this paper, we examine the publishers' role in detecting fake news on social media. We also suggest a high accurate multi-modal framework, namely FR-Detect, using user-related and content-related features with early detection capability. For this purpose, two new user-related features, namely Activity Credibility and Influence, have been introduced for publishers. Furthermore, a sentence-level convolutional neural network is provided to combine these features with latent textual content features properly. Experimental results have shown that the publishers' features can improve the performance of content-based models by up to 13% and 29% in accuracy and F1-score, respectively.
    BiSECT: Learning to Split and Rephrase Sentences with Bitexts. (arXiv:2109.05006v1 [cs.CL])
    (2 min) An important task in NLP applications such as sentence simplification is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. We introduce a novel dataset and a new model for this `split and rephrase' task. Our BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. We obtain these by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language. BiSECT contains higher quality training examples than previous Split and Rephrase corpora, with sentence splits that require more significant modifications. We categorize examples in our corpus, and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited. Moreover, we show that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.
    Semi-supervised Relation Extraction via Incremental Meta Self-Training. (arXiv:2010.16410v2 [cs.CL] UPDATED)
    (2 min) To alleviate human efforts from obtaining large-scale annotations, Semi-Supervised Relation Extraction methods aim to leverage unlabeled data in addition to learning from limited samples. Existing self-training methods suffer from the gradual drift problem, where noisy pseudo labels on unlabeled data are incorporated during training. To alleviate the noise in pseudo labels, we propose a method called MetaSRE, where a Relation Label Generation Network generates quality assessment on pseudo labels by (meta) learning from the successful and failed attempts on Relation Classification Network as an additional meta-objective. To reduce the influence of noisy pseudo labels, MetaSRE adopts a pseudo label selection and exploitation scheme which assesses pseudo label quality on unlabeled samples and only exploits high-quality pseudo labels in a self-training fashion to incrementally augment labeled samples for both robustness and accuracy. Experimental results on two public datasets demonstrate the effectiveness of the proposed approach.
    Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization. (arXiv:2109.04994v1 [cs.CL])
    (2 min) Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via \href{https://github.com/Junpliu/ConDigSum}{https://github.com/Junpliu/ConDigSum}.
    EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling. (arXiv:2109.04699v1 [cs.CL])
    (2 min) While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.
    Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers. (arXiv:2109.04922v1 [cs.CL])
    (2 min) As large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks, statistical bias in benchmark data and probing studies have recently called into question their true capabilities. For a more informative evaluation than accuracy on text classification tasks can offer, we propose evaluating systems through a novel measure of prediction coherence. We apply our framework to two existing language understanding benchmarks with different properties to demonstrate its versatility. Our experimental results show that this evaluation framework, although simple in ideas and implementation, is a quick, effective, and versatile measure to provide insight into the coherence of machines' predictions.
    Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources. (arXiv:2103.11320v2 [cs.CL] UPDATED)
    (2 min) Warning: this paper contains content that may be offensive or upsetting. Numerous natural language processing models have tried injecting commonsense by using the ConceptNet knowledge base to improve performance on different tasks. ConceptNet, however, is mostly crowdsourced from humans and may reflect human biases such as "lawyers are dishonest." It is important that these biases are not conflated with the notion of commonsense. We study this missing yet important problem by first defining and quantifying biases in ConceptNet as two types of representational harms: overgeneralization of polarized perceptions and representation disparity. We find that ConceptNet contains severe biases and disparities across four demographic categories. In addition, we analyze two downstream models that use ConceptNet as a source for commonsense knowledge and find the existence of biases in those models as well. We further propose a filtered-based bias-mitigation approach and examine its effectiveness. We show that our mitigation approach can reduce the issues in both resource and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models.
    EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation. (arXiv:2109.04600v1 [cs.CV])
    (2 min) Temporal grounding aims to predict a time interval of a video clip corresponding to a natural language query input. In this work, we present EVOQUER, a temporal grounding framework incorporating an existing text-to-video grounding model and a video-assisted query generation network. Given a query and an untrimmed video, the temporal grounding model predicts the target interval, and the predicted video clip is fed into a video translation task by generating a simplified version of the input query. EVOQUER forms closed-loop learning by incorporating loss functions from both temporal grounding and query generation serving as feedback. Our experiments on two widely used datasets, Charades-STA and ActivityNet, show that EVOQUER achieves promising improvements by 1.05 and 1.31 at R@0.7. We also discuss how the query generation task could facilitate error analysis by explaining temporal grounding model behavior.
    Improving Multilingual Translation by Representation and Gradient Regularization. (arXiv:2109.04778v1 [cs.CL])
    (2 min) Multilingual Neural Machine Translation (NMT) enables one model to serve all translation directions, including ones that are unseen during training, i.e. zero-shot translation. Despite being theoretically attractive, current models often produce low quality translations -- commonly failing to even produce outputs in the right target language. In this work, we observe that off-target translation is dominant even in strong multilingual systems, trained on massive multilingual corpora. To address this issue, we propose a joint approach to regularize NMT models at both representation-level and gradient-level. At the representation level, we leverage an auxiliary target language prediction task to regularize decoder outputs to retain information about the target language. At the gradient level, we leverage a small amount of direct data (in thousands of sentence pairs) to regularize model gradients. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance by +5.59 and +10.38 BLEU on WMT and OPUS datasets respectively. Moreover, experiments show that our method also works well when the small amount of direct data is not available.
    A Strong Baseline for Query Efficient Attacks in a Black Box Setting. (arXiv:2109.04775v1 [cs.CL])
    (2 min) Existing black box search methods have achieved high success rate in generating adversarial attacks against NLP models. However, such search methods are inefficient as they do not consider the amount of queries required to generate adversarial attacks. Also, prior attacks do not maintain a consistent search space while comparing different search methods. In this paper, we propose a query efficient attack strategy to generate plausible adversarial examples on text classification and entailment tasks. Our attack jointly leverages attention mechanism and locality sensitive hashing (LSH) to reduce the query count. We demonstrate the efficacy of our approach by comparing our attack with four baselines across three different search spaces. Further, we benchmark our results across the same search space used in prior attacks. In comparison to attacks proposed, on an average, we are able to reduce the query count by 75% across all datasets and target models. We also demonstrate that our attack achieves a higher success rate when compared to prior attacks in a limited query setting.
    DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization. (arXiv:2109.04673v1 [cs.CL])
    (2 min) Identifying relevant knowledge to be used in conversational systems that are grounded in long documents is critical to effective response generation. We introduce a knowledge identification model that leverages the document structure to provide dialogue-contextualized passage encodings and better locate knowledge relevant to the conversation. An auxiliary loss captures the history of dialogue-document connections. We demonstrate the effectiveness of our model on two document-grounded conversational datasets and provide analyses showing generalization to unseen documents and long dialogue contexts.
    Model Selection for Cross-Lingual Transfer. (arXiv:2010.06127v2 [cs.CL] UPDATED)
    (2 min) Transformers that are pre-trained on multilingual corpora, such as, mBERT and XLM-RoBERTa, have achieved impressive cross-lingual transfer capabilities. In the zero-shot transfer setting, only English training data is used, and the fine-tuned model is evaluated on another target language. While this works surprisingly well, substantial variance has been observed in target language performance between different fine-tuning runs, and in the zero-shot setup, no target-language development data is available to select among multiple fine-tuned models. Prior work has relied on English dev data to select among models that are fine-tuned with different learning rates, number of steps and other hyperparameters, often resulting in suboptimal choices. In this paper, we show that it is possible to select consistently better models when small amounts of annotated data are available in auxiliary pivot languages. We propose a machine learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities. In extensive experiments we find that this method consistently selects better models than English validation data across twenty five languages (including eight low-resource languages), and often achieves results that are comparable to model selection using target language development data.
    Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework. (arXiv:2109.04817v1 [cs.CL])
    (2 min) Style is an integral part of natural language. However, evaluation methods for style measures are rare, often task-specific and usually do not control for content. We propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL) to test the performance of any model that can compare two sentences on style. We illustrate STEL with two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contrac'tion and numb3r substitution). We find that BERT-based methods outperform simple versions of commonly used style measures like 3-grams, punctuation frequency and LIWC-based approaches. We invite the addition of further tasks and task instances to STEL and hope to facilitate the improvement of style-sensitive measures.
    Neural Machine Translation Quality and Post-Editing Performance. (arXiv:2109.05016v1 [cs.CL])
    (2 min) We test the natural expectation that using MT in professional translation saves human processing time. The last such study was carried out by Sanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing the translation quality. In contrast, we focus on neural MT (NMT) of high quality, which has become the state-of-the-art approach since then and also got adopted by most translation companies. Through an experimental study involving over 30 professional translators for English -> Czech translation, we examine the relationship between NMT performance and post-editing time and quality. Across all models, we found that better MT systems indeed lead to fewer changes in the sentences in this industry setting. The relation between system quality and post-editing time is however not straightforward and, contrary to the results on phrase-based MT, BLEU is definitely not a stable predictor of the time or final output quality.
    Block Pruning For Faster Transformers. (arXiv:2109.04838v1 [cs.LG])
    (2 min) Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.
    BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation. (arXiv:2109.04588v1 [cs.CL])
    (2 min) The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation (NMT) systems. However, proposed methods for incorporating pre-trained models are non-trivial and mainly focus on BERT, which lacks a comparison of the impact that other pre-trained models may have on translation performance. In this paper, we demonstrate that simply using the output (contextualized embeddings) of a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art translation performance. Moreover, we also propose a stochastic layer selection approach and a concept of dual-directional translation model to ensure the sufficient utilization of contextualized embeddings. In the case of without using back translation, our best models achieve BLEU scores of 30.45 for En->De and 38.61 for De->En on the IWSLT'14 dataset, and 31.26 for En->De and 34.94 for De->En on the WMT'14 dataset, which exceeds all published numbers.
    Genre as Weak Supervision for Cross-lingual Dependency Parsing. (arXiv:2109.04733v1 [cs.CL])
    (2 min) Recent work has shown that monolingual masked language models learn to represent data-driven notions of language variation which can be used for domain-targeted training data selection. Dataset genre labels are already frequently available, yet remain largely unexplored in cross-lingual setups. We harness this genre metadata as a weak supervision signal for targeted data selection in zero-shot dependency parsing. Specifically, we project treebank-level genre information to the finer-grained sentence level, with the goal to amplify information implicitly stored in unsupervised contextualized representations. We demonstrate that genre is recoverable from multilingual contextual embeddings and that it provides an effective signal for training data selection in cross-lingual, zero-shot scenarios. For 12 low-resource language treebanks, six of which are test-only, our genre-specific methods significantly outperform competitive baselines as well as recent embedding-based methods for data selection. Moreover, genre-based data selection provides new state-of-the-art results for three of these target languages.
    Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution. (arXiv:2109.04712v1 [cs.CL])
    (2 min) Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling of common labels. Here, we introduce the application of balancing loss functions for multi-label text classification. We perform experiments on a general domain dataset with 90 labels (Reuters-21578) and a domain-specific dataset from PubMed with 18211 labels. We find that a distribution-balanced loss function, which inherently addresses both the class imbalance and label linkage problems, outperforms commonly used loss functions. Distribution balancing methods have been successfully used in the image recognition field. Here, we show their effectiveness in natural language processing. Source code is available at https://github.com/blessu/BalancedLossNLP.
    Dynamic Terminology Integration for COVID-19 and other Emerging Domains. (arXiv:2109.04708v1 [cs.CL])
    (2 min) The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom available for less-resourced languages and niche domains. Furthermore, as exemplified by COVID-19 recently, no domain-specific parallel data is readily available for emerging domains. However, the gravity of this recent calamity created a high demand for reliable translation of critical information regarding pandemic and infection prevention. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems that are capable of dynamic terminology integration at the time of translation. Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training. We conclude our work with a broader discussion considering the Shared Task itself and terminology translation in MT.
    Generic resources are what you need: Style transfer tasks without task-specific parallel training data. (arXiv:2109.04543v1 [cs.CL])
    (0 min) Style transfer aims to rewrite a source text in a different target style while preserving its content. We propose a novel approach to this task that leverages generic resources, and without using any task-specific parallel (source-target) data outperforms existing unsupervised approaches on the two most popular style transfer tasks: formality transfer and polarity swap. In practice, we adopt a multi-step procedure which builds on a generic pre-trained sequence-to-sequence model (BART). First, we strengthen the model's ability to rewrite by further pre-training BART on both an existing collection of generic paraphrases, as well as on synthetic pairs created using a general-purpose lexical resource. Second, through an iterative back-translation approach, we train two models, each in a transfer direction, so that they can provide each other with synthetically generated pairs, dynamically in the training process. Lastly, we let our best reresulting model generate static synthetic pairs to be used in a supervised training regime. Besides methodology and state-of-the-art results, a core contribution of this work is a reflection on the nature of the two tasks we address, and how their differences are highlighted by their response to our approach.
    Efficient Test Time Adapter Ensembling for Low-resource Language Varieties. (arXiv:2109.04877v1 [cs.CL])
    (2 min) Adapters are light-weight modules that allow parameter-efficient fine-tuning of pretrained models. Specialized language and task adapters have recently been proposed to facilitate cross-lingual transfer of multilingual pretrained models (Pfeiffer et al., 2020b). However, this approach requires training a separate language adapter for every language one wishes to support, which can be impractical for languages with limited data. An intuitive solution is to use a related language adapter for the new language variety, but we observe that this solution can lead to sub-optimal performance. In this paper, we aim to improve the robustness of language adapters to uncovered languages without training new adapters. We find that ensembling multiple existing language adapters makes the fine-tuned model significantly more robust to other language varieties not included in these adapters. Building upon this observation, we propose Entropy Minimized Ensemble of Adapters (EMEA), a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. Experiments on three diverse groups of language varieties show that our method leads to significant improvements on both named entity recognition and part-of-speech tagging across all languages.
    How May I Help You? Using Neural Text Simplification to Improve Downstream NLP Tasks. (arXiv:2109.04604v1 [cs.CL])
    (2 min) The general goal of text simplification (TS) is to reduce text complexity for human consumption. This paper investigates another potential use of neural TS: assisting machines performing natural language processing (NLP) tasks. We evaluate the use of neural TS in two ways: simplifying input texts at prediction time and augmenting data to provide machines with additional information during training. We demonstrate that the latter scenario provides positive effects on machine performance on two separate datasets. In particular, the latter use of TS improves the performances of LSTM (1.82-1.98%) and SpanBERT (0.7-1.3%) extractors on TACRED, a complex, large-scale, real-world relation extraction task. Further, the same setting yields improvements of up to 0.65% matched and 0.62% mismatched accuracies for a BERT text classifier on MNLI, a practical natural language inference dataset.
    Investigating Numeracy Learning Ability of a Text-to-Text Transfer Model. (arXiv:2109.04672v1 [cs.CL])
    (2 min) The transformer-based pre-trained language models have been tremendously successful in most of the conventional NLP tasks. But they often struggle in those tasks where numerical understanding is required. Some possible reasons can be the tokenizers and pre-training objectives which are not specifically designed to learn and preserve numeracy. Here we investigate the ability of text-to-text transfer learning model (T5), which has outperformed its predecessors in the conventional NLP tasks, to learn numeracy. We consider four numeracy tasks: numeration, magnitude order prediction, finding minimum and maximum in a series, and sorting. We find that, although T5 models perform reasonably well in the interpolation setting, they struggle considerably in the extrapolation setting across all four tasks.
    Math Word Problem Generation with Mathematical Consistency and Problem Context Constraints. (arXiv:2109.04546v1 [cs.CL])
    (2 min) We study the problem of generating arithmetic math word problems (MWPs) given a math equation that specifies the mathematical computation and a context that specifies the problem scenario. Existing approaches are prone to generating MWPs that are either mathematically invalid or have unsatisfactory language quality. They also either ignore the context or require manual specification of a problem template, which compromises the diversity of the generated MWPs. In this paper, we develop a novel MWP generation approach that leverages i) pre-trained language models and a context keyword selection model to improve the language quality of the generated MWPs and ii) an equation consistency constraint for math equations to improve the mathematical validity of the generated MWPs. Extensive quantitative and qualitative experiments on three real-world MWP datasets demonstrate the superior performance of our approach compared to various baselines.
    Graph-Based Decoding for Task Oriented Semantic Parsing. (arXiv:2109.04587v1 [cs.CL])
    (2 min) The dominant paradigm for semantic parsing in recent years is to formulate parsing as a sequence-to-sequence task, generating predictions with auto-regressive sequence decoders. In this work, we explore an alternative paradigm. We formulate semantic parsing as a dependency parsing task, applying graph-based decoding techniques developed for syntactic parsing. We compare various decoding techniques given the same pre-trained Transformer encoder on the TOP dataset, including settings where training data is limited or contains only partially-annotated examples. We find that our graph-based approach is competitive with sequence decoders on the standard setting, and offers significant improvements in data efficiency and settings where partially-annotated data is available.
    CINS: Comprehensive Instruction for Few-shot Learning in Task-orientedDialog Systems. (arXiv:2109.04645v1 [cs.CL])
    (2 min) As labeling cost for different modules in task-oriented dialog (ToD) systems is high, a major challenge in practice is to learn different tasks with the least amount of labeled data. Recently, prompting methods over pre-trained language models (PLMs) have shown promising results for few-shot learning in ToD. To better utilize the power of PLMs, this paper proposes Comprehensive Instruction (CINS) that exploits PLMs with extra task-specific instructions. We design a schema(definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD, i.e. intent classification, dialog state tracking, and natural language generation. A sequence-to-sequence model (T5)is adopted to solve these three tasks in a unified framework. Extensive experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data. Empirical results demonstrate that the proposed CINS approach consistently improves techniques that finetune PLMs with raw input or short prompts.
    Subword Mapping and Anchoring across Languages. (arXiv:2109.04556v1 [cs.CL])
    (2 min) State-of-the-art multilingual systems rely on shared vocabularies that sufficiently cover all considered languages. To this end, a simple and frequently used approach makes use of subword vocabularies constructed jointly over several languages. We hypothesize that such vocabularies are suboptimal due to false positives (identical subwords with different meanings across languages) and false negatives (different subwords with similar meanings). To address these issues, we propose Subword Mapping and Anchoring across Languages (SMALA), a method to construct bilingual subword vocabularies. SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique and uses them to create cross-lingual anchors based on subword similarities. We demonstrate the benefits of SMALA for cross-lingual natural language inference (XNLI), where it improves zero-shot transfer to an unseen language without task-specific data, but only by sharing subword embeddings. Moreover, in neural machine translation, we show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
    ReasonBERT: Pre-trained to Reason with Distant Supervision. (arXiv:2109.04912v1 [cs.CL])
    (2 min) We present ReasonBert, a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid contexts. Unlike existing pre-training methods that only harvest learning signals from local contexts of naturally occurring texts, we propose a generalized notion of distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. We conduct a comprehensive evaluation on a variety of extractive question answering datasets ranging from single-hop to multi-hop and from text-only to table-only to hybrid that require various reasoning capabilities and show that ReasonBert achieves remarkable improvement over an array of strong baselines. Few-shot experiments further demonstrate that our pre-training method substantially improves sample efficiency.
    Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach. (arXiv:2109.04513v1 [cs.CL])
    (2 min) We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text in a subjective and time-consuming process. We identify that this challenge can be formulated as a masked language modelling task, used mostly as a pretraining objective for contextualized language models. Following, we develop several architectures focusing on the Akkadian language, the lingua franca of the time. We find that despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and pretraining on data from other languages and different time periods. Finally, we conduct human evaluations showing the applicability of our models in assisting experts to transcribe texts in extinct languages.
    Document-level Entity-based Extraction as Template Generation. (arXiv:2109.04901v1 [cs.CL])
    (2 min) Document-level entity-based extraction (EE), aiming at extracting entity-centric information such as entity roles and entity relations, is key to automatic knowledge acquisition from text corpora for various domains. Most document-level EE systems build extractive models, which struggle to model long-term dependencies among entities at the document level. To address this issue, we propose a generative framework for two document-level EE tasks: role-filler entity extraction (REE) and relation extraction (RE). We first formulate them as a template generation problem, allowing models to efficiently capture cross-entity dependencies, exploit label semantics, and avoid the exponential computation complexity of identifying N-ary relations. A novel cross-attention guided copy mechanism, TopK Copy, is incorporated into a pre-trained sequence-to-sequence model to enhance the capabilities of identifying key information in the input document. Experiments done on the MUC-4 and SciREX dataset show new state-of-the-art results on REE (+3.26%), binary RE (+4.8%), and 4-ary RE (+2.7%) in F1 score.
    A Large-Scale Study of Machine Translation in the Turkic Languages. (arXiv:2109.04593v1 [cs.CL])
    (2 min) Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 2 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.
    RoR: Read-over-Read for Long Document Machine Reading Comprehension. (arXiv:2109.04780v1 [cs.CL])
    (2 min) Transformer-based pre-trained models, such as BERT, have achieved remarkable results on machine reading comprehension. However, due to the constraint of encoding length (e.g., 512 WordPiece tokens), a long document is usually split into multiple chunks that are independently read. It results in the reading field being limited to individual chunks without information collaboration for long document machine reading comprehension. To address this problem, we propose RoR, a read-over-read method, which expands the reading field from chunk to document. Specifically, RoR includes a chunk reader and a document reader. The former first predicts a set of regional answers for each chunk, which are then compacted into a highly-condensed version of the original document, guaranteeing to be encoded once. The latter further predicts the global answers from this condensed document. Eventually, a voting strategy is utilized to aggregate and rerank the regional and global answers for final prediction. Extensive experiments on two benchmarks QuAC and TriviaQA demonstrate the effectiveness of RoR for long document reading. Notably, RoR ranks 1st place on the QuAC leaderboard (https://quac.ai/) at the time of submission (May 17th, 2021).
    Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation. (arXiv:2109.04653v1 [cs.CL])
    (2 min) Pre-trained language-vision models have shown remarkable performance on the visual question answering (VQA) task. However, most pre-trained models are trained by only considering monolingual learning, especially the resource-rich language like English. Training such models for multilingual setups demand high computing resources and multilingual language-vision dataset which hinders their application in practice. To alleviate these challenges, we propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student). Unlike the existing knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model learns and imitates the teacher from multiple intermediate layers (language and vision encoders) with appropriately designed distillation objectives for incremental knowledge extraction. We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups considering the multiple Indian and European languages. Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.
    Identifying Morality Frames in Political Tweets using Relational Learning. (arXiv:2109.04535v1 [cs.CL])
    (2 min) Extracting moral sentiment from text is a vital component in understanding public opinion, social movements, and policy decisions. The Moral Foundation Theory identifies five moral foundations, each associated with a positive and negative polarity. However, moral sentiment is often motivated by its targets, which can correspond to individuals or collective entities. In this paper, we introduce morality frames, a representation framework for organizing moral attitudes directed at different entities, and come up with a novel and high-quality annotated dataset of tweets written by US politicians. Then, we propose a relational learning model to predict moral attitudes towards entities and moral foundations jointly. We do qualitative and quantitative evaluations, showing that moral sentiment towards entities differs highly across political ideologies.
    SeDyT: A General Framework for Multi-Step Event Forecasting via Sequence Modeling on Dynamic Entity Embeddings. (arXiv:2109.04550v1 [cs.LG])
    (2 min) Temporal Knowledge Graphs store events in the form of subjects, relations, objects, and timestamps which are often represented by dynamic heterogeneous graphs. Event forecasting is a critical and challenging task in Temporal Knowledge Graph reasoning that predicts the subject or object of an event in the future. To obtain temporal embeddings multi-step away in the future, existing methods learn generative models that capture the joint distribution of the observed events. To reduce the high computation costs, these methods rely on unrealistic assumptions of independence and approximations in training and inference. In this work, we propose SeDyT, a discriminative framework that performs sequence modeling on the dynamic entity embeddings to solve the multi-step event forecasting problem. SeDyT consists of two components: a Temporal Graph Neural Network that generates dynamic entity embeddings in the past and a sequence model that predicts the entity embeddings in the future. Compared with the generative models, SeDyT does not rely on any heuristic-based probability model and has low computation complexity in both training and inference. SeDyT is compatible with most Temporal Graph Neural Networks and sequence models. We also design an efficient training method that trains the two components in one gradient descent propagation. We evaluate the performance of SeDyT on five popular datasets. By combining temporal Graph Neural Network models and sequence models, SeDyT achieves an average of 2.4% MRR improvement when not using the validation set and more than 10% MRR improvement when using the validation set.
    Exophoric Pronoun Resolution in Dialogues with Topic Regularization. (arXiv:2109.04787v1 [cs.CL])
    (2 min) Resolving pronouns to their referents has long been studied as a fundamental natural language understanding problem. Previous works on pronoun coreference resolution (PCR) mostly focus on resolving pronouns to mentions in text while ignoring the exophoric scenario. Exophoric pronouns are common in daily communications, where speakers may directly use pronouns to refer to some objects present in the environment without introducing the objects first. Although such objects are not mentioned in the dialogue text, they can often be disambiguated by the general topics of the dialogue. Motivated by this, we propose to jointly leverage the local context and global topics of dialogues to solve the out-of-text PCR problem. Extensive experiments demonstrate the effectiveness of adding topic regularization for resolving exophoric pronouns.
    Dimensional Emotion Detection from Categorical Emotion. (arXiv:1911.02499v2 [cs.CL] UPDATED)
    (2 min) We present a model to predict fine-grained emotions along the continuous dimensions of valence, arousal, and dominance (VAD) with a corpus with categorical emotion annotations. Our model is trained by minimizing the EMD (Earth Mover's Distance) loss between the predicted VAD score distribution and the categorical emotion distributions sorted along VAD, and it can simultaneously classify the emotion categories and predict the VAD scores for a given sentence. We use pre-trained RoBERTa-Large and fine-tune on three different corpora with categorical labels and evaluate on EmoBank corpus with VAD scores. We show that our approach reaches comparable performance to that of the state-of-the-art classifiers in categorical emotion classification and shows significant positive correlations with the ground truth VAD scores. Also, further training with supervision of VAD labels leads to improved performance especially when dataset is small. We also present examples of predictions of appropriate emotion words that are not part of the original annotations.
    Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars. (arXiv:2109.04939v1 [cs.CL])
    (2 min) In computational linguistics, it has been shown that hierarchical structures make language models (LMs) more human-like. However, the previous literature has been agnostic about a parsing strategy of the hierarchical models. In this paper, we investigated whether hierarchical structures make LMs more human-like, and if so, which parsing strategy is most cognitively plausible. In order to address this question, we evaluated three LMs against human reading times in Japanese with head-final left-branching structures: Long Short-Term Memory (LSTM) as a sequential model and Recurrent Neural Network Grammars (RNNGs) with top-down and left-corner parsing strategies as hierarchical models. Our computational modeling demonstrated that left-corner RNNGs outperformed top-down RNNGs and LSTM, suggesting that hierarchical and left-corner architectures are more cognitively plausible than top-down or sequential architectures. In addition, the relationships between the cognitive plausibility and (i) perplexity, (ii) parsing, and (iii) beam size will also be discussed.
    Few-Shot Keyword Spotting in Any Language. (arXiv:2104.01454v4 [cs.CL] UPDATED)
    (2 min) We introduce a few-shot transfer learning method for keyword spotting in any language. Leveraging open speech corpora in nine languages, we automate the extraction of a large multilingual keyword bank and use it to train an embedding model. With just five training examples, we fine-tune the embedding model for keyword spotting and achieve an average F1 score of 0.75 on keyword classification for 180 new keywords unseen by the embedding model in these nine languages. This embedding model also generalizes to new languages. We achieve an average F1 score of 0.65 on 5-shot models for 260 keywords sampled across 13 new languages unseen by the embedding model. We investigate streaming accuracy for our 5-shot models in two contexts: keyword spotting and keyword search. Across 440 keywords in 22 languages, we achieve an average streaming keyword spotting accuracy of 87.4% with a false acceptance rate of 4.3%, and observe promising initial results on keyword search.
    Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition. (arXiv:2109.04783v1 [cs.SD])
    (2 min) When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Variance Distortionless Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.
    Artificial Text Detection via Examining the Topology of Attention Maps. (arXiv:2109.04825v1 [cs.CL])
    (2 min) The impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) which is currently understudied in the field of NLP. We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10\% on three common datasets, and tend to be the most robust towards unseen GPT-style generation models as opposed to existing methods. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties. The results demonstrate that TDA is a promising line with respect to NLP tasks, specifically the ones that incorporate surface and structural information.
    WHOSe Heritage: Classification of UNESCO World Heritage "Outstanding Universal Value" Documents with Soft Labels. (arXiv:2104.05547v2 [cs.CL] UPDATED)
    (2 min) The UNESCO World Heritage List (WHL) includes the exceptionally valuable cultural and natural heritage to be preserved for mankind. Evaluating and justifying the Outstanding Universal Value (OUV) is essential for each site inscribed in the WHL, and yet a complex task, even for experts, since the selection criteria of OUV are not mutually exclusive. Furthermore, manual annotation of heritage values and attributes from multi-source textual data, which is currently dominant in heritage studies, is knowledge-demanding and time-consuming, impeding systematic analysis of such authoritative documents in terms of their implications on heritage management. This study applies state-of-the-art NLP models to build a classifier on a new dataset containing Statements of OUV, seeking an explainable and scalable automation tool to facilitate the nomination, evaluation, research, and monitoring processes of World Heritage sites. Label smoothing is innovatively adapted to improve the model performance by adding prior inter-class relationship knowledge to generate soft labels. The study shows that the best models fine-tuned from BERT and ULMFiT can reach 94.3% top-3 accuracy. A human study with expert evaluation on the model prediction shows that the models are sufficiently generalizable. The study is promising to be further developed and applied in heritage research and practice.
    Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning. (arXiv:2109.04689v1 [cs.CL])
    (2 min) Motivated by suggested question generation in conversational news recommendation systems, we propose a model for generating question-answer pairs (QA pairs) with self-contained, summary-centric questions and length-constrained, article-summarizing answers. We begin by collecting a new dataset of news articles with questions as titles and pairing them with summaries of varying length. This dataset is used to learn a QA pair generation model producing summaries as answers that balance brevity with sufficiency jointly with their corresponding questions. We then reinforce the QA pair generation process with a differentiable reward function to mitigate exposure bias, a common problem in natural language generation. Both automatic metrics and human evaluation demonstrate these QA pairs successfully capture the central gists of the articles and achieve high answer accuracy.
  • cs.CV updates on arXiv.org

    Temporally Coherent Person Matting Trained on Fake-Motion Dataset. (arXiv:2109.04843v1 [cs.CV])
    (2 min) We propose a novel neural-network-based method to perform matting of videos depicting people that does not require additional user input such as trimaps. Our architecture achieves temporal stability of the resulting alpha mattes by using motion-estimation-based smoothing of image-segmentation algorithm outputs, combined with convolutional-LSTM modules on U-Net skip connections. We also propose a fake-motion algorithm that generates training clips for the video-matting network given photos with ground-truth alpha mattes and background videos. We apply random motion to photos and their mattes to simulate movement one would find in real videos and composite the result with the background clips. It lets us train a deep neural network operating on videos in an absence of a large annotated video dataset and provides ground-truth training-clip foreground optical flow for use in loss functions.
    Better Self-training for Image Classification through Self-supervision. (arXiv:2109.00778v2 [cs.CV] UPDATED)
    (2 min) Self-training is a simple semi-supervised learning approach: Unlabelled examples that attract high-confidence predictions are labelled with their predictions and added to the training set, with this process being repeated multiple times. Recently, self-supervision -- learning without manual supervision by solving an automatically-generated pretext task -- has gained prominence in deep learning. This paper investigates three different ways of incorporating self-supervision into self-training to improve accuracy in image classification: self-supervision as pretraining only, self-supervision performed exclusively in the first iteration of self-training, and self-supervision added to every iteration of self-training. Empirical results on the SVHN, CIFAR-10, and PlantVillage datasets, using both training from scratch, and Imagenet-pretrained weights, show that applying self-supervision only in the first iteration of self-training can greatly improve accuracy, for a modest increase in computation time.
    VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. (arXiv:2103.16874v2 [cs.CV] UPDATED)
    (2 min) The task of image-based virtual try-on aims to transfer a target clothing item onto the corresponding region of a person, which is commonly tackled by fitting the item to the desired body part and fusing the warped item with the person. While an increasing number of studies have been conducted, the resolution of synthesized images is still limited to low (e.g., 256x192), which acts as the critical limitation against satisfying online consumers. We argue that the limitation stems from several challenges: as the resolution increases, the artifacts in the misaligned areas between the warped clothes and the desired clothing regions become noticeable in the final results; the architectures used in existing methods have low performance in generating high-quality body parts and maintaining the texture sharpness of the clothes. To address the challenges, we propose a novel virtual try-on method called VITON-HD that successfully synthesizes 1024x768 virtual try-on images. Specifically, we first prepare the segmentation map to guide our virtual try-on synthesis, and then roughly fit the target clothing item to a given person's body. Next, we propose ALIgnment-Aware Segment (ALIAS) normalization and ALIAS generator to handle the misaligned areas and preserve the details of 1024x768 inputs. Through rigorous comparison with existing methods, we demonstrate that VITON-HD highly surpasses the baselines in terms of synthesized image quality both qualitatively and quantitatively. Code is available at https://github.com/shadow2496/VITON-HD.
    Web image search engine based on LSH index and CNN Resnet50. (arXiv:2108.13301v1 [cs.IR] CROSS LISTED)
    (2 min) To implement a good Content Based Image Retrieval (CBIR) system, it is essential to adopt efficient search methods. One way to achieve this results is by exploiting approximate search techniques. In fact, when we deal with very large collections of data, using an exact search method makes the system very slow. In this project, we adopt the Locality Sensitive Hashing (LSH) index to implement a CBIR system that allows us to perform fast similarity search on deep features. Specifically, we exploit transfer learning techniques to extract deep features from images; this phase is done using two famous Convolutional Neural Networks (CNNs) as features extractors: Resnet50 and Resnet50v2, both pre-trained on ImageNet. Then we try out several fully connected deep neural networks, built on top of both of the previously mentioned CNNs in order to fine-tuned them on our dataset. In both of previous cases, we index the features within our LSH index implementation and within a sequential scan, to better understand how much the introduction of the index affects the results. Finally, we carry out a performance analysis: we evaluate the relevance of the result set, computing the mAP (mean Average Precision) value obtained during the different experiments with respect to the number of done comparison and varying the hyper-parameter values of the LSH index.
    MobileStyleGAN: A Lightweight Convolutional Neural Network for High-Fidelity Image Synthesis. (arXiv:2104.04767v2 [cs.CV] UPDATED)
    (2 min) In recent years, the use of Generative Adversarial Networks (GANs) has become very popular in generative image modeling. While style-based GAN architectures yield state-of-the-art results in high-fidelity image synthesis, computationally, they are highly complex. In our work, we focus on the performance optimization of style-based generative models. We analyze the most computationally hard parts of StyleGAN2, and propose changes in the generator network to make it possible to deploy style-based generative networks in the edge devices. We introduce MobileStyleGAN architecture, which has x3.5 fewer parameters and is x9.5 less computationally complex than StyleGAN2, while providing comparable quality.
    Saliency Guided Experience Packing for Replay in Continual Learning. (arXiv:2109.04954v1 [cs.LG])
    (2 min) Artificial learning systems aspire to mimic human intelligence by continually learning from a stream of tasks without forgetting past knowledge. One way to enable such learning is to store past experiences in the form of input examples in episodic memory and replay them when learning new tasks. However, performance of such method suffers as the size of the memory becomes smaller. In this paper, we propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps which provide visual explanations for the model's decision. Guided by these saliency maps, we pack the memory with only the parts or patches of the input images important for the model's prediction. While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions. We evaluate our algorithm on diverse image classification datasets and report better performance than the state-of-the-art approaches. With qualitative and quantitative analyses we show that our method captures richer summary of past experiences without any memory increase, and hence performs well with small episodic memory.
    Efficiently Identifying Task Groupings for Multi-Task Learning. (arXiv:2109.04617v1 [cs.LG])
    (2 min) Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naively training all tasks together in one model often degrades performance, and exhaustively searching through combinations of task groupings can be prohibitively expensive. As a result, efficiently identifying the tasks that would benefit from co-training remains a challenging design question without a clear solution. In this paper, we suggest an approach to select which tasks should train together in multi-task learning models. Our method determines task groupings in a single training run by co-training all tasks together and quantifying the effect to which one task's gradient would affect another task's loss. On the large-scale Taskonomy computer vision dataset, we find this method can decrease test loss by 10.0\% compared to simply training all tasks together while operating 11.6 times faster than a state-of-the-art task grouping method.
    AAformer: Auto-Aligned Transformer for Person Re-Identification. (arXiv:2104.00921v2 [cs.CV] UPDATED)
    (2 min) In person re-identification, extracting part-level features from person images has been verified to be crucial. Most of existing CNN-based methods only locate the human parts coarsely, or rely on pre-trained human parsing models and fail in locating the identifiable non-human parts (e.g., knapsack). In this paper, we introduce an alignment scheme in Transformer architecture for the first time and propose the Auto-Aligned Transformer (AAformer) to automatically locate both the human parts and non-human ones at patch-level. We introduce the "part tokens", which are learnable vectors, to extract part features in Transformer. A part token only interacts with a local subset of patches in self-attention and learns to be the part representation. To adaptively group the image patches into different subsets, we design the Auto-Alignment. Auto-Alignment employs a fast variant of Optimal Transport algorithm to online cluster the patch embeddings into several groups with the part tokens as their prototypes. We harmoniously integrate the part alignment into the self-attention and the output part tokens can be directly used for retrieval. Extensive experiments validate the effectiveness of part tokens and the superiority of AAformer over various state-of-the-art methods.
    Detection of GAN-synthesized street videos. (arXiv:2109.04991v1 [cs.CV])
    (2 min) Research on the detection of AI-generated videos has focused almost exclusively on face videos, usually referred to as deepfakes. Manipulations like face swapping, face reenactment and expression manipulation have been the subject of an intense research with the development of a number of efficient tools to distinguish artificial videos from genuine ones. Much less attention has been paid to the detection of artificial non-facial videos. Yet, new tools for the generation of such kind of videos are being developed at a fast pace and will soon reach the quality level of deepfake videos. The goal of this paper is to investigate the detectability of a new kind of AI-generated videos framing driving street sequences (here referred to as DeepStreets videos), which, by their nature, can not be analysed with the same tools used for facial deepfakes. Specifically, we present a simple frame-based detector, achieving very good performance on state-of-the-art DeepStreets videos generated by the Vid2vid architecture. Noticeably, the detector retains very good performance on compressed videos, even when the compression level used during training does not match that used for the test videos.
    Object recognition for robotics from tactile time series data utilising different neural network architectures. (arXiv:2109.04573v1 [cs.RO])
    (2 min) Robots need to exploit high-quality information on grasped objects to interact with the physical environment. Haptic data can therefore be used for supplementing the visual modality. This paper investigates the use of Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) neural network architectures for object classification on Spatio-temporal tactile grasping data. Furthermore, we compared these methods using data from two different fingertip sensors (namely the BioTac SP and WTS-FT) in the same physical setup, allowing for a realistic comparison across methods and sensors for the same tactile object classification dataset. Additionally, we propose a way to create more training examples from the recorded data. The results show that the proposed method improves the maximum accuracy from 82.4% (BioTac SP fingertips) and 90.7% (WTS-FT fingertips) with complete time-series data to about 94% for both sensor types.
    Nerfies: Deformable Neural Radiance Fields. (arXiv:2011.12948v5 [cs.CV] UPDATED)
    (2 min) We present the first method capable of photorealistically reconstructing deformable scenes using photos/videos captured casually from mobile phones. Our approach augments neural radiance fields (NeRF) by optimizing an additional continuous volumetric deformation field that warps each observed point into a canonical 5D NeRF. We observe that these NeRF-like deformation fields are prone to local minima, and propose a coarse-to-fine optimization method for coordinate-based models that allows for more robust optimization. By adapting principles from geometry processing and physical simulation to NeRF-like models, we propose an elastic regularization of the deformation field that further improves robustness. We show that our method can turn casually captured selfie photos/videos into deformable NeRF models that allow for photorealistic renderings of the subject from arbitrary viewpoints, which we dub "nerfies." We evaluate our method by collecting time-synchronized data using a rig with two mobile phones, yielding train/validation images of the same pose at different viewpoints. We show that our method faithfully reconstructs non-rigidly deforming scenes and reproduces unseen views with high fidelity.
    Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding. (arXiv:2109.04872v1 [cs.CV])
    (2 min) Temporal grounding aims to temporally localize a video moment in the video whose semantics are related to a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with a focus on designing complicated heads and fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Dual Matching Network (DMN), to directly model the relations between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs from a dual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal pair discrimination to maximize their mutual information. Experiments show that DMN achieves highly competitive performance compared with state-of-the-art methods on four video grounding benchmarks. Based on DMN, we present a winner solution for STVG challenge of the 3rd PIC workshop. This suggests that metric-learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.
    MEAL: Manifold Embedding-based Active Learning. (arXiv:2106.11858v3 [cs.CV] UPDATED)
    (2 min) Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising patches extracted from full image, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the autonomous driving datasets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art baselines. We find that our active learning method achieves better performance compared to previous methods.
    Resolving gas bubbles ascending in liquid metal from low-SNR neutron radiography images. (arXiv:2109.04883v1 [physics.flu-dyn])
    (2 min) We demonstrate a new image processing methodology for resolving gas bubbles travelling through liquid metal from dynamic neutron radiography images with intrinsically low signal-to-noise ratio. Image pre-processing, denoising and bubble segmentation are described in detail, with practical recommendations. Experimental validation is presented - stationary and moving reference bodies with neutron-transparent cavities are radiographed with imaging conditions similar to the cases with bubbles in liquid metal. The new methods are applied to our experimental data from previous and recent imaging campaigns, and the performance of the methods proposed in this paper is compared against our previously developed methods. Significant improvements are observed as well as the capacity to reliably extract physically meaningful information from measurements performed under highly adverse imaging conditions. The showcased image processing solution and separate elements thereof are readily extendable beyond the present application, and have been made open-source.
    What Makes for Hierarchical Vision Transformer?. (arXiv:2107.02174v2 [cs.CV] UPDATED)
    (2 min) Recent studies indicate that hierarchical Vision Transformer with a macro architecture of interleaved non-overlapped window-based self-attention \& shifted-window operation is able to achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. Most follow-up works attempt to replace the shifted-window operation with other kinds of cross-window communication paradigms, while treating self-attention as the de-facto standard for window-based information aggregation. In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed as LinMapper can achieve very strong performance in ImageNet-1k image recognition. Moreover, we find that LinMapper is able to better leverage the pre-trained representations from image recognition and demonstrates excellent transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches, which all give similar competitive results. Our study reveals that the \textbf{macro architecture} of Swin model families, other than specific aggregation layers or specific means of cross-window communication, may be more responsible for its strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm. Code and models will be publicly available to facilitate future research.
    LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation. (arXiv:2109.04993v1 [cs.CV])
    (2 min) Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space
    Multi-Features Guidance Network for partial-to-partial point cloud registration. (arXiv:2011.12079v2 [cs.CV] UPDATED)
    (2 min) To eliminate the problems of large dimensional differences, big semantic gap, and mutual interference caused by hybrid features, in this paper, we propose a novel Multi-Features Guidance Network for partial-to-partial point cloud registration(MFG). The proposed network mainly includes four parts: keypoints' feature extraction, correspondences searching, correspondences credibility computation, and SVD, among which correspondences searching and correspondence credibility computation are the cores of the network. Unlike the previous work, we utilize the shape features and the spatial coordinates to guide correspondences search independently and fusing the matching results to obtain the final matching matrix. In the correspondences credibility computation module, based on the conflicted relationship between the features matching matrix and the coordinates matching matrix, we score the reliability for each correspondence, which can reduce the impact of mismatched or non-matched points. Experimental results show that our network outperforms the current state-of-the-art while maintaining computational efficiency.
    FineAction: A Fine-Grained Video Dataset for Temporal Action Localization. (arXiv:2105.11107v2 [cs.CV] UPDATED)
    (2 min) Temporal action localization (TAL) is an important and challenging problem in video understanding. However, most existing TAL benchmarks are built upon the coarse granularity of action classes, which exhibits two major limitations in this task. First, coarse-level actions can make the localization models overfit in high-level context information, and ignore the atomic action details in the video. Second, the coarse action classes often lead to the ambiguous annotations of temporal boundaries, which are inappropriate for temporal action localization. To tackle these problems, we develop a novel large-scale and fine-grained video dataset, coined as FineAction, for temporal action localization. In total, FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges for temporal action localization, thanks to its distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes. To benchmark FineAction, we systematically investigate the performance of several popular temporal localization methods on it, and deeply analyze the influence of short-duration and fine-grained instances in temporal action localization. We believe that FineAction can advance research of temporal action localization and beyond.
    Visual Goal-Step Inference using wikiHow. (arXiv:2104.05845v2 [cs.CV] UPDATED)
    (2 min) Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
    Semi-Supervised Learning using Siamese Networks. (arXiv:2109.00794v2 [cs.LG] UPDATED)
    (2 min) Neural networks have been successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are more difficult to train successfully for semi-supervised problems where small amounts of labeled instances are available along with a large number of unlabeled instances. This work explores a new training method for semi-supervised learning that is based on similarity function learning using a Siamese network to obtain a suitable embedding. The learned representations are discriminative in Euclidean space, and hence can be used for labeling unlabeled instances using a nearest-neighbor classifier. Confident predictions of unlabeled instances are used as true labels for retraining the Siamese network on the expanded training set. This process is applied iteratively. We perform an empirical study of this iterative self-training algorithm. For improving unlabeled predictions, local learning with global consistency [22] is also evaluated.
    Mean Shift for Self-Supervised Learning. (arXiv:2105.07269v2 [cs.CV] UPDATED)
    (2 min) Most recent self-supervised learning (SSL) algorithms learn features by contrasting between instances of images or by clustering the images and then contrasting between the image clusters. We introduce a simple mean-shift algorithm that learns representations by grouping images together without contrasting between them or adopting much of prior on the structure of the clusters. We simply "shift" the embedding of each image to be close to the "mean" of its neighbors. Since in our setting, the closest neighbor is always another augmentation of the same image, our model will be identical to BYOL when using only one nearest neighbor instead of 5 as used in our experiments. Our model achieves 72.4% on ImageNet linear evaluation with ResNet50 at 200 epochs outperforming BYOL. Our code is available here: https://github.com/UMBCvision/MSF
    Measuring and Harnessing Transference in Multi-Task Learning. (arXiv:2010.15413v3 [cs.LG] UPDATED)
    (2 min) Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naive formulations often degrade performance and in particular, identifying the tasks that would benefit from co-training remains a challenging design question. In this paper, we analyze the dynamics of information transfer, or transference, across tasks throughout training. Specifically, we develop a similarity measure that can quantify transference among tasks and use this quantity to both better understand the optimization dynamics of multi-task learning as well as improve overall learning performance. In the latter case, we propose two methods to leverage our transference metric. The first operates at a macro-level by selecting which tasks should train together while the second functions at a micro-level by determining how to combine task gradients at each training step. We find these methods can lead to significant improvement over prior work on three supervised multi-task learning benchmarks and one multi-task reinforcement learning paradigm.
    Learning normal appearance for fetal anomaly screening: Application to the unsupervised detection of Hypoplastic Left Heart Syndrome. (arXiv:2012.03679v2 [eess.IV] UPDATED)
    (2 min) Congenital heart disease is considered as one the most common groups of congenital malformations which affects $6-11$ per $1000$ newborns. In this work, an automated framework for detection of cardiac anomalies during ultrasound screening is proposed and evaluated on the example of Hypoplastic Left Heart Syndrome (HLHS), a sub-category of congenital heart disease. We propose an unsupervised approach that learns healthy anatomy exclusively from clinically confirmed normal control patients. We evaluate a number of known anomaly detection frameworks together with a model architecture based on the $\alpha$-GAN network and find evidence that the proposed model performs significantly better than the state-of-the-art in image-based anomaly detection, yielding average $0.81$ AUC \emph{and} a better robustness towards initialisation compared to previous works.
    Variational Conditional Dependence Hidden Markov Models for Skeleton-Based Action Recognition. (arXiv:2002.05809v2 [cs.LG] UPDATED)
    (2 min) Hidden Markov Models (HMMs) comprise a powerful generative approach for modeling sequential data and time-series in general. However, the commonly employed assumption of the dependence of the current time frame to a single or multiple immediately preceding frames is unrealistic; more complicated dynamics potentially exist in real world scenarios. This paper revisits conventional sequential modeling approaches, aiming to address the problem of capturing time-varying temporal dependency patterns. To this end, we propose a different formulation of HMMs, whereby the dependence on past frames is dynamically inferred from the data. Specifically, we introduce a hierarchical extension by postulating an additional latent variable layer; therein, the (time-varying) temporal dependence patterns are treated as latent variables over which inference is performed. We leverage solid arguments from the Variational Bayes framework and derive a tractable inference algorithm based on the forward-backward algorithm. As we experimentally show, our approach can model highly complex sequential data and can effectively handle data with missing values.
    BAM: A Balanced Attention Mechanism for Single Image Super Resolution. (arXiv:2104.07566v3 [eess.IV] UPDATED)
    (2 min) Recovering texture information from the aliasing regions has always been a major challenge for Single Image Super Resolution (SISR) task. These regions are often submerged in noise so that we have to restore texture details while suppressing noise. To address this issue, we propose a Balanced Attention Mechanism (BAM), which consists of Avgpool Channel Attention Module (ACAM) and Maxpool Spatial Attention Module (MSAM) in parallel. ACAM is designed to suppress extreme noise in the large scale feature maps while MSAM preserves high-frequency texture details. Thanks to the parallel structure, these two modules not only conduct self-optimization, but also mutual optimization to obtain the balance of noise reduction and high-frequency texture restoration during the back propagation process, and the parallel structure makes the inference faster. To verify the effectiveness and robustness of BAM, we applied it to 10 SOTA SISR networks. The results demonstrate that BAM can efficiently improve the networks performance, and for those originally with attention mechanism, the substitution with BAM further reduces the amount of parameters and increases the inference speed. Moreover, we present a dataset with rich texture aliasing regions in real scenes, named realSR7. Experiments prove that BAM achieves better super-resolution results on the aliasing area.
    Non-imaging real-time detection and tracking of fast-moving objects using a single-pixel detector. (arXiv:2108.06009v2 [eess.IV] UPDATED)
    (2 min) Detection and tracking of fast-moving objects have widespread utility in many fields. However, fulfilling this demand for fast and efficient detecting and tracking using image-based techniques is problematic, owing to the complex calculations and limited data processing capabilities. To tackle this problem, we propose an image-free method to achieve real-time detection and tracking of fast-moving objects. It employs the Hadamard pattern to illuminate the fast-moving object by a spatial light modulator, in which the resulting light signal is collected by a single-pixel detector. The single-pixel measurement values are directly used to reconstruct the position information without image reconstruction. Furthermore, a new sampling method is used to optimize the pattern projection way for achieving an ultra-low sampling rate. Compared with the state-of-the-art methods, our approach is not only capable of handling real-time detection and tracking, but also it has a small amount of calculation and high efficiency. We experimentally demonstrate that the proposed method, using a 22kHz digital micro-mirror device, can implement a 105fps frame rate at a 1.28% sampling rate when tracks. Our method breaks through the traditional tracking ways, which can implement the object real-time tracking without image reconstruction.
    Highdicom: A Python library for standardized encoding of image annotations and machine learning model outputs in pathology and radiology. (arXiv:2106.07806v2 [eess.IV] UPDATED)
    (3 min) Machine learning is revolutionizing image-based diagnostics in pathology and radiology. ML models have shown promising results in research settings, but their lack of interoperability has been a major barrier for clinical integration and evaluation. The DICOM a standard specifies Information Object Definitions and Services for the representation and communication of digital images and related information, including image-derived annotations and analysis results. However, the complexity of the standard represents an obstacle for its adoption in the ML community and creates a need for software libraries and tools that simplify working with data sets in DICOM format. Here we present the highdicom library, which provides a high-level application programming interface for the Python programming language that abstracts low-level details of the standard and enables encoding and decoding of image-derived information in DICOM format in a few lines of Python code. The highdicom library ties into the extensive Python ecosystem for image processing and machine learning. Simultaneously, by simplifying creation and parsing of DICOM-compliant files, highdicom achieves interoperability with the medical imaging systems that hold the data used to train and run ML models, and ultimately communicate and store model outputs for clinical use. We demonstrate through experiments with slide microscopy and computed tomography imaging, that, by bridging these two ecosystems, highdicom enables developers to train and evaluate state-of-the-art ML models in pathology and radiology while remaining compliant with the DICOM standard and interoperable with clinical systems at all stages. To promote standardization of ML research and streamline the ML model development and deployment process, we made the library available free and open-source.
    HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields. (arXiv:2106.13228v2 [cs.CV] UPDATED)
    (2 min) Neural Radiance Fields (NeRF) are able to reconstruct scenes with unprecedented fidelity, and various recent works have extended NeRF to handle dynamic scenes. A common approach to reconstruct such non-rigid scenes is through the use of a learned deformation field mapping from coordinates in each input image into a canonical template coordinate space. However, these deformation-based approaches struggle to model changes in topology, as topological changes require a discontinuity in the deformation field, but these deformation fields are necessarily continuous. We address this limitation by lifting NeRFs into a higher dimensional space, and by representing the 5D radiance field corresponding to each individual input image as a slice through this "hyper-space". Our method is inspired by level set methods, which model the evolution of surfaces as slices through a higher dimensional surface. We evaluate our method on two tasks: (i) interpolating smoothly between "moments", i.e., configurations of the scene, seen in the input images while maintaining visual plausibility, and (ii) novel-view synthesis at fixed moments. We show that our method, which we dub HyperNeRF, outperforms existing methods on both tasks. Compared to Nerfies, HyperNeRF reduces average error rates by 4.1% for interpolation and 8.6% for novel-view synthesis, as measured by LPIPS. Additional videos, results, and visualizations are available at https://hypernerf.github.io.
    "Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks. (arXiv:2104.08384v2 [cs.CL] UPDATED)
    (2 min) We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong supervised baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our wikily supervised translation models to unsupervised image captioning, and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a translated version of the English captioning data, using our wikily-supervised translation models. Our captioning results on Arabic are slightly better than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as artificial training data in an annotation projection framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers.
    View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose. (arXiv:2010.13321v2 [cs.CV] UPDATED)
    (3 min) Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.
    Solid Texture Synthesis using Generative Adversarial Networks. (arXiv:2102.03973v4 [cs.CV] UPDATED)
    (2 min) Solid texture synthesis (STS), as an effective way to extend 2D exemplar to a 3D solid volume, exhibits advantages in numerous application domains. However, existing methods generally synthesize solid texture with specific features, which may result in the failure of capturing diversified textural information. In this paper, we propose a novel generative adversarial nets-based approach (STS-GAN) to hierarchically learn solid texture with a feature-free nature. Our multi-scale discriminators evaluate the similarity between patch from exemplar and slice from the generated volume, promoting the generator to synthesize realistic solid textures. Experimental results demonstrate that the proposed method can generate high-quality solid textures with similar visual characteristics to the exemplar.
    Combining Morphological and Histogram based Text Line Segmentation in the OCR Context. (arXiv:2103.08922v2 [cs.CV] UPDATED)
    (2 min) Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.
    Unsupervised Change Detection in Hyperspectral Images using Feature Fusion Deep Convolutional Autoencoders. (arXiv:2109.04990v1 [cs.CV])
    (2 min) Binary change detection in bi-temporal co-registered hyperspectral images is a challenging task due to a large number of spectral bands present in the data. Researchers, therefore, try to handle it by reducing dimensions. The proposed work aims to build a novel feature extraction system using a feature fusion deep convolutional autoencoder for detecting changes between a pair of such bi-temporal co-registered hyperspectral images. The feature fusion considers features across successive levels and multiple receptive fields and therefore adds a competitive edge over the existing feature extraction methods. The change detection technique described is completely unsupervised and is much more elegant than other supervised or semi-supervised methods which require some amount of label information. Different methods have been applied to the extracted features to find the changes in the two images and it is found that the proposed method clearly outperformed the state of the art methods in unsupervised change detection for all the datasets.
    MultiTask-CenterNet (MCN): Efficient and Diverse Multitask Learning using an Anchor Free Approach. (arXiv:2108.05060v2 [cs.CV] UPDATED)
    (2 min) Multitask learning is a common approach in machine learning, which allows to train multiple objectives with a shared architecture. It has been shown that by training multiple tasks together inference time and compute resources can be saved, while the objectives performance remains on a similar or even higher level. However, in perception related multitask networks only closely related tasks can be found, such as object detection, instance and semantic segmentation or depth estimation. Multitask networks with diverse tasks and their effects with respect to efficiency on one another are not well studied. In this paper we augment the CenterNet anchor-free approach for training multiple diverse perception related tasks together, including the task of object detection and semantic segmentation as well as human pose estimation. We refer to this DNN as Multitask-CenterNet (MCN). Additionally, we study different MCN settings for efficiency. The MCN can perform several tasks at once while maintaining, and in some cases even exceeding, the performance values of its corresponding single task networks. More importantly, the MCN architecture decreases inference time and reduces network size when compared to a composition of single task networks.
    Concept Generalization in Visual Representation Learning. (arXiv:2012.05649v2 [cs.CV] UPDATED)
    (2 min) Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the seen concepts used to train representations, thus ignoring any semantic relationships between the two. In this paper, we argue that the semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG, a novel benchmark on the ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in a principled way. Our benchmark leverages expert knowledge that comes from WordNet in order to define a sequence of unseen IN-21K concept sets that are semantically more and more distant from the ImageNet-1K (IN-1K) subset, a ubiquitous training set. This allows us to benchmark visual representations learned on IN-1K out-of-the box. We conduct a large-scale study encompassing 31 convolution and transformer-based models and show how different architectures, levels of supervision, regularization techniques and use of web data impact the concept generalization performance.
    ISD: Self-Supervised Learning by Iterative Similarity Distillation. (arXiv:2012.09259v3 [cs.CV] UPDATED)
    (2 min) Recently, contrastive learning has achieved great results in self-supervised learning, where the main idea is to push two augmentations of an image (positive pairs) closer compared to other random images (negative pairs). We argue that not all random images are equal. Hence, we introduce a self supervised learning algorithm where we use a soft similarity for the negative images rather than a binary distinction between positive and negative pairs. We iteratively distill a slowly evolving teacher model to the student model by capturing the similarity of a query image to some random images and transferring that knowledge to the student. We argue that our method is less constrained compared to recent contrastive learning methods, so it can learn better features. Specifically, our method should handle unbalanced and unlabeled data better than existing contrastive learning methods, because the randomly chosen negative set might include many samples that are semantically similar to the query image. In this case, our method labels them as highly similar while standard contrastive methods label them as negative pairs. Our method achieves comparable results to the state-of-the-art models. We also show that our method performs better in the settings where the unlabeled data is unbalanced. Our code is available here: https://github.com/UMBCvision/ISD.
    Pose Estimation for Robot Manipulators via Keypoint Optimization and Sim-to-Real Transfer. (arXiv:2010.08054v2 [cs.RO] UPDATED)
    (2 min) Keypoint detection is an essential building block for many robotic applications like motion capture and pose estimation. Historically, keypoints are detected using uniquely engineered markers such as checkerboards or fiducials. More recently, deep learning methods have been explored as they have the ability to detect user-defined keypoints in a marker-less manner. However, different manually selected keypoints can have uneven performance when it comes to detection and localization. An example of this can be found on symmetric robotic tools where DNN detectors cannot solve the correspondence problem correctly. In this work, we propose a new and autonomous way to define the keypoint locations that overcomes these challenges. The approach involves finding the optimal set of keypoints on robotic manipulators for robust visual detection and localization. Using a robotic simulator as a medium, our algorithm utilizes synthetic data for DNN training, and the proposed algorithm is used to optimize the selection of keypoints through an iterative approach. The results show that when using the optimized keypoints, the detection performance of the DNNs improved significantly. We further use the optimized keypoints for real robotic applications by using domain randomization to bridge the reality gap between the simulator and the physical world. The physical world experiments show how the proposed method can be applied to the wide-breadth of robotic applications that require visual feedback, such as camera-to-robot calibration, robotic tool tracking, and end-effector pose estimation.
    Transfer of Pretrained Model Weights Substantially Improves Semi-Supervised Image Classification. (arXiv:2109.00788v2 [cs.CV] UPDATED)
    (2 min) Deep neural networks produce state-of-the-art results when trained on a large number of labeled examples but tend to overfit when small amounts of labeled examples are used for training. Creating a large number of labeled examples requires considerable resources, time, and effort. If labeling new data is not feasible, so-called semi-supervised learning can achieve better generalisation than purely supervised learning by employing unlabeled instances as well as labeled ones. The work presented in this paper is motivated by the observation that transfer learning provides the opportunity to potentially further improve performance by exploiting models pretrained on a similar domain. More specifically, we explore the use of transfer learning when performing semi-supervised learning using self-learning. The main contribution is an empirical evaluation of transfer learning using different combinations of similarity metric learning methods and label propagation algorithms in semi-supervised learning. We find that transfer learning always substantially improves the model's accuracy when few labeled examples are available, regardless of the type of loss used for training the neural network. This finding is obtained by performing extensive experiments on the SVHN, CIFAR10, and Plant Village image classification datasets and applying pretrained weights from Imagenet for transfer learning.
    A New Backbone for Hyperspectral Image Reconstruction. (arXiv:2108.07739v2 [eess.IV] UPDATED)
    (2 min) The study of 3D hyperspectral image (HSI) reconstruction refers to the inverse process of snapshot compressive imaging, during which the optical system, e.g., the coded aperture snapshot spectral imaging (CASSI) system, captures the 3D spatial-spectral signal and encodes it to a 2D measurement. While numerous sophisticated neural networks have been elaborated for end-to-end reconstruction, trade-offs still need to be made among performance, efficiency (training and inference time), and feasibility (the ability of restoring high resolution HSI on limited GPU memory). This raises a challenge to design a new baseline to conjointly meet the above requirements. In this paper, we fill in this blank by proposing a Spatial/Spectral Invariant Residual U-Net, namely SSI-ResU-Net. It differentiates with U-Net in three folds--1) scale/spectral-invariant learning, 2) nested residual learning, and 3) computational efficiency. Benefiting from these three modules, the proposed SSI-ResU-Net outperforms the current state-of-the-art method TSA-Net by over 3 dB in PSNR and 0.036 in SSIM while only using 2.82% trainable parameters. To the greatest extent, SSI-ResU-Net achieves competing performance with over 77.3% reduction in terms of floating-point operations (FLOPs), which for the first time, makes high-resolution HSI reconstruction feasible under practical application scenarios. Code and pre-trained models are made available at https://github.com/Jiamian-Wang/HSI_baseline.
    Face-NMS: A Core-set Selection Approach for Efficient Face Recognition. (arXiv:2109.04698v1 [cs.CV])
    (2 min) Recently, face recognition in the wild has achieved remarkable success and one key engine is the increasing size of training data. For example, the largest face dataset, WebFace42M contains about 2 million identities and 42 million faces. However, a massive number of faces raise the constraints in training time, computing resources, and memory cost. The current research on this problem mainly focuses on designing an efficient Fully-connected layer (FC) to reduce GPU memory consumption caused by a large number of identities. In this work, we relax these constraints by resolving the redundancy problem of the up-to-date face datasets caused by the greedily collecting operation (i.e. the core-set selection perspective). As the first attempt in this perspective on the face recognition problem, we find that existing methods are limited in both performance and efficiency. For superior cost-efficiency, we contribute a novel filtering strategy dubbed Face-NMS. Face-NMS works on feature space and simultaneously considers the local and global sparsity in generating core sets. In practice, Face-NMS is analogous to Non-Maximum Suppression (NMS) in the object detection community. It ranks the faces by their potential contribution to the overall sparsity and filters out the superfluous face in the pairs with high similarity for local sparsity. With respect to the efficiency aspect, Face-NMS accelerates the whole pipeline by applying a smaller but sufficient proxy dataset in training the proxy model. As a result, with Face-NMS, we successfully scale down the WebFace42M dataset to 60% while retaining its performance on the main benchmarks, offering a 40% resource-saving and 1.64 times acceleration. The code is publicly available for reference at https://github.com/HuangJunJie2017/Face-NMS.
    Emerging AI Security Threats for Autonomous Cars -- Case Studies. (arXiv:2109.04865v1 [cs.CR])
    (2 min) Artificial Intelligence has made a significant contribution to autonomous vehicles, from object detection to path planning. However, AI models require a large amount of sensitive training data and are usually computationally intensive to build. The commercial value of such models motivates attackers to mount various attacks. Adversaries can launch model extraction attacks for monetization purposes or step-ping-stone towards other attacks like model evasion. In specific cases, it even results in destroying brand reputation, differentiation, and value proposition. In addition, IP laws and AI-related legalities are still evolving and are not uniform across countries. We discuss model extraction attacks in detail with two use-cases and a generic kill-chain that can compromise autonomous cars. It is essential to investigate strategies to manage and mitigate the risk of model theft.
    An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA. (arXiv:2109.05014v1 [cs.CV])
    (2 min) Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.
    View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution. (arXiv:2109.04970v1 [eess.IV])
    (2 min) In recent years, self-supervised denoising methods have shown impressive performance, which circumvent painstaking collection procedure of noisy-clean image pairs in supervised denoising methods and boost denoising applicability in real world. One of well-known self-supervised denoising strategies is the blind-spot training scheme. However, a few works attempt to improve blind-spot based self-denoiser in the aspect of network architecture. In this paper, we take an intuitive view of blind-spot strategy and consider its process of using neighbor pixels to predict manipulated pixels as an inpainting process. Therefore, we propose a novel Mask Guided Residual Convolution (MGRConv) into common convolutional neural networks, e.g. U-Net, to promote blind-spot based denoising. Our MGRConv can be regarded as soft partial convolution and find a trade-off among partial convolution, learnable attention maps, and gated convolution. It enables dynamic mask learning with appropriate mask constrain. Different from partial convolution and gated convolution, it provides moderate freedom for network learning. It also avoids leveraging external learnable parameters for mask activation, unlike learnable attention maps. The experiments show that our proposed plug-and-play MGRConv can assist blind-spot based denoising network to reach promising results on both existing single-image based and dataset-based methods.
    LibFewShot: A Comprehensive Library for Few-shot Learning. (arXiv:2109.04898v1 [cs.CV])
    (2 min) Few-shot learning, especially few-shot image classification, has received increasing attention and witnessed significant advances in recent years. Some recent studies implicitly show that many generic techniques or ``tricks'', such as data augmentation, pre-training, knowledge distillation, and self-supervision, may greatly boost the performance of a few-shot learning method. Moreover, different works may employ different software platforms, different training schedules, different backbone architectures and even different input image sizes, making fair comparisons difficult and practitioners struggle with reproducibility. To address these situations, we propose a comprehensive library for few-shot learning (LibFewShot) by re-implementing seventeen state-of-the-art few-shot learning methods in a unified framework with the same single codebase in PyTorch. Furthermore, based on LibFewShot, we provide comprehensive evaluations on multiple benchmark datasets with multiple backbone architectures to evaluate common pitfalls and effects of different training tricks. In addition, given the recent doubts on the necessity of meta- or episodic-training mechanism, our evaluation results show that such kind of mechanism is still necessary especially when combined with pre-training. We hope our work can not only lower the barriers for beginners to work on few-shot learning but also remove the effects of the nontrivial tricks to facilitate intrinsic research on few-shot learning. The source code is available from https://github.com/RL-VIG/LibFewShot.
    TADA: Taxonomy Adaptive Domain Adaptation. (arXiv:2109.04813v1 [cs.CV])
    (2 min) Traditional domain adaptation addresses the task of adapting a model to a novel target domain under limited or no additional supervision. While tackling the input domain gap, the standard domain adaptation settings assume no domain change in the output space. In semantic prediction tasks, different datasets are often labeled according to different semantic taxonomies. In many real-world settings, the target domain task requires a different taxonomy than the one imposed by the source domain. We therefore introduce the more general taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent taxonomies between the two domains. We further propose an approach that jointly addresses the image-level and label-level domain adaptation. On the label-level, we employ a bilateral mixed sampling strategy to augment the target domain, and a relabelling method to unify and align the label spaces. We address the image-level domain gap by proposing an uncertainty-rectified contrastive learning method, leading to more domain-invariant and class discriminative features. We extensively evaluate the effectiveness of our framework under different TADA settings: open taxonomy, coarse-to-fine taxonomy, and partially-overlapping taxonomy. Our framework outperforms previous state-of-the-art by a large margin, while capable of adapting to new target domain taxonomies.
    Residual 3D Scene Flow Learning with Context-Aware Feature Extraction. (arXiv:2109.04685v1 [cs.CV])
    (2 min) Scene flow estimation is the task to predict the point-wise 3D displacement vector between two consecutive frames of point clouds, which has important application in fields such as service robots and autonomous driving. Although many previous works have explored greatly on scene flow estimation based on point clouds, we point out two problems that have not been noticed or well solved before: 1) Points of adjacent frames in repetitive patterns may be wrongly associated due to similar spatial structure in their neighbourhoods; 2) Scene flow between adjacent frames of point clouds with long-distance movement may be inaccurately estimated. To solve the first problem, we propose a novel context-aware set conv layer to exploit contextual structure information of Euclidean space and learn soft aggregation weights for local point features. Our design is inspired by human perception of contextual structure information during scene understanding. We incorporate the context-aware set conv layer in a context-aware point feature pyramid module of 3D point clouds for scene flow estimation. For the second problem, we propose an explicit residual flow learning structure in the residual flow refinement layer to cope with long-distance movement. The experiments and ablation study on FlyingThings3D and KITTI scene flow datasets demonstrate the effectiveness of each proposed component and that we solve problem of ambiguous inter-frame association and long-distance movement estimation. Quantitative results on both FlyingThings3D and KITTI scene flow datasets show that our method achieves state-of-the-art performance, surpassing all other previous works to the best of our knowledge by at least 25%.
    ACFNet: Adaptively-Cooperative Fusion Network for RGB-D Salient Object Detection. (arXiv:2109.04627v1 [cs.CV])
    (2 min) The reasonable employment of RGB and depth data show great significance in promoting the development of computer vision tasks and robot-environment interaction. However, there are different advantages and disadvantages in the early and late fusion of the two types of data. Besides, due to the diversity of object information, using a single type of data in a specific scenario tends to result in semantic misleading. Based on the above considerations, we propose an adaptively-cooperative fusion network (ACFNet) with ResinRes structure for salient object detection. This structure is designed to flexibly utilize the advantages of feature fusion in early and late stages. Secondly, an adaptively-cooperative semantic guidance (ACG) scheme is designed to suppress inaccurate features in the guidance phase. Further, we proposed a type-based attention module (TAM) to optimize the network and enhance the multi-scale perception of different objects. For different objects, the features generated by different types of convolution are enhanced or suppressed by the gated mechanism for segmentation optimization. ACG and TAM optimize the transfer of feature streams according to their data attributes and convolution attributes, respectively. Sufficient experiments conducted on RGB-D SOD datasets illustrate that the proposed network performs favorably against 18 state-of-the-art algorithms.
    S3G-ARM: Highly Compressive Visual Self-localization from Sequential Semantic Scene Graph Using Absolute and Relative Measurements. (arXiv:2109.04569v1 [cs.CV])
    (2 min) In this paper, we address the problem of image sequence-based self-localization (ISS) from a new highly compressive scene representation called sequential semantic scene graph (S3G). Recent developments in deep graph convolutional neural networks (GCNs) have enabled a highly compressive visual place classifier (VPC) that can use a scene graph as the input modality. However, in such a highly compressive application, the amount of information lost in the image-to-graph mapping is significant and can damage the classification performance. To address this issue, we propose a pair of similarity-preserving mappings, image-to-nodes and image-to-edges, such that the nodes and edges act as absolute and relative features, respectively, that complement each other. Moreover, the proposed GCN-VPC is applied to a new task of viewpoint planning (VP) of the query image sequence, which contributes to further improvement in the VPC performance. Experiments using the public NCLT dataset validated the effectiveness of the proposed method.
    Automatic Portrait Video Matting via Context Motion Network. (arXiv:2109.04598v1 [cs.CV])
    (2 min) Our automatic portrait video matting method does not require extra inputs. Most state-of-the-art matting methods rely on semantic segmentation methods to automatically generate the trimap. Their performance is compromised due to the lack of temporal information. Our method exploits semantic information as well as temporal information from optical flow and produces high-quality results.
    Mesh convolutional neural networks for wall shear stress estimation in 3D artery models. (arXiv:2109.04797v1 [cs.LG])
    (2 min) Computational fluid dynamics (CFD) is a valuable tool for personalised, non-invasive evaluation of hemodynamics in arteries, but its complexity and time-consuming nature prohibit large-scale use in practice. Recently, the use of deep learning for rapid estimation of CFD parameters like wall shear stress (WSS) on surface meshes has been investigated. However, existing approaches typically depend on a hand-crafted re-parametrisation of the surface mesh to match convolutional neural network architectures. In this work, we propose to instead use mesh convolutional neural networks that directly operate on the same finite-element surface mesh as used in CFD. We train and evaluate our method on two datasets of synthetic coronary artery models with and without bifurcation, using a ground truth obtained from CFD simulation. We show that our flexible deep learning model can accurately predict 3D WSS vectors on this surface mesh. Our method processes new meshes in less than 5 [s], consistently achieves a normalised mean absolute error of $\leq$ 1.6 [%], and peaks at 90.5 [%] median approximation accuracy over the held-out test set, comparing favorably to previously published work. This shows the feasibility of CFD surrogate modelling using mesh convolutional neural networks for hemodynamic parameter estimation in artery models.
    Spatio-Temporal Recurrent Networks for Event-Based Optical Flow Estimation. (arXiv:2109.04871v1 [cs.CV])
    (2 min) Event camera has offered promising alternative for visual perception, especially in high speed and high dynamic range scenes. Recently, many deep learning methods have shown great success in providing model-free solutions to many event-based problems, such as optical flow estimation. However, existing deep learning methods did not address the importance of temporal information well from the perspective of architecture design and cannot effectively extract spatio-temporal features. Another line of research that utilizes Spiking Neural Network suffers from training issues for deeper architecture. To address these points, a novel input representation is proposed that captures the events temporal distribution for signal enhancement. Moreover, we introduce a spatio-temporal recurrent encoding-decoding neural network architecture for event-based optical flow estimation, which utilizes Convolutional Gated Recurrent Units to extract feature maps from a series of event images. Besides, our architecture allows some traditional frame-based core modules, such as correlation layer and iterative residual refine scheme, to be incorporated. The network is end-to-end trained with self-supervised learning on the Multi-Vehicle Stereo Event Camera dataset. We have shown that it outperforms all the existing state-of-the-art methods by a large margin.
    ReconfigISP: Reconfigurable Camera Image Processing Pipeline. (arXiv:2109.04760v1 [eess.IV])
    (2 min) Image Signal Processor (ISP) is a crucial component in digital cameras that transforms sensor signals into images for us to perceive and understand. Existing ISP designs always adopt a fixed architecture, e.g., several sequential modules connected in a rigid order. Such a fixed ISP architecture may be suboptimal for real-world applications, where camera sensors, scenes and tasks are diverse. In this study, we propose a novel Reconfigurable ISP (ReconfigISP) whose architecture and parameters can be automatically tailored to specific data and tasks. In particular, we implement several ISP modules, and enable backpropagation for each module by training a differentiable proxy, hence allowing us to leverage the popular differentiable neural architecture search and effectively search for the optimal ISP architecture. A proxy tuning mechanism is adopted to maintain the accuracy of proxy networks in all cases. Extensive experiments conducted on image restoration and object detection, with different sensors, light conditions and efficiency constraints, validate the effectiveness of ReconfigISP. Only hundreds of parameters need tuning for every task.
    Panoptic Narrative Grounding. (arXiv:2109.04988v1 [cs.CV])
    (2 min) This paper proposes Panoptic Narrative Grounding, a spatially fine and general formulation of the natural language visual grounding problem. We establish an experimental framework for the study of this new task, including new ground truth and metrics, and we propose a strong baseline method to serve as stepping stone for future work. We exploit the intrinsic semantic richness in an image by including panoptic categories, and we approach visual grounding at a fine-grained level by using segmentations. In terms of ground truth, we propose an algorithm to automatically transfer Localized Narratives annotations to specific regions in the panoptic segmentations of the MS COCO dataset. To guarantee the quality of our annotations, we take advantage of the semantic structure contained in WordNet to exclusively incorporate noun phrases that are grounded to a meaningfully related panoptic segmentation region. The proposed baseline achieves a performance of 55.4 absolute Average Recall points. This result is a suitable foundation to push the envelope further in the development of methods for Panoptic Narrative Grounding.
    Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering. (arXiv:2109.04735v1 [cs.CV])
    (2 min) Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts.
    EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling. (arXiv:2109.04699v1 [cs.CL])
    (2 min) While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.
    CrowdDriven: A New Challenging Dataset for Outdoor Visual Localization. (arXiv:2109.04527v1 [cs.CV])
    (2 min) Visual localization is the problem of estimating the position and orientation from which a given image (or a sequence of images) is taken in a known scene. It is an important part of a wide range of computer vision and robotics applications, from self-driving cars to augmented/virtual reality systems. Visual localization techniques should work reliably and robustly under a wide range of conditions, including seasonal, weather, illumination and man-made changes. Recent benchmarking efforts model this by providing images under different conditions, and the community has made rapid progress on these datasets since their inception. However, they are limited to a few geographical regions and often recorded with a single device. We propose a new benchmark for visual localization in outdoor scenes, using crowd-sourced data to cover a wide range of geographical regions and camera devices with a focus on the failure cases of current algorithms. Experiments with state-of-the-art localization approaches show that our dataset is very challenging, with all evaluated methods failing on its hardest parts. As part of the dataset release, we provide the tooling used to generate it, enabling efficient and effective 2D correspondence annotation to obtain reference poses.
    Per Garment Capture and Synthesis for Real-time Virtual Try-on. (arXiv:2109.04654v1 [cs.GR])
    (2 min) Virtual try-on is a promising application of computer graphics and human computer interaction that can have a profound real-world impact especially during this pandemic. Existing image-based works try to synthesize a try-on image from a single image of a target garment, but it inherently limits the ability to react to possible interactions. It is difficult to reproduce the change of wrinkles caused by pose and body size change, as well as pulling and stretching of the garment by hand. In this paper, we propose an alternative per garment capture and synthesis workflow to handle such rich interactions by training the model with many systematically captured images. Our workflow is composed of two parts: garment capturing and clothed person image synthesis. We designed an actuated mannequin and an efficient capturing process that collects the detailed deformations of the target garments under diverse body sizes and poses. Furthermore, we proposed to use a custom-designed measurement garment, and we captured paired images of the measurement garment and the target garments. We then learn a mapping between the measurement garment and the target garments using deep image-to-image translation. The customer can then try on the target garments interactively during online shopping.
    Automatic Displacement and Vibration Measurement in Laboratory Experiments with A Deep Learning Method. (arXiv:2109.04960v1 [eess.IV])
    (2 min) This paper proposes a pipeline to automatically track and measure displacement and vibration of structural specimens during laboratory experiments. The latest Mask Regional Convolutional Neural Network (Mask R-CNN) can locate the targets and monitor their movement from videos recorded by a stationary camera. To improve precision and remove the noise, techniques such as Scale-invariant Feature Transform (SIFT) and various filters for signal processing are included. Experiments on three small-scale reinforced concrete beams and a shaking table test are utilized to verify the proposed method. Results show that the proposed deep learning method can achieve the goal to automatically and precisely measure the motion of tested structural members during laboratory experiments.
    Accurate Lung Nodules Segmentation with Detailed Representation Transfer and Soft Mask Supervision. (arXiv:2007.14556v2 [eess.IV] UPDATED)
    (2 min) Accurate lung nodules segmentation from Computed Tomography (CT) images is crucial to the analysis and diagnosis of lung diseases such as COVID-19 and lung cancer. However, due to the smallness and variety of lung nodules and the lack of high-quality labeling, accurate lung nodule segmentation is still a challenging problem. To address these issues, we propose a complete paradigm for accurate lung nodules segmentation. First, we introduce a new segmentation mask named Soft Mask which has richer and more accurate edge details description and better visualization. Correspondingly, we develop a universal semi-automatic Soft Mask annotation pipeline to deal with different datasets. Second, a novel Network with Detailed representation transfer and Soft Mask supervision (DSNet) is proposed to process the input low-resolution images of lung nodules into high-quality segmentation results. In our DSNet, we design a novel Selective Detailed Representation Fusion Module to reconstruct the detailed representation to alleviate the small size of lung nodules images. In addition, the adversarial training framework with Soft Mask is proposed to further improve the accuracy of segmentation. Extensive experiments validate that our DSNet outperforms the state-of-the-art methods for accurate lung nodules segmentation. And our method also demonstrates competitive results in other accurate medical segmentation tasks. Besides, we provide a new challenging lung nodules segmentation dataset for further studies.
    PIP: Physical Interaction Prediction via Mental Imagery with Span Selection. (arXiv:2109.04683v1 [cs.CV])
    (2 min) To align advanced artificial intelligence (AI) with human values and promote safe AI, it is important for AI to predict the outcome of physical interactions. Even with the ongoing debates on how humans predict the outcomes of physical interactions among objects in the real world, there are works attempting to tackle this task via cognitive-inspired AI approaches. However, there is still a lack of AI approaches that mimic the mental imagery humans use to predict physical interactions in the real world. In this work, we propose a novel PIP scheme: Physical Interaction Prediction via Mental Imagery with Span Selection. PIP utilizes a deep generative model to output future frames of physical interactions among objects before extracting crucial information for predicting physical interactions by focusing on salient frames using span selection. To evaluate our model, we propose a large-scale SPACE+ dataset of synthetic video frames, including three physical interaction events in a 3D environment. Our experiments show that PIP outperforms baselines and human performance in physical interaction prediction for both seen and unseen objects. Furthermore, PIP's span selection scheme can effectively identify the frames where physical interactions among objects occur within the generated frames, allowing for added interpretability.
    EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation. (arXiv:2109.04600v1 [cs.CV])
    (2 min) Temporal grounding aims to predict a time interval of a video clip corresponding to a natural language query input. In this work, we present EVOQUER, a temporal grounding framework incorporating an existing text-to-video grounding model and a video-assisted query generation network. Given a query and an untrimmed video, the temporal grounding model predicts the target interval, and the predicted video clip is fed into a video translation task by generating a simplified version of the input query. EVOQUER forms closed-loop learning by incorporating loss functions from both temporal grounding and query generation serving as feedback. Our experiments on two widely used datasets, Charades-STA and ActivityNet, show that EVOQUER achieves promising improvements by 1.05 and 1.31 at R@0.7. We also discuss how the query generation task could facilitate error analysis by explaining temporal grounding model behavior.
    Generative Modelling of BRDF Textures from Flash Images. (arXiv:2102.11861v2 [cs.GR] UPDATED)
    (2 min) We learn a latent space for easy capture, consistent interpolation, and efficient reproduction of visual material appearance. When users provide a photo of a stationary natural material captured under flashlight illumination, first it is converted into a latent material code. Then, in the second step, conditioned on the material code, our method produces an infinite and diverse spatial field of BRDF model parameters (diffuse albedo, normals, roughness, specular albedo) that subsequently allows rendering in complex scenes and illuminations, matching the appearance of the input photograph. Technically, we jointly embed all flash images into a latent space using a convolutional encoder, and -- conditioned on these latent codes -- convert random spatial fields into fields of BRDF parameters using a convolutional neural network (CNN). We condition these BRDF parameters to match the visual characteristics (statistics and spectra of visual features) of the input under matching light. A user study compares our approach favorably to previous work, even those with access to BRDF supervision.
    Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization. (arXiv:2109.04753v1 [cs.CV])
    (2 min) Along with feature points for image matching, line features provide additional constraints to solve visual geometric problems in robotics and computer vision (CV). Although recent convolutional neural network (CNN)-based line descriptors are promising for viewpoint changes or dynamic environments, we claim that the CNN architecture has innate disadvantages to abstract variable line length into the fixed-dimensional descriptor. In this paper, we effectively introduce Line-Transformers dealing with variable lines. Inspired by natural language processing (NLP) tasks where sentences can be understood and abstracted well in neural nets, we view a line segment as a sentence that contains points (words). By attending to well-describable points on aline dynamically, our descriptor performs excellently on variable line length. We also propose line signature networks sharing the line's geometric attributes to neighborhoods. Performing as group descriptors, the networks enhance line descriptors by understanding lines' relative geometries. Finally, we present the proposed line descriptor and matching in a Point and Line Localization (PL-Loc). We show that the visual localization with feature points can be improved using our line features. We validate the proposed method for homography estimation and visual localization.
    Is Attention Better Than Matrix Decomposition?. (arXiv:2109.04553v1 [cs.CV])
    (2 min) As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants.
  • cs.IR updates on arXiv.org

    A Search Engine for Discovery of Scientific Challenges and Directions. (arXiv:2108.13751v2 [cs.CL] UPDATED)
    (2 min) Keeping track of scientific challenges, advances and emerging directions is a fundamental part of research. However, researchers face a flood of papers that hinders discovery of important knowledge. In biomedicine, this directly impacts human lives. To address this problem, we present a novel task of extraction and search of scientific challenges and directions, to facilitate rapid knowledge discovery. We construct and release an expert-annotated corpus of texts sampled from full-length papers, labeled with novel semantic categories that generalize across many types of challenges and directions. We focus on a large corpus of interdisciplinary work relating to the COVID-19 pandemic, ranging from biomedicine to areas such as AI and economics. We apply a model trained on our data to identify challenges and directions across the corpus and build a dedicated search engine. In experiments with 19 researchers and clinicians using our system, we outperform a popular scientific search engine in assisting knowledge discovery. Finally, we show that models trained on our resource generalize to the wider biomedical domain and to AI papers, highlighting its broad utility. We make our data, model and search engine publicly available. https://challenges.apps.allenai.org/
    Web image search engine based on LSH index and CNN Resnet50. (arXiv:2108.13301v1 [cs.IR] CROSS LISTED)
    (2 min) To implement a good Content Based Image Retrieval (CBIR) system, it is essential to adopt efficient search methods. One way to achieve this results is by exploiting approximate search techniques. In fact, when we deal with very large collections of data, using an exact search method makes the system very slow. In this project, we adopt the Locality Sensitive Hashing (LSH) index to implement a CBIR system that allows us to perform fast similarity search on deep features. Specifically, we exploit transfer learning techniques to extract deep features from images; this phase is done using two famous Convolutional Neural Networks (CNNs) as features extractors: Resnet50 and Resnet50v2, both pre-trained on ImageNet. Then we try out several fully connected deep neural networks, built on top of both of the previously mentioned CNNs in order to fine-tuned them on our dataset. In both of previous cases, we index the features within our LSH index implementation and within a sequential scan, to better understand how much the introduction of the index affects the results. Finally, we carry out a performance analysis: we evaluate the relevance of the result set, computing the mAP (mean Average Precision) value obtained during the different experiments with respect to the number of done comparison and varying the hyper-parameter values of the LSH index.
    You Get What You Chat: Using Conversations to Personalize Search-based Recommendations. (arXiv:2109.04716v1 [cs.IR])
    (2 min) Prior work on personalized recommendations has focused on exploiting explicit signals from user-specific queries, clicks, likes, and ratings. This paper investigates tapping into a different source of implicit signals of interests and tastes: online chats between users. The paper develops an expressive model and effective methods for personalizing search-based entity recommendations. User models derived from chats augment different methods for re-ranking entity answers for medium-grained queries. The paper presents specific techniques to enhance the user models by capturing domain-specific vocabularies and by entity-based expansion. Experiments are based on a collection of online chats from a controlled user study covering three domains: books, travel, food. We evaluate different configurations and compare chat-based user models against concise user profiles from questionnaires. Overall, these two variants perform on par in terms of NCDG@20, but each has advantages in certain domains.
    COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval. (arXiv:2010.12800v2 [cs.CL] UPDATED)
    (2 min) We present a large, challenging dataset, COUGH, for COVID-19 FAQ retrieval. Similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, Query Bank and Relevance Set. The FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce Query Bank and Relevance Set, where the former contains 1,236 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 48.8 under P@5, indicating a great challenge presented by COUGH and encouraging future research for further improvement. Our COUGH dataset is available at https://github.com/sunlab-osu/covid-faq.
    AutoTriggER: Named Entity Recognition with Auxiliary Trigger Extraction. (arXiv:2109.04726v1 [cs.CL])
    (2 min) Deep neural models for low-resource named entity recognition (NER) have shown impressive results by leveraging distant super-vision or other meta-level information (e.g. explanation). However, the costs of acquiring such additional information are generally prohibitive, especially in domains where existing resources (e.g. databases to be used for distant supervision) may not exist. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging "entity triggers" which are essentially human-readable clues in the text that can help guide the model to make better decisions. Thus, the framework is able to both create and leverage auxiliary supervision by itself. Through experiments on three well-studied NER datasets, we show that our automatically extracted triggers are well-matched to human triggers, and AutoTriggER improves performance over a RoBERTa-CRFarchitecture by nearly 0.5 F1 points on average and much more in a low resource setting.
    Personalized Entity Search by Sparse and Scrutable User Profiles. (arXiv:2109.04713v1 [cs.IR])
    (2 min) Prior work on personalizing web search results has focused on considering query-and-click logs to capture users individual interests. For product search, extensive user histories about purchases and ratings have been exploited. However, for general entity search, such as for books on specific topics or travel destinations with certain features, personalization is largely underexplored. In this paper, we address personalization of book search, as an exemplary case of entity search, by exploiting sparse user profiles obtained through online questionnaires. We devise and compare a variety of re-ranking methods based on language models or neural learning. Our experiments show that even very sparse information about individuals can enhance the effectiveness of the search results.
    Trust your neighbors: A comprehensive survey of neighborhood-based methods for recommender systems. (arXiv:2109.04584v1 [cs.IR])
    (2 min) Collaborative recommendation approaches based on nearest-neighbors are still highly popular today due to their simplicity, their efficiency, and their ability to produce accurate and personalized recommendations. This chapter offers a comprehensive survey of neighborhood-based methods for the item recommendation problem. It presents the main characteristics and benefits of such methods, describes key design choices for implementing a neighborhood-based recommender system, and gives practical information on how to make these choices. A broad range of methods is covered in the chapter, including traditional algorithms like k-nearest neighbors as well as advanced approaches based on matrix factorization, sparse coding and random walks.
    Bursting Scientific Filter Bubbles: Boosting Innovation via Novel Author Discovery. (arXiv:2108.05669v2 [cs.DL] UPDATED)
    (2 min) Isolated silos of scientific research and the growing challenge of information overload limit awareness across the literature and hinder innovation. Algorithmic curation and recommendation, which often prioritize relevance, can further reinforce these informational "filter bubbles." In response, we describe Bridger, a system for facilitating discovery of scholars and their work, to explore design tradeoffs between relevant and novel recommendations. We construct a faceted representation of authors with information gleaned from their papers and inferred author personas, and use it to develop an approach that locates commonalities ("bridges") and contrasts between scientists -- retrieving partially similar authors rather than aiming for strict similarity. In studies with computer science researchers, this approach helps users discover authors considered useful for generating novel research directions, outperforming a state-of-art neural model. In addition to recommending new content, we also demonstrate an approach for displaying it in a manner that boosts researchers' ability to understand the work of authors with whom they are unfamiliar. Finally, our analysis reveals that Bridger connects authors who have different citation profiles, publish in different venues, and are more distant in social co-authorship networks, raising the prospect of bridging diverse communities and facilitating discovery.
    Assessing the Quality of the Datasets by Identifying Mislabeled Samples. (arXiv:2109.05000v1 [cs.LG])
    (2 min) Due to the over-emphasize of the quantity of data, the data quality has often been overlooked. However, not all training data points contribute equally to learning. In particular, if mislabeled, it might actively damage the performance of the model and the ability to generalize out of distribution, as the model might end up learning spurious artifacts present in the dataset. This problem gets compounded by the prevalence of heavily parameterized and complex deep neural networks, which can, with their high capacity, end up memorizing the noise present in the dataset. This paper proposes a novel statistic -- noise score, as a measure for the quality of each data point to identify such mislabeled samples based on the variations in the latent space representation. In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS). Our method leverages the fact that samples belonging to the same class will have similar latent representations. Therefore, by identifying the outliers in the latent space, we can find the mislabeled samples. We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets in different noise settings for the task of identifying mislabelled samples. We further show significant improvements in accuracy for the classification task for each dataset.
    Dynamic Terminology Integration for COVID-19 and other Emerging Domains. (arXiv:2109.04708v1 [cs.CL])
    (2 min) The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom available for less-resourced languages and niche domains. Furthermore, as exemplified by COVID-19 recently, no domain-specific parallel data is readily available for emerging domains. However, the gravity of this recent calamity created a high demand for reliable translation of critical information regarding pandemic and infection prevention. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems that are capable of dynamic terminology integration at the time of translation. Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training. We conclude our work with a broader discussion considering the Shared Task itself and terminology translation in MT.
    Query-driven Segment Selection for Ranking Long Documents. (arXiv:2109.04611v1 [cs.IR])
    (2 min) Transformer-based rankers have shown state-of-the-art performance. However, their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose query-driven segment selection from long documents to build training data. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that can process longer input sequences. Our findings open up new direction to design efficient transformer-based rankers.
  • cs.LG updates on arXiv.org

    Active learning for reducing labeling effort in text classification tasks. (arXiv:2109.04847v1 [cs.CL])
    (2 min) Labeling data can be an expensive task as it is usually performed manually by domain experts. This is cumbersome for deep learning, as it is dependent on large labeled datasets. Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative. Little research has been done on AL in a text classification setting and next to none has involved the more recent, state-of-the-art NLP models. Here, we present an empirical study that compares different uncertainty-based algorithms with BERT$_{base}$ as the used classifier. We evaluate the algorithms on two NLP classification datasets: Stanford Sentiment Treebank and KvK-Frontpages. Additionally, we explore heuristics that aim to solve presupposed problems of uncertainty-based AL; namely, that it is unscalable and that it is prone to selecting outliers. Furthermore, we explore the influence of the query-pool size on the performance of AL. Whereas it was found that the proposed heuristics for AL did not improve performance of AL; our results show that using uncertainty-based AL with BERT$_{base}$ outperforms random sampling of data. This difference in performance can decrease as the query-pool size gets larger.
    Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent. (arXiv:2010.09697v3 [cs.LG] UPDATED)
    (2 min) The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.
    Visual Goal-Step Inference using wikiHow. (arXiv:2104.05845v2 [cs.CV] UPDATED)
    (2 min) Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
    ReasonBERT: Pre-trained to Reason with Distant Supervision. (arXiv:2109.04912v1 [cs.CL])
    (2 min) We present ReasonBert, a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid contexts. Unlike existing pre-training methods that only harvest learning signals from local contexts of naturally occurring texts, we propose a generalized notion of distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. We conduct a comprehensive evaluation on a variety of extractive question answering datasets ranging from single-hop to multi-hop and from text-only to table-only to hybrid that require various reasoning capabilities and show that ReasonBert achieves remarkable improvement over an array of strong baselines. Few-shot experiments further demonstrate that our pre-training method substantially improves sample efficiency.
    Multi-label Classification of Aircraft Heading Changes Using Neural Network to Resolve Conflicts. (arXiv:2109.04767v1 [cs.LG])
    (2 min) An aircraft conflict occurs when two or more aircraft cross at a certain distance at the same time. Specific air traffic controllers are assigned to solve such conflicts. A controller needs to consider various types of information in order to solve a conflict. The most common and preliminary information is the coordinate position of the involved aircraft. Additionally, a controller has to take into account more information such as flight planning, weather, restricted territory, etc. The most important challenges a controller has to face are: to think about the issues involved and make a decision in a very short time. Due to the increased number of aircraft, it is crucial to reduce the workload of the controllers and help them make quick decisions. A conflict can be solved in many ways, therefore, we consider this problem as a multi-label classification problem. In doing so, we are proposing a multi-label classification model which provides multiple heading advisories for a given conflict. This model we named CRMLnet is based on a novel application of a multi-layer neural network and helps the controllers in their decisions. When compared to other machine learning models, our CRMLnet has achieved the best results with an accuracy of 98.72% and ROC of 0.999. The simulated data set that we have developed and used in our experiments will be delivered to the research community.
    C-MinHash: Practically Reducing Two Permutations to Just One. (arXiv:2109.04595v1 [cs.DS])
    (2 min) Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two independent permutations. In this paper, we show that one can actually simply use one permutation. That is, one single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes. Although the theoretical analysis becomes very complicated, we are able to explicitly write down the expression for the expectation of the estimator. The new estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy (mean square errors). An extensive set of experiments are provided to verify our claim for using just one permutation.
    BAM: A Balanced Attention Mechanism for Single Image Super Resolution. (arXiv:2104.07566v3 [eess.IV] UPDATED)
    (2 min) Recovering texture information from the aliasing regions has always been a major challenge for Single Image Super Resolution (SISR) task. These regions are often submerged in noise so that we have to restore texture details while suppressing noise. To address this issue, we propose a Balanced Attention Mechanism (BAM), which consists of Avgpool Channel Attention Module (ACAM) and Maxpool Spatial Attention Module (MSAM) in parallel. ACAM is designed to suppress extreme noise in the large scale feature maps while MSAM preserves high-frequency texture details. Thanks to the parallel structure, these two modules not only conduct self-optimization, but also mutual optimization to obtain the balance of noise reduction and high-frequency texture restoration during the back propagation process, and the parallel structure makes the inference faster. To verify the effectiveness and robustness of BAM, we applied it to 10 SOTA SISR networks. The results demonstrate that BAM can efficiently improve the networks performance, and for those originally with attention mechanism, the substitution with BAM further reduces the amount of parameters and increases the inference speed. Moreover, we present a dataset with rich texture aliasing regions in real scenes, named realSR7. Experiments prove that BAM achieves better super-resolution results on the aliasing area.
    A conditional one-output likelihood formulation for multitask Gaussian processes. (arXiv:2006.03495v3 [cs.LG] UPDATED)
    (2 min) Multitask Gaussian processes (MTGP) are the Gaussian process (GP) framework's solution for multioutput regression problems in which the $T$ elements of the regressors cannot be considered conditionally independent given the observations. Standard MTGP models assume that there exist both a multitask covariance matrix as a function of an intertask matrix, and a noise covariance matrix. These matrices need to be approximated by a low rank simplification of order $P$ in order to reduce the number of parameters to be learnt from $T^2$ to $TP$. Here we introduce a novel approach that simplifies the multitask learning by reducing it to a set of conditioned univariate GPs without the need for any low rank approximations, therefore completely eliminating the requirement to select an adequate value for hyperparameter $P$. At the same time, by extending this approach with both a hierarchical and an approximate model, the proposed extensions are capable of recovering the multitask covariance and noise matrices after learning only $2T$ parameters, avoiding the validation of any model hyperparameter and reducing the overall complexity of the model as well as the risk of overfitting. Experimental results over synthetic and real problems confirm the advantages of this inference approach in its ability to accurately recover the original noise and signal matrices, as well as the achieved performance improvement in comparison to other state of art MTGP approaches. We have also integrated the model with standard GP toolboxes, showing that it is computationally competitive with state of the art options.
    Rapid Model Architecture Adaption for Meta-Learning. (arXiv:2109.04925v1 [cs.LG])
    (2 min) Network Architecture Search (NAS) methods have recently gathered much attention. They design networks with better performance and use a much shorter search time compared to traditional manual tuning. Despite their efficiency in model deployments, most NAS algorithms target a single task on a fixed hardware system. However, real-life few-shot learning environments often cover a great number of tasks (T ) and deployments on a wide variety of hardware platforms (H ). The combinatorial search complexity T times H creates a fundamental search efficiency challenge if one naively applies existing NAS methods to these scenarios. To overcome this issue, we show, for the first time, how to rapidly adapt model architectures to new tasks in a many-task many-hardware few-shot learning setup by integrating Model Agnostic Meta Learning (MAML) into the NAS flow. The proposed NAS method (H-Meta-NAS) is hardware-aware and performs optimisation in the MAML framework. H-Meta-NAS shows a Pareto dominance compared to a variety of NAS and manual baselines in popular few-shot learning benchmarks with various hardware platforms and constraints. In particular, on the 5-way 1-shot Mini-ImageNet classification task, the proposed method outperforms the best manual baseline by a large margin (5.21% in accuracy) using 60% less computation.
    Dual-State Capsule Networks for Text Classification. (arXiv:2109.04762v1 [cs.CL])
    (2 min) Text classification systems based on contextual embeddings are not viable options for many of the low resource languages. On the other hand, recently introduced capsule networks have shown performance in par with these text classification models. Thus, they could be considered as a viable alternative for text classification for languages that do not have pre-trained contextual embedding models. However, current capsule networks depend upon spatial patterns without considering the sequential features of the text. They are also sub-optimal in capturing the context-level information in longer sequences. This paper presents a novel Dual-State Capsule (DS-Caps) network-based technique for text classification, which is optimized to mitigate these issues. Two varieties of states, namely sentence-level and word-level, are integrated with capsule layers to capture deeper context-level information for language modeling. The dynamic routing process among capsules was also optimized using the context-level information obtained through sentence-level states. The DS-Caps networks outperform the existing capsule network architectures for multiple datasets, particularly for tasks with longer sequences of text. We also demonstrate the superiority of DS-Caps in text classification for a low resource language.
    Multi-agent deep reinforcement learning (MADRL) meets multi-user MIMO systems. (arXiv:2109.04986v1 [cs.IT])
    (2 min) A multi-agent deep reinforcement learning (MADRL) is a promising approach to challenging problems in wireless environments involving multiple decision-makers (or actors) with high-dimensional continuous action space. In this paper, we present a MADRL-based approach that can jointly optimize precoders to achieve the outer-boundary, called pareto-boundary, of the achievable rate region for a multiple-input single-output (MISO) interference channel (IFC). In order to address two main challenges, namely, multiple actors (or agents) with partial observability and multi-dimensional continuous action space in MISO IFC setup, we adopt a multi-agent deep deterministic policy gradient (MA-DDPG) framework in which decentralized actors with partial observability can learn a multi-dimensional continuous policy in a centralized manner with the aid of shared critic with global information. Meanwhile, we will also address a phase ambiguity issue with the conventional complex baseband representation of signals widely used in radio communications. In order to mitigate the impact of phase ambiguity on training performance, we propose a training method, called phase ambiguity elimination (PAE), that leads to faster learning and better performance of MA-DDPG in wireless communication systems. The simulation results exhibit that MA-DDPG is capable of learning a near-optimal precoding strategy in a MISO IFC environment. To the best of our knowledge, this is the first work to demonstrate that the MA-DDPG framework can jointly optimize precoders to achieve the pareto-boundary of achievable rate region in a multi-cell multi-user multi-antenna system.
    A Large-Scale Study of Machine Translation in the Turkic Languages. (arXiv:2109.04593v1 [cs.CL])
    (2 min) Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 2 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.
    The Effect of Efficient Messaging and Input Variability on Neural-Agent Iterated Language Learning. (arXiv:2104.07637v2 [cs.CL] UPDATED)
    (2 min) Natural languages display a trade-off among different strategies to convey syntactic structure, such as word order or inflection. This trade-off, however, has not appeared in recent simulations of iterated language learning with neural network agents (Chaabouni et al., 2019b). We re-evaluate this result in light of three factors that play an important role in comparable experiments from the Language Evolution field: (i) speaker bias towards efficient messaging, (ii) non systematic input languages, and (iii) learning bottleneck. Our simulations show that neural agents mainly strive to maintain the utterance type distribution observed during learning, instead of developing a more efficient or systematic language.
    Mesh convolutional neural networks for wall shear stress estimation in 3D artery models. (arXiv:2109.04797v1 [cs.LG])
    (2 min) Computational fluid dynamics (CFD) is a valuable tool for personalised, non-invasive evaluation of hemodynamics in arteries, but its complexity and time-consuming nature prohibit large-scale use in practice. Recently, the use of deep learning for rapid estimation of CFD parameters like wall shear stress (WSS) on surface meshes has been investigated. However, existing approaches typically depend on a hand-crafted re-parametrisation of the surface mesh to match convolutional neural network architectures. In this work, we propose to instead use mesh convolutional neural networks that directly operate on the same finite-element surface mesh as used in CFD. We train and evaluate our method on two datasets of synthetic coronary artery models with and without bifurcation, using a ground truth obtained from CFD simulation. We show that our flexible deep learning model can accurately predict 3D WSS vectors on this surface mesh. Our method processes new meshes in less than 5 [s], consistently achieves a normalised mean absolute error of $\leq$ 1.6 [%], and peaks at 90.5 [%] median approximation accuracy over the held-out test set, comparing favorably to previously published work. This shows the feasibility of CFD surrogate modelling using mesh convolutional neural networks for hemodynamic parameter estimation in artery models.
    Enhancing Unsupervised Anomaly Detection with Score-Guided Network. (arXiv:2109.04684v1 [cs.LG])
    (2 min) Anomaly detection plays a crucial role in various real-world applications, including healthcare and finance systems. Owing to the limited number of anomaly labels in these complex systems, unsupervised anomaly detection methods have attracted great attention in recent years. Two major challenges faced by the existing unsupervised methods are: (i) distinguishing between normal and abnormal data in the transition field, where normal and abnormal data are highly mixed together; (ii) defining an effective metric to maximize the gap between normal and abnormal data in a hypothesis space, which is built by a representation learner. To that end, this work proposes a novel scoring network with a score-guided regularization to learn and enlarge the anomaly score disparities between normal and abnormal data. With such score-guided strategy, the representation learner can gradually learn more informative representation during the model training stage, especially for the samples in the transition field. We next propose a score-guided autoencoder (SG-AE), incorporating the scoring network into an autoencoder framework for anomaly detection, as well as other three state-of-the-art models, to further demonstrate the effectiveness and transferability of the design. Extensive experiments on both synthetic and real-world datasets demonstrate the state-of-the-art performance of these score-guided models (SGMs).
    Saliency Guided Experience Packing for Replay in Continual Learning. (arXiv:2109.04954v1 [cs.LG])
    (2 min) Artificial learning systems aspire to mimic human intelligence by continually learning from a stream of tasks without forgetting past knowledge. One way to enable such learning is to store past experiences in the form of input examples in episodic memory and replay them when learning new tasks. However, performance of such method suffers as the size of the memory becomes smaller. In this paper, we propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps which provide visual explanations for the model's decision. Guided by these saliency maps, we pack the memory with only the parts or patches of the input images important for the model's prediction. While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions. We evaluate our algorithm on diverse image classification datasets and report better performance than the state-of-the-art approaches. With qualitative and quantitative analyses we show that our method captures richer summary of past experiences without any memory increase, and hence performs well with small episodic memory.
    Projected State-action Balancing Weights for Offline Reinforcement Learning. (arXiv:2109.04640v1 [cs.LG])
    (2 min) Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we make a first attempt towards characterizing the difficulty of OPE problems, which may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
    Semi-Supervised Learning using Siamese Networks. (arXiv:2109.00794v2 [cs.LG] UPDATED)
    (0 min) Neural networks have been successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are more difficult to train successfully for semi-supervised problems where small amounts of labeled instances are available along with a large number of unlabeled instances. This work explores a new training method for semi-supervised learning that is based on similarity function learning using a Siamese network to obtain a suitable embedding. The learned representations are discriminative in Euclidean space, and hence can be used for labeling unlabeled instances using a nearest-neighbor classifier. Confident predictions of unlabeled instances are used as true labels for retraining the Siamese network on the expanded training set. This process is applied iteratively. We perform an empirical study of this iterative self-training algorithm. For improving unlabeled predictions, local learning with global consistency [22] is also evaluated.
    Solid Texture Synthesis using Generative Adversarial Networks. (arXiv:2102.03973v4 [cs.CV] UPDATED)
    (0 min) Solid texture synthesis (STS), as an effective way to extend 2D exemplar to a 3D solid volume, exhibits advantages in numerous application domains. However, existing methods generally synthesize solid texture with specific features, which may result in the failure of capturing diversified textural information. In this paper, we propose a novel generative adversarial nets-based approach (STS-GAN) to hierarchically learn solid texture with a feature-free nature. Our multi-scale discriminators evaluate the similarity between patch from exemplar and slice from the generated volume, promoting the generator to synthesize realistic solid textures. Experimental results demonstrate that the proposed method can generate high-quality solid textures with similar visual characteristics to the exemplar.
    Multi-task learning for virtual flow metering. (arXiv:2103.08713v2 [cs.LG] UPDATED)
    (2 min) Virtual flow metering (VFM) is a cost-effective and non-intrusive technology for inferring multiphase flow rates in petroleum assets. Inferences about flow rates are fundamental to decision support systems that operators extensively rely on. Data-driven VFM, where mechanistic models are replaced with machine learning models, has recently gained attention due to its promise of lower maintenance costs. While excellent performances in small sample studies have been reported in the literature, there is still considerable doubt about the robustness of data-driven VFM. In this paper, we propose a new multi-task learning (MTL) architecture for data-driven VFM. Our method differs from previous methods in that it enables learning across oil and gas wells. We study the method by modeling 55 wells from four petroleum assets and compare the results with two single-task baseline models. Our findings show that MTL improves robustness over single-task methods, without sacrificing performance. MTL yields a 25-50% error reduction on average for the assets where single-task architectures are struggling.
    Differentiable Implicit Soft-Body Physics. (arXiv:2102.05791v3 [cs.LG] UPDATED)
    (2 min) We present a differentiable soft-body physics simulator that can be composed with neural networks as a differentiable layer. In contrast to other differentiable physics approaches that use explicit forward models to define state transitions, we focus on implicit state transitions defined via function minimization. Implicit state transitions appear in implicit numerical integration methods, which offer the benefits of large time steps and excellent numerical stability, but require a special treatment to achieve differentiability due to the absence of an explicit differentiable forward pass. In contrast to other implicit differentiation approaches that require explicit formulas for the force function and the force Jacobian matrix, we present an energy-based approach that allows us to compute these derivatives automatically and in a matrix-free fashion via reverse-mode automatic differentiation. This allows for more flexibility and productivity when defining physical models and is particularly important in the context of neural network training, which often relies on reverse-mode automatic differentiation (backpropagation). We demonstrate the effectiveness of our differentiable simulator in policy optimization for locomotion tasks and show that it achieves better sample efficiency than model-free reinforcement learning.
    NegatER: Unsupervised Discovery of Negatives in Commonsense Knowledge Bases. (arXiv:2011.07497v2 [cs.AI] UPDATED)
    (2 min) Codifying commonsense knowledge in machines is a longstanding goal of artificial intelligence. Recently, much progress toward this goal has been made with automatic knowledge base (KB) construction techniques. However, such techniques focus primarily on the acquisition of positive (true) KB statements, even though negative (false) statements are often also important for discriminative reasoning over commonsense KBs. As a first step toward the latter, this paper proposes NegatER, a framework that ranks potential negatives in commonsense KBs using a contextual language model (LM). Importantly, as most KBs do not contain negatives, NegatER relies only on the positive knowledge in the LM and does not require ground-truth negative examples. Experiments demonstrate that, compared to multiple contrastive data augmentation approaches, NegatER yields negatives that are more grammatical, coherent, and informative -- leading to statistically significant accuracy improvements in a challenging KB completion task and confirming that the positive knowledge in LMs can be "re-purposed" to generate negative knowledge.
    Measuring and Harnessing Transference in Multi-Task Learning. (arXiv:2010.15413v3 [cs.LG] UPDATED)
    (2 min) Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naive formulations often degrade performance and in particular, identifying the tasks that would benefit from co-training remains a challenging design question. In this paper, we analyze the dynamics of information transfer, or transference, across tasks throughout training. Specifically, we develop a similarity measure that can quantify transference among tasks and use this quantity to both better understand the optimization dynamics of multi-task learning as well as improve overall learning performance. In the latter case, we propose two methods to leverage our transference metric. The first operates at a macro-level by selecting which tasks should train together while the second functions at a micro-level by determining how to combine task gradients at each training step. We find these methods can lead to significant improvement over prior work on three supervised multi-task learning benchmarks and one multi-task reinforcement learning paradigm.
    PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams. (arXiv:2109.05013v1 [cs.LG])
    (2 min) As the number of Internet of Things (IoT) devices and systems have surged, IoT data analytics techniques have been developed to detect malicious cyber-attacks and secure IoT systems; however, concept drift issues often occur in IoT data analytics, as IoT data is often dynamic data streams that change over time, causing model degradation and attack detection failure. This is because traditional data analytics models are static models that cannot adapt to data distribution changes. In this paper, we propose a Performance Weighted Probability Averaging Ensemble (PWPAE) framework for drift adaptive IoT anomaly detection through IoT data stream analytics. Experiments on two public datasets show the effectiveness of our proposed PWPAE method compared against state-of-the-art methods.
    Citizen centric optimal electric vehicle charging stations locations in a full city: case of Malaga. (arXiv:2109.04975v1 [cs.NE])
    (2 min) This article presents the problem of locating electric vehicle (EV) charging stations in a city by defining the Electric Vehicle Charging Stations Locations (EV-CSL) problem. The idea is to minimize the distance the citizens have to travel to charge their vehicles. EV-CSL takes into account the maximum number of charging stations to install and the electric power requirements. Two metaheuristics are applied to address the relying optimization problem: a genetic algorithm (GA) and a variable neighborhood search (VNS). The experimental analysis over a realistic scenario of Malaga city, Spain, shows that the metaheuristics are able to find competitive solutions which dramatically improve the actual installation of the stations in Malaga. GA provided statistically the best results.
    De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective. (arXiv:2108.07971v2 [cs.CL] UPDATED)
    (2 min) In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.
    Generative Modelling of BRDF Textures from Flash Images. (arXiv:2102.11861v2 [cs.GR] UPDATED)
    (2 min) We learn a latent space for easy capture, consistent interpolation, and efficient reproduction of visual material appearance. When users provide a photo of a stationary natural material captured under flashlight illumination, first it is converted into a latent material code. Then, in the second step, conditioned on the material code, our method produces an infinite and diverse spatial field of BRDF model parameters (diffuse albedo, normals, roughness, specular albedo) that subsequently allows rendering in complex scenes and illuminations, matching the appearance of the input photograph. Technically, we jointly embed all flash images into a latent space using a convolutional encoder, and -- conditioned on these latent codes -- convert random spatial fields into fields of BRDF parameters using a convolutional neural network (CNN). We condition these BRDF parameters to match the visual characteristics (statistics and spectra of visual features) of the input under matching light. A user study compares our approach favorably to previous work, even those with access to BRDF supervision.
    Lessons on Parameter Sharing across Layers in Transformers. (arXiv:2104.06022v2 [cs.CL] UPDATED)
    (2 min) We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.
    Secure Multi-Function Computation with Private Remote Sources. (arXiv:2106.09485v3 [cs.IT] UPDATED)
    (2 min) We consider a distributed function computation problem in which parties observing noisy versions of a remote source facilitate the computation of a function of their observations at a fusion center through public communication. The distributed function computation is subject to constraints, including not only reliability and storage but also privacy and secrecy. Specifically, 1) the remote source should remain private from an eavesdropper and the fusion center, measured in terms of the information leaked about the remote source; 2) the function computed should remain secret from the eavesdropper, measured in terms of the information leaked about the arguments of the function, to ensure secrecy regardless of the exact function used. We derive the exact rate regions for lossless and lossy single-function computation and illustrate the lossy single-function computation rate region for an information bottleneck example, in which the optimal auxiliary random variables are characterized for binary-input symmetric-output channels. We extend the approach to lossless and lossy asynchronous multiple-function computations with joint secrecy and privacy constraints, in which case inner and outer bounds for the rate regions differing only in the Markov chain conditions imposed are characterized.
    Towards control of opinion diversity by introducing zealots into a polarised social group. (arXiv:2006.07265v5 [cs.SI] UPDATED)
    (2 min) We explore a method to influence or even control the diversity of opinions within a polarised social group. We leverage the voter model in which users hold binary opinions and repeatedly update their beliefs based on others they connect with. Stubborn agents who never change their minds ("zealots") are also disseminated through the network, which is modelled by a connected graph. Building on earlier results, we provide a closed-form expression for the average opinion of the group at equilibrium. This leads us to a strategy to inject zealots into a polarised network in order to shift the average opinion towards any target value. We account for the possible presence of a backfire effect, which may lead the group to react negatively and reinforce its level of polarisation in response. Our results are supported by numerical experiments on synthetic data.
    Learning the Hypotheses Space from data: Learning Space and U-curve Property. (arXiv:2001.09532v3 [stat.ML] UPDATED)
    (2 min) This paper presents an extension of the classical agnostic PAC learning model in which learning problems are modelled not only by a Hypothesis Space $\mathcal{H}$, but also by a Learning Space $\mathbb{L}(\mathcal{H})$, which is a cover of $\mathcal{H}$, constrained by a VC-dimension property, that is a suitable domain for Model Selection algorithms. Our main contribution is a data driven general learning algorithm to perform regularized Model Selection on $\mathbb{L}(\mathcal{H})$. A remarkable, formally proved, consequence of this approach are conditions on $\mathbb{L}(\mathcal{H})$ and on the loss function that lead to estimated out-of-sample error surfaces which are true U-curves on $\mathbb{L}(\mathcal{H})$ chains, enabling a more efficient search on $\mathbb{L}(\mathcal{H})$. To our knowledge, this is the first rigorous result asserting that a non exhaustive search of a family of candidate models can return an optimal solution. In this new framework, an U-curve optimization algorithm becomes a natural component of Model Selection, hence of learning algorithms. The abstract general framework proposed here may have important implications on modern learning models and on areas such as Neural Architecture Search.
    FR-Detect: A Multi-Modal Framework for Early Fake News Detection on Social Media Using Publishers Features. (arXiv:2109.04835v1 [cs.SI])
    (2 min) In recent years, with the expansion of the Internet and attractive social media infrastructures, people prefer to follow the news through these media. Despite the many advantages of these media in the news field, the lack of any control and verification mechanism has led to the spread of fake news, as one of the most important threats to democracy, economy, journalism and freedom of expression. Designing and using automatic methods to detect fake news on social media has become a significant challenge. In this paper, we examine the publishers' role in detecting fake news on social media. We also suggest a high accurate multi-modal framework, namely FR-Detect, using user-related and content-related features with early detection capability. For this purpose, two new user-related features, namely Activity Credibility and Influence, have been introduced for publishers. Furthermore, a sentence-level convolutional neural network is provided to combine these features with latent textual content features properly. Experimental results have shown that the publishers' features can improve the performance of content-based models by up to 13% and 29% in accuracy and F1-score, respectively.
    Unsupervised Change Detection in Hyperspectral Images using Feature Fusion Deep Convolutional Autoencoders. (arXiv:2109.04990v1 [cs.CV])
    (2 min) Binary change detection in bi-temporal co-registered hyperspectral images is a challenging task due to a large number of spectral bands present in the data. Researchers, therefore, try to handle it by reducing dimensions. The proposed work aims to build a novel feature extraction system using a feature fusion deep convolutional autoencoder for detecting changes between a pair of such bi-temporal co-registered hyperspectral images. The feature fusion considers features across successive levels and multiple receptive fields and therefore adds a competitive edge over the existing feature extraction methods. The change detection technique described is completely unsupervised and is much more elegant than other supervised or semi-supervised methods which require some amount of label information. Different methods have been applied to the extracted features to find the changes in the two images and it is found that the proposed method clearly outperformed the state of the art methods in unsupervised change detection for all the datasets.
    Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning. (arXiv:2109.04689v1 [cs.CL])
    (2 min) Motivated by suggested question generation in conversational news recommendation systems, we propose a model for generating question-answer pairs (QA pairs) with self-contained, summary-centric questions and length-constrained, article-summarizing answers. We begin by collecting a new dataset of news articles with questions as titles and pairing them with summaries of varying length. This dataset is used to learn a QA pair generation model producing summaries as answers that balance brevity with sufficiency jointly with their corresponding questions. We then reinforce the QA pair generation process with a differentiable reward function to mitigate exposure bias, a common problem in natural language generation. Both automatic metrics and human evaluation demonstrate these QA pairs successfully capture the central gists of the articles and achieve high answer accuracy.
    Dynamic Collective Intelligence Learning: Finding Efficient Sparse Model via Refined Gradients for Pruned Weights. (arXiv:2109.04660v1 [cs.LG])
    (2 min) With the growth of deep neural networks (DNN), the number of DNN parameters has drastically increased. This makes DNN models hard to be deployed on resource-limited embedded systems. To alleviate this problem, dynamic pruning methods have emerged, which try to find diverse sparsity patterns during training by utilizing Straight-Through-Estimator (STE) to approximate gradients of pruned weights. STE can help the pruned weights revive in the process of finding dynamic sparsity patterns. However, using these coarse gradients causes training instability and performance degradation owing to the unreliable gradient signal of the STE approximation. In this work, to tackle this issue, we introduce refined gradients to update the pruned weights by forming dual forwarding paths from two sets (pruned and unpruned) of weights. We propose a novel Dynamic Collective Intelligence Learning (DCIL) which makes use of the learning synergy between the collective intelligence of both weight sets. We verify the usefulness of the refined gradients by showing enhancements in the training stability and the model performance on the CIFAR and ImageNet datasets. DCIL outperforms various previously proposed pruning schemes including other dynamic pruning methods with enhanced stability during training.
    Investigating Numeracy Learning Ability of a Text-to-Text Transfer Model. (arXiv:2109.04672v1 [cs.CL])
    (2 min) The transformer-based pre-trained language models have been tremendously successful in most of the conventional NLP tasks. But they often struggle in those tasks where numerical understanding is required. Some possible reasons can be the tokenizers and pre-training objectives which are not specifically designed to learn and preserve numeracy. Here we investigate the ability of text-to-text transfer learning model (T5), which has outperformed its predecessors in the conventional NLP tasks, to learn numeracy. We consider four numeracy tasks: numeration, magnitude order prediction, finding minimum and maximum in a series, and sorting. We find that, although T5 models perform reasonably well in the interpolation setting, they struggle considerably in the extrapolation setting across all four tasks.
    Efficient Locally Optimal Number Set Partitioning for Scheduling, Allocation and Fair Selection. (arXiv:2109.04809v1 [cs.DS])
    (2 min) We study the optimization version of the set partition problem (where the difference between the partition sums are minimized), which has numerous applications in decision theory literature. While the set partitioning problem is NP-hard and requires exponential complexity to solve (i.e., intractable); we formulate a weaker version of this NP-hard problem, where the goal is to find a locally optimal solution. We show that our proposed algorithms can find a locally optimal solution in near linear time. Our algorithms require neither positive nor integer elements in the input set, hence, they are more widely applicable.
    Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition. (arXiv:2109.04783v1 [cs.SD])
    (2 min) When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Variance Distortionless Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.
    ReLU Regression with Massart Noise. (arXiv:2109.04623v1 [cs.LG])
    (2 min) We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.
    On the validity of pre-trained transformers for natural language processing in the software engineering domain. (arXiv:2109.04738v1 [cs.SE])
    (2 min) Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.
    Supervising the Decoder of Variational Autoencoders to Improve Scientific Utility. (arXiv:2109.04561v1 [stat.ML])
    (2 min) Probabilistic generative models are attractive for scientific modeling because their inferred parameters can be used to generate hypotheses and design experiments. This requires that the learned model provide an accurate representation of the input data and yield a latent space that effectively predicts outcomes relevant to the scientific question. Supervised Variational Autoencoders (SVAEs) have previously been used for this purpose, where a carefully designed decoder can be used as an interpretable generative model while the supervised objective ensures a predictive latent representation. Unfortunately, the supervised objective forces the encoder to learn a biased approximation to the generative posterior distribution, which renders the generative parameters unreliable when used in scientific models. This issue has remained undetected as reconstruction losses commonly used to evaluate model performance do not detect bias in the encoder. We address this previously-unreported issue by developing a second order supervision framework (SOS-VAE) that influences the decoder to induce a predictive latent representation. This ensures that the associated encoder maintains a reliable generative interpretation. We extend this technique to allow the user to trade-off some bias in the generative parameters for improved predictive performance, acting as an intermediate option between SVAEs and our new SOS-VAE. We also use this methodology to address missing data issues that often arise when combining recordings from multiple scientific experiments. We demonstrate the effectiveness of these developments using synthetic data and electrophysiological recordings with an emphasis on how our learned representations can be used to design scientific experiments.
    SPECTRA: Sparse Structured Text Rationalization. (arXiv:2109.04552v1 [cs.CL])
    (2 min) Selective rationalization aims to produce decisions along with rationales (e.g., text highlights or word alignments between two sentences). Commonly, rationales are modeled as stochastic binary masks, requiring sampling-based gradient estimators, which complicates training and requires careful hyperparameter tuning. Sparse attention mechanisms are a deterministic alternative, but they lack a way to regularize the rationale extraction (e.g., to control the sparsity of a text highlight or the number of alignments). In this paper, we present a unified framework for deterministic extraction of structured explanations via constrained inference on a factor graph, forming a differentiable layer. Our approach greatly eases training and rationale regularization, generally outperforming previous work on what comes to performance and plausibility of the extracted rationales. We further provide a comparative study of stochastic and deterministic methods for rationale extraction for classification and natural language inference tasks, jointly assessing their predictive power, quality of the explanations, and model variability.
    SanitAIs: Unsupervised Data Augmentation to Sanitize Trojaned Neural Networks. (arXiv:2109.04566v1 [cs.LG])
    (2 min) The application of self-supervised methods has resulted in broad improvements to neural network performance by leveraging large, untapped collections of unlabeled data to learn generalized underlying structure. In this work, we harness unsupervised data augmentation (UDA) to mitigate backdoor or Trojan attacks on deep neural networks. We show that UDA is more effective at removing the effects of a trigger than current state-of-the-art methods for both feature space and point triggers. These results demonstrate that UDA is both an effective and practical approach to mitigating the effects of backdoors on neural networks.
    SeDyT: A General Framework for Multi-Step Event Forecasting via Sequence Modeling on Dynamic Entity Embeddings. (arXiv:2109.04550v1 [cs.LG])
    (2 min) Temporal Knowledge Graphs store events in the form of subjects, relations, objects, and timestamps which are often represented by dynamic heterogeneous graphs. Event forecasting is a critical and challenging task in Temporal Knowledge Graph reasoning that predicts the subject or object of an event in the future. To obtain temporal embeddings multi-step away in the future, existing methods learn generative models that capture the joint distribution of the observed events. To reduce the high computation costs, these methods rely on unrealistic assumptions of independence and approximations in training and inference. In this work, we propose SeDyT, a discriminative framework that performs sequence modeling on the dynamic entity embeddings to solve the multi-step event forecasting problem. SeDyT consists of two components: a Temporal Graph Neural Network that generates dynamic entity embeddings in the past and a sequence model that predicts the entity embeddings in the future. Compared with the generative models, SeDyT does not rely on any heuristic-based probability model and has low computation complexity in both training and inference. SeDyT is compatible with most Temporal Graph Neural Networks and sequence models. We also design an efficient training method that trains the two components in one gradient descent propagation. We evaluate the performance of SeDyT on five popular datasets. By combining temporal Graph Neural Network models and sequence models, SeDyT achieves an average of 2.4% MRR improvement when not using the validation set and more than 10% MRR improvement when using the validation set.
    Knowledge-Aware Meta-learning for Low-Resource Text Classification. (arXiv:2109.04707v1 [cs.CL])
    (2 min) Meta-learning has achieved great success in leveraging the historical learned knowledge to facilitate the learning process of the new task. However, merely learning the knowledge from the historical tasks, adopted by current meta-learning algorithms, may not generalize well to testing tasks when they are not well-supported by training tasks. This paper studies a low-resource text classification problem and bridges the gap between meta-training and meta-testing tasks by leveraging the external knowledge bases. Specifically, we propose KGML to introduce additional representation for each sentence learned from the extracted sentence-specific knowledge graph. The extensive experiments on three datasets demonstrate the effectiveness of KGML under both supervised adaptation and unsupervised adaptation settings.
    Transfer Learning and Curriculum Learning in Sokoban. (arXiv:2105.11702v2 [cs.AI] UPDATED)
    (2 min) Transfer learning can speed up training in machine learning and is regularly used in classification tasks. It reuses prior knowledge from other tasks to pre-train networks for new tasks. In reinforcement learning, learning actions for a behavior policy that can be applied to new environments is still a challenge, especially for tasks that involve much planning. Sokoban is a challenging puzzle game. It has been used widely as a benchmark in planning-based reinforcement learning. In this paper, we show how prior knowledge improves learning in Sokoban tasks. We find that reusing feature representations learned previously can accelerate learning new, more complex, instances. In effect, we show how curriculum learning, from simple to complex tasks, works in Sokoban. Furthermore, feature representations learned in simpler instances are more general, and thus lead to positive transfers towards more complex tasks, but not vice versa. We have also studied which part of the knowledge is most important for transfer to succeed, and identify which layers should be used for pre-training.
    Global Optimization of Objective Functions Represented by ReLU Networks. (arXiv:2010.03258v3 [cs.LG] UPDATED)
    (0 min) Neural networks can learn complex, non-convex functions, and it is challenging to guarantee their correct behavior in safety-critical contexts. Many approaches exist to find failures in networks (e.g., adversarial examples), but these cannot guarantee the absence of failures. Verification algorithms address this need and provide formal guarantees about a neural network by answering "yes or no" questions. For example, they can answer whether a violation exists within certain bounds. However, individual "yes or no" questions cannot answer qualitative questions such as "what is the largest error within these bounds"; the answers to these lie in the domain of optimization. Therefore, we propose strategies to extend existing verifiers to perform optimization and find: (i) the most extreme failure in a given input region and (ii) the minimum input perturbation required to cause a failure. A naive approach using a bisection search with an off-the-shelf verifier results in many expensive and overlapping calls to the verifier. Instead, we propose an approach that tightly integrates the optimization process into the verification procedure, achieving better runtime performance than the naive approach. We evaluate our approach implemented as an extension of Marabou, a state-of-the-art neural network verifier, and compare its performance with the bisection approach and MIPVerify, an optimization-based verifier. We observe complementary performance between our extension of Marabou and MIPVerify.
    FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging. (arXiv:2012.15781v2 [cs.LG] UPDATED)
    (0 min) Influence functions approximate the "influences" of training data-points for test predictions and have a wide variety of applications. Despite the popularity, their computational cost does not scale well with model and training data size. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. We use k-Nearest Neighbors (kNN) to narrow the search space down to a subset of good candidate data points, identify the configurations that best balance the speed-quality trade-off in estimating the inverse Hessian-vector product, and introduce a fast parallel variant. Our proposed method achieves about 80X speedup while being highly correlated with the original influence values. With the availability of the fast influence functions, we demonstrate their usefulness in four applications. First, we examine whether influential data-points can "explain" test time behavior using the framework of simulatability. Second, we visualize the influence interactions between training and test data-points. Third, we show that we can correct model errors by additional fine-tuning on certain influential data-points, improving the accuracy of a trained MultiNLI model by 2.5% on the HANS dataset. Finally, we experiment with a similar setup but fine-tuning on datapoints not seen during training, improving the model accuracy by 2.8% and 1.7% on HANS and ANLI datasets respectively. Overall, our fast influence functions can be efficiently applied to large models and datasets, and our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors. Code is available at https://github.com/salesforce/fast-influence-functions
    Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation. (arXiv:2105.03432v2 [cs.CL] UPDATED)
    (0 min) Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. Previous approaches in this task have been able to generalise to rare or unseen instances by relying on a delexicalisation of the input. However, this often requires that the input appears verbatim in the output text. This poses challenges in multilingual settings, where the task expands to generate the output text in multiple languages given the same input. In this paper, we explore the application of multilingual models in concept-to-text and propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text and that our framework outperforms previous approaches, especially for low resource languages.
    Probing Commonsense Explanation in Dialogue Response Generation. (arXiv:2104.09574v4 [cs.CL] UPDATED)
    (0 min) Humans use commonsense reasoning (CSR) implicitly to produce natural and coherent responses in conversations. Aiming to close the gap between current response generation (RG) models and human communication abilities, we want to understand why RG models respond as they do by probing RG model's understanding of commonsense reasoning that elicits proper responses. We formalize the problem by framing commonsense as a latent variable in the RG task and using explanations for responses as textual form of commonsense. We collect 6k annotated explanations justifying responses from four dialogue datasets and ask humans to verify them and propose two probing settings to evaluate RG models' CSR capabilities. Probing results show that models fail to capture the logical relations between commonsense explanations and responses and fine-tuning on in-domain data and increasing model sizes do not lead to understanding of CSR for RG. We hope our study motivates more research in making RG models emulate the human reasoning process in pursuit of smooth human-AI communication.
    Query-driven Segment Selection for Ranking Long Documents. (arXiv:2109.04611v1 [cs.IR])
    (0 min) Transformer-based rankers have shown state-of-the-art performance. However, their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose query-driven segment selection from long documents to build training data. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that can process longer input sequences. Our findings open up new direction to design efficient transformer-based rankers.
    Relational World Knowledge Representation in Contextual Language Models: A Review. (arXiv:2104.05837v2 [cs.CL] UPDATED)
    (0 min) Relational knowledge bases (KBs) are commonly used to represent world knowledge in machines. However, while advantageous for their high degree of precision and interpretability, KBs are usually organized according to manually-defined schemas, which limit their expressiveness and require significant human efforts to engineer and maintain. In this review, we take a natural language processing perspective to these limitations, examining how they may be addressed in part by training deep contextual language models (LMs) to internalize and express relational knowledge in more flexible forms. We propose to organize knowledge representation strategies in LMs by the level of KB supervision provided, from no KB supervision at all to entity- and relation-level supervision. Our contributions are threefold: (1) We provide a high-level, extensible taxonomy for knowledge representation in LMs; (2) Within our taxonomy, we highlight notable models, evaluation tasks, and findings, in order to provide an up-to-date review of current knowledge representation capabilities in LMs; and (3) We suggest future research directions that build upon the complementary aspects of LMs and KBs as knowledge representations.
    Optimised one-class classification performance. (arXiv:2102.02618v2 [cs.LG] UPDATED)
    (0 min) We provide a thorough treatment of one-class classification with hyperparameter optimisation for five data descriptors: Support Vector Machine (SVM), Nearest Neighbour Distance (NND), Localised Nearest Neighbour Distance (LNND), Local Outlier Factor (LOF) and Average Localised Proximity (ALP). The hyperparameters of SVM and LOF have to be optimised through cross-validation, while NND, LNND and ALP allow an efficient form of leave-one-out validation and the reuse of a single nearest-neighbour query. We experimentally evaluate the effect of hyperparameter optimisation with 246 classification problems drawn from 50 datasets. From a selection of optimisation algorithms, the recent Malherbe-Powell proposal optimises the hyperparameters of all data descriptors most efficiently. We calculate the increase in test AUROC and the amount of overfitting as a function of the number of hyperparameter evaluations. After 50 evaluations, ALP and SVM significantly outperform LOF, NND and LNND, and LOF and NND outperform LNND. The performance of ALP and SVM is comparable, but ALP can be optimised more efficiently so constitutes a good default choice. Alternatively, using validation AUROC as a selection criterion between ALP or SVM gives the best overall result, and NND is the least computationally demanding option. We thus end up with a clear trade-off between three choices, allowing practitioners to make an informed decision.
    Semi-supervised Relation Extraction via Incremental Meta Self-Training. (arXiv:2010.16410v2 [cs.CL] UPDATED)
    (0 min) To alleviate human efforts from obtaining large-scale annotations, Semi-Supervised Relation Extraction methods aim to leverage unlabeled data in addition to learning from limited samples. Existing self-training methods suffer from the gradual drift problem, where noisy pseudo labels on unlabeled data are incorporated during training. To alleviate the noise in pseudo labels, we propose a method called MetaSRE, where a Relation Label Generation Network generates quality assessment on pseudo labels by (meta) learning from the successful and failed attempts on Relation Classification Network as an additional meta-objective. To reduce the influence of noisy pseudo labels, MetaSRE adopts a pseudo label selection and exploitation scheme which assesses pseudo label quality on unlabeled samples and only exploits high-quality pseudo labels in a self-training fashion to incrementally augment labeled samples for both robustness and accuracy. Experimental results on two public datasets demonstrate the effectiveness of the proposed approach.
    Neural Networks for Latent Budget Analysis of Compositional Data. (arXiv:2109.04875v1 [stat.ML])
    (0 min) Compositional data are non-negative data collected in a rectangular matrix with a constant row sum. Due to the non-negativity the focus is on conditional proportions that add up to 1 for each row. A row of conditional proportions is called an observed budget. Latent budget analysis (LBA) assumes a mixture of latent budgets that explains the observed budgets. LBA is usually fitted to a contingency table, where the rows are levels of one or more explanatory variables and the columns the levels of a response variable. In prospective studies, there is only knowledge about the explanatory variables of individuals and interest goes out to predicting the response variable. Thus, a form of LBA is needed that has the functionality of prediction. Previous studies proposed a constrained neural network (NN) extension of LBA that was hampered by an unsatisfying prediction ability. Here we propose LBA-NN, a feed forward NN model that yields a similar interpretation to LBA but equips LBA with a better ability of prediction. A stable and plausible interpretation of LBA-NN is obtained through the use of importance plots and table, that show the relative importance of all explanatory variables on the response variable. An LBA-NN-K- means approach that applies K-means clustering on the importance table is used to produce K clusters that are comparable to K latent budgets in LBA. Here we provide different experiments where LBA-NN is implemented and compared with LBA. In our analysis, LBA-NN outperforms LBA in prediction in terms of accuracy, specificity, recall and mean square error. We provide open-source software at GitHub.
    Artificial Text Detection via Examining the Topology of Attention Maps. (arXiv:2109.04825v1 [cs.CL])
    (0 min) The impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) which is currently understudied in the field of NLP. We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10\% on three common datasets, and tend to be the most robust towards unseen GPT-style generation models as opposed to existing methods. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties. The results demonstrate that TDA is a promising line with respect to NLP tasks, specifically the ones that incorporate surface and structural information.
    Best-Arm Identification in Correlated Multi-Armed Bandits. (arXiv:2109.04941v1 [stat.ML])
    (0 min) In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-\delta$ for some $\delta>0$, the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal{K}$. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form $\mathcal{O}\left(\sum_{k \in \mathcal{C}} \log\left(\frac{1}{\delta}\right)\right)$ as opposed to the typical $\mathcal{O}\left(\sum_{k \in \mathcal{K}} \log\left(\frac{1}{\delta}\right)\right)$ samples required in the independent reward setting. The improvement comes, as the $\mathcal{O}(\log(1/\delta))$ term is summed only for the set of competitive arms $\mathcal{C}$, which is a subset of the original set of arms $\mathcal{K}$. The size of the set $\mathcal{C}$, depending on the problem setting, can be as small as $2$, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.
    EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling. (arXiv:2109.04699v1 [cs.CL])
    (0 min) While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.
    DeepOPF: A Feasibility-Optimized Deep Neural Network Approach for AC Optimal Power Flow Problems. (arXiv:2007.01002v4 [eess.SY] UPDATED)
    (0 min) High percentage penetrations of renewable energy generations introduce significant uncertainty into power systems. It requires grid operators to solve alternative current optimal power flow (AC-OPF) problems more frequently for economical and reliable operation in both transmission and distribution grids. In this paper, we develop a Deep Neural Network (DNN) approach, called DeepOPF, for solving AC-OPF problems in a fraction of the time used by conventional solvers. A key difficulty for applying machine learning techniques for solving AC-OPF problems lies in ensuring that the obtained solutions respect the equality and inequality physical and operational constraints. Generalized the 2-stage procedure in [1], [2], DeepOPF first trains a DNN model to predict a set of independent operating variables and then directly compute the remaining dependable ones by solving power flow equations. Such an approach not only preserves the power-flow balance equality constraints but also reduces the number of variables to predict by the DNN, cutting down the number of neurons and training data needed. DeepOPF then employs a penalty approach with a zero-order gradient estimation technique in the training process to preserve the remaining inequality constraints. As another contribution, we drive a condition for tuning the size of the DNN according to the desired approximation accuracy, which measures the DNN generalization capability. It provides theoretical justification for using DNN to solve the AC-OPF problem. Simulation results of IEEE 30/118/300-bus and a synthetic 2000-bus test cases show that DeepOPF speeds up the computing time by up to two orders of magnitude as compared to a state-of-the-art solver, at the expense of $<$0.1% cost difference.
    Highdicom: A Python library for standardized encoding of image annotations and machine learning model outputs in pathology and radiology. (arXiv:2106.07806v2 [eess.IV] UPDATED)
    (0 min) Machine learning is revolutionizing image-based diagnostics in pathology and radiology. ML models have shown promising results in research settings, but their lack of interoperability has been a major barrier for clinical integration and evaluation. The DICOM a standard specifies Information Object Definitions and Services for the representation and communication of digital images and related information, including image-derived annotations and analysis results. However, the complexity of the standard represents an obstacle for its adoption in the ML community and creates a need for software libraries and tools that simplify working with data sets in DICOM format. Here we present the highdicom library, which provides a high-level application programming interface for the Python programming language that abstracts low-level details of the standard and enables encoding and decoding of image-derived information in DICOM format in a few lines of Python code. The highdicom library ties into the extensive Python ecosystem for image processing and machine learning. Simultaneously, by simplifying creation and parsing of DICOM-compliant files, highdicom achieves interoperability with the medical imaging systems that hold the data used to train and run ML models, and ultimately communicate and store model outputs for clinical use. We demonstrate through experiments with slide microscopy and computed tomography imaging, that, by bridging these two ecosystems, highdicom enables developers to train and evaluate state-of-the-art ML models in pathology and radiology while remaining compliant with the DICOM standard and interoperable with clinical systems at all stages. To promote standardization of ML research and streamline the ML model development and deployment process, we made the library available free and open-source.
    Counterfactual Adversarial Learning with Representation Interpolation. (arXiv:2109.04746v1 [cs.LG])
    (0 min) Deep learning models exhibit a preference for statistical fitting over logical reasoning. Spurious correlations might be memorized when there exists statistical bias in training data, which severely limits the model performance especially in small data scenarios. In this work, we introduce Counterfactual Adversarial Training framework (CAT) to tackle the problem from a causality perspective. Particularly, for a specific sample, CAT first generates a counterfactual representation through latent space interpolation in an adversarial manner, and then performs Counterfactual Risk Minimization (CRM) on each original-counterfactual pair to adjust sample-wise loss weight dynamically, which encourages the model to explore the true causal effect. Extensive experiments demonstrate that CAT achieves substantial performance improvement over SOTA across different downstream tasks, including sentence classification, natural language inference and question answering.
    Multimodal Federated Learning. (arXiv:2109.04833v1 [cs.LG])
    (0 min) Federated learning is proposed as an alternative to centralized machine learning since its client-server structure provides better privacy protection and scalability in real-world applications. In many applications, such as smart homes with IoT devices, local data on clients are generated from different modalities such as sensory, visual, and audio data. Existing federated learning systems only work on local data from a single modality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learning framework that trains autoencoders to extract shared or correlated representations from different local data modalities on clients. In addition, we propose a multimodal FedAvg algorithm to aggregate local autoencoders trained on different data modalities. We use the learned global autoencoder for a downstream classification task with the help of auxiliary labelled data on the server. We empirically evaluate our framework on different modalities including sensory data, depth camera videos, and RGB camera videos. Our experimental results demonstrate that introducing data from multiple modalities into federated learning can improve its accuracy. In addition, we can use labelled data from only one modality for supervised learning on the server and apply the learned model to testing data from other modalities to achieve decent accuracy (e.g., approximately 70% as the best performance), especially when combining contributions from both unimodal clients and multimodal clients.
    Does Pretraining for Summarization Require Knowledge Transfer?. (arXiv:2109.04953v1 [cs.CL])
    (0 min) Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer.
    Learning to Teach with Student Feedback. (arXiv:2109.04641v1 [cs.LG])
    (0 min) Knowledge distillation (KD) has gained much attention due to its effectiveness in compressing large-scale pre-trained models. In typical KD methods, the small student model is trained to match the soft targets generated by the big teacher model. However, the interaction between student and teacher is one-way. The teacher is usually fixed once trained, resulting in static soft targets to be distilled. This one-way interaction leads to the teacher's inability to perceive the characteristics of the student and its training progress. To address this issue, we propose Interactive Knowledge Distillation (IKD), which also allows the teacher to learn to teach from the feedback of the student. In particular, IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps: a course step to optimize student with the soft target of teacher, and an exam step to optimize teacher with the feedback of student. IKD is a general framework that is orthogonal to most existing knowledge distillation methods. Experimental results show that IKD outperforms traditional KD methods on various NLP tasks.
    What Makes for Hierarchical Vision Transformer?. (arXiv:2107.02174v2 [cs.CV] UPDATED)
    (2 min) Recent studies indicate that hierarchical Vision Transformer with a macro architecture of interleaved non-overlapped window-based self-attention \& shifted-window operation is able to achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. Most follow-up works attempt to replace the shifted-window operation with other kinds of cross-window communication paradigms, while treating self-attention as the de-facto standard for window-based information aggregation. In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed as LinMapper can achieve very strong performance in ImageNet-1k image recognition. Moreover, we find that LinMapper is able to better leverage the pre-trained representations from image recognition and demonstrates excellent transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches, which all give similar competitive results. Our study reveals that the \textbf{macro architecture} of Swin model families, other than specific aggregation layers or specific means of cross-window communication, may be more responsible for its strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm. Code and models will be publicly available to facilitate future research.
    Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder. (arXiv:2102.09206v2 [cs.LG] UPDATED)
    (2 min) Dense retrieval requires high-quality text sequence embeddings to support effective search in the representation space. Autoencoder-based language models are appealing in dense retrieval as they train the encoder to output high-quality embedding that can reconstruct the input texts. However, in this paper, we provide theoretical analyses and show empirically that an autoencoder language model with a low reconstruction loss may not provide good sequence representations because the decoder may take shortcuts by exploiting language patterns. To address this, we propose a new self-learning method that pre-trains the autoencoder using a \textit{weak} decoder, with restricted capacity and attention flexibility to push the encoder to provide better text representations. Our experiments on web search, news recommendation, and open domain question answering show that our pre-trained model significantly boosts the effectiveness and few-shot ability of dense retrieval models. Our code is available at https://github.com/microsoft/SEED-Encoder/.
    Optimal Clustering from Noisy Binary Feedback. (arXiv:1910.06002v3 [stat.ML] UPDATED)
    (2 min) We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the {\it hardness} of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.
    KNODE-MPC: A Knowledge-based Data-driven Predictive Control Framework for Aerial Robots. (arXiv:2109.04821v1 [cs.RO])
    (2 min) In this work, we consider the problem of deriving and incorporating accurate dynamic models for model predictive control (MPC) with an application to quadrotor control. MPC relies on precise dynamic models to achieve the desired closed-loop performance. However, the presence of uncertainties in complex systems and the environments they operate in poses a challenge in obtaining sufficiently accurate representations of the system dynamics. In this work, we make use of a deep learning tool, knowledge-based neural ordinary differential equations (KNODE), to augment a model obtained from first principles. The resulting hybrid model encompasses both a nominal first-principle model and a neural network learnt from simulated or real-world experimental data. Using a quadrotor, we benchmark our hybrid model against a state-of-the-art Gaussian Process (GP) model and show that the hybrid model provides more accurate predictions of the quadrotor dynamics and is able to generalize beyond the training data. To improve closed-loop performance, the hybrid model is integrated into a novel MPC framework, known as KNODE-MPC. Results show that the integrated framework achieves 73% improvement in simulations and more than 14% in physical experiments, in terms of trajectory tracking performance.
    Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization. (arXiv:2109.04697v1 [cs.LG])
    (2 min) Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. However, unfolding a proximal splitting algorithm with a positive semi-definite (PSD) cone projection operator per iteration is expensive, due to the required full matrix eigen-decomposition. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph classifier, where the PSD cone constraint is replaced by a set of "tightest possible" linear constraints per iteration. As a result, each iteration only requires computing a linear program (LP) and one extreme eigenvector. Inside the unrolled network, we optimize parameters via stochastic gradient descent (SGD) that determine graph edge weights in two ways: i) a metric matrix that computes feature distances, and ii) a sparse weight matrix computed via local linear embedding (LLE). Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
    Representation Memorization for Fast Learning New Knowledge without Forgetting. (arXiv:2108.12596v2 [cs.LG] UPDATED)
    (2 min) The ability to quickly learn new knowledge (e.g. new classes or data distributions) is a big step towards human-level intelligence. In this paper, we consider scenarios that require learning new classes or data distributions quickly and incrementally over time, as it often occurs in real-world dynamic environments. We propose "Memory-based Hebbian Parameter Adaptation" (Hebb) to tackle the two major challenges (i.e., catastrophic forgetting and sample efficiency) towards this goal in a unified framework. To mitigate catastrophic forgetting, Hebb augments a regular neural classifier with a continuously updated memory module to store representations of previous data. To improve sample efficiency, we propose a parameter adaptation method based on the well-known Hebbian theory, which directly "wires" the output network's parameters with similar representations retrieved from the memory. We empirically verify the superior performance of Hebb through extensive experiments on a wide range of learning tasks (image classification, language model) and learning scenarios (continual, incremental, online). We demonstrate that Hebb effectively mitigates catastrophic forgetting, and it indeed learns new knowledge better and faster than the current state-of-the-art.
    Constructing Contrastive samples via Summarization for Text Classification with limited annotations. (arXiv:2104.05094v2 [cs.CL] UPDATED)
    (2 min) Contrastive Learning has emerged as a powerful representation learning method and facilitates various downstream tasks especially when supervised data is limited. How to construct efficient contrastive samples through data augmentation is key to its success. Unlike vision tasks, the data augmentation method for contrastive learning has not been investigated sufficiently in language tasks. In this paper, we propose a novel approach to construct contrastive samples for language tasks using text summarization. We use these samples for supervised contrastive learning to gain better text representations which greatly benefit text classification tasks with limited annotations. To further improve the method, we mix up samples from different classes and add an extra regularization, named Mixsum, in addition to the cross-entropy-loss. Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News, and IMDb) demonstrate the effectiveness of the proposed contrastive learning framework with summarization-based data augmentation and Mixsum regularization.
    Foreseeing the Benefits of Incidental Supervision. (arXiv:2006.05500v2 [cs.LG] UPDATED)
    (2 min) Real-world applications often require improved models by leveraging a range of cheap incidental supervision signals. These could include partial labels, noisy labels, knowledge-based constraints, and cross-domain or cross-task annotations -- all having statistical associations with gold annotations but not exactly the same. However, we currently lack a principled way to measure the benefits of these signals to a given target task, and the common practice of evaluating these benefits is through exhaustive experiments with various models and hyperparameters. This paper studies whether we can, in a single framework, quantify the benefits of various types of incidental signals for a given target task without going through combinatorial experiments. We propose a unified PAC-Bayesian motivated informativeness measure, PABI, that characterizes the uncertainty reduction provided by incidental supervision signals. We demonstrate PABI's effectiveness by quantifying the value added by various types of incidental signals to sequence tagging tasks. Experiments on named entity recognition (NER) and question answering (QA) show that PABI's predictions correlate well with learning performance, providing a promising way to determine, ahead of learning, which supervision signals would be beneficial.
    Assessing the Quality of the Datasets by Identifying Mislabeled Samples. (arXiv:2109.05000v1 [cs.LG])
    (2 min) Due to the over-emphasize of the quantity of data, the data quality has often been overlooked. However, not all training data points contribute equally to learning. In particular, if mislabeled, it might actively damage the performance of the model and the ability to generalize out of distribution, as the model might end up learning spurious artifacts present in the dataset. This problem gets compounded by the prevalence of heavily parameterized and complex deep neural networks, which can, with their high capacity, end up memorizing the noise present in the dataset. This paper proposes a novel statistic -- noise score, as a measure for the quality of each data point to identify such mislabeled samples based on the variations in the latent space representation. In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS). Our method leverages the fact that samples belonging to the same class will have similar latent representations. Therefore, by identifying the outliers in the latent space, we can find the mislabeled samples. We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets in different noise settings for the task of identifying mislabelled samples. We further show significant improvements in accuracy for the classification task for each dataset.
    SimCSE: Simple Contrastive Learning of Sentence Embeddings. (arXiv:2104.08821v3 [cs.CL] UPDATED)
    (2 min) This paper presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework, by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to previous best results. We also show -- both theoretically and empirically -- that contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
    Unsupervised classification of simulated magnetospheric regions. (arXiv:2109.04916v1 [physics.space-ph])
    (2 min) In magnetospheric missions, burst mode data sampling should be triggered in the presence of processes of scientific or operational interest. We present an unsupervised classification method for magnetospheric regions, that could constitute the first-step of a multi-step method for the automatic identification of magnetospheric processes of interest. Our method is based on Self Organizing Maps (SOMs), and we test it preliminarily on data points from global magnetospheric simulations obtained with the OpenGGCM-CTIM-RCM code. The dimensionality of the data is reduced with Principal Component Analysis before classification. The classification relies exclusively on local plasma properties at the selected data points, without information on their neighborhood or on their temporal evolution. We classify the SOM nodes into an automatically selected number of classes, and we obtain clusters that map to well defined magnetospheric regions. We validate our classification results by plotting the classified data in the simulated space and by comparing with K-means classification. For the sake of result interpretability, we examine the SOM feature maps (magnetospheric variables are called features in the context of classification), and we use them to unlock information on the clusters. We repeat the classification experiments using different sets of features, we quantitatively compare different classification results, and we obtain insights on which magnetospheric variables make more effective features for unsupervised classification.
    AutoPilot: Automating SoC Design Space Exploration for SWaP Constrained Autonomous UAVs. (arXiv:2102.02988v3 [cs.RO] UPDATED)
    (2 min) Building domain-specific accelerators for autonomous unmanned aerial vehicles (UAVs) is challenging due to a lack of systematic methodology for designing onboard compute. Balancing a computing system for a UAV requires considering both the cyber (e.g., sensor rate, compute performance) and physical (e.g., payload weight) characteristics that affect overall performance. Iterating over the many component choices results in a combinatorial explosion of the number of possible combinations: from 10s of thousands to billions, depending on implementation details. Manually selecting combinations of these components is tedious and expensive. To navigate the {cyber-physical design space} efficiently, we introduce \emph{AutoPilot}, a framework that automates full-system UAV co-design. AutoPilot uses Bayesian optimization to navigate a large design space and automatically select a combination of autonomy algorithm and hardware accelerator while considering the cross-product effect of other cyber and physical UAV components. We show that the AutoPilot methodology consistently outperforms general-purpose hardware selections like Xavier NX and Jetson TX2, as well as dedicated hardware accelerators built for autonomous UAVs, across a range of representative scenarios (three different UAV types and three deployment environments). Designs generated by AutoPilot increase the number of missions on average by up to 2.25x, 1.62x, and 1.43x for nano, micro, and mini-UAVs respectively over baselines. Our work demonstrates the need for holistic full-UAV co-design to achieve maximum overall UAV performance and the need for automated flows to simplify the design process for autonomous cyber-physical systems.
    Hybrid modeling of the human cardiovascular system using NeuralFMUs. (arXiv:2109.04880v1 [cs.LG])
    (2 min) Hybrid modeling, the combination of first principle and machine learning models, is an emerging research field that gathers more and more attention. Even if hybrid models produce formidable results for academic examples, there are still different technical challenges that hinder the use of hybrid modeling in real-world applications. By presenting NeuralFMUs, the fusion of a FMU, a numerical ODE solver and an ANN, we are paving the way for the use of a variety of first principle models from different modeling tools as parts of hybrid models. This contribution handles the hybrid modeling of a complex, real-world example: Starting with a simplified 1D-fluid model of the human cardiovascular system (arterial side), the aim is to learn neglected physical effects like arterial elasticity from data. We will show that the hybrid modeling process is more comfortable, needs less system knowledge and is therefore less error-prone compared to modeling solely based on first principle. Further, the resulting hybrid model has improved in computation performance, compared to a pure first principle white-box model, while still fulfilling the requirements regarding accuracy of the considered hemodynamic quantities. The use of the presented techniques is explained in a general manner and the considered use-case can serve as example for other modeling and simulation applications in and beyond the medical domain.
    Trust your neighbors: A comprehensive survey of neighborhood-based methods for recommender systems. (arXiv:2109.04584v1 [cs.IR])
    (2 min) Collaborative recommendation approaches based on nearest-neighbors are still highly popular today due to their simplicity, their efficiency, and their ability to produce accurate and personalized recommendations. This chapter offers a comprehensive survey of neighborhood-based methods for the item recommendation problem. It presents the main characteristics and benefits of such methods, describes key design choices for implementing a neighborhood-based recommender system, and gives practical information on how to make these choices. A broad range of methods is covered in the chapter, including traditional algorithms like k-nearest neighbors as well as advanced approaches based on matrix factorization, sparse coding and random walks.
    Heading Estimation Using Ultra-Wideband Received Signal Strength and Gaussian Processes. (arXiv:2109.04868v1 [cs.RO])
    (2 min) It is essential that a robot has the ability to determine its position and orientation to execute tasks autonomously. Heading estimation is especially challenging in indoor environments where magnetic distortions make magnetometer-based heading estimation difficult. Ultra-wideband (UWB) transceivers are common in indoor localization problems. This letter experimentally demonstrates how to use UWB range and received signal strength (RSS) measurements to estimate robot heading. The RSS of a UWB antenna varies with its orientation. As such, a Gaussian process (GP) is used to learn a data-driven relationship from UWB range and RSS inputs to orientation outputs. Combined with a gyroscope in an invariant extended Kalman filter, this realizes a heading estimation method that uses only UWB and gyroscope measurements.
    Fairness without the sensitive attribute via Causal Variational Autoencoder. (arXiv:2109.04999v1 [cs.LG])
    (0 min) In recent years, most fairness strategies in machine learning models focus on mitigating unwanted biases by assuming that the sensitive information is observed. However this is not always possible in practice. Due to privacy purposes and var-ious regulations such as RGPD in EU, many personal sensitive attributes are frequently not collected. We notice a lack of approaches for mitigating bias in such difficult settings, in particular for achieving classical fairness objectives such as Demographic Parity and Equalized Odds. By leveraging recent developments for approximate inference, we propose an approach to fill this gap. Based on a causal graph, we rely on a new variational auto-encoding based framework named SRCVAE to infer a sensitive information proxy, that serve for bias mitigation in an adversarial fairness approach. We empirically demonstrate significant improvements over existing works in the field. We observe that the generated proxy's latent space recovers sensitive information and that our approach achieves a higher accuracy while obtaining the same level of fairness on two real datasets, as measured using com-mon fairness definitions.
    Automated Machine Learning, Bounded Rationality, and Rational Metareasoning. (arXiv:2109.04744v1 [cs.AI])
    (0 min) The notion of bounded rationality originated from the insight that perfectly rational behavior cannot be realized by agents with limited cognitive or computational resources. Research on bounded rationality, mainly initiated by Herbert Simon, has a longstanding tradition in economics and the social sciences, but also plays a major role in modern AI and intelligent agent design. Taking actions under bounded resources requires an agent to reflect on how to use these resources in an optimal way - hence, to reason and make decisions on a meta-level. In this paper, we will look at automated machine learning (AutoML) and related problems from the perspective of bounded rationality, essentially viewing an AutoML tool as an agent that has to train a model on a given set of data, and the search for a good way of doing so (a suitable "ML pipeline") as deliberation on a meta-level.
    CINS: Comprehensive Instruction for Few-shot Learning in Task-orientedDialog Systems. (arXiv:2109.04645v1 [cs.CL])
    (0 min) As labeling cost for different modules in task-oriented dialog (ToD) systems is high, a major challenge in practice is to learn different tasks with the least amount of labeled data. Recently, prompting methods over pre-trained language models (PLMs) have shown promising results for few-shot learning in ToD. To better utilize the power of PLMs, this paper proposes Comprehensive Instruction (CINS) that exploits PLMs with extra task-specific instructions. We design a schema(definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD, i.e. intent classification, dialog state tracking, and natural language generation. A sequence-to-sequence model (T5)is adopted to solve these three tasks in a unified framework. Extensive experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data. Empirical results demonstrate that the proposed CINS approach consistently improves techniques that finetune PLMs with raw input or short prompts.
    A Fast PC Algorithm with Reversed-order Pruning and A Parallelization Strategy. (arXiv:2109.04626v1 [cs.LG])
    (2 min) The PC algorithm is the state-of-the-art algorithm for causal structure discovery on observational data. It can be computationally expensive in the worst case due to the conditional independence tests are performed in an exhaustive-searching manner. This makes the algorithm computationally intractable when the task contains several hundred or thousand nodes, particularly when the true underlying causal graph is dense. We propose a critical observation that the conditional set rendering two nodes independent is non-unique, and including certain redundant nodes do not sacrifice result accuracy. Based on this finding, the innovations of our work are two-folds. First, we innovate on a reserve order linkage pruning PC algorithm which significantly increases the algorithm's efficiency. Second, we propose a parallel computing strategy for statistical independence tests by leveraging tensor computation, which brings further speedup. We also prove the proposed algorithm does not induce statistical power loss under mild graph and data dimensionality assumptions. Experimental results show that the single-threaded version of the proposed algorithm can achieve a 6-fold speedup compared to the PC algorithm on a dense 95-node graph, and the parallel version can make a 825-fold speed-up. We also provide proof that the proposed algorithm is consistent under the same set of conditions with conventional PC algorithm.
    Integrating Approaches to Word Representation. (arXiv:2109.04876v1 [cs.CL])
    (0 min) The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing. I present a survey of the distributional, compositional, and relational approaches to addressing this task, and discuss various means of integrating them into systems, with special emphasis on the word level and the out-of-vocabulary phenomenon.
    MEAL: Manifold Embedding-based Active Learning. (arXiv:2106.11858v3 [cs.CV] UPDATED)
    (2 min) Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising patches extracted from full image, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the autonomous driving datasets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art baselines. We find that our active learning method achieves better performance compared to previous methods.
    Estimates on the generalization error of Physics Informed Neural Networks (PINNs) for approximating PDEs. (arXiv:2006.16144v2 [math.NA] UPDATED)
    (2 min) Physics informed neural networks (PINNs) have recently been widely used for robust and accurate approximation of PDEs. We provide rigorous upper bounds on the generalization error of PINNs approximating solutions of the forward problem for PDEs. An abstract formalism is introduced and stability properties of the underlying PDE are leveraged to derive an estimate for the generalization error in terms of the training error and number of training samples. This abstract framework is illustrated with several examples of nonlinear PDEs. Numerical experiments, validating the proposed theory, are also presented.
    In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors. (arXiv:1912.04265v3 [cs.LG] UPDATED)
    (2 min) We propose to study the generalization error of a learned predictor $\hat h$ in terms of that of a surrogate (potentially randomized) predictor that is coupled to $\hat h$ and designed to trade empirical risk for control of generalization error. In the case where $\hat h$ interpolates the data, it is interesting to consider theoretical surrogate classifiers that are partially derandomized or rerandomized, e.g., fit to the training data but with modified label noise. We also show that replacing $\hat h$ by its conditional distribution with respect to an arbitrary $\sigma$-field is a convenient way to derandomize. We study two examples, inspired by the work of Nagarajan and Kolter (2019) and Bartlett et al. (2019), where the learned classifier $\hat h$ interpolates the training data with high probability, has small risk, and, yet, does not belong to a nonrandom class with a tight uniform bound on two-sided generalization error. At the same time, we bound the risk of $\hat h$ in terms of surrogates constructed by conditioning and denoising, respectively, and shown to belong to nonrandom classes with uniformly small generalization error.
    Box Embeddings: An open-source library for representation learning using geometric structures. (arXiv:2109.04997v1 [cs.CL])
    (2 min) A major factor contributing to the success of modern representation learning is the ease of performing various vector operations. Recently, objects with geometric structures (eg. distributions, complex or hyperbolic vectors, or regions such as cones, disks, or boxes) have been explored for their alternative inductive biases and additional representational capacities. In this work, we introduce Box Embeddings, a Python library that enables researchers to easily apply and extend probabilistic box embeddings.
    Multi-Armed Bandits with Correlated Arms. (arXiv:1911.03959v4 [stat.ML] UPDATED)
    (2 min) We consider a multi-armed bandit framework where the rewards obtained by pulling different arms are correlated. We develop a unified approach to leverage these reward correlations and present fundamental generalizations of classic bandit algorithms to the correlated setting. We present a unified proof technique to analyze the proposed algorithms. Rigorous analysis of C-UCB (the correlated bandit version of Upper-confidence-bound) reveals that the algorithm ends up pulling certain sub-optimal arms, termed as non-competitive, only O(1) times, as opposed to the O(log T) pulls required by classic bandit algorithms such as UCB, TS etc. We present regret-lower bound and show that when arms are correlated through a latent random source, our algorithms obtain order-optimal regret. We validate the proposed algorithms via experiments on the MovieLens and Goodreads datasets, and show significant improvement over classical bandit algorithms.
    Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little. (arXiv:2104.06644v2 [cs.CL] UPDATED)
    (2 min) A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks -- including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
    Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training. (arXiv:2109.05003v1 [cs.CL])
    (2 min) We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
    FNet: Mixing Tokens with Fourier Transforms. (arXiv:2105.03824v3 [cs.CL] UPDATED)
    (2 min) We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences. (arXiv:2109.05019v1 [q-bio.GN])
    (3 min) With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such \textit{Big Data} creates a new opportunity for researchers to study this virus in detail. This is particularly important with all of the dynamics of the COVID-19 variants which emerge and circulate. This rich data source will give us insights on the best ways to perform genomic surveillance for this and future pandemic threats, with the ultimate goal of mitigating or eliminating such threats. Analyzing and processing the several million genomic sequences is a challenging task. Although traditional methods for sequence classification are proven to be effective, they are not designed to deal with these specific types of genomic sequences. Moreover, most of the existing methods also face the issue of scalability. Previous studies which were tailored to coronavirus genomic data proposed to use spike sequences (corresponding to a subsequence of the genome), rather than using the complete genomic sequence, to perform different machine learning (ML) tasks such as classification and clustering. However, those methods suffer from scalability issues. In this paper, we propose an approach called Spike2Vec, an efficient and scalable feature vector representation for each spike sequence that can be used for downstream ML tasks. Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1-score, etc.
    Generalizing to unseen domains via distribution matching. (arXiv:1911.00804v5 [cs.LG] UPDATED)
    (2 min) Supervised learning results typically rely on assumptions of i.i.d. data. Unfortunately, those assumptions are commonly violated in practice. In this work, we tackle such problem by focusing on domain generalization: a formalization where the data generating process at test time may yield samples from never-before-seen domains (distributions). Our work relies on the following lemma: by minimizing a notion of discrepancy between all pairs from a set of given domains, we also minimize the discrepancy between any pairs of mixtures of domains. Using this result, we derive a generalization bound for our setting. We then show that low risk over unseen domains can be achieved by representing the data in a space where (i) the training distributions are indistinguishable, and (ii) relevant information for the task at hand is preserved. Minimizing the terms in our bound yields an adversarial formulation which estimates and minimizes pairwise discrepancies. We validate our proposed strategy on standard domain generalization benchmarks, outperforming a number of recently introduced methods. Notably, we tackle a real-world application where the underlying data corresponds to multi-channel electroencephalography time series from different subjects, each considered as a distinct domain.
    A Neural Tangent Kernel Perspective of Infinite Tree Ensembles. (arXiv:2109.04983v1 [cs.LG])
    (2 min) In practical situations, the ensemble tree model is one of the most popular models along with neural networks. A soft tree is one of the variants of a decision tree. Instead of using a greedy method for searching splitting rules, the soft tree is trained using a gradient method in which the whole splitting operation is formulated in a differentiable form. Although ensembles of such soft trees have been increasingly used in recent years, little theoretical work has been done for understanding their behavior. In this paper, by considering an ensemble of infinite soft trees, we introduce and study the Tree Neural Tangent Kernel (TNTK), which provides new insights into the behavior of the infinite ensemble of soft trees. Using the TNTK, we succeed in theoretically finding several non-trivial properties, such as the effect of the oblivious tree structure and the degeneracy of the TNTK induced by the deepening of the trees. Moreover, we empirically examine the performance of an ensemble of infinite soft trees using the TNTK.
    Differentially Private Federated Learning with Laplacian Smoothing. (arXiv:2005.00218v2 [cs.LG] UPDATED)
    (2 min) Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users. However, an adversary may still be able to infer the private training data by attacking the released model. Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models. In this paper, we investigate a utility enhancement scheme based on Laplacian smoothing for differentially private federated learning (DP-Fed-LS), where the parameter aggregation with injected Gaussian noise is improved in statistical precision without losing privacy budget. Our key observation is that the aggregated gradients in federated learning often enjoy a type of smoothness, i.e. sparsity in the graph Fourier basis with polynomial decays of Fourier coefficients as frequency grows, which can be exploited by the Laplacian smoothing efficiently. Under a prescribed differential privacy budget, convergence error bounds with tight rates are provided for DP-Fed-LS with uniform subsampling of heterogeneous Non-IID data, revealing possible utility improvement of Laplacian smoothing in effective dimensionality and variance reduction, among others. Experiments over MNIST, SVHN, and Shakespeare datasets show that the proposed method can improve model accuracy with DP-guarantee and membership privacy under both uniform and Poisson subsampling mechanisms.
    Contrastive Explanations for Model Interpretability. (arXiv:2103.01378v2 [cs.CL] UPDATED)
    (2 min) Contrastive explanations clarify why an event occurred in contrast to another. They are more inherently intuitive to humans to both produce and comprehend. We propose a methodology to produce contrastive explanations for classification models by modifying the representation to disregard non-contrastive information, and modifying model behavior to only be based on contrastive reasoning. Our method is based on projecting model representation to a latent space that captures only the features that are useful (to the model) to differentiate two potential decisions. We demonstrate the value of contrastive explanations by analyzing two different scenarios, using both high-level abstract concept attribution and low-level input token/span attribution, on two widely used text classification tasks. Specifically, we produce explanations for answering: for which label, and against which alternative label, is some aspect of the input useful? And which aspects of the input are useful for and against particular decisions? Overall, our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
    Concept Generalization in Visual Representation Learning. (arXiv:2012.05649v2 [cs.CV] UPDATED)
    (2 min) Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the seen concepts used to train representations, thus ignoring any semantic relationships between the two. In this paper, we argue that the semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG, a novel benchmark on the ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in a principled way. Our benchmark leverages expert knowledge that comes from WordNet in order to define a sequence of unseen IN-21K concept sets that are semantically more and more distant from the ImageNet-1K (IN-1K) subset, a ubiquitous training set. This allows us to benchmark visual representations learned on IN-1K out-of-the box. We conduct a large-scale study encompassing 31 convolution and transformer-based models and show how different architectures, levels of supervision, regularization techniques and use of web data impact the concept generalization performance.
    Learning the Hypotheses Space from data Part II: Convergence and Feasibility. (arXiv:2001.11578v2 [stat.ML] UPDATED)
    (2 min) In part \textit{I} we proposed a structure for a general Hypotheses Space $\mathcal{H}$, the Learning Space $\mathbb{L}(\mathcal{H})$, which can be employed to avoid \textit{overfitting} when estimating in a complex space with relative shortage of examples. Also, we presented the U-curve property, which can be taken advantage of in order to select a Hypotheses Space without exhaustively searching $\mathbb{L}(\mathcal{H})$. In this paper, we carry further our agenda, by showing the consistency of a model selection framework based on Learning Spaces, in which one selects from data the Hypotheses Space on which to learn. The method developed in this paper adds to the state-of-the-art in model selection, by extending Vapnik-Chervonenkis Theory to \textit{random} Hypotheses Spaces, i.e., Hypotheses Spaces learned from data. In this framework, one estimates a random subspace $\hat{\mathcal{M}} \in \mathbb{L}(\mathcal{H})$ which converges with probability one to a target Hypotheses Space $\mathcal{M}^{\star} \in \mathbb{L}(\mathcal{H})$ with desired properties. As the convergence implies asymptotic unbiased estimators, we have a consistent framework for model selection, showing that it is feasible to learn the Hypotheses Space from data. Furthermore, we show that the generalization errors of learning on $\hat{\mathcal{M}}$ are lesser than those we commit when learning on $\mathcal{H}$, so it is more efficient to learn on a subspace learned from data.
    Variational Conditional Dependence Hidden Markov Models for Skeleton-Based Action Recognition. (arXiv:2002.05809v2 [cs.LG] UPDATED)
    (2 min) Hidden Markov Models (HMMs) comprise a powerful generative approach for modeling sequential data and time-series in general. However, the commonly employed assumption of the dependence of the current time frame to a single or multiple immediately preceding frames is unrealistic; more complicated dynamics potentially exist in real world scenarios. This paper revisits conventional sequential modeling approaches, aiming to address the problem of capturing time-varying temporal dependency patterns. To this end, we propose a different formulation of HMMs, whereby the dependence on past frames is dynamically inferred from the data. Specifically, we introduce a hierarchical extension by postulating an additional latent variable layer; therein, the (time-varying) temporal dependence patterns are treated as latent variables over which inference is performed. We leverage solid arguments from the Variational Bayes framework and derive a tractable inference algorithm based on the forward-backward algorithm. As we experimentally show, our approach can model highly complex sequential data and can effectively handle data with missing values.
    Robust Dynamic Multi-Modal Data Fusion: A Model Uncertainty Perspective. (arXiv:2105.06018v4 [cs.LG] UPDATED)
    (2 min) This paper is concerned with multi-modal data fusion (MMDF) under unexpected modality failures in nonlinear non-Gaussian dynamic processes. An efficient framework to tackle this problem is proposed. In particular, a notion termed modality "\emph{usefulness}", which takes a value of 1 or 0, is used for indicating whether the observation of this modality is useful or not. For $n$ modalities involved, $2^n$ combinations of their "\emph{usefulness}" values exist. Each combination defines one hypothetical model of the true data generative process. Then the problem of concern is formalized as a task of nonlinear non-Gaussian state filtering under model uncertainty, which is addressed by a dynamic model averaging (DMA) based particle filter (PF) algorithm. This DMA algorithm employs $2^n$ models, while all models share the same state-transition function and a unique set of particle values. That makes the computational complexity of this algorithm only slightly larger than a single model based PF algorithm, especially for scenarios in which $n$ is small. Experimental results show that the proposed solution outperforms remarkably state-of-the-art methods. Code and data are available at https://github.com/robinlau1981/fusion.
    SplitAVG: A heterogeneity-aware federated deep learning method for medical imaging. (arXiv:2107.02375v2 [cs.LG] UPDATED)
    (2 min) Federated learning is an emerging research paradigm for enabling collaboratively training deep learning models without sharing patient data. However, the data from different institutions are usually heterogeneous across institutions, which may reduce the performance of models trained using federated learning. In this study, we propose a novel heterogeneity-aware federated learning method, SplitAVG, to overcome the performance drops from data heterogeneity in federated learning. Unlike previous federated methods that require complex heuristic training or hyper parameter tuning, our SplitAVG leverages the simple network split and feature map concatenation strategies to encourage the federated model training an unbiased estimator of the target data distribution. We compare SplitAVG with seven state-of-the-art federated learning methods, using centrally hosted training data as the baseline on a suite of both synthetic and real-world federated datasets. We find that the performance of models trained using all the comparison federated learning methods degraded significantly with the increasing degrees of data heterogeneity. In contrast, SplitAVG method achieves comparable results to the baseline method under all heterogeneous settings, that it achieves 96.2% of the accuracy and 110.4% of the mean absolute error obtained by the baseline in a diabetic retinopathy binary classification dataset and a bone age prediction dataset, respectively, on highly heterogeneous data partitions. We conclude that SplitAVG method can effectively overcome the performance drops from variability in data distributions across institutions. Experimental results also show that SplitAVG can be adapted to different base networks and generalized to various types of medical imaging tasks.
    RL-IoT: Reinforcement Learning to Interact with IoT Devices. (arXiv:2105.00884v3 [cs.LG] UPDATED)
    (2 min) Our life is getting filled by Internet of Things (IoT) devices. These devices often rely on closed or poorly documented protocols, with unknown formats and semantics. Learning how to interact with such devices in an autonomous manner is the key for interoperability and automatic verification of their capabilities. In this paper, we propose RL-IoT, a system that explores how to automatically interact with possibly unknown IoT devices. We leverage reinforcement learning (RL) to recover the semantics of protocol messages and to take control of the device to reach a given goal, while minimizing the number of interactions. We assume to know only a database of possible IoT protocol messages, whose semantics are however unknown. RL-IoT exchanges messages with the target IoT device, learning those commands that are useful to reach the given goal. Our results show that RL-IoT is able to solve both simple and complex tasks. With properly tuned parameters, RL-IoT learns how to perform actions with the target device, a Yeelight smart bulb in our case study, completing non-trivial patterns with as few as 400 interactions. RL-IoT paves the road for automatic interactions with poorly documented IoT protocols, thus enabling interoperable systems.
    A Unifying View on Implicit Bias in Training Linear Neural Networks. (arXiv:2010.02501v3 [cs.LG] UPDATED)
    (2 min) We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.
    Model Selection for Cross-Lingual Transfer. (arXiv:2010.06127v2 [cs.CL] UPDATED)
    (2 min) Transformers that are pre-trained on multilingual corpora, such as, mBERT and XLM-RoBERTa, have achieved impressive cross-lingual transfer capabilities. In the zero-shot transfer setting, only English training data is used, and the fine-tuned model is evaluated on another target language. While this works surprisingly well, substantial variance has been observed in target language performance between different fine-tuning runs, and in the zero-shot setup, no target-language development data is available to select among multiple fine-tuned models. Prior work has relied on English dev data to select among models that are fine-tuned with different learning rates, number of steps and other hyperparameters, often resulting in suboptimal choices. In this paper, we show that it is possible to select consistently better models when small amounts of annotated data are available in auxiliary pivot languages. We propose a machine learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities. In extensive experiments we find that this method consistently selects better models than English validation data across twenty five languages (including eight low-resource languages), and often achieves results that are comparable to model selection using target language development data.
    Hamiltonian neural networks for solving equations of motion. (arXiv:2001.11107v3 [physics.comp-ph] UPDATED)
    (2 min) There has been a wave of interest in applying machine learning to study dynamical systems. In particular, neural networks have been applied to solve the equations of motion, and therefore, track the evolution of a system. In contrast to other applications of neural networks and machine learning, dynamical systems possess invariants such as energy, momentum, and angular momentum, depending on their underlying symmetries. Traditional numerical integration methods sometimes violate these conservation laws, propagating errors in time, ultimately reducing the predictability of the method. We present a data-free Hamiltonian neural network that solves the differential equations that govern dynamical systems. This is an equation-driven unsupervised learning method where the optimization process of the network depends solely on the predicted functions without using any ground truth data. This unsupervised model learns solutions that satisfy identically, up to an arbitrarily small error, Hamilton's equations and, therefore, conserve the Hamiltonian invariants. Once the network is optimized, the proposed architecture is considered a symplectic unit due to the introduction of an efficient parametric form of solutions. In addition, the choice of an appropriate activation function drastically improves the predictability of the network. An error analysis is derived and states that the numerical errors depend on the overall network performance. The symplectic architecture is then employed to solve the equations for the nonlinear oscillator and the chaotic Henon-Heiles dynamical system. In both systems, a symplectic Euler integrator requires two orders more evaluation points than the Hamiltonian network in order to achieve the same order of the numerical error in the predicted phase space trajectories.
    ProcK: Machine Learning for Knowledge-Intensive Processes. (arXiv:2109.04881v1 [cs.LG])
    (2 min) Process mining deals with extraction of knowledge from business process execution logs. Traditional process mining tasks, like process model generation or conformance checking, rely on a minimalistic feature set where each event is characterized only by its case identifier, activity type, and timestamp. In contrast, the success of modern machine learning is based on models that take any available data as direct input and build layers of features automatically during training. In this work, we introduce ProcK (Process & Knowledge), a novel pipeline to build business process prediction models that take into account both sequential data in the form of event logs and rich semantic information represented in a graph-structured knowledge base. The hybrid approach enables ProcK to flexibly make use of all information residing in the databases of organizations. Components to extract inter-linked event logs and knowledge bases from relational databases are part of the pipeline. We demonstrate the power of ProcK by training it for prediction tasks on the OULAD e-learning dataset, where we achieve state-of-the-art performance on the tasks of predicting student dropout from courses and predicting their success. We also apply our method on a number of additional machine learning tasks, including exam score prediction and early predictions that only take into account data recorded during the first weeks of the courses.
    A Study of Joint Graph Inference and Forecasting. (arXiv:2109.04979v1 [cs.LG])
    (2 min) We study a recent class of models which uses graph neural networks (GNNs) to improve forecasting in multivariate time series. The core assumption behind these models is that there is a latent graph between the time series (nodes) that governs the evolution of the multivariate time series. By parameterizing a graph in a differentiable way, the models aim to improve forecasting quality. We compare four recent models of this class on the forecasting task. Further, we perform ablations to study their behavior under changing conditions, e.g., when disabling the graph-learning modules and providing the ground-truth relations instead. Based on our findings, we propose novel ways of combining the existing architectures.
    Detect and Classify -- Joint Span Detection and Classification for Health Outcomes. (arXiv:2104.07789v2 [cs.CL] UPDATED)
    (2 min) A health outcome is a measurement or an observation used to capture and assess the effect of a treatment. Automatic detection of health outcomes from text would undoubtedly speed up access to evidence necessary in healthcare decision making. Prior work on outcome detection has modelled this task as either (a) a sequence labelling task, where the goal is to detect which text spans describe health outcomes, or (b) a classification task, where the goal is to classify a text into a pre-defined set of categories depending on an outcome that is mentioned somewhere in that text. However, this decoupling of span detection and classification is problematic from a modelling perspective and ignores global structural correspondences between sentence-level and word-level information present in a given text. To address this, we propose a method that uses both word-level and sentence-level information to simultaneously perform outcome span detection and outcome type classification. In addition to injecting contextual information to hidden vectors, we use label attention to appropriately weight both word and sentence level information. Experimental results on several benchmark datasets for health outcome detection show that our proposed method consistently outperforms decoupled methods, reporting competitive results.
    Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization. (arXiv:2109.04994v1 [cs.CL])
    (2 min) Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via \href{https://github.com/Junpliu/ConDigSum}{https://github.com/Junpliu/ConDigSum}.
    ECO: Enabling Energy-Neutral IoT Devices through Runtime Allocation of Harvested Energy. (arXiv:2102.13605v2 [eess.SY] UPDATED)
    (2 min) Energy harvesting offers an attractive and promising mechanism to power low-energy devices. However, it alone is insufficient to enable an energy-neutral operation, which can eliminate tedious battery charging and replacement requirements. Achieving an energy-neutral operation is challenging since the uncertainties in harvested energy undermine the quality of service requirements. To address this challenge, we present a runtime energy-allocation framework that optimizes the utility of the target device under energy constraints using a rollout algorithm, which is a sequential approach to solve dynamic optimization problems. The proposed framework uses an efficient iterative algorithm to compute initial energy allocations at the beginning of a day. The initial allocations are then corrected at every interval to compensate for the deviations from the expected energy harvesting pattern. We evaluate this framework using solar and motion energy harvesting modalities and American Time Use Survey data from 4772 different users. Compared to prior techniques, the proposed framework achieves up to 35% higher utility even under energy-limited scenarios. Moreover, measurements on a wearable device prototype show that the proposed framework has 1000x smaller energy overhead than iterative approaches with a negligible loss in utility.
    FedProf: Selective Federated Learning with Representation Profiling. (arXiv:2102.01733v6 [cs.LG] UPDATED)
    (2 min) Federated Learning (FL) has shown great potential as a privacy-preserving solution to learning from decentralized data that are only accessible to end devices (i.e., clients). In many scenarios however, a large proportion of the clients are probably in possession of low-quality data that are biased, noisy or even irrelevant. As a result, they could significantly slow down the convergence of the global model we aim to build and also compromise its quality. In light of this, we propose FedProf, a novel algorithm for optimizing FL under such circumstances without breaching data privacy. The key of our approach is a data representation profiling and matching scheme that uses the global model to dynamically profile data representations and allows for low-cost, lightweight representation matching. Based on the scheme we adaptively score each client and adjust its participation probability so as to mitigate the impact of low-value clients on the training process. We have conducted extensive experiments on public datasets using various FL settings. The results show that FedProf effectively reduces the number of communication rounds and overall time (up to 4.5x speedup) for the global model to converge and provides accuracy gain.
    Differential Privacy in Personalized Pricing with Nonparametric Demand Models. (arXiv:2109.04615v1 [stat.ML])
    (2 min) In the recent decades, the advance of information technology and abundant personal data facilitate the application of algorithmic personalized pricing. However, this leads to the growing concern of potential violation of privacy due to adversarial attack. To address the privacy issue, this paper studies a dynamic personalized pricing problem with \textit{unknown} nonparametric demand models under data privacy protection. Two concepts of data privacy, which have been widely applied in practices, are introduced: \textit{central differential privacy (CDP)} and \textit{local differential privacy (LDP)}, which is proved to be stronger than CDP in many cases. We develop two algorithms which make pricing decisions and learn the unknown demand on the fly, while satisfying the CDP and LDP gurantees respectively. In particular, for the algorithm with CDP guarantee, the regret is proved to be at most $\tilde O(T^{(d+2)/(d+4)}+\varepsilon^{-1}T^{d/(d+4)})$. Here, the parameter $T$ denotes the length of the time horizon, $d$ is the dimension of the personalized information vector, and the key parameter $\varepsilon>0$ measures the strength of privacy (smaller $\varepsilon$ indicates a stronger privacy protection). On the other hand, for the algorithm with LDP guarantee, its regret is proved to be at most $\tilde O(\varepsilon^{-2/(d+2)}T^{(d+1)/(d+2)})$, which is near-optimal as we prove a lower bound of $\Omega(\varepsilon^{-2/(d+2)}T^{(d+1)/(d+2)})$ for any algorithm with LDP guarantee.
    FedCon: A Contrastive Framework for Federated Semi-Supervised Learning. (arXiv:2109.04533v1 [cs.LG])
    (2 min) Federated Semi-Supervised Learning (FedSSL) has gained rising attention from both academic and industrial researchers, due to its unique characteristics of co-training machine learning models with isolated yet unlabeled data. Most existing FedSSL methods focus on the classical scenario, i.e, the labeled and unlabeled data are stored at the client side. However, in real world applications, client users may not provide labels without any incentive. Thus, the scenario of labels at the server side is more practical. Since unlabeled data and labeled data are decoupled, most existing FedSSL approaches may fail to deal with such a scenario. To overcome this problem, in this paper, we propose FedCon, which introduces a new learning paradigm, i.e., contractive learning, to FedSSL. Experimental results on three datasets show that FedCon achieves the best performance with the contractive framework compared with state-of-the-art baselines under both IID and Non-IID settings. Besides, ablation studies demonstrate the characteristics of the proposed FedCon framework.
    Inverse design of 3d molecular structures with conditional generative neural networks. (arXiv:2109.04824v1 [cs.LG])
    (2 min) The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified structural and chemical properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified composition or motifs, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.
    Block Pruning For Faster Transformers. (arXiv:2109.04838v1 [cs.LG])
    (2 min) Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.
    Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees. (arXiv:2109.04522v1 [math.OC])
    (2 min) We introduce novel convergence results for asynchronous iterations which appear in the analysis of parallel and distributed optimization algorithms. The results are simple to apply and give explicit estimates for how the degree of asynchrony impacts the convergence rates of the iterates. Our results shorten, streamline and strengthen existing convergence proofs for several asynchronous optimization methods, and allow us to establish convergence guarantees for popular algorithms that were thus far lacking a complete theoretical understanding. Specifically, we use our results to derive better iteration complexity bounds for proximal incremental aggregated gradient methods, to provide less conservative analyses of the speedup conditions for asynchronous block-coordinate implementations of Krasnoselskii-Mann iterations, and to quantify the convergence rates for totally asynchronous iterations under various assumptions on communication delays and update rates.
    Feature-based Individual Fairness in k-Clustering. (arXiv:2109.04554v1 [cs.LG])
    (2 min) Ensuring fairness in machine learning algorithms is a challenging and important task. We consider the problem of clustering a set of points while ensuring fairness constraints. While there have been several attempts to capture group fairness in the k-clustering problem, fairness at an individual level is not well-studied. We introduce a new notion of individual fairness in k-clustering based on features that are not necessarily used for clustering. We show that this problem is NP-hard and does not admit a constant factor approximation. We then design a randomized algorithm that guarantees approximation both in terms of minimizing the clustering distance objective as well as individual fairness under natural restrictions on the distance metric and fairness constraints. Finally, our experimental results validate that our algorithm produces lower clustering costs compared to existing algorithms while being competitive in individual fairness.
    TENET: Temporal CNN with Attention for Anomaly Detection in Automotive Cyber-Physical Systems. (arXiv:2109.04565v1 [cs.LG])
    (2 min) Modern vehicles have multiple electronic control units (ECUs) that are connected together as part of a complex distributed cyber-physical system (CPS). The ever-increasing communication between ECUs and external electronic systems has made these vehicles particularly susceptible to a variety of cyber-attacks. In this work, we present a novel anomaly detection framework called TENET to detect anomalies induced by cyber-attacks on vehicles. TENET uses temporal convolutional neural networks with an integrated attention mechanism to detect anomalous attack patterns. TENET is able to achieve an improvement of 32.70% in False Negative Rate, 19.14% in the Mathews Correlation Coefficient, and 17.25% in the ROC-AUC metric, with 94.62% fewer model parameters, 86.95% decrease in memory footprint, and 48.14% lower inference time when compared to the best performing prior work on automotive anomaly detection.
    Spatially Focused Attack against Spatiotemporal Graph Neural Networks. (arXiv:2109.04608v1 [cs.LG])
    (2 min) Spatiotemporal forecasting plays an essential role in various applications in intelligent transportation systems (ITS), such as route planning, navigation, and traffic control and management. Deep Spatiotemporal graph neural networks (GNNs), which capture both spatial and temporal patterns, have achieved great success in traffic forecasting applications. Understanding how GNNs-based forecasting work and the vulnerability and robustness of these models becomes critical to real-world applications. For example, if spatiotemporal GNNs are vulnerable in real-world traffic prediction applications, a hacker can easily manipulate the results and cause serious traffic congestion and even a city-scale breakdown. However, despite that recent studies have demonstrated that deep neural networks (DNNs) are vulnerable to carefully designed perturbations in multiple domains like objection classification and graph representation, current adversarial works cannot be directly applied to spatiotemporal forecasting due to the causal nature and spatiotemporal mechanisms in forecasting models. To fill this gap, in this paper we design Spatially Focused Attack (SFA) to break spatiotemporal GNNs by attacking a single vertex. To achieve this, we first propose the inverse estimation to address the causality issue; then, we apply genetic algorithms with a universal attack method as the evaluation function to locate the weakest vertex; finally, perturbations are generated by solving an inverse estimation-based optimization problem. We conduct experiments on real-world traffic data and our results show that perturbations in one vertex designed by SA can be diffused into a large part of the graph.
    Simulating the Effects of Eco-Friendly Transportation Selections for Air Pollution Reduction. (arXiv:2109.04831v1 [cs.LG])
    (2 min) Reducing air pollution, such as CO2 and PM2.5 emissions, is one of the most important issues for many countries worldwide. Selecting an environmentally friendly transport mode can be an effective approach of individuals to reduce air pollution in daily life. In this study, we propose a method to simulate the effectiveness of an eco-friendly transport mode selection for reducing air pollution by using map search logs. We formulate the transport mode selection as a combinatorial optimization problem with the constraints regarding the total amount of CO2 emissions as an example of air pollution and the average travel time. The optimization results show that the total amount of CO2 emissions can be reduced by 9.23%, whereas the average travel time can in fact be reduced by 9.96%. Our research proposal won first prize in Regular Machine Learning Competition Track Task 2 at KDD Cup 2019.
    Unsupervised Causal Binary Concepts Discovery with VAE for Black-box Model Explanation. (arXiv:2109.04518v1 [cs.LG])
    (2 min) We aim to explain a black-box classifier with the form: `data X is classified as class Y because X \textit{has} A, B and \textit{does not have} C' in which A, B, and C are high-level concepts. The challenge is that we have to discover in an unsupervised manner a set of concepts, i.e., A, B and C, that is useful for the explaining the classifier. We first introduce a structural generative model that is suitable to express and discover such concepts. We then propose a learning process that simultaneously learns the data distribution and encourages certain concepts to have a large causal influence on the classifier output. Our method also allows easy integration of user's prior knowledge to induce high interpretability of concepts. Using multiple datasets, we demonstrate that our method can discover useful binary concepts for explanation.
    Is Attention Better Than Matrix Decomposition?. (arXiv:2109.04553v1 [cs.CV])
    (2 min) As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants.
    Truth Discovery in Sequence Labels from Crowds. (arXiv:2109.04470v1 [cs.HC])
    (2 min) Annotations quality and quantity positively affect the performance of sequence labeling, a vital task in Natural Language Processing. Hiring domain experts to annotate a corpus set is very costly in terms of money and time. Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose. However, these platforms are prone to human errors due to the lack of expertise; hence, one worker's annotations cannot be directly used to train the model. Existing literature in annotation aggregation more focuses on binary or multi-choice problems. In recent years, handling the sequential label aggregation tasks on imbalanced datasets with complex dependencies between tokens has been challenging. To conquer the challenge, we propose an optimization-based method that infers the best set of aggregated annotations using labels provided by workers. The proposed Aggregation method for Sequential Labels from Crowds ($AggSLC$) jointly considers the characteristics of sequential labeling tasks, workers' reliabilities, and advanced machine learning techniques. We evaluate $AggSLC$ on different crowdsourced data for Named Entity Recognition (NER), Information Extraction tasks in biomedical (PICO), and the simulated dataset. Our results show that the proposed method outperforms the state-of-the-art aggregation methods. To achieve insights into the framework, we study $AggSLC$ components' effectiveness through ablation studies by evaluating our model in the absence of the prediction module and inconsistency loss function. Theoretical analysis of our algorithm's convergence points that the proposed $AggSLC$ halts after a finite number of iterations.
    Efficiently Identifying Task Groupings for Multi-Task Learning. (arXiv:2109.04617v1 [cs.LG])
    (2 min) Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naively training all tasks together in one model often degrades performance, and exhaustively searching through combinations of task groupings can be prohibitively expensive. As a result, efficiently identifying the tasks that would benefit from co-training remains a challenging design question without a clear solution. In this paper, we suggest an approach to select which tasks should train together in multi-task learning models. Our method determines task groupings in a single training run by co-training all tasks together and quantifying the effect to which one task's gradient would affect another task's loss. On the large-scale Taskonomy computer vision dataset, we find this method can decrease test loss by 10.0\% compared to simply training all tasks together while operating 11.6 times faster than a state-of-the-art task grouping method.
    Bootstrapped Meta-Learning. (arXiv:2109.04504v1 [cs.LG])
    (2 min) Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem that often exhibits ill-conditioning, and myopic meta-objectives. We propose an algorithm that tackles these issues by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the improvement is related to the target distance. Thus, by controlling curvature, the distance measure can be used to ease meta-optimization, for instance by reducing ill-conditioning. Further, the bootstrapping mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. The algorithm is versatile and easy to implement. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark, improve upon MAML in few-shot learning, and demonstrate how our approach opens up new possibilities by meta-learning efficient exploration in a Q-learning agent.
    6MapNet: Representing soccer players from tracking data by a triplet network. (arXiv:2109.04720v1 [cs.LG])
    (2 min) Although the values of individual soccer players have become astronomical, subjective judgments still play a big part in the player analysis. Recently, there have been new attempts to quantitatively grasp players' styles using video-based event stream data. However, they have some limitations in scalability due to high annotation costs and sparsity of event stream data. In this paper, we build a triplet network named 6MapNet that can effectively capture the movement styles of players using in-game GPS data. Without any annotation of soccer-specific actions, we use players' locations and velocities to generate two types of heatmaps. Our subnetworks then map these heatmap pairs into feature vectors whose similarity corresponds to the actual similarity of playing styles. The experimental results show that players can be accurately identified with only a small number of matches by our method.
    Style Pooling: Automatic Text Style Obfuscation for Improved Classification Fairness. (arXiv:2109.04624v1 [cs.LG])
    (2 min) Text style can reveal sensitive attributes of the author (e.g. race or age) to the reader, which can, in turn, lead to privacy violations and bias in both human and algorithmic decisions based on text. For example, the style of writing in job applications might reveal protected attributes of the candidate which could lead to bias in hiring decisions, regardless of whether hiring decisions are made algorithmically or by humans. We propose a VAE-based framework that obfuscates stylistic features of human-generated text through style transfer by automatically re-writing the text itself. Our framework operationalizes the notion of obfuscated style in a flexible way that enables two distinct notions of obfuscated style: (1) a minimal notion that effectively intersects the various styles seen in training, and (2) a maximal notion that seeks to obfuscate by adding stylistic features of all sensitive attributes to text, in effect, computing a union of styles. Our style-obfuscation framework can be used for multiple purposes, however, we demonstrate its effectiveness in improving the fairness of downstream classifiers. We also conduct a comprehensive study on style pooling's effect on fluency, semantic consistency, and attribute removal from text, in two and three domain style obfuscation.
    Deciphering Environmental Air Pollution with Large Scale City Data. (arXiv:2109.04572v1 [cs.LG])
    (2 min) Out of the numerous hazards posing a threat to sustainable environmental conditions in the 21st century, only a few have a graver impact than air pollution. Its importance in determining the health and living standards in urban settings is only expected to increase with time. Various factors ranging from emissions from traffic and power plants, household emissions, natural causes are known to be primary causal agents or influencers behind rising air pollution levels. However, the lack of large scale data involving the major factors has hindered the research on the causes and relations governing the variability of the different air pollutants. Through this work, we introduce a large scale city-wise dataset for exploring the relationships among these agents over a long period of time. We analyze and explore the dataset to bring out inferences which we can derive by modeling the data. Also, we provide a set of benchmarks for the problem of estimating or forecasting pollutant levels with a set of diverse models and methodologies. Through our paper, we seek to provide a ground base for further research into this domain that will demand critical attention of ours in the near future.
    Notes on Generalizing the Maximum Entropy Principle to Uncertain Data. (arXiv:2109.04530v1 [cs.IT])
    (2 min) The principle of maximum entropy is a broadly applicable technique for computing a distribution with the least amount of information possible while commonly constrained to match empirically estimated feature expectations. We seek to generalize this principle to scenarios where the empirical feature expectations cannot be computed because the model variables are only partially observed, which introduces a dependency on the learned model. Extending and generalizing the principle of latent maximum entropy, we introduce uncertain maximum entropy and describe an expectation-maximization based solution to approximately solve these problems. We show that our technique generalizes the principle of maximum entropy and latent maximum entropy and discuss a generally applicable regularization technique for adding error terms to feature expectation constraints in the event of limited data.
    Identifying Morality Frames in Political Tweets using Relational Learning. (arXiv:2109.04535v1 [cs.CL])
    (2 min) Extracting moral sentiment from text is a vital component in understanding public opinion, social movements, and policy decisions. The Moral Foundation Theory identifies five moral foundations, each associated with a positive and negative polarity. However, moral sentiment is often motivated by its targets, which can correspond to individuals or collective entities. In this paper, we introduce morality frames, a representation framework for organizing moral attitudes directed at different entities, and come up with a novel and high-quality annotated dataset of tweets written by US politicians. Then, we propose a relational learning model to predict moral attitudes towards entities and moral foundations jointly. We do qualitative and quantitative evaluations, showing that moral sentiment towards entities differs highly across political ideologies.
    Ergodic Limits, Relaxations, and Geometric Properties of Random Walk Node Embeddings. (arXiv:2109.04526v1 [stat.ML])
    (2 min) Random walk based node embedding algorithms learn vector representations of nodes by optimizing an objective function of node embedding vectors and skip-bigram statistics computed from random walks on the network. They have been applied to many supervised learning problems such as link prediction and node classification and have demonstrated state-of-the-art performance. Yet, their properties remain poorly understood. This paper studies properties of random walk based node embeddings in the unsupervised setting of discovering hidden block structure in the network, i.e., learning node representations whose cluster structure in Euclidean space reflects their adjacency structure within the network. We characterize the ergodic limits of the embedding objective, its generalization, and related convex relaxations to derive corresponding non-randomized versions of the node embedding objectives. We also characterize the optimal node embedding Grammians of the non-randomized objectives for the expected graph of a two-community Stochastic Block Model (SBM). We prove that the solution Grammian has rank $1$ for a suitable nuclear norm relaxation of the non-randomized objective. Comprehensive experimental results on SBM random networks reveal that our non-randomized ergodic objectives yield node embeddings whose distribution is Gaussian-like, centered at the node embeddings of the expected network within each community, and concentrate in the linear degree-scaling regime as the number of nodes increases.

2021-09-10

  • cs.CL updates on arXiv.org

    LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding. (arXiv:2104.08836v3 [cs.CL] UPDATED)
    (2 min) Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at https://aka.ms/layoutxlm.
    LightNER: A Lightweight Generative Framework with Prompt-guided Attention for Low-resource NER. (arXiv:2109.00720v2 [cs.CL] UPDATED)
    (2 min) Most existing NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Recently, prompt-tuning methods for pre-trained language models have achieved remarkable performance in few-shot learning by exploiting prompts as task guidance to reduce the gap between training progress and downstream tuning. Inspired by prompt learning, we propose a novel lightweight generative framework with prompt-guided attention for low-resource NER (LightNER). Specifically, we construct the semantic-aware answer space of entity categories for prompt learning to generate the entity span sequence and entity categories without any label-specific classifiers. We further propose prompt-guided attention by incorporating continuous prompts into the self-attention layer to re-modulate the attention and adapt pre-trained weights. Note that we only tune those continuous prompts with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings by tuning only a small part of the parameters.
    Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval. (arXiv:2104.08801v2 [cs.CL] UPDATED)
    (2 min) In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA) from source to target domain. While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between the target domain and synthetic data distribution, and reduces model overfitting to the source domain. We run UDA experiments on question generation and passage retrieval from the \textit{Natural Questions} domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6\% top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset- \textit{MLQuestions} containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.
    KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. (arXiv:2104.07650v5 [cs.CL] UPDATED)
    (2 min) Recently, prompt-tuning has achieved promising results for certain few-shot classification tasks. The core idea of prompt-tuning is to insert text pieces (i.e., templates) into the input and transform a classification task into a masked language modeling problem. However, for relation extraction, determining an appropriate prompt template requires domain expertise, and it is cumbersome and time-consuming to obtain a suitable label word. Furthermore, there exist abundant semantic knowledge among the entities and relations that cannot be ignored. To this end, we focus on incorporating knowledge into prompt-tuning for relation extraction and propose a knowledge-aware prompt-tuning approach with synergistic optimization (KnowPrompt). Specifically, we inject entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. Extensive experimental results on five datasets with standard and low-resource settings demonstrate the effectiveness of our approach.
    Variational Latent-State GPT for Semi-supervised Task-Oriented Dialog Systems. (arXiv:2109.04314v1 [cs.CL])
    (2 min) Recently, two approaches, fine-tuning large pre-trained language models and variational training, have attracted significant interests, separately, for semi-supervised end-to-end task-oriented dialog (TOD) systems. In this paper, we propose Variational Latent-State GPT model (VLS-GPT), which is the first to combine the strengths of the two approaches. Among many options of models, we propose the generative model and the inference model for variational learning of the end-to-end TOD system, both as auto-regressive language models based on GPT-2, which can be further trained over a mix of labeled and unlabeled dialog data in a semi-supervised manner. We develop the strategy of sampling-then-forward-computation, which successfully overcomes the memory explosion issue of using GPT in variational learning and speeds up training. Semi-supervised TOD experiments are conducted on two benchmark multi-domain datasets of different languages - MultiWOZ2.1 and CrossWOZ. VLS-GPT is shown to significantly outperform both supervised-only and semi-supervised baselines.
    All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. (arXiv:2109.04404v1 [cs.CL])
    (2 min) Similarity measures are a vital tool for understanding how language models represent and process language. Standard representational similarity measures such as cosine similarity and Euclidean distance have been successfully used in static word embedding models to understand how words cluster in semantic space. Recently, these measures have been applied to embeddings from contextualized models such as BERT and GPT-2. In this work, we call into question the informativity of such measures for contextualized language models. We find that a small number of rogue dimensions, often just 1-3, dominate these measures. Moreover, we find a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model. We show that simple postprocessing techniques such as standardization are able to correct for rogue dimensions and reveal underlying representational quality. We argue that accounting for rogue dimensions is essential for any similarity-based analysis of contextual language models.
    MATE: Multi-view Attention for Table Transformer Efficiency. (arXiv:2109.04312v1 [cs.CL])
    (2 min) This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.
    AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models. (arXiv:2109.04413v1 [cs.CL])
    (2 min) Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments demonstrate that, on the task of detecting idiomatic usage, these models perform reasonably well in the one-shot and few-shot scenarios, but that there is significant scope for improvement in the zero-shot scenario. On the task of representing idiomaticity, we find that pre-training is not always effective, while fine-tuning could provide a sample efficient method of learning representations of sentences containing MWEs.
    Understanding Mention Detector-Linker Interaction in Neural Coreference Resolution. (arXiv:2009.09363v2 [cs.CL] UPDATED)
    (2 min) Despite significant recent progress in coreference resolution, the quality of current state-of-the-art systems still considerably trails behind human-level performance. Using the CoNLL-2012 and PreCo datasets, we dissect the best instantiation of the mainstream end-to-end coreference resolution model that underlies most current best-performing coreference systems, and empirically analyze the behavior of its two components: mention detector and mention linker. While the detector traditionally focuses heavily on recall as a design decision, we demonstrate the importance of precision, calling for their balance. However, we point out the difficulty in building a precise detector due to its inability to make important anaphoricity decisions. We also highlight the enormous room for improving the linker and show that the rest of its errors mainly involve pronoun resolution. We propose promising next steps and hope our findings will help future research in coreference resolution.
    Compositional Generalization via Semantic Tagging. (arXiv:2010.11818v2 [cs.CL] UPDATED)
    (2 min) Although neural sequence-to-sequence models have been successfully applied to semantic parsing, they fail at compositional generalization, i.e., they are unable to systematically generalize to unseen compositions of seen components. Motivated by traditional semantic parsing where compositionality is explicitly accounted for by symbolic grammars, we propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models while featuring lexicon-style alignments and disentangled information processing. Specifically, we decompose decoding into two phases where an input utterance is first tagged with semantic symbols representing the meaning of individual words, and then a sequence-to-sequence model is used to predict the final meaning representation conditioning on the utterance and the predicted tag sequence. Experimental results on three semantic parsing datasets show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
    Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning. (arXiv:2109.04144v1 [cs.CL])
    (2 min) Recent prompt-based approaches allow pretrained language models to achieve strong performances on few-shot finetuning by reformulating downstream tasks as a language modeling problem. In this work, we demonstrate that, despite its advantages on low data regimes, finetuned prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inference heuristics based on lexical overlap, e.g., models incorrectly assuming a sentence pair is of the same meaning because they consist of the same set of words. Interestingly, we find that this particular inference heuristic is significantly less present in the zero-shot evaluation of the prompt-based model, indicating how finetuning can be destructive to useful knowledge learned during the pretraining. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning. Our evaluation on three datasets demonstrates promising improvements on the three corresponding challenge datasets used to diagnose the inference heuristics.
    Lexico-semantic and affective modelling of Spanish poetry: A semi-supervised learning approach. (arXiv:2109.04152v1 [cs.AI])
    (2 min) Text classification tasks have improved substantially during the last years by the usage of transformers. However, the majority of researches focus on prose texts, with poetry receiving less attention, specially for Spanish language. In this paper, we propose a semi-supervised learning approach for inferring 21 psychological categories evoked by a corpus of 4572 sonnets, along with 10 affective and lexico-semantic multiclass ones. The subset of poems used for training an evaluation includes 270 sonnets. With our approach, we achieve an AUC beyond 0.7 for 76% of the psychological categories, and an AUC over 0.65 for 60% on the multiclass ones. The sonnets are modelled using transformers, through sentence embeddings, along with lexico-semantic and affective features, obtained by using external lexicons. Consequently, we see that this approach provides an AUC increase of up to 0.12, as opposed to using transformers alone.
    Word-Level Coreference Resolution. (arXiv:2109.04127v1 [cs.CL])
    (2 min) Recent coreference resolution models rely heavily on span representations to find coreference links between word spans. As the number of spans is $O(n^2)$ in the length of text and the number of potential links is $O(n^4)$, various pruning techniques are necessary to make this approach computationally feasible. We propose instead to consider coreference links between individual words rather than word spans and then reconstruct the word spans. This reduces the complexity of the coreference model to $O(n^2)$ and allows it to consider all potential mentions without pruning any of them out. We also demonstrate that, with these changes, SpanBERT for coreference resolution will be significantly outperformed by RoBERTa. While being highly efficient, our model performs competitively with recent coreference resolution systems on the OntoNotes benchmark.
    Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph. (arXiv:2109.04400v1 [cs.CL])
    (2 min) In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available, where the task is identical to that of a low-resource target language. However, collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns. This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries. First, we construct a dictionary-based heterogeneous graph (DHG) from bilingual dictionaries. This opens the possibility to use graph neural networks for cross-lingual transfer. The remaining challenge is the heterogeneity of DHG because multiple languages are considered. To address this challenge, we propose dictionary-based heterogeneous graph neural network (DHGNet) that effectively handles the heterogeneity of DHG by two-step aggregations, which are word-level and language-level aggregations. Experimental results demonstrate that our method outperforms pretrained models even though it does not access to large corpora. Furthermore, it can perform well even though dictionaries contain many incorrect translations. Its robustness allows the usage of a wider range of dictionaries such as an automatically constructed dictionary and crowdsourced dictionary, which are convenient for real-world applications.
    Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging. (arXiv:2010.03060v5 [cs.LG] UPDATED)
    (2 min) A key challenge in training neural networks for a given medical imaging task is often the difficulty of obtaining a sufficient number of manually labeled examples. In contrast, textual imaging reports, which are often readily available in medical records, contain rich but unstructured interpretations written by experts as part of standard clinical practice. We propose using these textual reports as a form of weak supervision to improve the image interpretation performance of a neural network without requiring additional manually labeled examples. We use an image-text matching task to train a feature extractor and then fine-tune it in a transfer learning setting for a supervised task using a small labeled dataset. The end result is a neural network that automatically interprets imagery without requiring textual reports during inference. This approach can be applied to any task for which text-image pairs are readily available. We evaluate our method on three classification tasks and find consistent performance improvements, reducing the need for labeled data by 67%-98%.
    Tracking Turbulence Through Financial News During COVID-19. (arXiv:2109.04369v1 [cs.CL])
    (2 min) Grave human toll notwithstanding, the COVID-19 pandemic created uniquely unstable conditions in financial markets. In this work we uncover and discuss relationships involving sentiment in financial publications during the 2020 pandemic-motivated U.S. financial crash. First, we introduce a set of expert annotations of financial sentiment for articles from major American financial news publishers. After an exploratory data analysis, we then describe a CNN-based architecture to address the task of predicting financial sentiment in this anomalous, tumultuous setting. Our best performing model achieves a maximum weighted F1 score of 0.746, establishing a strong performance benchmark. Using predictions from our top performing model, we close by conducting a statistical correlation study with real stock market data, finding interesting and strong relationships between financial news and the S\&P 500 index, trading volume, market volatility, and different single-factor ETFs.
    Uncertainty Measures in Neural Belief Tracking and the Effects on Dialogue Policy Performance. (arXiv:2109.04349v1 [cs.CL])
    (2 min) The ability to identify and resolve uncertainty is crucial for the robustness of a dialogue system. Indeed, this has been confirmed empirically on systems that utilise Bayesian approaches to dialogue belief tracking. However, such systems consider only confidence estimates and have difficulty scaling to more complex settings. Neural dialogue systems, on the other hand, rarely take uncertainties into account. They are therefore overconfident in their decisions and less robust. Moreover, the performance of the tracking task is often evaluated in isolation, without consideration of its effect on the downstream policy optimisation. We propose the use of different uncertainty measures in neural belief tracking. The effects of these measures on the downstream task of policy optimisation are evaluated by adding selected measures of uncertainty to the feature space of the policy and training policies through interaction with a user simulator. Both human and simulated user results show that incorporating these measures leads to improvements both of the performance and of the robustness of the downstream dialogue policy. This highlights the importance of developing neural dialogue belief trackers that take uncertainty into account.
    TxT: Crossmodal End-to-End Learning with Transformers. (arXiv:2109.04422v1 [cs.CV])
    (2 min) Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.
    Generalised Unsupervised Domain Adaptation of Neural Machine Translation with Cross-Lingual Data Selection. (arXiv:2109.04292v1 [cs.CL])
    (2 min) This paper considers the unsupervised domain adaptation problem for neural machine translation (NMT), where we assume the access to only monolingual text in either the source or target language in the new domain. We propose a cross-lingual data selection method to extract in-domain sentences in the missing language side from a large generic monolingual corpus. Our proposed method trains an adaptive layer on top of multilingual BERT by contrastive learning to align the representation between the source and target language. This then enables the transferability of the domain classifier between the languages in a zero-shot manner. Once the in-domain data is detected by the classifier, the NMT model is then adapted to the new domain by jointly learning translation and domain discrimination tasks. We evaluate our cross-lingual data selection method on NMT across five diverse domains in three language pairs, as well as a real-world scenario of translation for COVID-19. The results show that our proposed method outperforms other selection baselines up to +1.5 BLEU score.
    Automatic Text Evaluation through the Lens of Wasserstein Barycenters. (arXiv:2108.12463v2 [cs.CL] UPDATED)
    (2 min) A new metric \texttt{BaryScore} to evaluate text generation based on deep contextualized embeddings e.g., BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, i.e., Wasserstein distance and barycenter. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; this framework provides a natural way to aggregate the different outputs through the Wasserstein space topology. In addition, it provides theoretical grounds to our metric and offers an alternative to available solutions e.g., MoverScore and BertScore). Numerical evaluation is performed on four different tasks: machine translation, summarization, data2text generation and image captioning. Our results show that \texttt{BaryScore} outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.
    The Topic Confusion Task: A Novel Scenario for Authorship Attribution. (arXiv:2104.08530v2 [cs.CL] UPDATED)
    (2 min) Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by a failure to capture authorship writing style or by a topic shift. Motivated by this, we propose the \emph{topic confusion} task where we switch the author-topic configuration between the training and testing sets. This setup allows us to distinguish two types of errors: those caused by the topic shift and those caused by the features' inability to capture the writing styles. We show that stylometric features with part-of-speech tags are the least susceptible to topic variations. We further show that combining them with other features leads to significantly lower topic confusion and higher attribution accuracy. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task and are surpassed by simple features such as word-level $n$-grams.
    How to Train BERT with an Academic Budget. (arXiv:2104.07705v2 [cs.CL] UPDATED)
    (2 min) While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
    Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation. (arXiv:2106.07843v3 [cs.SD] UPDATED)
    (2 min) In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semisupervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.
    $Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering. (arXiv:2104.08202v2 [cs.CL] UPDATED)
    (2 min) Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^2$, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^2$ against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.
    DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature. (arXiv:2107.01198v4 [cs.CL] UPDATED)
    (2 min) In this work, we present to the NLP community, and to the wider research community as a whole, an application for the diachronic analysis of research corpora. We open source an easy-to-use tool coined: DRIFT, which allows researchers to track research trends and development over the years. The analysis methods are collated from well-cited research works, with a few of our own methods added for good measure. Succinctly put, some of the analysis methods are: keyword extraction, word clouds, predicting declining/stagnant/growing trends using Productivity, tracking bi-grams using Acceleration plots, finding the Semantic Drift of words, tracking trends using similarity, etc. To demonstrate the utility and efficacy of our tool, we perform a case study on the cs.CL corpus of the arXiv repository and draw inferences from the analysis methods. The toolkit and the associated code are available here: https://github.com/rajaswa/DRIFT.
    Code-switched inspired losses for generic spoken dialog representations. (arXiv:2108.12465v2 [cs.CL] UPDATED)
    (2 min) Spoken dialog systems need to be able to handle both multiple languages and multilinguality inside a conversation (\textit{e.g} in case of code-switching). In this work, we introduce new pretraining losses tailored to learn multilingual spoken dialog representations. The goal of these losses is to expose the model to code-switched language. To scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from \texttt{OpenSubtitles}, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on \texttt{MIAM}, a new benchmark composed of five dialog act corpora on the same aforementioned languages as well as on two novel multilingual downstream tasks (\textit{i.e} multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new code switched-inspired losses achieve a better performance in both monolingual and multilingual settings.
    Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders. (arXiv:2104.08027v2 [cs.CL] UPDATED)
    (2 min) Pretrained Masked Language Models (MLMs) have revolutionised NLP in recent years. However, previous work has indicated that off-the-shelf MLMs are not effective as universal lexical or sentence encoders without further task-specific fine-tuning on NLI, sentence similarity, or paraphrasing tasks using annotated task data. In this work, we demonstrate that it is possible to turn MLMs into effective universal lexical and sentence encoders even without any additional data and without any supervision. We propose an extremely simple, fast and effective contrastive learning technique, termed Mirror-BERT, which converts MLMs (e.g., BERT and RoBERTa) into such encoders in 20-30 seconds without any additional external knowledge. Mirror-BERT relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning. We report huge gains over off-the-shelf MLMs with Mirror-BERT in both lexical-level and sentence-level tasks, across different domains and different languages. Notably, in the standard sentence semantic similarity (STS) tasks, our self-supervised Mirror-BERT model even matches the performance of the task-tuned Sentence-BERT models from prior work. Finally, we delve deeper into the inner workings of MLMs, and suggest some evidence on why this simple approach can yield effective universal lexical and sentence encoders.
    Improving Multimodal fusion via Mutual Dependency Maximisation. (arXiv:2109.00922v2 [cs.LG] UPDATED)
    (2 min) Multimodal sentiment analysis is a trending area of research, and the multimodal fusion is one of its most active topic. Acknowledging humans communicate through a variety of channels (i.e visual, acoustic, linguistic), multimodal systems aim at integrating different unimodal representations into a synthetic one. So far, a consequent effort has been made on developing complex architectures allowing the fusion of these modalities. However, such systems are mainly trained by minimising simple losses such as $L_1$ or cross-entropy. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to $4.3$ on accuracy) across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: \texttt{CMU-MOSI} and \texttt{CMU-MOSEI}. Our method not only achieves a new SOTA on both datasets but also produces representations that are more robust to modality drops. Finally, a by-product of our methods includes a statistical network which can be used to interpret the high dimensional representations learnt by the model.
    NAREOR: The Narrative Reordering Problem. (arXiv:2104.06669v3 [cs.CL] UPDATED)
    (2 min) Many implicit inferences exist in text depending on how it is structured that can critically impact the text's interpretation and meaning. One such structural aspect present in text with chronology is the order of its presentation. For narratives or stories, this is known as the narrative order. Reordering a narrative can impact the temporal, causal, event-based, and other inferences readers draw from it, which in turn can have strong effects both on its interpretation and interestingness. In this paper, we propose and investigate the task of Narrative Reordering (NAREOR) which involves rewriting a given story in a different narrative order while preserving its plot. We present a dataset, NAREORC, with human rewritings of stories within ROCStories in non-linear orders, and conduct a detailed analysis of it. Further, we propose novel task-specific training methods with suitable evaluation metrics. We perform experiments on NAREORC using state-of-the-art models such as BART and T5 and conduct extensive automatic and human evaluations. We demonstrate that although our models can perform decently, NAREOR is a challenging task with potential for further exploration. We also investigate two applications of NAREOR: generation of more interesting variations of stories and serving as adversarial sets for temporal/event-related tasks, besides discussing other prospective ones, such as for pedagogical setups related to language skills like essay writing and applications to medicine involving clinical narratives.
    Multiplex Graph Neural Network for Extractive Text Summarization. (arXiv:2108.12870v2 [cs.CL] UPDATED)
    (2 min) Extractive text summarization aims at extracting the most representative sentences from a given document as its summary. To extract a good summary from a long text document, sentence embedding plays an important role. Recent studies have leveraged graph neural networks to capture the inter-sentential relationship (e.g., the discourse graph) to learn contextual sentence embedding. However, those approaches neither consider multiple types of inter-sentential relationships (e.g., semantic similarity & natural connection), nor model intra-sentential relationships (e.g, semantic & syntactic relationship among words). To address these problems, we propose a novel Multiplex Graph Convolutional Network (Multi-GCN) to jointly model different types of relationships among sentences and words. Based on Multi-GCN, we propose a Multiplex Graph Summarization (Multi-GraS) model for extractive text summarization. Finally, we evaluate the proposed models on the CNN/DailyMail benchmark dataset to demonstrate the effectiveness of our method.
    A Three-Stage Learning Framework for Low-Resource Knowledge-Grounded Dialogue Generation. (arXiv:2109.04096v1 [cs.CL])
    (2 min) Neural conversation models have shown great potentials towards generating fluent and informative responses by introducing external background knowledge. Nevertheless, it is laborious to construct such knowledge-grounded dialogues, and existing models usually perform poorly when transfer to new domains with limited training samples. Therefore, building a knowledge-grounded dialogue system under the low-resource setting is a still crucial issue. In this paper, we propose a novel three-stage learning framework based on weakly supervised learning which benefits from large scale ungrounded dialogues and unstructured knowledge base. To better cooperate with this framework, we devise a variant of Transformer with decoupled decoder which facilitates the disentangled learning of response generation and knowledge incorporation. Evaluation results on two benchmarks indicate that our approach can outperform other state-of-the-art methods with less training data, and even in zero-resource scenario, our approach still performs well.
    MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces. (arXiv:2109.04240v1 [cs.LG])
    (2 min) Albeit the universal representational power of pre-trained language models, adapting them onto a specific NLP task still requires a considerably large amount of labeled data. Effective task fine-tuning meets challenges when only a few labeled examples are present for the task. In this paper, we aim to the address of the problem of few shot task learning by exploiting and transferring from a different task which admits a related but disparate label space. Specifically, we devise a label transfer network (LTN) to transform the labels from source task to the target task of interest for training. Both the LTN and the model for task prediction are learned via a bi-level optimization framework, which we term as MetaXT. MetaXT offers a principled solution to best adapt a pre-trained language model to the target task by transferring knowledge from the source task. Empirical evaluations on cross-task transfer settings for four NLP tasks, from two different types of label space disparities, demonstrate the effectiveness of MetaXT, especially when the labeled data in the target task is limited.
    MapRE: An Effective Semantic Mapping Approach for Low-resource Relation Extraction. (arXiv:2109.04108v1 [cs.CL])
    (2 min) Neural relation extraction models have shown promising results in recent years; however, the model performance drops dramatically given only a few training samples. Recent works try leveraging the advance in few-shot learning to solve the low resource problem, where they train label-agnostic models to directly compare the semantic similarities among context sentences in the embedding space. However, the label-aware information, i.e., the relation label that contains the semantic knowledge of the relation itself, is often neglected for prediction. In this work, we propose a framework considering both label-agnostic and label-aware semantic mapping information for low resource relation extraction. We show that incorporating the above two types of mapping information in both pretraining and fine-tuning can significantly improve the model performance on low-resource relation extraction tasks.
    HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints. (arXiv:2109.04443v1 [cs.CL])
    (2 min) Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve effectiveness of the available BT data, we introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using both high and low quality BT data by providing hints (as source tags on the encoder) to the model about the quality of each source-target pair. We don't filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data. Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script). For such cases, we propose training the model with additional hints (as target tags on the decoder) that provide information about the operation required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: {Hindi,Gujarati,Tamil}-to-English. Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings.
    TimeTraveler: Reinforcement Learning for Temporal Knowledge Graph Forecasting. (arXiv:2109.04101v1 [cs.LG])
    (2 min) Temporal knowledge graph (TKG) reasoning is a crucial task that has gained increasing research interest in recent years. Most existing methods focus on reasoning at past timestamps to complete the missing facts, and there are only a few works of reasoning on known TKGs to forecast future facts. Compared with the completion task, the forecasting task is more difficult that faces two main challenges: (1) how to effectively model the time information to handle future timestamps? (2) how to make inductive inference to handle previously unseen entities that emerge over time? To address these challenges, we propose the first reinforcement learning method for forecasting. Specifically, the agent travels on historical knowledge graph snapshots to search for the answer. Our method defines a relative time encoding function to capture the timespan information, and we design a novel time-shaped reward based on Dirichlet distribution to guide the model learning. Furthermore, we propose a novel representation method for unseen entities to improve the inductive inference ability of the model. We evaluate our method for this link prediction task at future timestamps. Extensive experiments on four benchmark datasets demonstrate substantial performance improvement meanwhile with higher explainability, less calculation, and fewer parameters when compared with existing state-of-the-art methods.
    Debiasing Methods in Natural Language Understanding Make Bias More Accessible. (arXiv:2109.04095v1 [cs.CL])
    (2 min) Model robustness to bias is often determined by the generalization on carefully designed out-of-distribution datasets. Recent debiasing methods in natural language understanding (NLU) improve performance on such datasets by pressuring models into making unbiased predictions. An underlying assumption behind such methods is that this also leads to the discovery of more robust features in the model's inner representations. We propose a general probing-based framework that allows for post-hoc interpretation of biases in language models, and use an information-theoretic approach to measure the extractability of certain biases from the model's representations. We experiment with several NLU datasets and known biases, and show that, counter-intuitively, the more a language model is pushed towards a debiased regime, the more bias is actually encoded in its inner representations.
    Smoothed Contrastive Learning for Unsupervised Sentence Embedding. (arXiv:2109.04321v1 [cs.CL])
    (2 min) Contrastive learning has been gradually applied to learn high-quality unsupervised sentence embedding. Among the previous un-supervised methods, the latest state-of-the-art method, as far as we know, is unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE uses the InfoNCE1loss function in the training stage by pulling semantically similar sentences together and pushing apart dis-similar ones.Theoretically, we expect to use larger batches in unsup-SimCSE to get more adequate comparisons among samples and avoid overfitting. However, increasing the batch size does not always lead to improvements, but instead even lead to performance degradation when the batch size exceeds a threshold. Through statistical observation, we find that this is probably due to the introduction of low-confidence negative pairs after in-creasing the batch size. To alleviate this problem, we introduce a simple smoothing strategy upon the InfoNCE loss function, termedGaussian Smoothing InfoNCE (GS-InfoNCE).Specifically, we add random Gaussian noise vectors as negative samples, which act asa smoothing of the negative sample space.Though being simple, the proposed smooth-ing strategy brings substantial improvements to unsup-SimCSE. We evaluate GS-InfoNCEon the standard semantic text similarity (STS)task. GS-InfoNCE outperforms the state-of-the-art unsup-SimCSE by an average Spear-man correlation of 1.38%, 0.72%, 1.17% and0.28% on the base of BERT-base, BERT-large,RoBERTa-base and RoBERTa-large, respectively.
    Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. (arXiv:2109.04448v1 [cs.CL])
    (2 min) Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.
    Biomedical Question Answering: A Survey of Approaches and Challenges. (arXiv:2102.05281v2 [cs.CL] UPDATED)
    (2 min) Automatic Question Answering (QA) has been successfully applied in various domains such as search engines and chatbots. Biomedical QA (BQA), as an emerging QA task, enables innovative applications to effectively perceive, access and understand complex biomedical knowledge. There have been tremendous developments of BQA in the past two decades, which we classify into 5 distinctive approaches: classic, information retrieval, machine reading comprehension, knowledge base and question entailment approaches. In this survey, we introduce available datasets and representative methods of each BQA approach in detail. Despite the developments, BQA systems are still immature and rarely used in real-life settings. We identify and characterize several key challenges in BQA that might lead to this issue, and discuss some potential future directions to explore.
    PPT: Pre-trained Prompt Tuning for Few-shot Learning. (arXiv:2109.04332v1 [cs.CL])
    (2 min) Prompts for pre-trained language models (PLMs) have shown remarkable performance by bridging the gap between pre-training tasks and various downstream tasks. Among these methods, prompt tuning, which freezes PLMs and only tunes soft prompts, provides an efficient and effective solution for adapting large-scale PLMs to downstream tasks. However, prompt tuning is yet to be fully explored. In our pilot experiments, we find that prompt tuning performs comparably with conventional full-model fine-tuning when downstream data are sufficient, whereas it performs much worse under few-shot learning settings, which may hinder the application of prompt tuning in practice. We attribute this low performance to the manner of initializing soft prompts. Therefore, in this work, we propose to pre-train prompts by adding soft prompts into the pre-training stage to obtain a better initialization. We name this Pre-trained Prompt Tuning framework "PPT". To ensure the generalization of PPT, we formulate similar classification tasks into a unified task form and pre-train soft prompts for this unified task. Extensive experiments show that tuning pre-trained prompts for downstream tasks can reach or even outperform full-model fine-tuning under both full-data and few-shot settings. Our approach is effective and efficient for using large-scale PLMs in practice.
    Multi-granularity Textual Adversarial Attack with Behavior Cloning. (arXiv:2109.04367v1 [cs.CL])
    (2 min) Recently, the textual adversarial attack models become increasingly popular due to their successful in estimating the robustness of NLP models. However, existing works have obvious deficiencies. (1) They usually consider only a single granularity of modification strategies (e.g. word-level or sentence-level), which is insufficient to explore the holistic textual space for generation; (2) They need to query victim models hundreds of times to make a successful attack, which is highly inefficient in practice. To address such problems, in this paper we propose MAYA, a Multi-grAnularitY Attack model to effectively generate high-quality adversarial samples with fewer queries to victim models. Furthermore, we propose a reinforcement-learning based method to train a multi-granularity attack agent through behavior cloning with the expert knowledge from our MAYA algorithm to further reduce the query times. Additionally, we also adapt the agent to attack black-box models that only output labels without confidence scores. We conduct comprehensive experiments to evaluate our attack models by attacking BiLSTM, BERT and RoBERTa in two different black-box attack settings and three benchmark datasets. Experimental results show that our models achieve overall better attacking performance and produce more fluent and grammatical adversarial samples compared to baseline models. Besides, our adversarial attack agent significantly reduces the query times in both attack settings. Our codes are released at https://github.com/Yangyi-Chen/MAYA.
    Mathematical Word Problem Generation from Commonsense Knowledge Graph and Equations. (arXiv:2010.06196v3 [cs.CL] UPDATED)
    (2 min) There is an increasing interest in the use of mathematical word problem (MWP) generation in educational assessment. Different from standard natural question generation, MWP generation needs to maintain the underlying mathematical operations between quantities and variables, while at the same time ensuring the relevance between the output and the given topic. To address above problem, we develop an end-to-end neural model to generate diverse MWPs in real-world scenarios from commonsense knowledge graph and equations. The proposed model (1) learns both representations from edge-enhanced Levi graphs of symbolic equations and commonsense knowledge; (2) automatically fuses equation and commonsense knowledge information via a self-planning module when generating the MWPs. Experiments on an educational gold-standard set and a large-scale generated MWP set show that our approach is superior on the MWP generation task, and it outperforms the SOTA models in terms of both automatic evaluation metrics, i.e., BLEU-4, ROUGE-L, Self-BLEU, and human evaluation metrics, i.e., equation relevance, topic relevance, and language coherence. To encourage reproducible results, we make our code and MWP dataset public available at \url{https://github.com/tal-ai/MaKE_EMNLP2021}.
    SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. (arXiv:1911.03437v5 [cs.CL] UPDATED)
    (2 min) Transfer learning has fundamentally changed the landscape of natural language processing (NLP) research. Many existing state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely large capacity of pre-trained models, aggressive fine-tuning often causes the adapted model to overfit the data of downstream tasks and forget the knowledge of the pre-trained model. To address the above issue in a more principled manner, we propose a new computational framework for robust and efficient fine-tuning for pre-trained language models. Specifically, our proposed framework contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the capacity of the model; 2. Bregman proximal point optimization, which is a class of trust-region methods and can prevent knowledge forgetting. Our experiments demonstrate that our proposed method achieves the state-of-the-art performance on multiple NLP benchmarks.
    ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization. (arXiv:2109.04098v1 [cs.CL])
    (2 min) Abstractive text summarization is one of the areas influenced by the emergence of pre-trained language models. Current pre-training works in abstractive summarization give more points to the summaries with more words in common with the main text and pay less attention to the semantic similarity between generated sentences and the original document. We propose ARMAN, a Transformer-based encoder-decoder model pre-trained with three novel objectives to address this issue. In ARMAN, salient sentences from a document are selected according to a modified semantic score to be masked and form a pseudo summary. To summarize more accurately and similar to human writing patterns, we applied modified sentence reordering. We evaluated our proposed models on six downstream Persian summarization tasks. Experimental results show that our proposed model achieves state-of-the-art performance on all six summarization tasks measured by ROUGE and BERTScore. Our models also outperform prior works in textual entailment, question paraphrasing, and multiple choice question answering. Finally, we established a human evaluation and show that using the semantic score significantly improves summarization results.
    Editing Factual Knowledge in Language Models. (arXiv:2104.08164v2 [cs.CL] UPDATED)
    (2 min) The factual knowledge acquired during pre-training and stored in the parameters of Language Models (LMs) can be useful in downstream tasks (e.g., question answering or textual inference). However, some facts can be incorrectly induced or become obsolete over time. We present KnowledgeEditor, a method which can be used to edit this knowledge and, thus, fix 'bugs' or unexpected predictions without the need for expensive re-training or fine-tuning. Besides being computationally efficient, KnowledgeEditordoes not require any modifications in LM pre-training (e.g., the use of meta-learning). In our approach, we train a hyper-network with constrained optimization to modify a fact without affecting the rest of the knowledge; the trained hyper-network is then used to predict the weight update at test time. We show KnowledgeEditor's efficacy with two popular architectures and knowledge-intensive tasks: i) a BERT model fine-tuned for fact-checking, and ii) a sequence-to-sequence BART model for question answering. With our method, changing a prediction on the specific wording of a query tends to result in a consistent change in predictions also for its paraphrases. We show that this can be further encouraged by exploiting (e.g., automatically-generated) paraphrases during training. Interestingly, our hyper-network can be regarded as a 'probe' revealing which components need to be changed to manipulate factual knowledge; our analysis shows that the updates tend to be concentrated on a small subset of components. Source code available at https://github.com/nicola-decao/KnowledgeEditor
    Mitigating False-Negative Contexts in Multi-document Question Answering with Retrieval Marginalization. (arXiv:2103.12235v2 [cs.CL] UPDATED)
    (2 min) Question Answering (QA) tasks requiring information from multiple documents often rely on a retrieval model to identify relevant information for reasoning. The retrieval model is typically trained to maximize the likelihood of the labeled supporting evidence. However, when retrieving from large text corpora such as Wikipedia, the correct answer can often be obtained from multiple evidence candidates. Moreover, not all such candidates are labeled as positive during annotation, rendering the training signal weak and noisy. This problem is exacerbated when the questions are unanswerable or when the answers are Boolean, since the model cannot rely on lexical overlap to make a connection between the answer and supporting evidence. We develop a new parameterization of set-valued retrieval that handles unanswerable queries, and we show that marginalizing over this set during training allows a model to mitigate false negatives in supporting evidence annotations. We test our method on two multi-document QA datasets, IIRC and HotpotQA. On IIRC, we show that joint modeling with marginalization improves model performance by 5.5 F1 points and achieves a new state-of-the-art performance of 50.5 F1. We also show that retrieval marginalization results in 4.1 QA F1 improvement over a non-marginalized baseline on HotpotQA in the fullwiki setting.
    Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension. (arXiv:2109.04066v1 [cs.CL])
    (2 min) Multi-party multi-turn dialogue comprehension brings unprecedented challenges on handling the complicated scenarios from multiple speakers and criss-crossed discourse relationship among speaker-aware utterances. Most existing methods deal with dialogue contexts as plain texts and pay insufficient attention to the crucial speaker-aware clues. In this work, we propose an enhanced speaker-aware model with masking attention and heterogeneous graph networks to comprehensively capture discourse clues from both sides of speaker property and speaker-aware relationships. With such comprehensive speaker-aware modeling, experimental results show that our speaker-aware model helps achieves state-of-the-art performance on the benchmark dataset Molweni. Case analysis shows that our model enhances the connections between utterances and their own speakers and captures the speaker-aware discourse relations, which are critical for dialogue modeling.
    ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding. (arXiv:2109.04380v1 [cs.CL])
    (2 min) Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embeddings to build a positive pair. As the length information of a sentence will generally be encoded into the sentence embeddings due to the usage of position embedding in Transformer, each positive pair in unsup-SimCSE actually contains the same length information. And thus unsup-SimCSE trained with these positive pairs is probably biased, which would tend to consider that sentences of the same or similar length are more similar in semantics. Through statistical observations, we find that unsup-SimCSE does have such a problem. To alleviate it, we apply a simple repetition operation to modify the input sentence, and then pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair. Additionally, we draw inspiration from the community of computer vision and introduce a momentum contrast, enlarging the number of negative pairs without additional calculations. The proposed two modifications are applied on positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base.
    Fixing exposure bias with imitation learning needs powerful oracles. (arXiv:2109.04114v1 [cs.CL])
    (2 min) We apply imitation learning (IL) to tackle the NMT exposure bias problem with error-correcting oracles, and evaluate an SMT lattice-based oracle which, despite its excellent performance in an unconstrained oracle translation task, turned out to be too pruned and idiosyncratic to serve as the oracle for IL.
    Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring. (arXiv:2109.04411v1 [eess.AS])
    (2 min) This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelerate the decoding speed by generating multiple tokens in parallel on the basis of the token-wise conditional independence assumption. We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder. The auxiliary shallow AR decoder selects the best hypothesis by rescoring multiple candidates generated from the NAR decoder in parallel (parallel AR rescoring). We adopt conditional masked language model (CMLM) and a connectionist temporal classification (CTC)-based model as NAR decoders for Orthros, referred to as Orthros-CMLM and Orthros-CTC, respectively. We also propose two training methods to enhance the CMLM decoder. Experimental evaluations on three benchmark datasets with six language directions demonstrated that Orthros achieved large improvements in translation quality with a very small overhead compared with the baseline NAR model. Moreover, the Conformer encoder architecture enabled large quality improvements, especially for CTC-based models. Orthros-CTC with the Conformer encoder increased decoding speed by 3.63x on CPU with translation quality comparable to that of an AR model.
    Cartography Active Learning. (arXiv:2109.04282v1 [cs.CL])
    (2 min) We propose Cartography Active Learning (CAL), a novel Active Learning (AL) algorithm that exploits the behavior of the model on individual instances during training as a proxy to find the most informative instances for labeling. CAL is inspired by data maps, which were recently proposed to derive insights into dataset quality (Swayamdipta et al., 2020). We compare our method on popular text classification tasks to commonly used AL strategies, which instead rely on post-training behavior. We demonstrate that CAL is competitive to other common AL methods, showing that training dynamics derived from small seed data can be successfully used for AL. We provide insights into our new AL method by analyzing batch-level statistics utilizing the data maps. Our results further show that CAL results in a more data-efficient learning strategy, achieving comparable or better results with considerably less training data.
    Learning Opinion Summarizers by Selecting Informative Reviews. (arXiv:2109.04325v1 [cs.CL])
    (2 min) Opinion summarization has been traditionally approached with unsupervised, weakly-supervised and few-shot learning techniques. In this work, we collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training. However, the number of reviews per product is large (320 on average), making summarization - and especially training a summarizer - impractical. Moreover, the content of many reviews is not reflected in the human-written summaries, and, thus, the summarizer trained on random review subsets hallucinates. In order to deal with both of these challenges, we formulate the task as jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets. The choice of the review subset is treated as a latent variable, predicted by a small and simple selector. The subset is then fed into a more powerful summarizer. For joint training, we use amortized variational inference and policy gradient methods. Our experiments demonstrate the importance of selecting informative reviews resulting in improved quality of summaries and reduced hallucinations.
    Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification. (arXiv:2109.04385v1 [cs.CL])
    (2 min) Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render attacks unsuccessful, raising the question of whether valid attacks are actually feasible. In this work, we investigate this through the lens of human language ability. We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text, while receiving immediate model feedback, with the aim of causing a sentiment classification model to misclassify the example. Our findings suggest that humans are capable of generating a substantial amount of adversarial examples using semantics-preserving word substitutions. We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms on the dimensions naturalness, preservation of sentiment, grammaticality and substitution rate. Our findings suggest that human-generated adversarial examples are not more able than the best algorithms to generate natural-reading, sentiment-preserving examples, though they do so by being much more computationally efficient.
    Understanding Few-Shot Commonsense Knowledge Models. (arXiv:2101.00297v2 [cs.CL] UPDATED)
    (2 min) Providing natural language processing systems with commonsense knowledge is a critical challenge for achieving language understanding. Recently, commonsense knowledge models have emerged as a suitable approach for hypothesizing situation-relevant commonsense knowledge on-demand in natural language applications. However, these systems are limited by the fixed set of relations captured by schemas of the knowledge bases on which they're trained. To address this limitation, we investigate training commonsense knowledge models in a few-shot setting with limited tuples per commonsense relation in the graph. We perform five separate studies on different dimensions of few-shot commonsense knowledge learning, providing a roadmap on best practices for training these systems efficiently. Importantly, we find that human quality ratings for knowledge produced from a few-shot trained system can achieve performance within 6% of knowledge produced from fully supervised systems. This few-shot performance enables coverage of a wide breadth of relations in future commonsense systems.
    Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with Synthetic Data. (arXiv:2109.04319v1 [cs.CL])
    (2 min) While multilingual pretrained language models (LMs) fine-tuned on a single language have shown substantial cross-lingual task transfer capabilities, there is still a wide performance gap in semantic parsing tasks when target language supervision is available. In this paper, we propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parser. This method simplifies the popular Translate-Align-Project (TAP) pipeline and consists of a sequence-to-sequence filler model that constructs a full parse conditioned on an utterance and a view of the same parse. Our filler is trained on English data only but can accurately complete instances in other languages (i.e., translations of the English training utterances), in a zero-shot fashion. Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems which rely on traditional alignment techniques.
    What's Hidden in a One-layer Randomly Weighted Transformer?. (arXiv:2109.03939v1 [cs.CL])
    (2 min) We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for one-layer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pre-trained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods. We released the source code at https://github.com/sIncerass/one_layer_lottery_ticket.
    Competence-based Curriculum Learning for Multilingual Machine Translation. (arXiv:2109.04002v1 [cs.CL])
    (2 min) Currently, multilingual machine translation is receiving more and more attention since it brings better performance for low resource languages (LRLs) and saves more space. However, existing multilingual machine translation models face a severe challenge: imbalance. As a result, the translation performance of different languages in multilingual translation models are quite different. We argue that this imbalance problem stems from the different learning competencies of different languages. Therefore, we focus on balancing the learning competencies of different languages and propose Competence-based Curriculum Learning for Multilingual Machine Translation, named CCL-M. Specifically, we firstly define two competencies to help schedule the high resource languages (HRLs) and the low resource languages: 1) Self-evaluated Competence, evaluating how well the language itself has been learned; and 2) HRLs-evaluated Competence, evaluating whether an LRL is ready to be learned according to HRLs' Self-evaluated Competence. Based on the above competencies, we utilize the proposed CCL-M algorithm to gradually add new languages into the training set in a curriculum learning manner. Furthermore, we propose a novel competenceaware dynamic balancing sampling strategy for better selecting training samples in multilingual training. Experimental results show that our approach has achieved a steady and significant performance gain compared to the previous state-of-the-art approach on the TED talks dataset.
    Efficient Nearest Neighbor Language Models. (arXiv:2109.04212v1 [cs.CL])
    (2 min) Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore, which allows them to learn through explicitly memorizing the training datapoints. While effective, these models often require retrieval from a large datastore at test time, significantly increasing the inference overhead and thus limiting the deployment of non-parametric NLMs in practical applications. In this paper, we take the recently proposed $k$-nearest neighbors language model (Khandelwal et al., 2019) as an example, exploring methods to improve its efficiency along various dimensions. Experiments on the standard WikiText-103 benchmark and domain-adaptation datasets show that our methods are able to achieve up to a 6x speed-up in inference speed while retaining comparable performance. The empirical analysis we present may provide guidelines for future research seeking to develop or deploy more efficient non-parametric NLMs.
    Flexible Generation of Natural Language Deductions. (arXiv:2104.08825v2 [cs.CL] UPDATED)
    (2 min) An interpretable system for open-domain reasoning needs to express its reasoning process in a transparent form. Natural language is an attractive representation for this purpose -- it is both highly expressive and easy for humans to understand. However, manipulating natural language statements in logically consistent ways is hard: models must cope with variation in how meaning is expressed while remaining precise. In this paper, we describe ParaPattern, a method for building models to generate deductive inferences from diverse natural language inputs without direct human supervision. We train BART-based models (Lewis et al., 2020) to generate the result of applying a particular logical operation to one or more premise statements. Crucially, we develop a largely automated pipeline for constructing suitable training examples from Wikipedia. We evaluate our models using out-of-domain sentence compositions from the QASC (Khot et al., 2020) and EntailmentBank (Dalvi et al., 2021) datasets as well as targeted perturbation sets. Our results show that our models are substantially more accurate and flexible than baseline systems. ParaPattern achieves 85% validity on examples of the 'substitution' operation from EntailmentBank without the use of any in-domain training data, matching the performance of a model fine-tuned for EntailmentBank. The full source code for our method is publicly available.
    Position Information in Transformers: An Overview. (arXiv:2102.11090v2 [cs.CL] UPDATED)
    (2 min) Transformers are arguably the main workhorse in recent Natural Language Processing research. By definition a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate position information into Transformer models. The objectives of this survey are to (1) showcase that position information in Transformer is a vibrant and extensive research area; (2) enable the reader to compare existing methods by providing a unified notation and systematization of different approaches along important model dimensions; (3) indicate what characteristics of an application should be taken into account when selecting a position encoding; (4) provide stimuli for future research.
    Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining. (arXiv:2109.04080v1 [cs.CL])
    (2 min) With the rapid increase in the volume of dialogue data from daily life, there is a growing demand for dialogue summarization. Unfortunately, training a large summarization model is generally infeasible due to the inadequacy of dialogue data with annotated summaries. Most existing works for low-resource dialogue summarization directly pretrain models in other domains, e.g., the news domain, but they generally neglect the huge difference between dialogues and conventional articles. To bridge the gap between out-of-domain pretraining and in-domain fine-tuning, in this work, we propose a multi-source pretraining paradigm to better leverage the external summary data. Specifically, we exploit large-scale in-domain non-summary data to separately pretrain the dialogue encoder and the summary decoder. The combined encoder-decoder model is then pretrained on the out-of-domain summary data using adversarial critics, aiming to facilitate domain-agnostic summarization. The experimental results on two public datasets show that with only limited training data, our approach achieves competitive performance and generalizes well in different dialogue scenarios.
    KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs. (arXiv:2109.04223v1 [cs.CL])
    (2 min) Incorporating factual knowledge into pre-trained language models (PLM) such as BERT is an emerging trend in recent NLP studies. However, most of the existing methods combine the external knowledge integration module with a modified pre-training loss and re-implement the pre-training process on the large-scale corpus. Re-pretraining these models is usually resource-consuming, and difficult to adapt to another domain with a different knowledge graph (KG). Besides, those works either cannot embed knowledge context dynamically according to textual context or struggle with the knowledge ambiguity issue. In this paper, we propose a novel knowledge-aware language model framework based on fine-tuning process, which equips PLM with a unified knowledge-enhanced text graph that contains both text and multi-relational sub-graphs extracted from KG. We design a hierarchical relational-graph-based message passing mechanism, which can allow the representations of injected KG and text to mutually update each other and can dynamically select ambiguous mentioned entities that share the same text. Our empirical results show that our model can efficiently incorporate world knowledge from KGs into existing language models such as BERT, and achieve significant improvement on the machine reading comprehension (MRC) task compared with other knowledge-enhanced models.
    Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. (arXiv:2103.00453v2 [cs.CL] UPDATED)
    (2 min) When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we first demonstrate a surprising finding: pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. We refer to this capability as self-diagnosis. Based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. We refer to this approach as self-debiasing. Self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction.
    On the Linguistic Capacity of Real-Time Counter Automata. (arXiv:2004.06866v2 [cs.CL] UPDATED)
    (2 min) Counter machines have achieved a newfound relevance to the field of natural language processing (NLP): recent work suggests some strong-performing recurrent neural networks utilize their memory as counters. Thus, one potential way to understand the success of these networks is to revisit the theory of counter computation. Therefore, we study the abilities of real-time counter machines as formal grammars, focusing on formal properties that are relevant for NLP models. We first show that several variants of the counter machine converge to express the same class of formal languages. We also prove that counter languages are closed under complement, union, intersection, and many other common set operations. Next, we show that counter machines cannot evaluate boolean expressions, even though they can weakly validate their syntax. This has implications for the interpretability and evaluation of neural network systems: successfully matching syntactic patterns does not guarantee that counter memory accurately encodes compositional semantics. Finally, we consider whether counter languages are semilinear. This work makes general contributions to the theory of formal languages that are of potential interest for understanding recurrent neural networks.
    SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. (arXiv:2108.04556v3 [cs.CL] UPDATED)
    (2 min) Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.
    Learning from Uneven Training Data: Unlabeled, Single Label, and Multiple Labels. (arXiv:2109.04408v1 [cs.CL])
    (2 min) Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the spectrum of language interpretation. We explore new label annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from uneven training examples (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in both accuracy and label distribution metrics in two tasks, suggesting training with uneven training data can be beneficial for many NLP tasks.
    NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions. (arXiv:2010.12758v2 [cs.CL] UPDATED)
    (2 min) Existing conversational systems are mostly agent-centric, which assumes the user utterances would closely follow the system ontology (for NLU or dialogue state tracking). However, in real-world scenarios, it is highly desirable that the users can speak freely in their own way. It is extremely hard, if not impossible, for the users to adapt to the unknown system ontology. In this work, we attempt to build a user-centric dialogue system. As there is no clean mapping for a user's free form utterance to an ontology, we first model the user preferences as estimated distributions over the system ontology and map the users' utterances to such distributions. Learning such a mapping poses new challenges on reasoning over existing knowledge, ranging from factoid knowledge, commonsense knowledge to the users' own situations. To this end, we build a new dataset named NUANCED that focuses on such realistic settings for conversational recommendation. Collected via dialogue simulation and paraphrasing, NUANCED contains 5.1k dialogues, 26k turns of high-quality user responses. We conduct experiments, showing both the usefulness and challenges of our problem setting. We believe NUANCED can serve as a valuable resource to push existing research from the agent-centric system to the user-centric system. The code and data is publicly available at \url{https://github.com/facebookresearch/nuanced}.
    Thinking Clearly, Talking Fast: Concept-Guided Non-Autoregressive Generation for Open-Domain Dialogue Systems. (arXiv:2109.04084v1 [cs.CL])
    (2 min) Human dialogue contains evolving concepts, and speakers naturally associate multiple concepts to compose a response. However, current dialogue models with the seq2seq framework lack the ability to effectively manage concept transitions and can hardly introduce multiple concepts to responses in a sequential decoding manner. To facilitate a controllable and coherent dialogue, in this work, we devise a concept-guided non-autoregressive model (CG-nAR) for open-domain dialogue generation. The proposed model comprises a multi-concept planning module that learns to identify multiple associated concepts from a concept graph and a customized Insertion Transformer that performs concept-guided non-autoregressive generation to complete a response. The experimental results on two public datasets show that CG-nAR can produce diverse and coherent responses, outperforming state-of-the-art baselines in both automatic and human evaluations with substantially faster inference speed.
    Fusing task-oriented and open-domain dialogues in conversational agents. (arXiv:2109.04137v1 [cs.CL])
    (2 min) The goal of building intelligent dialogue systems has largely been \textit{separately} pursued under two paradigms: task-oriented dialogue (TOD) systems, which perform goal-oriented functions, and open-domain dialogue (ODD) systems, which focus on non-goal-oriented chitchat. The two dialogue modes can potentially be intertwined together seamlessly in the same conversation, as easily done by a friendly human assistant. Such ability is desirable in conversational agents, as the integration makes them more accessible and useful. Our paper addresses this problem of fusing TODs and ODDs in multi-turn dialogues. Based on the popular TOD dataset MultiWOZ, we build a new dataset FusedChat, by rewriting the existing TOD turns and adding new ODD turns. This procedure constructs conversation sessions containing exchanges from both dialogue modes. It features inter-mode contextual dependency, i.e., the dialogue turns from the two modes depend on each other. Rich dependency patterns including co-reference and ellipsis are features. The new dataset, with 60k new human-written ODD turns and 5k re-written TOD turns, offers a benchmark to test a dialogue model's ability to perform inter-mode conversations. This is a more challenging task since the model has to determine the appropriate dialogue mode and generate the response based on the inter-mode context. But such models would better mimic human-level conversation capabilities. We evaluate baseline models on this task, including \textit{classification-based} two-stage models and \textit{two-in-one} fused models. We publicly release FusedChat and the baselines to propel future work on inter-mode dialogue systems https://github.com/tomyoung903/FusedChat.
    Consistent Accelerated Inference via Confident Adaptive Transformers. (arXiv:2104.08803v2 [cs.CL] UPDATED)
    (2 min) We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.
    Analysis of Language Change in Collaborative Instruction Following. (arXiv:2109.04452v1 [cs.CL])
    (2 min) We analyze language change over time in a collaborative, goal-oriented instructional task, where utility-maximizing participants form conventions and increase their expertise. Prior work studied such scenarios mostly in the context of reference games, and consistently found that language complexity is reduced along multiple dimensions, such as utterance length, as conventions are formed. In contrast, we find that, given the ability to increase instruction utility, instructors increase language complexity along these previously studied dimensions to better collaborate with increasingly skilled instruction followers.
    Time-Aware Evidence Ranking for Fact-Checking. (arXiv:2009.06402v4 [cs.CL] UPDATED)
    (2 min) Truth can vary over time. Fact-checking decisions on claim veracity should therefore take into account temporal information of both the claim and supporting or refuting evidence. In this work, we investigate the hypothesis that the timestamp of a Web page is crucial to how it should be ranked for a given claim. We delineate four temporal ranking methods that constrain evidence ranking differently and simulate hypothesis-specific evidence rankings given the evidence timestamps as gold standard. Evidence ranking in three fact-checking models is ultimately optimized using a learning-to-rank loss function. Our study reveals that time-aware evidence ranking not only surpasses relevance assumptions based purely on semantic similarity or position in a search results list, but also improves veracity predictions of time-sensitive claims in particular.
    Optimal Embedding Calibration for Symbolic Music Similarity. (arXiv:2103.07656v2 [cs.SD] UPDATED)
    (2 min) In natural language processing (NLP), the semantic similarity task requires large-scale, high-quality human-annotated labels for fine-tuning or evaluation. By contrast, in cases of music similarity, such labels are expensive to collect and largely dependent on the annotator's artistic preferences. Recent research has demonstrated that embedding calibration technique can greatly increase semantic similarity performance of the pre-trained language model without fine-tuning. However, it is yet unknown which calibration method is the best and how much performance improvement can be achieved. To address these issues, we propose using composer information to construct labels for automatically evaluating music similarity. Under this paradigm, we discover the optimal combination of embedding calibration which achieves superior metrics than the baseline methods.
    Transformers in the loop: Polarity in neural models of language. (arXiv:2109.03926v1 [cs.CL])
    (2 min) Representation of linguistic phenomena in computational language models is typically assessed against the predictions of existing linguistic theories of these phenomena. Using the notion of polarity as a case study, we show that this is not always the most adequate set-up. We probe polarity via so-called 'negative polarity items' (in particular, English 'any') in two pre-trained Transformer-based models (BERT and GPT-2). We show that -- at least for polarity -- metrics derived from language models are more consistent with data from psycholinguistic experiments than linguistic theory predictions. Establishing this allows us to more adequately evaluate the performance of language models and also to use language models to discover new insights into natural language grammar beyond existing linguistic theories. Overall, our results encourage a closer tie between experiments with human subjects and with language models. We propose methods to enable this closer tie, with language models as part of experimental pipeline, and show this pipeline at work.
    Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer. (arXiv:2109.03969v1 [cs.CL])
    (2 min) Transformers have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. In this work, we propose a multi-task learning-based transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a conformer [1] encoder and two parallel transformer decoders. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence. We consider the phoneme recognition task as an auxiliary task for our multi-task learning framework. We jointly optimize the network for both phoneme and grapheme recognition tasks using Joint CTC-Attention [2] training. We use a conditional decoding scheme to inject the language information into the model before predicting the grapheme sequence. Our experiments show that our proposed approach can obtain significant improvement over previous approaches [4]. We also show that our conformer-based dual-decoder approach outperforms both the transformer-based dual-decoder approach and single decoder approach. Finally, We compare monolingual ASR models with our proposed multilingual ASR approach.
    Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation. (arXiv:2109.03858v1 [cs.CL])
    (2 min) Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at www.github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings.
    Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering. (arXiv:2109.04014v1 [cs.CL])
    (2 min) Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models' performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles: classification and extraction. Both the retriever and reader are trained with weak supervision. Our experimental results show that a good retriever can significantly improve the reader's performance on the OK-VQA challenge. The code and corpus are provided in https://github.com/luomancs/retriever\_reader\_for\_okvqa.git
    Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models. (arXiv:2109.03892v1 [cs.CL])
    (2 min) We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.
    Graph Based Network with Contextualized Representations of Turns in Dialogue. (arXiv:2109.04008v1 [cs.CL])
    (2 min) Dialogue-based relation extraction (RE) aims to extract relation(s) between two arguments that appear in a dialogue. Because dialogues have the characteristics of high personal pronoun occurrences and low information density, and since most relational facts in dialogues are not supported by any single sentence, dialogue-based relation extraction requires a comprehensive understanding of dialogue. In this paper, we propose the TUrn COntext awaRE Graph Convolutional Network (TUCORE-GCN) modeled by paying attention to the way people understand dialogues. In addition, we propose a novel approach which treats the task of emotion recognition in conversations (ERC) as a dialogue-based RE. Experiments on a dialogue-based RE dataset and three ERC datasets demonstrate that our model is very effective in various dialogue-based natural language understanding tasks. In these experiments, TUCORE-GCN outperforms the state-of-the-art models on most of the benchmark datasets. Our code is available at https://github.com/BlackNoodle/TUCORE-GCN.
    ELIT: Emory Language and Information Toolkit. (arXiv:2109.03903v1 [cs.CL])
    (2 min) We introduce ELIT, the Emory Language and Information Toolkit, which is a comprehensive NLP framework providing transformer-based end-to-end models for core tasks with a special focus on memory efficiency while maintaining state-of-the-art accuracy and speed. Compared to existing toolkits, ELIT features an efficient Multi-Task Learning (MTL) model with many downstream tasks that include lemmatization, part-of-speech tagging, named entity recognition, dependency parsing, constituency parsing, semantic role labeling, and AMR parsing. The backbone of ELIT's MTL framework is a pre-trained transformer encoder that is shared across tasks to speed up their inference. ELIT provides pre-trained models developed on a remix of eight datasets. To scale up its service, ELIT also integrates a RESTful Client/Server combination. On the server side, ELIT extends its functionality to cover other tasks such as tokenization and coreference resolution, providing an end user with agile research experience. All resources including the source codes, documentation, and pre-trained models are publicly available at https://github.com/emorynlp/elit.
    Table-based Fact Verification with Salience-aware Learning. (arXiv:2109.04053v1 [cs.CL])
    (2 min) Tables provide valuable knowledge that can be used to verify textual statements. While a number of works have considered table-based fact verification, direct alignments of tabular data with tokens in textual statements are rarely available. Moreover, training a generalized fact verification model requires abundant labeled training data. In this paper, we propose a novel system to address these problems. Inspired by counterfactual causality, our system identifies token-level salience in the statement with probing-based salience estimation. Salience estimation allows enhanced learning of fact verification from two perspectives. From one perspective, our system conducts masked salient token prediction to enhance the model for alignment and reasoning between the table and the statement. From the other perspective, our system applies salience-aware data augmentation to generate a more diverse set of training instances by replacing non-salient terms. Experimental results on TabFact show the effective improvement by the proposed salience-aware learning techniques, leading to the new SOTA performance on the benchmark. Our code is publicly available at https://github.com/luka-group/Salience-aware-Learning .
    A Bayesian Framework for Information-Theoretic Probing. (arXiv:2109.03853v1 [cs.CL])
    (2 min) Pimentel et al. (2020) recently analysed probing from an information-theoretic perspective. They argue that probing should be seen as approximating a mutual information. This led to the rather unintuitive conclusion that representations encode exactly the same information about a target task as the original sentences. The mutual information, however, assumes the true probability distribution of a pair of random variables is known, leading to unintuitive results in settings where it is not. This paper proposes a new framework to measure what we term Bayesian mutual information, which analyses information from the perspective of Bayesian agents -- allowing for more intuitive findings in scenarios with finite data. For instance, under Bayesian MI we have that data can add information, processing can help, and information can hurt, which makes it more intuitive for machine learning applications. Finally, we apply our framework to probing where we believe Bayesian mutual information naturally operationalises ease of extraction by explicitly limiting the available background knowledge to solve a task.
    Ensemble Fine-tuned mBERT for Translation Quality Estimation. (arXiv:2109.03914v1 [cs.CL])
    (2 min) Quality Estimation (QE) is an important component of the machine translation workflow as it assesses the quality of the translated output without consulting reference translations. In this paper, we discuss our submission to the WMT 2021 QE Shared Task. We participate in Task 2 sentence-level sub-task that challenge participants to predict the HTER score for sentence-level post-editing effort. Our proposed system is an ensemble of multilingual BERT (mBERT)-based regression models, which are generated by fine-tuning on different input settings. It demonstrates comparable performance with respect to the Pearson's correlation and beats the baseline system in MAE/ RMSE for several language pairs. In addition, we adapt our system for the zero-shot setting by exploiting target language-relevant language pairs and pseudo-reference translations.
    Unsupervised Pre-training with Structured Knowledge for Improving Natural Language Inference. (arXiv:2109.03941v1 [cs.CL])
    (2 min) While recent research on natural language inference has considerably benefited from large annotated datasets, the amount of inference-related knowledge (including commonsense) provided in the annotated data is still rather limited. There have been two lines of approaches that can be used to further address the limitation: (1) unsupervised pretraining can leverage knowledge in much larger unstructured text data; (2) structured (often human-curated) knowledge has started to be considered in neural-network-based models for NLI. An immediate question is whether these two approaches complement each other, or how to develop models that can bring together their advantages. In this paper, we propose models that leverage structured knowledge in different components of pre-trained models. Our results show that the proposed models perform better than previous BERT-based state-of-the-art models. Although our models are proposed for NLI, they can be easily extended to other sentence or sentence-pair classification problems.
    Powering Comparative Classification with Sentiment Analysis via Domain Adaptive Knowledge Transfer. (arXiv:2109.03819v1 [cs.CL])
    (2 min) We study Comparative Preference Classification (CPC) which aims at predicting whether a preference comparison exists between two entities in a given sentence and, if so, which entity is preferred over the other. High-quality CPC models can significantly benefit applications such as comparative question answering and review-based recommendations. Among the existing approaches, non-deep learning methods suffer from inferior performances. The state-of-the-art graph neural network-based ED-GAT (Ma et al., 2020) only considers syntactic information while ignoring the critical semantic relations and the sentiments to the compared entities. We proposed sentiment Analysis Enhanced COmparative Network (SAECON) which improves CPC ac-curacy with a sentiment analyzer that learns sentiments to individual entities via domain adaptive knowledge transfer. Experiments on the CompSent-19 (Panchenko et al., 2019) dataset present a significant improvement on the F1 scores over the best existing CPC approaches.
    A Recipe For Arbitrary Text Style Transfer with Large Language Models. (arXiv:2109.03910v1 [cs.CL])
    (2 min) In this paper, we leverage large language models (LMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as "make this melodramatic" or "insert a metaphor."
    Knowledge mining of unstructured information: application to cyber-domain. (arXiv:2109.03848v1 [cs.CR])
    (2 min) Cyber intelligence is widely and abundantly available in numerous open online sources with reports on vulnerabilities and incidents. This constant stream of noisy information requires new tools and techniques if it is to be used for the benefit of analysts and investigators in various organizations. In this paper we present and implement a novel knowledge graph and knowledge mining framework for extracting relevant information from free-form text about incidents in the cyber domain. Our framework includes a machine learning based pipeline as well as crawling methods for generating graphs of entities, attackers and the related information with our non-technical cyber ontology. We test our framework on publicly available cyber incident datasets to evaluate the accuracy of our knowledge mining methods as well as the usefulness of the framework in the use of cyber analysts. Our results show analyzing the knowledge graph constructed using the novel framework, an analyst can infer additional information from the current cyber landscape in terms of risk to various entities and the propagation of risk between industries and countries. Expanding the framework to accommodate more technical and operational level information can increase the accuracy and explainability of trends and risk in the knowledge graph.
    Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems. (arXiv:2109.03888v1 [cs.CL])
    (2 min) Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Training and inference using large transformer models can be computationally expensive. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. Modified encoder architectures such as LED or LoBART use local attention patterns to address this problem for summarization. In contrast, this work focuses on the transformer's encoder-decoder attention mechanism. The cost of this attention becomes more significant in inference or training approaches that require model-generated histories. First, we examine the complexity of the encoder-decoder attention. We demonstrate empirically that there is a sparse sentence structure in document summarization that can be exploited by constraining the attention mechanism to a subset of input sentences, whilst maintaining system performance. Second, we propose a modified architecture that selects the subset of sentences to constrain the encoder-decoder attention. Experiments are carried out on abstractive summarization tasks, including CNN/DailyMail, XSum, Spotify Podcast, and arXiv.
    Graphine: A Dataset for Graph-aware Terminology Definition Generation. (arXiv:2109.04018v1 [cs.CL])
    (2 min) Precisely defining the terminology is the first step in scientific communication. Developing neural text generation models for definition generation can circumvent the labor-intensity curation, further accelerating scientific discovery. Unfortunately, the lack of large-scale terminology definition dataset hinders the process toward definition generation. In this paper, we present a large-scale terminology definition dataset Graphine covering 2,010,648 terminology definition pairs, spanning 227 biomedical subdisciplines. Terminologies in each subdiscipline further form a directed acyclic graph, opening up new avenues for developing graph-aware text generation models. We then proposed a novel graph-aware definition generation model Graphex that integrates transformer with graph neural network. Our model outperforms existing text generation models by exploiting the graph structure of terminologies. We further demonstrated how Graphine can be used to evaluate pretrained language models, compare graph representation learning methods and predict sentence granularity. We envision Graphine to be a unique resource for definition generation and many other NLP tasks in biomedicine.
    Distributionally Robust Multilingual Machine Translation. (arXiv:2109.04020v1 [cs.CL])
    (2 min) Multilingual neural machine translation (MNMT) learns to translate multiple language pairs with a single model, potentially improving both the accuracy and the memory-efficiency of deployed models. However, the heavy data imbalance between languages hinders the model from performing uniformly across language pairs. In this paper, we propose a new learning objective for MNMT based on distributionally robust optimization, which minimizes the worst-case expected loss over the set of language pairs. We further show how to practically optimize this objective for large translation corpora using an iterated best response scheme, which is both effective and incurs negligible additional computational cost compared to standard empirical risk minimization. We perform extensive experiments on three sets of languages from two datasets and show that our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
    A Formal Description of Sorani Kurdish Morphology. (arXiv:2109.03942v1 [cs.CL])
    (2 min) Sorani Kurdish, also known as Central Kurdish, has a complex morphology, particularly due to the patterns in which morphemes appear. Although several aspects of Kurdish morphology have been studied, such as pronominal endoclitics and Izafa constructions, Sorani Kurdish morphology has received trivial attention in computational linguistics. Moreover, some morphemes, such as the emphasis endoclitic =\^i\c{s}, and derivational morphemes have not been previously studied. To tackle the complex morphology of Sorani, we provide a thorough description of Sorani Kurdish morphological and morphophonological constructions in a formal way such that they can be used as finite-state transducers for morphological analysis and synthesis.
  • cs.CV updates on arXiv.org

    Adaptive Binary-Ternary Quantization. (arXiv:1909.12205v2 [cs.LG] UPDATED)
    (0 min) Neural network models are resource hungry. It is difficult to deploy such deep networks on devices with limited resources, like smart wearables, cellphones, drones, and autonomous vehicles. Low bit quantization such as binary and ternary quantization is a common approach to alleviate this resource requirements. Ternary quantization provides a more flexible model and outperforms binary quantization in terms of accuracy, however doubles the memory footprint and increases the computational cost. Contrary to these approaches, mixed quantized models allow a trade-off between accuracy and memory footprint. In such models, quantization depth is often chosen manually, or is tuned using a separate optimization routine. The latter requires training a quantized network multiple times. Here, we propose an adaptive combination of binary and ternary quantization, namely Smart Quantization (SQ), in which the quantization depth is modified directly via a regularization function, so that the model is trained only once. Our experimental results show that the proposed method adapts quantization depth successfully while keeping the model accuracy high on MNIST and CIFAR10 benchmarks.
    Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. (arXiv:2109.04290v1 [cs.CV])
    (2 min) Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC. Further, with both of them, the performance is advanced to a big extend, surpassing the previous SOTA methods for around 4.6\% R@1 in MSR-VTT.
    General Partial Label Learning via Dual Bipartite Graph Autoencoder. (arXiv:2001.01290v2 [cs.CV] UPDATED)
    (2 min) We formulate a practical yet challenging problem: General Partial Label Learning (GPLL). Compared to the traditional Partial Label Learning (PLL) problem, GPLL relaxes the supervision assumption from instance-level -- a label set partially labels an instance -- to group-level: 1) a label set partially labels a group of instances, where the within-group instance-label link annotations are missing, and 2) cross-group links are allowed -- instances in a group may be partially linked to the label set from another group. Such ambiguous group-level supervision is more practical in real-world scenarios as additional annotation on the instance-level is no longer required, e.g., face-naming in videos where the group consists of faces in a frame, labeled by a name set in the corresponding caption. In this paper, we propose a novel graph convolutional network (GCN) called Dual Bipartite Graph Autoencoder (DB-GAE) to tackle the label ambiguity challenge of GPLL. First, we exploit the cross-group correlations to represent the instance groups as dual bipartite graphs: within-group and cross-group, which reciprocally complements each other to resolve the linking ambiguities. Second, we design a GCN autoencoder to encode and decode them, where the decodings are considered as the refined results. It is worth noting that DB-GAE is self-supervised and transductive, as it only uses the group-level supervision without a separate offline training stage. Extensive experiments on two real-world datasets demonstrate that DB-GAE significantly outperforms the best baseline over absolute 0.159 F1-score and 24.8% accuracy. We further offer analysis on various levels of label ambiguities.
    Continuous Event-Line Constraint for Closed-Form Velocity Initialization. (arXiv:2109.04313v1 [cs.CV])
    (2 min) Event cameras trigger events asynchronously and independently upon a sufficient change of the logarithmic brightness level. The neuromorphic sensor has several advantages over standard cameras including low latency, absence of motion blur, and high dynamic range. Event cameras are particularly well suited to sense motion dynamics in agile scenarios. We propose the continuous event-line constraint, which relies on a constant-velocity motion assumption as well as trifocal tensor geometry in order to express a relationship between line observations given by event clusters as well as first-order camera dynamics. Our core result is a closed-form solver for up-to-scale linear camera velocity {with known angular velocity}. Nonlinear optimization is adopted to improve the performance of the algorithm. The feasibility of the approach is demonstrated through a careful analysis on both simulated and real data.
    Modified Supervised Contrastive Learning for Detecting Anomalous Driving Behaviours. (arXiv:2109.04021v1 [cs.CV])
    (2 min) Detecting distracted driving behaviours is important to reduce millions of deaths and injuries occurring worldwide. Distracted or anomalous driving behaviours are deviations from the 'normal' driving that need to be identified correctly to alert the driver. However, these driving behaviours do not comprise of one specific type of driving style and their distribution can be different during training and testing phases of a classifier. We formulate this problem as a supervised contrastive learning approach to learn a visual representation to detect normal, and seen and unseen anomalous driving behaviours. We made a change to the standard contrastive loss function to adjust the similarity of negative pairs to aid the optimization. Normally, the (self) supervised contrastive framework contains an encoder followed by a projection head, which is omitted during testing phase as the encoding layers are considered to contain general visual representative information. However, we assert that for supervised contrastive learning task, including projection head will be beneficial. We showed our results on a Driver Anomaly Detection dataset that contains 783 minutes of video recordings of normal and anomalous driving behaviours of 31 drivers from various from top and front cameras (both depth and infrared). We also performed an extra step of fine tuning the labels in this dataset. Out of 9 video modalities combinations, our modified contrastive approach improved the ROC AUC on 7 in comparison to the baseline models (from 3.12% to 8.91% for different modalities); the remaining two models also had manual labelling. We performed statistical tests that showed evidence that our modifications perform better than the baseline contrastive models. Finally, the results showed that the fusion of depth and infrared modalities from top and front view achieved the best AUC ROC of 0.9738 and AUC PR of 0.9772.
    Neighborhood Consensus Contrastive Learning for Backward-Compatible Representation. (arXiv:2108.03372v2 [cs.CV] UPDATED)
    (2 min) In object re-identification (ReID), the development of deep learning techniques often involves model updates and deployment. It is unbearable to re-embedding and re-index with the system suspended when deploying new models. Therefore, backward-compatible representation is proposed to enable "new" features to be compared with "old" features directly, which means that the database is active when there are both "new" and "old" features in it. Thus we can scroll-refresh the database or even do nothing on the database to update. The existing backward-compatible methods either require a strong overlap between old and new training data or simply conduct constraints at the instance level. Thus they are difficult in handling complicated cluster structures and are limited in eliminating the impact of outliers in old embeddings, resulting in a risk of damaging the discriminative capability of new features. In this work, we propose a Neighborhood Consensus Contrastive Learning (NCCL) method. With no assumptions about the new training data, we estimate the sub-cluster structures of old embeddings. A new embedding is constrained with multiple old embeddings in both embedding space and discrimination space at the sub-class level. The effect of outliers diminished, as the multiple samples serve as "mean teachers". Besides, we also propose a scheme to filter the old embeddings with low credibility, further improving the compatibility robustness. Our method ensures backward compatibility without impairing the accuracy of the new model. And it can even improve the new model's accuracy in most scenarios.
    NEAT: Neural Attention Fields for End-to-End Autonomous Driving. (arXiv:2109.04456v1 [cs.CV])
    (2 min) Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial prerequisite for autonomous driving. We present NEural ATtention fields (NEAT), a novel representation that enables such reasoning for end-to-end imitation learning models. NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows our model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert used to generate its training data. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability.
    DSAM: A Distance Shrinking with Angular Marginalizing Loss for High Performance Vehicle Re-identificatio. (arXiv:2011.06228v3 [cs.CV] UPDATED)
    (2 min) Vehicle Re-identification (ReID) is an important yet challenging problem in computer vision. Compared to other visual objects like faces and persons, vehicles simultaneously exhibit much larger intraclass viewpoint variations and interclass visual similarities, making most exiting loss functions designed for face recognition and person ReID unsuitable for vehicle ReID. To obtain a high-performance vehicle ReID model, we present a novel Distance Shrinking with Angular Marginalizing (DSAM) loss function to perform hybrid learning in both the Original Feature Space (OFS) and the Feature Angular Space (FAS) using the local verification and the global identification information. Specifically, it shrinks the distance between samples of the same class locally in the Original Feature Space while keeps samples of different classes far away in the Feature Angular Space. The shrinking and marginalizing operations are performed during each iteration of the training process and are suitable for different SoftMax based loss functions. We evaluate the DSAM loss function on three large vehicle ReID datasets with detailed analyses and extensive comparisons with many competing vehicle ReID methods. Experimental results show that our DSAM loss enhances the SoftMax loss by a large margin on the PKU-VD1-Large dataset: 10.41% for mAP, 5.29% for cmc1, and 4.60% for cmc5. Moreover, the mAP is increased by 9.34% on the PKU-VehicleID dataset and 6.13% on the VeRi-776 dataset. Source code will be released to facilitate further studies in this research direction.
    PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition. (arXiv:2109.04145v1 [cs.CV])
    (2 min) Nowadays, scene text recognition has attracted more and more attention due to its various applications. Most state-of-the-art methods adopt an encoder-decoder framework with attention mechanism, which generates text autoregressively from left to right. Despite the convincing performance, the speed is limited because of the one-by-one decoding strategy. As opposed to autoregressive models, non-autoregressive models predict the results in parallel with a much shorter inference time, but the accuracy falls behind the autoregressive counterpart considerably. In this paper, we propose a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. In each iteration, the context information is fully explored. To improve learning of the hidden layer, we exploit the mimicking learning in the training phase, where an additional autoregressive decoder is adopted and the parallel decoder mimics the autoregressive decoder with fitting outputs of the hidden layer. With the shared backbone between the two decoders, the proposed PIMNet can be trained end-to-end without pre-training. During inference, the branch of the autoregressive decoder is removed for a faster speed. Extensive experiments on public benchmarks demonstrate the effectiveness and efficiency of PIMNet. Our code will be available at https://github.com/Pay20Y/PIMNet.
    Fair Conformal Predictors for Applications in Medical Imaging. (arXiv:2109.04392v1 [eess.IV])
    (2 min) Deep learning has the potential to augment many components of the clinical workflow, such as medical image interpretation. However, the translation of these black box algorithms into clinical practice has been marred by the relative lack of transparency compared to conventional machine learning methods, hindering in clinician trust in the systems for critical medical decision-making. Specifically, common deep learning approaches do not have intuitive ways of expressing uncertainty with respect to cases that might require further human review. Furthermore, the possibility of algorithmic bias has caused hesitancy regarding the use of developed algorithms in clinical settings. To these ends, we explore how conformal methods can complement deep learning models by providing both clinically intuitive way (by means of confidence prediction sets) of expressing model uncertainty as well as facilitating model transparency in clinical workflows. In this paper, we conduct a field survey with clinicians to assess clinical use-cases of conformal predictions. Next, we conduct experiments with a mammographic breast density and dermatology photography datasets to demonstrate the utility of conformal predictions in "rule-in" and "rule-out" disease scenarios. Further, we show that conformal predictors can be used to equalize coverage with respect to patient demographics such as race and skin tone. We find that a conformal predictions to be a promising framework with potential to increase clinical usability and transparency for better collaboration between deep learning algorithms and clinicians.
    Sharing Matters for Generalization in Deep Metric Learning. (arXiv:2004.05582v3 [cs.CV] UPDATED)
    (2 min) Learning the similarity between images constitutes the foundation for numerous vision tasks. The common paradigm is discriminative metric learning, which seeks an embedding that separates different training classes. However, the main challenge is to learn a metric that not only generalizes from training to novel, but related, test samples. It should also transfer to different object classes. So what complementary information is missed by the discriminative paradigm? Besides finding characteristics that separate between classes, we also need them to likely occur in novel categories, which is indicated if they are shared across training classes. This work investigates how to learn such characteristics without the need for extra annotations or training data. By formulating our approach as a novel triplet sampling strategy, it can be easily applied on top of recent ranking loss frameworks. Experiments show that, independent of the underlying network architecture and the specific ranking loss, our approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets. Preliminary early access page can be found here: https://ieeexplore.ieee.org/document/9141449
    Improving Building Segmentation for Off-Nadir Satellite Imagery. (arXiv:2109.03961v1 [cs.CV])
    (2 min) Automatic building segmentation is an important task for satellite imagery analysis and scene understanding. Most existing segmentation methods focus on the case where the images are taken from directly overhead (i.e., low off-nadir/viewing angle). These methods often fail to provide accurate results on satellite images with larger off-nadir angles due to the higher noise level and lower spatial resolution. In this paper, we propose a method that is able to provide accurate building segmentation for satellite imagery captured from a large range of off-nadir angles. Based on Bayesian deep learning, we explicitly design our method to learn the data noise via aleatoric and epistemic uncertainty modeling. Satellite image metadata (e.g., off-nadir angle and ground sample distance) is also used in our model to further improve the result. We show that with uncertainty modeling and metadata injection, our method achieves better performance than the baseline method, especially for noisy images taken from large off-nadir angles.
    Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training. (arXiv:2101.12699v3 [cs.LG] UPDATED)
    (2 min) In this paper, we introduce the \textit{Layer-Peeled Model}, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse \cite{papyan2020prevalence}. More importantly, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term \textit{Minority Collapse}, which fundamentally limits the performance of deep learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before being confirmed by our computational experiments.
    IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System. (arXiv:2109.04202v1 [q-bio.QM])
    (2 min) Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging the task formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation. Unlike previous Neural Network-based systems, IMG2SMI builds around the task of molecule description generation, which enables IMG2SMI to outperform OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further research on this task, we release a new molecule prediction dataset. including 81 million molecules for molecule description generation
    Just Noticeable Difference for Machine Perception and Generation of Regularized Adversarial Images with Minimal Perturbation. (arXiv:2102.08079v3 [cs.CV] UPDATED)
    (2 min) In this study, we introduce a measure for machine perception, inspired by the concept of Just Noticeable Difference (JND) of human perception. Based on this measure, we suggest an adversarial image generation algorithm, which iteratively distorts an image by an additive noise until the model detects the change in the image by outputting a false label. The noise added to the original image is defined as the gradient of the cost function of the model. A novel cost function is defined to explicitly minimize the amount of perturbation applied to the input image while enforcing the perceptual similarity between the adversarial and input images. For this purpose, the cost function is regularized by the well-known total variation and bounded range terms to meet the natural appearance of the adversarial image. We evaluate the adversarial images generated by our algorithm both qualitatively and quantitatively on CIFAR10, ImageNet, and MS COCO datasets. Our experiments on image classification and object detection tasks show that adversarial images generated by our JND method are both more successful in deceiving the recognition/detection models and less perturbed compared to the images generated by the state-of-the-art methods, namely, FGV, FSGM, and DeepFool methods.
    ACP++: Action Co-occurrence Priors for Human-Object Interaction Detection. (arXiv:2109.04047v1 [cs.CV])
    (2 min) A common problem in the task of human-object interaction (HOI) detection is that numerous HOI classes have only a small number of labeled examples, resulting in training sets with a long-tailed distribution. The lack of positive labels can lead to low classification accuracy for these classes. Towards addressing this issue, we observe that there exist natural correlations and anti-correlations among human-object interactions. In this paper, we model the correlations as action co-occurrence matrices and present techniques to learn these priors and leverage them for more effective training, especially on rare classes. The efficacy of our approach is demonstrated experimentally, where the performance of our approach consistently improves over the state-of-the-art methods on both of the two leading HOI detection benchmark datasets, HICO-Det and V-COCO.
    Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer. (arXiv:2108.01390v3 [cs.CV] UPDATED)
    (3 min) Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue. Since the computation complexity of ViT is quadratic with respect to the input sequence length, a mainstream paradigm for computation reduction is to reduce the number of tokens. Existing designs include structured spatial compression that uses a progressive shrinking pyramid to reduce the computations of large feature maps, and unstructured token pruning that dynamically drops redundant tokens. However, the limitation of existing token pruning lies in two folds: 1) the incomplete spatial structure caused by pruning is not compatible with structured spatial compression that is commonly used in modern deep-narrow transformers; 2) it usually requires a time-consuming pre-training procedure. To tackle the limitations and expand the applicable scenario of token pruning, we present Evo-ViT, a self-motivated slow-fast token evolution approach for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the simple and effective global class attention that is native to vision transformers. Then, we propose to update the selected informative tokens and uninformative tokens with different computation paths, namely, slow-fast updating. Since slow-fast updating mechanism maintains the spatial structure and information flow, Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification.
    Efficient Weight Pruning using Pre-trained Lottery Jackpots. (arXiv:2104.08700v3 [cs.CV] UPDATED)
    (2 min) Network pruning is an effective approach to reduce network complexity without performance compromise. Existing studies achieve the sparsity of neural networks via time-consuming weight tuning or complex search on networks with expanded width, which greatly limits the applications of network pruning. In this paper, we show that high-performing and sparse sub-networks without the involvement of weight tuning, termed "lottery jackpots", exist in pre-trained models with unexpanded width. For example, we obtain a lottery jackpot that has only 10% parameters and still reaches the performance of the original dense VGGNet-19 without any modifications on the pre-trained weights. Furthermore, we observe that the sparse masks derived from many existing pruning criteria have a high overlap with the searched mask of our lottery jackpot, among which, the magnitude-based pruning results in the most similar mask with ours. Based on this insight, we initialize our sparse mask using the magnitude pruning, resulting in at least 3x cost reduction on the lottery jackpot search while achieves comparable or even better performance. Specifically, our magnitude-based lottery jackpot removes 90% weights in the ResNet-50, while easily obtains more than 70% top-1 accuracy using only 10 searching epochs on ImageNet.
    Energy-Efficient Mobile Robot Control via Run-time Monitoring of Environmental Complexity and Computing Workload. (arXiv:2109.04285v1 [cs.RO])
    (2 min) We propose an energy-efficient controller to minimize the energy consumption of a mobile robot by dynamically manipulating the mechanical and computational actuators of the robot. The mobile robot performs real-time vision-based applications based on an event-based camera. The actuators of the controller are CPU voltage/frequency for the computation part and motor voltage for the mechanical part. We show that independently considering speed control of the robot and voltage/frequency control of the CPU does not necessarily result in an energy-efficient solution. In fact, to obtain the highest efficiency, the computation and mechanical parts should be controlled together in synergy. We propose a fast hill-climbing optimization algorithm to allow the controller to find the best CPU/motor configuration at run-time and whenever the mobile robot is facing a new environment during its travel. Experimental results on a robot with Brushless DC Motors, Jetson TX2 board as the computing unit, and a DAVIS-346 event-based camera show that the proposed control algorithm can save battery energy by an average of 50.5%, 41%, and 30%, in low-complexity, medium-complexity, and high-complexity environments, over baselines.
    Preservational Learning Improves Self-supervised Medical Image Models by Reconstructing Diverse Contexts. (arXiv:2109.04379v1 [cs.CV])
    (2 min) Preserving maximal information is one of principles of designing self-supervised learning methodologies. To reach this goal, contrastive learning adopts an implicit way which is contrasting image pairs. However, we believe it is not fully optimal to simply use the contrastive estimation for preservation. Moreover, it is necessary and complemental to introduce an explicit solution to preserve more information. From this perspective, we introduce Preservational Learning to reconstruct diverse image contexts in order to preserve more information in learned representations. Together with the contrastive loss, we present Preservational Contrastive Representation Learning (PCRL) for learning self-supervised medical representations. PCRL provides very competitive results under the pretraining-finetuning protocol, outperforming both self-supervised and supervised counterparts in 5 classification/segmentation tasks substantially.
    Neural-IMLS: Learning Implicit Moving Least-Squares for Surface Reconstruction from Unoriented Point clouds. (arXiv:2109.04398v1 [cs.CV])
    (2 min) Surface reconstruction from noisy, non-uniformly, and unoriented point clouds is a fascinating yet difficult problem in computer vision and computer graphics. In this paper, we propose Neural-IMLS, a novel approach that learning noise-resistant signed distance function (SDF) for reconstruction. Instead of explicitly learning priors with the ground-truth signed distance values, our method learns the SDF from raw point clouds directly in a self-supervised fashion by minimizing the loss between the couple of SDFs, one obtained by the implicit moving least-square function (IMLS) and the other by our network. Finally, a watertight and smooth 2-manifold triangle mesh is yielded by running Marching Cubes. We conduct extensive experiments on various benchmarks to demonstrate the performance of Neural-IMLS, especially for point clouds with noise.
    Reconstructing and grounding narrated instructional videos in 3D. (arXiv:2109.04409v1 [cs.CV])
    (2 min) Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D.
    Attention-Based 3D Seismic Fault Segmentation Training by a Few 2D Slice Labels. (arXiv:2105.03857v5 [cs.CV] UPDATED)
    (3 min) Detection faults in seismic data is a crucial step for seismic structural interpretation, reservoir characterization and well placement. Some recent works regard it as an image segmentation task. The task of image segmentation requires huge labels, especially 3D seismic data, which has a complex structure and lots of noise. Therefore, its annotation requires expert experience and a huge workload. In this study, we present lambda-BCE and lambda-smooth L1loss to effectively train 3D-CNN by some slices from 3D seismic data, so that the model can learn the segmentation of 3D seismic data from a few 2D slices. In order to fully extract information from limited data and suppress seismic noise, we propose an attention module that can be used for active supervision training and embedded in the network. The attention heatmap label is generated by the original label, and letting it supervise the attention module using the lambda-smooth L1loss. The experiment demonstrates the effectiveness of our loss function, the method can extract 3D seismic features from a few 2D slice labels. And it also shows the advanced performance of the attention module, which can significantly suppress the noise in the seismic data while increasing the model's sensitivity to the foreground. Finally, on the public test set, we only use the 2D slice labels training that accounts for 3.3% of the 3D volume label, and achieve similar performance to the 3D volume label training.
    Superpoint-guided Semi-supervised Semantic Segmentation of 3D Point Clouds. (arXiv:2107.03601v3 [cs.CV] UPDATED)
    (2 min) 3D point cloud semantic segmentation is a challenging topic in the computer vision field. Most of the existing methods in literature require a large amount of fully labeled training data, but it is extremely time-consuming to obtain these training data by manually labeling massive point clouds. Addressing this problem, we propose a superpoint-guided semi-supervised segmentation network for 3D point clouds, which jointly utilizes a small portion of labeled scene point clouds and a large number of unlabeled point clouds for network training. The proposed network is iteratively updated with its predicted pseudo labels, where a superpoint generation module is introduced for extracting superpoints from 3D point clouds, and a pseudo-label optimization module is explored for automatically assigning pseudo labels to the unlabeled points under the constraint of the extracted superpoints. Additionally, there are some 3D points without pseudo-label supervision. We propose an edge prediction module to constrain features of edge points. A superpoint feature aggregation module and a superpoint feature consistency loss function are introduced to smooth superpoint features. Extensive experimental results on two 3D public datasets demonstrate that our method can achieve better performance than several state-of-the-art point cloud segmentation networks and several popular semi-supervised segmentation methods with few labeled scenes.
    ProAI: An Efficient Embedded AI Hardware for Automotive Applications -- a Benchmark Study. (arXiv:2108.05170v2 [cs.CV] UPDATED)
    (2 min) Development in the field of Single Board Computers (SBC) have been increasing for several years. They provide a good balance between computing performance and power consumption which is usually required for mobile platforms, like application in vehicles for Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD). However, there is an ever-increasing need of more powerful and efficient SBCs which can run power intensive Deep Neural Networks (DNNs) in real-time and can also satisfy necessary functional safety requirements such as Automotive Safety Integrity Level (ASIL). ProAI is being developed by ZF mainly to run powerful and efficient applications such as multitask DNNs and on top of that it also has the required safety certification for AD. In this work, we compare and discuss state of the art SBC on the basis of power intensive multitask DNN architecture called Multitask-CenterNet with respect to performance measures such as, FPS and power efficiency. As an automotive supercomputer, ProAI delivers an excellent combination of performance and efficiency, managing nearly twice the number of FPS per watt than a modern workstation laptop and almost four times compared to the Jetson Nano. Furthermore, it was also shown that there is still power in reserve for further and more complex tasks on the ProAI, based on the CPU and GPU utilization during the benchmark.
    Localization Uncertainty Estimation for Anchor-Free Object Detection. (arXiv:2006.15607v5 [cs.CV] UPDATED)
    (3 min) Since many safety-critical systems, such as surgical robots and autonomous driving cars operate in unstable environments with sensor noise and incomplete data, it is desirable for object detectors to take the localization uncertainty into account. However, there are several limitations of the existing uncertainty estimation methods for anchor-based object detection. 1) They model the uncertainty of the heterogeneous object properties with different characteristics and scales, such as location (center point) and scale (width, height), which could be difficult to estimate. 2) They model box offsets as Gaussian distributions, which is not compatible with the ground truth bounding boxes that follow the Dirac delta distribution. 3) Since anchor-based methods are sensitive to anchor hyperparameters, the localization uncertainty for them could be also highly sensitive to the choice of hyperparameters as well. To tackle these limitations, we propose a new localization uncertainty estimation method called UAD for anchor-free object detection. Our method captures the uncertainty in four directions of box offsets~(left, right, top, bottom) that are homogeneous, so that it can tell which direction is uncertain, and provides a quantitative value of uncertainty in $[0, 1]$. To enable such uncertainty estimation, we design a new uncertainty loss, negative power log-likelihood loss, to measure the localization uncertainty by weighting the likelihood loss by its IoU, which alleviates the model misspecification problem. Furthermore, we propose an uncertainty-aware focal loss for reflecting the estimated uncertainty to the classification score. Experimental results on COCO datasets demonstrate that our method significantly improves FCOS, by up to 1.8 points, without sacrificing computational efficiency.
    Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging. (arXiv:2010.03060v5 [cs.LG] UPDATED)
    (2 min) A key challenge in training neural networks for a given medical imaging task is often the difficulty of obtaining a sufficient number of manually labeled examples. In contrast, textual imaging reports, which are often readily available in medical records, contain rich but unstructured interpretations written by experts as part of standard clinical practice. We propose using these textual reports as a form of weak supervision to improve the image interpretation performance of a neural network without requiring additional manually labeled examples. We use an image-text matching task to train a feature extractor and then fine-tune it in a transfer learning setting for a supervised task using a small labeled dataset. The end result is a neural network that automatically interprets imagery without requiring textual reports during inference. This approach can be applied to any task for which text-image pairs are readily available. We evaluate our method on three classification tasks and find consistent performance improvements, reducing the need for labeled data by 67%-98%.
    IICNet: A Generic Framework for Reversible Image Conversion. (arXiv:2109.04242v1 [cs.CV])
    (2 min) Reversible image conversion (RIC) aims to build a reversible transformation between specific visual content (e.g., short videos) and an embedding image, where the original content can be restored from the embedding when necessary. This work develops Invertible Image Conversion Net (IICNet) as a generic solution to various RIC tasks due to its strong capacity and task-independent design. Unlike previous encoder-decoder based methods, IICNet maintains a highly invertible structure based on invertible neural networks (INNs) to better preserve the information during conversion. We use a relation module and a channel squeeze layer to improve the INN nonlinearity to extract cross-image relations and the network flexibility, respectively. Experimental results demonstrate that IICNet outperforms the specifically-designed methods on existing RIC tasks and can generalize well to various newly-explored tasks. With our generic IICNet, we no longer need to hand-engineer task-specific embedding networks for rapidly occurring visual content. Our source codes are available at: https://github.com/felixcheng97/IICNet.
    Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency. (arXiv:2105.08667v2 [cs.CY] UPDATED)
    (2 min) Twitter uses machine learning to crop images, where crops are centered around the part predicted to be the most salient. In fall 2020, Twitter users raised concerns that the automated image cropping system on Twitter favored light-skinned over dark-skinned individuals, as well as concerns that the system favored cropping woman's bodies instead of their heads. In order to address these concerns, we conduct an extensive analysis using formalized group fairness metrics. We find systematic disparities in cropping and identify contributing factors, including the fact that the cropping based on the single most salient point can amplify the disparities because of an effect we term argmax bias. However, we demonstrate that formalized fairness metrics and quantitative analysis on their own are insufficient for capturing the risk of representational harm in automatic cropping. We suggest the removal of saliency-based cropping in favor of a solution that better preserves user agency. For developing a new solution that sufficiently address concerns related to representational harm, our critique motivates a combination of quantitative and qualitative methods that include human-centered design.
    Dynamic Modeling of Hand-Object Interactions via Tactile Sensing. (arXiv:2109.04378v1 [cs.RO])
    (2 min) Tactile sensing is critical for humans to perform everyday tasks. While significant progress has been made in analyzing object grasping from vision, it remains unclear how we can utilize tactile sensing to reason about and model the dynamics of hand-object interactions. In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects. We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model, which can then be used on its own during the test time. The tactile model aims to predict the 3d locations of both the hand and the object purely from the touch data by combining a predictive model and a contrastive learning module. This framework can reason about the interaction patterns from the tactile data, hallucinate the changes in the environment, estimate the uncertainty of the prediction, and generalize to unseen objects. We also provide detailed ablation studies regarding different system designs as well as visualizations of the predicted trajectories. This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing, which opens the door for future applications in activity learning, human-computer interactions, and imitation learning for robotics.
    HoHoNet: 360 Indoor Holistic Understanding with Latent Horizontal Features. (arXiv:2011.11498v3 [cs.CV] UPDATED)
    (2 min) We present HoHoNet, a versatile and efficient framework for holistic understanding of an indoor 360-degree panorama using a Latent Horizontal Feature (LHFeat). The compact LHFeat flattens the features along the vertical direction and has shown success in modeling per-column modality for room layout reconstruction. HoHoNet advances in two important aspects. First, the deep architecture is redesigned to run faster with improved accuracy. Second, we propose a novel horizon-to-dense module, which relaxes the per-column output shape constraint, allowing per-pixel dense prediction from LHFeat. HoHoNet is fast: It runs at 52 FPS and 110 FPS with ResNet-50 and ResNet-34 backbones respectively, for modeling dense modalities from a high-resolution $512 \times 1024$ panorama. HoHoNet is also accurate. On the tasks of layout estimation and semantic segmentation, HoHoNet achieves results on par with current state-of-the-art. On dense depth estimation, HoHoNet outperforms all the prior arts by a large margin.
    Leveraging Local Domains for Image-to-Image Translation. (arXiv:2109.04468v1 [cs.CV])
    (2 min) Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as 'local domains' and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training.
    Adversarial Attacks are Reversible with Natural Supervision. (arXiv:2103.14222v3 [cs.CV] UPDATED)
    (2 min) We find that images contain intrinsic structure that enables the reversal of many adversarial attacks. Attack vectors cause not only image classifiers to fail, but also collaterally disrupt incidental structure in the image. We demonstrate that modifying the attacked image to restore the natural structure will reverse many types of attacks, providing a defense. Experiments demonstrate significantly improved robustness for several state-of-the-art models across the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Our results show that our defense is still effective even if the attacker is aware of the defense mechanism. Since our defense is deployed during inference instead of training, it is compatible with pre-trained networks as well as most other defenses. Our results suggest deep networks are vulnerable to adversarial examples partly because their representations do not enforce the natural structure of images.
    Single Image 3D Object Estimation with Primitive Graph Networks. (arXiv:2109.04153v1 [cs.CV])
    (2 min) Reconstructing 3D object from a single image (RGB or depth) is a fundamental problem in visual scene understanding and yet remains challenging due to its ill-posed nature and complexity in real-world scenes. To address those challenges, we adopt a primitive-based representation for 3D object, and propose a two-stage graph network for primitive-based 3D object estimation, which consists of a sequential proposal module and a graph reasoning module. Given a 2D image, our proposal module first generates a sequence of 3D primitives from input image with local feature attention. Then the graph reasoning module performs joint reasoning on a primitive graph to capture the global shape context for each primitive. Such a framework is capable of taking into account rich geometry and semantic constraints during 3D structure recovery, producing 3D objects with more coherent structure even under challenging viewing conditions. We train the entire graph neural network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet and NYU Depth V2. Extensive experiments show that our approach outperforms the previous state of the arts with a considerable margin.
    LightSAL: Lightweight Sign Agnostic Learning for Implicit Surface Representation. (arXiv:2103.14273v2 [cs.CV] UPDATED)
    (2 min) Recently, several works have addressed modeling of 3D shapes using deep neural networks to learn implicit surface representations. Up to now, the majority of works have concentrated on reconstruction quality, paying little or no attention to model size or training time. This work proposes LightSAL, a novel deep convolutional architecture for learning 3D shapes; the proposed work concentrates on efficiency both in network training time and resulting model size. We build on the recent concept of Sign Agnostic Learning for training the proposed network, relying on signed distance fields, with unsigned distance as ground truth. In the experimental section of the paper, we demonstrate that the proposed architecture outperforms previous work in model size and number of required training iterations, while achieving equivalent accuracy. Experiments are based on the D-Faust dataset that contains 41k 3D scans of human shapes. The proposed model has been implemented in PyTorch.
    M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks. (arXiv:2109.04275v1 [cs.CV])
    (2 min) In this paper, we aim to advance the research of multi-modal pre-training on E-commerce and subsequently contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs, covering more than 6,000 categories and 5,000 attributes. Generally, existing multi-modal datasets are either limited in scale or modality diversity. Differently, our M5Product is featured from the following aspects. First, the M5Product dataset is 500 times larger than the public multimodal dataset with the same number of modalities and nearly twice larger compared with the largest available text-image cross-modal dataset. Second, the dataset contains rich information of multiple modalities including image, text, table, video and audio, in which each modality can capture different views of semantic information (e.g. category, attributes, affordance, brand, preference) and complements the other. Third, to better accommodate with real-world problems, a few portion of M5Product contains incomplete modality pairs and noises while having the long-tailed distribution, which aligns well with real-world scenarios. Finally, we provide a baseline model M5-MMT that makes the first attempt to integrate the different modality configuration into an unified model for feature fusion to address the great challenge for semantic alignment. We also evaluate various multi-model pre-training state-of-the-arts for benchmarking their capabilities in learning from unlabeled data under the different number of modalities on the M5Product dataset. We conduct extensive experiments on four downstream tasks and provide some interesting findings on these modalities. Our dataset and related code are available at https://xiaodongsuper.github.io/M5Product_dataset.
    Copy-Move Image Forgery Detection Based on Evolving Circular Domains Coverage. (arXiv:2109.04381v1 [cs.CV])
    (2 min) The aim of this paper is to improve the accuracy of copy-move forgery detection (CMFD) in image forensics by proposing a novel scheme. The proposed scheme integrates both block-based and keypoint-based forgery detection methods. Firstly, speed-up robust feature (SURF) descriptor in log-polar space and scale invariant feature transform (SIFT) descriptor are extracted from an entire forged image. Secondly, generalized 2 nearest neighbor (g2NN) is employed to get massive matched pairs. Then, random sample consensus (RANSAC) algorithm is employed to filter out mismatched pairs, thus allowing rough localization of the counterfeit areas. To present more accurately these forgery areas more accurately, we propose an efficient and accurate algorithm, evolving circular domains coverage (ECDC), to cover present them. This algorithm aims to find satisfactory threshold areas by extracting block features from jointly evolving circular domains, which are centered on the matched pairs. Finally, morphological operation is applied to refine the detected forgery areas. The experimental results indicate that the proposed CMFD scheme can achieve better detection performance under various attacks compared with other state-of-the-art CMFD schemes.
    PhysGNN: A Physics-Driven Graph Neural Network Based Model for Predicting Soft Tissue Deformation in Image-Guided Neurosurgery. (arXiv:2109.04352v1 [eess.IV])
    (2 min) Correctly capturing intraoperative brain shift in image-guided neurosurgical procedures is a critical task for aligning preoperative data with intraoperative geometry, ensuring effective surgical navigation and optimal surgical precision. While the finite element method (FEM) is a proven technique to effectively approximate soft tissue deformation through biomechanical formulations, their degree of success boils down to a trade-off between accuracy and speed. To circumvent this problem, the most recent works in this domain have proposed leveraging data-driven models obtained by training various machine learning algorithms, e.g. random forests, artificial neural networks (ANNs), with the results of finite element analysis (FEA) to speed up tissue deformation approximations by prediction. These methods, however, do not account for the structure of the finite element (FE) mesh during training that provides information on node connectivities as well as the distance between them, which can aid with approximating tissue deformation based on the proximity of force load points with the rest of the mesh nodes. Therefore, this work proposes a novel framework, PhysGNN, a data-driven model that approximates the solution of FEA by leveraging graph neural networks (GNNs), which are capable of accounting for the mesh structural information and inductive learning over unstructured grids and complex topological structures. Empirically, we demonstrate that the proposed architecture, PhysGNN, promises accurate and fast soft tissue deformation approximations while remaining computationally feasible, suitable for neurosurgical settings.
    Indoor Panorama Planar 3D Reconstruction via Divide and Conquer. (arXiv:2106.14166v2 [cs.CV] UPDATED)
    (2 min) Indoor panorama typically consists of human-made structures parallel or perpendicular to gravity. We leverage this phenomenon to approximate the scene in a 360-degree image with (H)orizontal-planes and (V)ertical-planes. To this end, we propose an effective divide-and-conquer strategy that divides pixels based on their plane orientation estimation; then, the succeeding instance segmentation module conquers the task of planes clustering more easily in each plane orientation group. Besides, parameters of V-planes depend on camera yaw rotation, but translation-invariant CNNs are less aware of the yaw change. We thus propose a yaw-invariant V-planar reparameterization for CNNs to learn. We create a benchmark for indoor panorama planar reconstruction by extending existing 360 depth datasets with ground truth H\&V-planes (referred to as PanoH&V dataset) and adopt state-of-the-art planar reconstruction methods to predict H\&V-planes as our baselines. Our method outperforms the baselines by a large margin on the proposed dataset.
    Energy Attack: On Transferring Adversarial Examples. (arXiv:2109.04300v1 [cs.LG])
    (2 min) In this work we propose Energy Attack, a transfer-based black-box $L_\infty$-adversarial attack. The attack is parameter-free and does not require gradient approximation. In particular, we first obtain white-box adversarial perturbations of a surrogate model and divide these perturbations into small patches. Then we extract the unit component vectors and eigenvalues of these patches with principal component analysis (PCA). Base on the eigenvalues, we can model the energy distribution of adversarial perturbations. We then perform black-box attacks by sampling from the perturbation patches according to their energy distribution, and tiling the sampled patches to form a full-size adversarial perturbation. This can be done without the available access to victim models. Extensive experiments well demonstrate that the proposed Energy Attack achieves state-of-the-art performance in black-box attacks on various models and several datasets. Moreover, the extracted distribution is able to transfer among different model architectures and different datasets, and is therefore intrinsic to vision architectures.
    ConvMLP: Hierarchical Convolutional MLPs for Vision. (arXiv:2109.04454v1 [cs.CV])
    (2 min) MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods. However, most adopt spatial MLPs which take fixed dimension inputs, therefore making it difficult to apply them to downstream tasks, such as object detection and semantic segmentation. Moreover, single-stage designs further limit performance in other computer vision tasks and fully connected layers bear heavy computation. To tackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP for visual recognition, which is a light-weight, stage-wise, co-design of convolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% of MLP-Mixer-B/16, respectively). Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Convolutional-MLPs.
    Label Efficient Visual Abstractions for Autonomous Driving. (arXiv:2005.10091v2 [cs.CV] CROSS LISTED)
    (2 min) It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this work, we seek to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents. We analyze several segmentation-based intermediate representations. We use these visual abstractions to systematically study the trade-off between annotation efficiency and driving performance, i.e., the types of classes labeled, the number of image samples used to learn the visual abstraction model, and their granularity (e.g., object masks vs. 2D bounding boxes). Our analysis uncovers several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner. Surprisingly, we find that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost. Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.
    ErfAct: Non-monotonic smooth trainable Activation Functions. (arXiv:2109.04386v1 [cs.NE])
    (2 min) An activation function is a crucial component of a neural network that introduces non-linearity in the network. The state-of-the-art performance of a neural network depends on the perfect choice of an activation function. We propose two novel non-monotonic smooth trainable activation functions, called ErfAct-1 and ErfAct-2. Experiments suggest that the proposed functions improve the network performance significantly compared to the widely used activations like ReLU, Swish, and Mish. Replacing ReLU by ErfAct-1 and ErfAct-2, we have 5.21% and 5.04% improvement for top-1 accuracy on PreactResNet-34 network in CIFAR100 dataset, 2.58% and 2.76% improvement for top-1 accuracy on PreactResNet-34 network in CIFAR10 dataset, 1.0%, and 1.0% improvement on mean average precision (mAP) on SSD300 model in Pascal VOC dataset.
    TxT: Crossmodal End-to-End Learning with Transformers. (arXiv:2109.04422v1 [cs.CV])
    (2 min) Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.
    ISCL: Interdependent Self-Cooperative Learning for Unpaired Image Denoising. (arXiv:2102.09858v2 [cs.CV] UPDATED)
    (2 min) With the advent of advances in self-supervised learning, paired clean-noisy data are no longer required in deep learning-based image denoising. However, existing blind denoising methods still require the assumption with regard to noise characteristics, such as zero-mean noise distribution and pixel-wise noise-signal independence; this hinders wide adaptation of the method in the medical domain. On the other hand, unpaired learning can overcome limitations related to the assumption on noise characteristics, which makes it more feasible for collecting the training data in real-world scenarios. In this paper, we propose a novel image denoising scheme, Interdependent Self-Cooperative Learning (ISCL), that leverages unpaired learning by combining cyclic adversarial learning with self-supervised residual learning. Unlike the existing unpaired image denoising methods relying on matching data distributions in different domains, the two architectures in ISCL, designed for different tasks, complement each other and boost the learning process. To assess the performance of the proposed method, we conducted extensive experiments in various biomedical image degradation scenarios, such as noise caused by physical characteristics of electron microscopy (EM) devices (film and charging noise), and structural noise found in low-dose computer tomography (CT). We demonstrate that the image quality of our method is superior to conventional and current state-of-the-art deep learning-based image denoising methods, including supervised learning.
    Semi-Supervised Domain Generalizable Person Re-Identification. (arXiv:2108.05045v2 [cs.CV] UPDATED)
    (2 min) Existing person re-identification (re-id) methods are stuck when deployed to a new unseen scenario despite the success in cross-camera person matching. Recent efforts have been substantially devoted to domain adaptive person re-id where extensive unlabeled data in the new scenario are utilized in a transductive learning manner. However, for each scenario, it is required to first collect enough data and then train such a domain adaptive re-id model, thus restricting their practical application. Instead, we aim to explore multiple labeled datasets to learn generalized domain-invariant representations for person re-id, which is expected universally effective for each new-coming re-id scenario. To pursue practicability in real-world systems, we collect all the person re-id datasets (20 datasets) in this field and select the three most frequently used datasets (i.e., Market1501, DukeMTMC, and MSMT17) as unseen target domains. In addition, we develop DataHunter that collects over 300K+ weak annotated images named YouTube-Human from YouTube street-view videos, which joins 17 remaining full labeled datasets to form multiple source domains. On such a large and challenging benchmark called FastHuman (~440K+ labeled images), we further propose a simple yet effective Semi-Supervised Knowledge Distillation (SSKD) framework. SSKD effectively exploits the weakly annotated data by assigning soft pseudo labels to YouTube-Human to improve models' generalization ability. Experiments on several protocols verify the effectiveness of the proposed SSKD framework on domain generalizable person re-id, which is even comparable to supervised learning on the target domains. Lastly, but most importantly, we hope the proposed benchmark FastHuman could bring the next development of domain generalizable person re-id algorithms.
    Improving Deep Metric Learning by Divide and Conquer. (arXiv:2109.04003v1 [cs.CV])
    (2 min) Deep metric learning (DML) is a cornerstone of many computer vision applications. It aims at learning a mapping from the input domain to an embedding space, where semantically similar objects are located nearby and dissimilar objects far from another. The target similarity on the training data is defined by user in form of ground-truth class labels. However, while the embedding space learns to mimic the user-provided similarity on the training data, it should also generalize to novel categories not seen during training. Besides user-provided groundtruth training labels, a lot of additional visual factors (such as viewpoint changes or shape peculiarities) exist and imply different notions of similarity between objects, affecting the generalization on the images unseen during training. However, existing approaches usually directly learn a single embedding space on all available training data, struggling to encode all different types of relationships, and do not generalize well. We propose to build a more expressive representation by jointly splitting the embedding space and the data hierarchically into smaller sub-parts. We successively focus on smaller subsets of the training data, reducing its variance and learning a different embedding subspace for each data subset. Moreover, the subspaces are learned jointly to cover not only the intricacies, but the breadth of the data as well. Only after that, we build the final embedding from the subspaces in the conquering stage. The proposed algorithm acts as a transparent wrapper that can be placed around arbitrary existing DML methods. Our approach significantly improves upon the state-of-the-art on image retrieval, clustering, and re-identification tasks evaluated using CUB200-2011, CARS196, Stanford Online Products, In-shop Clothes, and PKU VehicleID datasets.
    Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. (arXiv:2109.04448v1 [cs.CL])
    (2 min) Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.
    Towards Transferable Adversarial Attacks on Vision Transformers. (arXiv:2109.04176v1 [cs.CV])
    (2 min) Vision transformers (ViTs) have demonstrated impressive performance on a series of computer vision tasks, yet they still suffer from adversarial examples. In this paper, we posit that adversarial attacks on transformers should be specially tailored for their architecture, jointly considering both patches and self-attention, in order to achieve high transferability. More specifically, we introduce a dual attack framework, which contains a Pay No Attention (PNA) attack and a PatchOut attack, to improve the transferability of adversarial samples across different ViTs. We show that skipping the gradients of attention during backpropagation can generate adversarial examples with high transferability. In addition, adversarial perturbations generated by optimizing randomly sampled subsets of patches at each iteration achieve higher attack success rates than attacks using all patches. We evaluate the transferability of attacks on state-of-the-art ViTs, CNNs and robustly trained CNNs. The results of these experiments demonstrate that the proposed dual attack can greatly boost transferability between ViTs and from ViTs to CNNs. In addition, the proposed method can easily be combined with existing transfer methods to boost performance.
    Talk-to-Edit: Fine-Grained Facial Editing via Dialog. (arXiv:2109.04425v1 [cs.CV])
    (2 min) Facial editing is an important task in vision and graphics with numerous applications. However, existing works are incapable to deliver a continuous and fine-grained editing mode (e.g., editing a slightly smiling face to a big laughing one) with natural interactions with users. In this work, we propose Talk-to-Edit, an interactive facial editing framework that performs fine-grained attribute manipulation through dialog between the user and the system. Our key insight is to model a continual "semantic field" in the GAN latent space. 1) Unlike previous works that regard the editing as traversing straight lines in the latent space, here the fine-grained editing is formulated as finding a curving trajectory that respects fine-grained attribute landscape on the semantic field. 2) The curvature at each step is location-specific and determined by the input image as well as the users' language requests. 3) To engage the users in a meaningful dialog, our system generates language feedback by considering both the user request and the current state of the semantic field. We also contribute CelebA-Dialog, a visual-language facial editing dataset to facilitate large-scale study. Specifically, each image has manually annotated fine-grained attribute annotations as well as template-based textual descriptions in natural language. Extensive quantitative and qualitative experiments demonstrate the superiority of our framework in terms of 1) the smoothness of fine-grained editing, 2) the identity/attribute preservation, and 3) the visual photorealism and dialog fluency. Notably, user study validates that our overall system is consistently favored by around 80% of the participants. Our project page is https://www.mmlab-ntu.com/project/talkedit/.
    Tiny CNN for feature point description for document analysis: approach and dataset. (arXiv:2109.04134v1 [cs.CV])
    (2 min) In this paper, we study the problem of feature points description in the context of document analysis and template matching. Our study shows that the specific training data is required for the task especially if we are to train a lightweight neural network that will be usable on devices with limited computational resources. In this paper, we construct and provide a dataset with a method of training patches retrieval. We prove the effectiveness of this data by training a lightweight neural network and show how it performs in both documents and general patches matching. The training was done on the provided dataset in comparison with HPatches training dataset and for the testing we use HPatches testing framework and two publicly available datasets with various documents pictured on complex backgrounds: MIDV-500 and MIDV-2019.
    Self Supervision to Distillation for Long-Tailed Visual Recognition. (arXiv:2109.04075v1 [cs.CV])
    (2 min) Deep learning has achieved remarkable progress for visual recognition on large-scale balanced datasets but still performs poorly on real-world long-tailed data. Previous methods often adopt class re-balanced training strategies to effectively alleviate the imbalance issue, but might be a risk of over-fitting tail classes. The recent decoupling method overcomes over-fitting issues by using a multi-stage training scheme, yet, it is still incapable of capturing tail class information in the feature learning stage. In this paper, we show that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition. The intrinsic relation between classes embodied by soft labels turns out to be helpful for long-tailed recognition by transferring knowledge from head to tail classes. Specifically, we propose a conceptually simple yet particularly effective multi-stage training scheme, termed as Self Supervised to Distillation (SSD). This scheme is composed of two parts. First, we introduce a self-distillation framework for long-tailed recognition, which can mine the label relation automatically. Second, we present a new distillation label generation module guided by self-supervision. The distilled labels integrate information from both label and data domains that can model long-tailed distribution effectively. We conduct extensive experiments and our method achieves the state-of-the-art results on three long-tailed recognition benchmarks: ImageNet-LT, CIFAR100-LT and iNaturalist 2018. Our SSD outperforms the strong LWS baseline by from $2.7\%$ to $4.5\%$ on various datasets. The code is available at https://github.com/MCG-NJU/SSD-LT.
    Efficient Visual Recognition with Deep Neural Networks: A Survey on Recent Advances and New Directions. (arXiv:2108.13055v2 [cs.CV] UPDATED)
    (2 min) Visual recognition is currently one of the most important and active research areas in computer vision, pattern recognition, and even the general field of artificial intelligence. It has great fundamental importance and strong industrial needs. Deep neural networks (DNNs) have largely boosted their performances on many concrete tasks, with the help of large amounts of training data and new powerful computation resources. Though recognition accuracy is usually the first concern for new progresses, efficiency is actually rather important and sometimes critical for both academic research and industrial applications. Moreover, insightful views on the opportunities and challenges of efficiency are also highly required for the entire community. While general surveys on the efficiency issue of DNNs have been done from various perspectives, as far as we are aware, scarcely any of them focused on visual recognition systematically, and thus it is unclear which progresses are applicable to it and what else should be concerned. In this paper, we present the review of the recent advances with our suggestions on the new possible directions towards improving the efficiency of DNN-related visual recognition approaches. We investigate not only from the model but also the data point of view (which is not the case in existing surveys), and focus on three most studied data types (images, videos and points). This paper attempts to provide a systematic summary via a comprehensive survey which can serve as a valuable reference and inspire both researchers and practitioners who work on visual recognition problems.
    IFBiD: Inference-Free Bias Detection. (arXiv:2109.04374v1 [cs.CV])
    (2 min) This paper is the first to explore an automatic way to detect bias in deep convolutional neural networks by simply looking at their weights. Furthermore, it is also a step towards understanding neural networks and how they work. We show that it is indeed possible to know if a model is biased or not simply by looking at its weights, without the model inference for an specific input. We analyze how bias is encoded in the weights of deep networks through a toy example using the Colored MNIST database and we also provide a realistic case study in gender detection from face images using state-of-the-art methods and experimental resources. To do so, we generated two databases with 36K and 48K biased models each. In the MNIST models we were able to detect whether they presented a strong or low bias with more than 99% accuracy, and we were also able to classify between four levels of bias with more than 70% accuracy. For the face models, we achieved 90% accuracy in distinguishing between models biased towards Asian, Black, or Caucasian ethnicity.
    UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer. (arXiv:2109.04335v1 [cs.CV])
    (2 min) Most recent semantic segmentation methods adopt a U-Net framework with an encoder-decoder architecture. It is still challenging for U-Net with a simple skip connection scheme to model the global multi-scale context: 1) Not each skip connection setting is effective due to the issue of incompatible feature sets of encoder and decoder stage, even some skip connection negatively influence the segmentation performance; 2) The original U-Net is worse than the one without any skip connection on some datasets. Based on our findings, we propose a new segmentation framework, named UCTransNet (with a proposed CTrans module in U-Net), from the channel perspective with attention mechanism. Specifically, the CTrans module is an alternate of the U-Net skip connections, which consists of a sub-module to conduct the multi-scale Channel Cross fusion with Transformer (named CCT) and a sub-module Channel-wise Cross-Attention (named CCA) to guide the fused multi-scale channel-wise information to effectively connect to the decoder features for eliminating the ambiguity. Hence, the proposed connection consisting of the CCT and CCA is able to replace the original skip connection to solve the semantic gaps for an accurate automatic medical image segmentation. The experimental results suggest that our UCTransNet produces more precise segmentation performance and achieves consistent improvements over the state-of-the-art for semantic segmentation across different datasets and conventional architectures involving transformer or U-shaped framework. Code: https://github.com/McGregorWwww/UCTransNet.
    Multilingual Audio-Visual Smartphone Dataset And Evaluation. (arXiv:2109.04138v1 [cs.CR])
    (2 min) Smartphones have been employed with biometric-based verification systems to provide security in highly sensitive applications. Audio-visual biometrics are getting popular due to the usability and also it will be challenging to spoof because of multi-modal nature. In this work, we present an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset contains 103 subjects captured in three different sessions considering the different real-world scenarios. Three different languages are acquired in this dataset to include the problem of language dependency of the speaker recognition systems. These unique characteristics of this dataset will pave the way to implement novel state-of-the-art unimodal or audio-visual speaker recognition systems. We also report the performance of the bench-marked biometric verification systems on our dataset. The robustness of biometric algorithms is evaluated towards multiple dependencies like signal noise, device, language and presentation attacks like replay and synthesized signals with extensive experiments. The obtained results raised many concerns about the generalization properties of state-of-the-art biometrics methods in smartphones.
    SORNet: Spatial Object-Centric Representations for Sequential Manipulation. (arXiv:2109.03891v1 [cs.RO])
    (2 min) Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state, where the ability to reason about spatial relationships among object entities from raw sensor inputs is crucial. Prior works relying on explicit state estimation or end-to-end learning struggle with novel objects. In this work, we propose SORNet (Spatial Object-Centric Representation Network), which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest. We show that the object embeddings learned by SORNet generalize zero-shot to unseen object entities on three spatial reasoning tasks: spatial relationship classification, skill precondition classification and relative direction regression, significantly outperforming baselines. Further, we present real-world robotic experiments demonstrating the usage of the learned object embeddings in task planning for sequential manipulation.
    OSSR-PID: One-Shot Symbol Recognition in P&ID Sheets using Path Sampling and GCN. (arXiv:2109.03849v1 [cs.CV])
    (2 min) Piping and Instrumentation Diagrams (P&ID) are ubiquitous in several manufacturing, oil and gas enterprises for representing engineering schematics and equipment layout. There is an urgent need to extract and digitize information from P&IDs without the cost of annotating a varying set of symbols for each new use case. A robust one-shot learning approach for symbol recognition i.e., localization followed by classification, would therefore go a long way towards this goal. Our method works by sampling pixels sequentially along the different contour boundaries in the image. These sampled points form paths which are used in the prototypical line diagram to construct a graph that captures the structure of the contours. Subsequently, the prototypical graphs are fed into a Dynamic Graph Convolutional Neural Network (DGCNN) which is trained to classify graphs into one of the given symbol classes. Further, we append embeddings from a Resnet-34 network which is trained on symbol images containing sampled points to make the classification network more robust. Since, many symbols in P&ID are structurally very similar to each other, we utilize Arcface loss during DGCNN training which helps in maximizing symbol class separability by producing highly discriminative embeddings. The images consist of components attached on the pipeline (straight line). The sampled points segregated around the symbol regions are used for the classification task. The proposed pipeline, named OSSR-PID, is fast and gives outstanding performance for recognition of symbols on a synthetic dataset of 100 P&ID diagrams. We also compare our method against prior-work on a real-world private dataset of 12 P&ID sheets and obtain comparable/superior results. Remarkably, it is able to achieve such excellent performance using only one prototypical example per symbol.
    Towards Robust Cross-domain Image Understanding with Unsupervised Noise Removal. (arXiv:2109.04284v1 [cs.CV])
    (3 min) Deep learning models usually require a large amount of labeled data to achieve satisfactory performance. In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge transfer from a label rich source domain to a label scarce target domain, thus potentially alleviates the annotation requirement for deep learning models. However, we find that contemporary domain adaptation methods for cross-domain image understanding perform poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Prior methods on WSDA remove noisy source data and align the marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, which have the problem of class misalignment, e.g., features of cats in the target domain might be mapped near features of dogs in the source domain. In this paper, we propose a novel method, termed Noise Tolerant Domain Adaptation, for WSDA. Specifically, we adopt the cluster assumption and learn cluster discriminatively with class prototypes in the embedding space. We propose to leverage the location information of the data points in the embedding space and model the location information with a Gaussian mixture model to identify noisy source data. We then design a network which incorporates the Gaussian mixture noise model as a sub-module for unsupervised noise removal and propose a novel cluster-level adversarial adaptation method which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images from COVID-19 and e-commerce datasets. The results show that our method significantly outperforms state-of-the-art WSDA methods.
    Fine-grained Data Distribution Alignment for Post-Training Quantization. (arXiv:2109.04186v1 [cs.CV])
    (2 min) While post-training quantization receives popularity mostly due to its evasion in accessing the original complete training dataset, its poor performance also stems from this limitation. To alleviate this limitation, in this paper, we leverage the synthetic data introduced by zero-shot quantization with calibration dataset and we propose a fine-grained data distribution alignment (FDDA) method to boost the performance of post-training quantization. The method is based on two important properties of batch normalization statistics (BNS) we observed in deep layers of the trained network, i.e., inter-class separation and intra-class incohesion. To preserve this fine-grained distribution information: 1) We calculate the per-class BNS of the calibration dataset as the BNS centers of each class and propose a BNS-centralized loss to force the synthetic data distributions of different classes to be close to their own centers. 2) We add Gaussian noise into the centers to imitate the incohesion and propose a BNS-distorted loss to force the synthetic data distribution of the same class to be close to the distorted centers. By introducing these two fine-grained losses, our method shows the state-of-the-art performance on ImageNet, especially when the first and last layers are quantized to low-bit as well. Our project is available at https://github.com/viperit/FDDA.
    Deep Hough Voting for Robust Global Registration. (arXiv:2109.04310v1 [cs.CV])
    (2 min) Point cloud registration is the task of estimating the rigid transformation that aligns a pair of point cloud fragments. We present an efficient and robust framework for pairwise registration of real-world 3D scans, leveraging Hough voting in the 6D transformation parameter space. First, deep geometric features are extracted from a point cloud pair to compute putative correspondences. We then construct a set of triplets of correspondences to cast votes on the 6D Hough space, representing the transformation parameters in sparse tensors. Next, a fully convolutional refinement module is applied to refine the noisy votes. Finally, we identify the consensus among the correspondences from the Hough space, which we use to predict our final transformation parameters. Our method outperforms state-of-the-art methods on 3DMatch and 3DLoMatch benchmarks while achieving comparable performance on KITTI odometry dataset. We further demonstrate the generalizability of our approach by setting a new state-of-the-art on ICL-NUIM dataset, where we integrate our module into a multi-way registration pipeline.
    Multi-Tensor Network Representation for High-Order Tensor Completion. (arXiv:2109.04022v1 [cs.CV])
    (2 min) This work studies the problem of high-dimensional data (referred to tensors) completion from partially observed samplings. We consider that a tensor is a superposition of multiple low-rank components. In particular, each component can be represented as multilinear connections over several latent factors and naturally mapped to a specific tensor network (TN) topology. In this paper, we propose a fundamental tensor decomposition (TD) framework: Multi-Tensor Network Representation (MTNR), which can be regarded as a linear combination of a range of TD models, e.g., CANDECOMP/PARAFAC (CP) decomposition, Tensor Train (TT), and Tensor Ring (TR). Specifically, MTNR represents a high-order tensor as the addition of multiple TN models, and the topology of each TN is automatically generated instead of manually pre-designed. For the optimization phase, an adaptive topology learning (ATL) algorithm is presented to obtain latent factors of each TN based on a rank incremental strategy and a projection error measurement strategy. In addition, we theoretically establish the fundamental multilinear operations for the tensors with TN representation, and reveal the structural transformation of MTNR to a single TN. Finally, MTNR is applied to a typical task, tensor completion, and two effective algorithms are proposed for the exact recovery of incomplete data based on the Alternating Least Squares (ALS) scheme and Alternating Direction Method of Multiplier (ADMM) framework. Extensive numerical experiments on synthetic data and real-world datasets demonstrate the effectiveness of MTNR compared with the start-of-the-art methods.
    HSMD: An object motion detection algorithm using a Hybrid Spiking Neural Network Architecture. (arXiv:2109.04119v1 [cs.CV])
    (2 min) The detection of moving objects is a trivial task performed by vertebrate retinas, yet a complex computer vision task. Object-motion-sensitive ganglion cells (OMS-GC) are specialised cells in the retina that sense moving objects. OMS-GC take as input continuous signals and produce spike patterns as output, that are transmitted to the Visual Cortex via the optic nerve. The Hybrid Sensitive Motion Detector (HSMD) algorithm proposed in this work enhances the GSOC dynamic background subtraction (DBS) algorithm with a customised 3-layer spiking neural network (SNN) that outputs spiking responses akin to the OMS-GC. The algorithm was compared against existing background subtraction (BS) approaches, available on the OpenCV library, specifically on the 2012 change detection (CDnet2012) and the 2014 change detection (CDnet2014) benchmark datasets. The results show that the HSMD was ranked overall first among the competing approaches and has performed better than all the other algorithms on four of the categories across all the eight test metrics. Furthermore, the HSMD proposed in this paper is the first to use an SNN to enhance an existing state of the art DBS (GSOC) algorithm and the results demonstrate that the SNN provides near real-time performance in realistic applications.
    Application of the Singular Spectrum Analysis on electroluminescence images of thin-film photovoltaic modules. (arXiv:2109.04048v1 [cs.CV])
    (2 min) This paper discusses an application of the singular spectrum analysis method (SSA) in the context of electroluminescence (EL) images of thin-film photovoltaic (PV) modules. We propose an EL image decomposition as a sum of three components: global intensity, cell, and aperiodic components. A parametric model of the extracted signal is used to perform several image processing tasks. The cell component is used to identify interconnection lines between PV cells at sub-pixel accuracy, as well as to correct incorrect stitching of EL images. Furthermore, an explicit expression of the cell component signal is used to estimate the inverse characteristic length, a physical parameter related to the resistances in a PV module.
    Towards Fully Automated Segmentation of Rat Cardiac MRI by Leveraging Deep Learning Frameworks. (arXiv:2109.04188v1 [eess.IV])
    (2 min) Automated segmentation of human cardiac magnetic resonance datasets has been steadily improving during recent years. However, these methods are not directly applicable in preclinical context due to limited datasets and lower image resolution. Successful application of deep architectures for rat cardiac segmentation, although of critical importance for preclinical evaluation of cardiac function, has to our knowledge not yet been reported. We developed segmentation models that expand on the standard U-Net architecture and evaluated separate models for systole and diastole phases, 2MSA, and one model for all timepoints, 1MSA. Furthermore, we calibrated model outputs using a Gaussian Process (GP)-based prior to improve phase selection. Resulting models approach human performance in terms of left ventricular segmentation quality and ejection fraction (EF) estimation in both 1MSA and 2MSA settings (S{\o}rensen-Dice score 0.91 +/- 0.072 and 0.93 +/- 0.032, respectively). 2MSA achieved a mean absolute difference between estimated and reference EF of 3.5 +/- 2.5 %, while 1MSA resulted in 4.1 +/- 3.0 %. Applying Gaussian Processes to 1MSA allows to automate the selection of systole and diastole phases. Combined with a novel cardiac phase selection strategy, our work presents an important first step towards a fully automated segmentation pipeline in the context of rat cardiac analysis.
    Automated LoD-2 Model Reconstruction from Very-HighResolution Satellite-derived Digital Surface Model and Orthophoto. (arXiv:2109.03876v1 [cs.CV])
    (2 min) In this paper, we propose a model-driven method that reconstructs LoD-2 building models following a "decomposition-optimization-fitting" paradigm. The proposed method starts building detection results through a deep learning-based detector and vectorizes individual segments into polygons using a "three-step" polygon extraction method, followed by a novel grid-based decomposition method that decomposes the complex and irregularly shaped building polygons to tightly combined elementary building rectangles ready to fit elementary building models. We have optionally introduced OpenStreetMap (OSM) and Graph-Cut (GC) labeling to further refine the orientation of 2D building rectangle. The 3D modeling step takes building-specific parameters such as hip lines, as well as non-rigid and regularized transformations to optimize the flexibility for using a minimal set of elementary models. Finally, roof type of building models s refined and adjacent building models in one building segment are merged into the complex polygonal model. Our proposed method has addressed a few technical caveats over existing methods, resulting in practically high-quality results, based on our evaluation and comparative study on a diverse set of experimental datasets of cities with different urban patterns.
    Learning Cross-Scale Visual Representations for Real-Time Image Geo-Localization. (arXiv:2109.04087v1 [cs.CV])
    (2 min) Robot localization remains a challenging task in GPS denied environments. State estimation approaches based on local sensors, e.g. cameras or IMUs, are drifting-prone for long-range missions as error accumulates. In this study, we aim to address this problem by localizing image observations in a 2D multi-modal geospatial map. We introduce the cross-scale dataset and a methodology to produce additional data from cross-modality sources. We propose a framework that learns cross-scale visual representations without supervision. Experiments are conducted on data from two different domains, underwater and aerial. In contrast to existing studies in cross-view image geo-localization, our approach a) performs better on smaller-scale multi-modal maps; b) is more computationally efficient for real-time applications; c) can serve directly in concert with state estimation pipelines.
    Taming Self-Supervised Learning for Presentation Attack Detection: In-Image De-Folding and Out-of-Image De-Mixing. (arXiv:2109.04100v1 [cs.CV])
    (2 min) Biometric systems are vulnerable to the Presentation Attacks (PA) performed using various Presentation Attack Instruments (PAIs). Even though there are numerous Presentation Attack Detection (PAD) techniques based on both deep learning and hand-crafted features, the generalization of PAD for unknown PAI is still a challenging problem. The common problem with existing deep learning-based PAD techniques is that they may struggle with local optima, resulting in weak generalization against different PAs. In this work, we propose to use self-supervised learning to find a reasonable initialization against local trap, so as to improve the generalization ability in detecting PAs on the biometric system.The proposed method, denoted as IF-OM, is based on a global-local view coupled with De-Folding and De-Mixing to derive the task-specific representation for PAD.During De-Folding, the proposed technique will learn region-specific features to represent samples in a local pattern by explicitly maximizing cycle consistency. While, De-Mixing drives detectors to obtain the instance-specific features with global information for more comprehensive representation by maximizing topological consistency. Extensive experimental results show that the proposed method can achieve significant improvements in terms of both face and fingerprint PAD in more complicated and hybrid datasets, when compared with the state-of-the-art methods. Specifically, when training in CASIA-FASD and Idiap Replay-Attack, the proposed method can achieve 18.60% Equal Error Rate (EER) in OULU-NPU and MSU-MFSD, exceeding baseline performance by 9.54%. Code will be made publicly available.
  • cs.IR updates on arXiv.org

    Self-supervised Representation Learning for Trip Recommendation. (arXiv:2109.00968v2 [cs.IR] UPDATED)
    (2 min) Trip recommendation is a significant and engaging location-based service that can help new tourists make more customized travel plans. It often attempts to suggest a sequence of point of interests (POIs) for a user who requests a personalized travel demand. Conventional methods either leverage the heuristic algorithms (e.g., dynamic programming) or statistical analysis (e.g., Markov models) to search or rank a POI sequence. These procedures may fail to capture the diversity of human needs and transitional regularities. They even provide recommendations that deviate from tourists' real travel intention when the trip data is sparse. Although recent deep recursive models (e.g., RNN) are capable of alleviating these concerns, existing solutions hardly recognize the practical reality, such as the diversity of tourist demands, uncertainties in the trip generation, and the complex visiting preference. Inspired by the advance in deep learning, we introduce a novel self-supervised representation learning framework for trip recommendation -- SelfTrip, aiming at tackling the aforementioned challenges. Specifically, we propose a two-step contrastive learning mechanism concerning the POI representation, as well as trip representation. Furthermore, we present four trip augmentation methods to capture the visiting uncertainties in trip planning. We evaluate our SelfTrip on four real-world datasets, and extensive results demonstrate the promising gain compared with several cutting-edge benchmarks, e.g., up to 4% and 12% on F1 and pair-F1, respectively.
    CMML: Contextual Modulation Meta Learning for Cold-Start Recommendation. (arXiv:2108.10511v3 [cs.IR] UPDATED)
    (2 min) Practical recommender systems experience a cold-start problem when observed user-item interactions in the history are insufficient. Meta learning, especially gradient based one, can be adopted to tackle this problem by learning initial parameters of the model and thus allowing fast adaptation to a specific task from limited data examples. Though with significant performance improvement, it commonly suffers from two critical issues: the non-compatibility with mainstream industrial deployment and the heavy computational burdens, both due to the inner-loop gradient operation. These two issues make them hard to be applied in practical recommender systems. To enjoy the benefits of meta learning framework and mitigate these problems, we propose a recommendation framework called Contextual Modulation Meta Learning (CMML). CMML is composed of fully feed-forward operations so it is computationally efficient and completely compatible with the mainstream industrial deployment. CMML consists of three components, including a context encoder that can generate context embedding to represent a specific task, a hybrid context generator that aggregates specific user-item features with task-level context, and a contextual modulation network, which can modulate the recommendation model to adapt effectively. We validate our approach on both scenario-specific and user-specific cold-start setting on various real-world datasets, showing CMML can achieve comparable or even better performance with gradient based methods yet with much higher computational efficiency and better interpretability.
    Compression Network with Transformer for Approximate Nearest Neighbor Search. (arXiv:2107.14415v2 [cs.IR] UPDATED)
    (2 min) We propose a generic feature compression method for Approximate Nearest Neighbor Search (ANNS) problems, which speeds up existing ANNS methods in a plug-and-play manner. Specifically, we propose a new network structure called Compression Network with Transformer (CNT) to compress the feature into a low dimensional space, and an inhomogeneous neighborhood relationship preserving (INRP) loss that aims to maintain high search accuracy. In CNT, we use multiple compression projections to cast the feature into many low dimensional spaces, and then use transformer to globally optimize these projections such that the features are well compressed following the guidance from our loss function. The loss function is designed to assign high weights on point pairs that are close in original feature space, and keep their distances in projected space. Keeping these distances helps maintain the eventual top-k retrieval accuracy, and down weighting others creates room for feature compression. In experiments, we run our compression method on public datasets, and use the compressed features in graph based, product quantization and scalar quantization based ANNS solutions. Experimental results show that our compression method can significantly improve the efficiency of these methods while preserves or even improves search accuracy, suggesting its broad potential impact on real world applications.
    A Gumbel-based activation function for imbalanced datasets. (arXiv:2012.05009v2 [cs.IR] UPDATED)
    (2 min) Rating prediction is a core problem in recommender systems to quantify users preferences towards different items. Due to the imbalanced rating distributions in training data, existing recommendation methods suffer from the biased prediction problem that generates biased prediction results. Thus, their performance on predicting ratings which rarely appear in training data is unsatisfactory. In this paper, inspired by the superior capability of Extreme Value Distribution (EVD)-based methods in modeling the distribution of rare data, we propose a novel \underline{\emph{G}}umbel Distribution-based \underline{\emph{R}}ating \underline{\emph{P}}rediction framework (GRP) which can accurately predict both frequent and rare ratings between users and items. In our approach, we first define different Gumbel distributions for each rating level, which can be learned by historical rating statistics of users and items. Second, we incorporate the Gumbel-based representations of users and items with their original representations learned from the rating matrix and/or reviews to enrich the representations of users and items via a proposed multi-scale convolutional fusion layer. Third, we propose a data-driven rating prediction module to predict the ratings of user-item pairs. It's worthy to note that our approach can be readily applied to existing recommendation methods for addressing their biased prediction problem. To verify the effectiveness of GRP, we conduct extensive experiments on eight benchmark datasets. Compared with several baseline models, the results show that: 1) GRP achieves state-of-the-art overall performance on all eight datasets; 2) GRP makes a substantial improvement in predicting rare ratings, which shows the effectiveness of our model in addressing the bias prediction problem.
    Mining Points of Interest via Address Embeddings: An Unsupervised Approach. (arXiv:2109.04467v1 [cs.LG])
    (2 min) Digital maps are commonly used across the globe for exploring places that users are interested in, commonly referred to as points of interest (PoI). In online food delivery platforms, PoIs could represent any major private compounds where customers could order from such as hospitals, residential complexes, office complexes, educational institutes and hostels. In this work, we propose an end-to-end unsupervised system design for obtaining polygon representations of PoIs (PoI polygons) from address locations and address texts. We preprocess the address texts using locality names and generate embeddings for the address texts using a deep learning-based architecture, viz. RoBERTa, trained on our internal address dataset. The PoI candidates are identified by jointly clustering the anonymised customer phone GPS locations (obtained during address onboarding) and the embeddings of the address texts. The final list of PoI polygons is obtained from these PoI candidates using novel post-processing steps. This algorithm identified 74.8 % more PoIs than those obtained using the Mummidi-Krumm baseline algorithm run on our internal dataset. The proposed algorithm achieves a median area precision of 98 %, a median area recall of 8 %, and a median F-score of 0.15. In order to improve the recall of the algorithmic polygons, we post-process them using building footprint polygons from the OpenStreetMap (OSM) database. The post-processing algorithm involves reshaping the algorithmic polygon using intersecting polygons and closed private roads from the OSM database, and accounting for intersection with public roads on the OSM database. We achieve a median area recall of 70 %, a median area precision of 69 %, and a median F-score of 0.69 on these post-processed polygons.
    QUINT: Node embedding using network hashing. (arXiv:2109.04206v1 [cs.SI])
    (2 min) Representation learning using network embedding has received tremendous attention due to its efficacy to solve downstream tasks. Popular embedding methods (such as deepwalk, node2vec, LINE) are based on a neural architecture, thus unable to scale on large networks both in terms of time and space usage. Recently, we proposed BinSketch, a sketching technique for compressing binary vectors to binary vectors. In this paper, we show how to extend BinSketch and use it for network hashing. Our proposal named QUINT is built upon BinSketch, and it embeds nodes of a sparse network onto a low-dimensional space using simple bi-wise operations. QUINT is the first of its kind that provides tremendous gain in terms of speed and space usage without compromising much on the accuracy of the downstream tasks. Extensive experiments are conducted to compare QUINT with seven state-of-the-art network embedding methods for two end tasks - link prediction and node classification. We observe huge performance gain for QUINT in terms of speedup (up to 7000x) and space saving (up to 800x) due to its bit-wise nature to obtain node embedding. Moreover, QUINT is a consistent top-performer for both the tasks among the baselines across all the datasets. Our empirical observations are backed by rigorous theoretical analysis to justify the effectiveness of QUINT. In particular, we prove that QUINT retains enough structural information which can be used further to approximate many topological properties of networks with high confidence.
    MATE: Multi-view Attention for Table Transformer Efficiency. (arXiv:2109.04312v1 [cs.CL])
    (2 min) This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.
    NU:BRIEF -- A Privacy-aware Newsletter Personalization Engine for Publishers. (arXiv:2109.03955v1 [cs.DL])
    (2 min) Newsletters have (re-) emerged as a powerful tool for publishers to engage with their readers directly and more effectively. Despite the diversity in their audiences, publishers' newsletters remain largely a one-size-fits-all offering, which is suboptimal. In this paper, we present NU:BRIEF, a web application for publishers that enables them to personalize their newsletters without harvesting personal data. Personalized newsletters build a habit and become a great conversion tool for publishers, providing an alternative readers-generated revenue model to a declining ad/clickbait-centered business model.
    Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning. (arXiv:2109.04432v1 [cs.LG])
    (2 min) Reliably predicting potential failure risks of machine learning (ML) systems when deployed with production data is a crucial aspect of trustworthy AI. This paper introduces Risk Advisor, a novel post-hoc meta-learner for estimating failure risks and predictive uncertainties of any already-trained black-box classification model. In addition to providing a risk score, the Risk Advisor decomposes the uncertainty estimates into aleatoric and epistemic uncertainty components, thus giving informative insights into the sources of uncertainty inducing the failures. Consequently, Risk Advisor can distinguish between failures caused by data variability, data shifts and model limitations and advise on mitigation actions (e.g., collecting more data to counter data shift). Extensive experiments on various families of black-box classification models and on real-world and synthetic datasets covering common ML failure scenarios show that the Risk Advisor reliably predicts deployment-time failure risks in all the scenarios, and outperforms strong baselines.
    Double-Scale Self-Supervised Hypergraph Learning for Group Recommendation. (arXiv:2109.04200v1 [cs.IR])
    (2 min) With the prevalence of social media, there has recently been a proliferation of recommenders that shift their focus from individual modeling to group recommendation. Since the group preference is a mixture of various predilections from group members, the fundamental challenge of group recommendation is to model the correlations among members. Existing methods mostly adopt heuristic or attention-based preference aggregation strategies to synthesize group preferences. However, these models mainly focus on the pairwise connections of users and ignore the complex high-order interactions within and beyond groups. Besides, group recommendation suffers seriously from the problem of data sparsity due to severely sparse group-item interactions. In this paper, we propose a self-supervised hypergraph learning framework for group recommendation to achieve two goals: (1) capturing the intra- and inter-group interactions among users; (2) alleviating the data sparsity issue with the raw data itself. Technically, for (1), a hierarchical hypergraph convolutional network based on the user- and group-level hypergraphs is developed to model the complex tuplewise correlations among users within and beyond groups. For (2), we design a double-scale node dropout strategy to create self-supervision signals that can regularize user representations with different granularities against the sparsity issue. The experimental analysis on multiple benchmark datasets demonstrates the superiority of the proposed model and also elucidates the rationality of the hypergraph modeling and the double-scale self-supervision.
    Knowledge mining of unstructured information: application to cyber-domain. (arXiv:2109.03848v1 [cs.CR])
    (2 min) Cyber intelligence is widely and abundantly available in numerous open online sources with reports on vulnerabilities and incidents. This constant stream of noisy information requires new tools and techniques if it is to be used for the benefit of analysts and investigators in various organizations. In this paper we present and implement a novel knowledge graph and knowledge mining framework for extracting relevant information from free-form text about incidents in the cyber domain. Our framework includes a machine learning based pipeline as well as crawling methods for generating graphs of entities, attackers and the related information with our non-technical cyber ontology. We test our framework on publicly available cyber incident datasets to evaluate the accuracy of our knowledge mining methods as well as the usefulness of the framework in the use of cyber analysts. Our results show analyzing the knowledge graph constructed using the novel framework, an analyst can infer additional information from the current cyber landscape in terms of risk to various entities and the propagation of risk between industries and countries. Expanding the framework to accommodate more technical and operational level information can increase the accuracy and explainability of trends and risk in the knowledge graph.
    Follow the guides: disentangling human and algorithmic curation in online music consumption. (arXiv:2109.03915v1 [cs.CY])
    (3 min) The role of recommendation systems in the diversity of content consumption on platforms is a much-debated issue. The quantitative state of the art often overlooks the existence of individual attitudes toward guidance, and eventually of different categories of users in this regard. Focusing on the case of music streaming, we analyze the complete listening history of about 9k users over one year and demonstrate that there is no blanket answer to the intertwinement of recommendation use and consumption diversity: it depends on users. First we compute for each user the relative importance of different access modes within their listening history, introducing a trichotomy distinguishing so-called `organic' use from algorithmic and editorial guidance. We thereby identify four categories of users. We then focus on two scales related to content diversity, both in terms of dispersion -- how much users consume the same content repeatedly -- and popularity -- how popular is the content they consume. We show that the two types of recommendation offered by music platforms -- algorithmic and editorial -- may drive the consumption of more or less diverse content in opposite directions, depending also strongly on the type of users. Finally, we compare users' streaming histories with the music programming of a selection of popular French radio stations during the same period. While radio programs are usually more tilted toward repetition than users' listening histories, they often program more songs from less popular artists. On the whole, our results highlight the nontrivial effects of platform-mediated recommendation on consumption, and lead us to speak of `filter niches' rather than `filter bubbles'. They hint at further ramifications for the study and design of recommendation systems.
    Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction. (arXiv:2109.03821v1 [cs.IR])
    (2 min) Compliments and concerns in reviews are valuable for understanding users' shopping interests and their opinions with respect to specific aspects of certain items. Existing review-based recommenders favor large and complex language encoders that can only learn latent and uninterpretable text representations. They lack explicit user attention and item property modeling, which however could provide valuable information beyond the ability to recommend items. Therefore, we propose a tightly coupled two-stage approach, including an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE). Unsupervised ASPE mines Aspect-Sentiment pairs (AS-pairs) and APRE predicts ratings using AS-pairs as concrete aspect-level evidence. Extensive experiments on seven real-world Amazon Review Datasets demonstrate that ASPE can effectively extract AS-pairs which enable APRE to deliver superior accuracy over the leading baselines.
  • cs.LG updates on arXiv.org

    People Still Care About Facts: Twitter Users Engage More with Factual Discourse than Misinformation--A Comparison Between COVID and General Narratives on Twitter. (arXiv:2012.02164v3 [cs.SI] UPDATED)
    (2 min) Misinformation entails the dissemination of falsehoods that leads to the slow fracturing of society via decreased trust in democratic processes, institutions, and science. The public has grown aware of the role of social media as a superspreader of untrustworthy information, where even pandemics have not been immune. In this paper, we focus on COVID-19 misinformation and examine a subset of 2.1M tweets to understand misinformation as a function of engagement, tweet content (COVID-19- vs. non-COVID-19-related), and veracity (misleading or factual). Using correlation analysis, we show the most relevant feature subsets among over 126 features that most heavily correlate with misinformation or facts. We found that (i) factual tweets, regardless of whether COVID-related, were more engaging than misinformation tweets; and (ii) features that most heavily correlated with engagement varied depending on the veracity and content of the tweet.
    Extreme Bandits using Robust Statistics. (arXiv:2109.04433v1 [stat.ML])
    (2 min) We consider a multi-armed bandit problem motivated by situations where only the extreme values, as opposed to expected values in the classical bandit setting, are of interest. We propose distribution free algorithms using robust statistics and characterize the statistical properties. We show that the provided algorithms achieve vanishing extremal regret under weaker conditions than existing algorithms. Performance of the algorithms is demonstrated for the finite-sample setting using numerical experiments. The results show superior performance of the proposed algorithms compared to the well known algorithms.
    Detection of Epileptic Seizures on EEG Signals Using ANFIS Classifier, Autoencoders and Fuzzy Entropies. (arXiv:2109.04364v1 [eess.SP])
    (2 min) Epilepsy is one of the most crucial neurological disorders, and its early diagnosis will help the clinicians to provide accurate treatment for the patients. The electroencephalogram (EEG) signals are widely used for epileptic seizures detection, which provides specialists with substantial information about the functioning of the brain. In this paper, a novel diagnostic procedure using fuzzy theory and deep learning techniques are introduced. The proposed method is evaluated on the Bonn University dataset with six classification combinations and also on the Freiburg dataset. The tunable-Q wavelet transform (TQWT) is employed to decompose the EEG signals into different sub-bands. In the feature extraction step, 13 different fuzzy entropies are calculated from different sub-bands of TQWT, and their computational complexities are calculated to help researchers choose the best feature sets. In the following, an autoencoder (AE) with six layers is employed for dimensionality reduction. Finally, the standard adaptive neuro-fuzzy inference system (ANFIS), and also its variants with grasshopper optimization algorithm (ANFIS-GOA), particle swarm optimization (ANFIS-PSO), and breeding swarm optimization (ANFIS-BS) methods are used for classification. Using our proposed method, ANFIS-BS method has obtained an accuracy of 99.74% in classifying into two classes and an accuracy of 99.46% in ternary classification on the Bonn dataset and 99.28% on the Freiburg dataset, reaching state-of-the-art performances on both of them.
    Learning from Uneven Training Data: Unlabeled, Single Label, and Multiple Labels. (arXiv:2109.04408v1 [cs.CL])
    (2 min) Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the spectrum of language interpretation. We explore new label annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from uneven training examples (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in both accuracy and label distribution metrics in two tasks, suggesting training with uneven training data can be beneficial for many NLP tasks.
    Neural Latents Benchmark '21: Evaluating latent variable models of neural population activity. (arXiv:2109.04463v1 [cs.LG])
    (2 min) Advances in neural recording present increasing opportunities to study neural activity in unprecedented detail. Latent variable models (LVMs) are promising tools for analyzing this rich activity across diverse neural systems and behaviors, as LVMs do not depend on known relationships between the activity and external experimental variables. However, progress in latent variable modeling is currently impeded by a lack of standardization, resulting in methods being developed and compared in an ad hoc manner. To coordinate these modeling efforts, we introduce a benchmark suite for latent variable modeling of neural population activity. We curate four datasets of neural spiking activity from cognitive, sensory, and motor areas to promote models that apply to the wide variety of activity seen across these areas. We identify unsupervised evaluation as a common framework for evaluating models across datasets, and apply several baselines that demonstrate benchmark diversity. We release this benchmark through EvalAI. this http URL
    Adversarial Attacks are Reversible with Natural Supervision. (arXiv:2103.14222v3 [cs.CV] UPDATED)
    (2 min) We find that images contain intrinsic structure that enables the reversal of many adversarial attacks. Attack vectors cause not only image classifiers to fail, but also collaterally disrupt incidental structure in the image. We demonstrate that modifying the attacked image to restore the natural structure will reverse many types of attacks, providing a defense. Experiments demonstrate significantly improved robustness for several state-of-the-art models across the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Our results show that our defense is still effective even if the attacker is aware of the defense mechanism. Since our defense is deployed during inference instead of training, it is compatible with pre-trained networks as well as most other defenses. Our results suggest deep networks are vulnerable to adversarial examples partly because their representations do not enforce the natural structure of images.
    Assessing Machine Learning Approaches to Address IoT Sensor Drift. (arXiv:2109.04356v1 [eess.SP])
    (2 min) The proliferation of IoT sensors and their deployment in various industries and applications has brought about numerous analysis opportunities in this Big Data era. However, drift of those sensor measurements poses major challenges to automate data analysis and the ability to effectively train and deploy models on a continuous basis. In this paper we study and test several approaches from the literature with regard to their ability to cope with and adapt to sensor drift under realistic conditions. Most of these approaches are recent and thus are representative of the current state-of-the-art. The testing was performed on a publicly available gas sensor dataset exhibiting drift over time. The results show substantial drops in sensing performance due to sensor drift in spite of the approaches. We then discuss several issues identified with current approaches and outline directions for future research to tackle them.
    Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders. (arXiv:2104.08027v2 [cs.CL] UPDATED)
    (2 min) Pretrained Masked Language Models (MLMs) have revolutionised NLP in recent years. However, previous work has indicated that off-the-shelf MLMs are not effective as universal lexical or sentence encoders without further task-specific fine-tuning on NLI, sentence similarity, or paraphrasing tasks using annotated task data. In this work, we demonstrate that it is possible to turn MLMs into effective universal lexical and sentence encoders even without any additional data and without any supervision. We propose an extremely simple, fast and effective contrastive learning technique, termed Mirror-BERT, which converts MLMs (e.g., BERT and RoBERTa) into such encoders in 20-30 seconds without any additional external knowledge. Mirror-BERT relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning. We report huge gains over off-the-shelf MLMs with Mirror-BERT in both lexical-level and sentence-level tasks, across different domains and different languages. Notably, in the standard sentence semantic similarity (STS) tasks, our self-supervised Mirror-BERT model even matches the performance of the task-tuned Sentence-BERT models from prior work. Finally, we delve deeper into the inner workings of MLMs, and suggest some evidence on why this simple approach can yield effective universal lexical and sentence encoders.
    Quantizing data for distributed learning. (arXiv:2012.07913v3 [cs.LG] UPDATED)
    (3 min) We consider machine learning applications that train a model by leveraging data distributed over a trusted network, where communication constraints can create a performance bottleneck. A number of recent approaches propose to overcome this bottleneck through compression of gradient updates. However, as models become larger, so does the size of the gradient updates. In this paper, we propose an alternate approach to learn from distributed data that quantizes data instead of gradients, and can support learning over applications where the size of gradient updates is prohibitive. Our approach leverages the dependency of the computed gradient on data samples, which lie in a much smaller space in order to perform the quantization in the smaller dimension data space. At the cost of an extra gradient computation, the gradient estimate can be refined by conveying the difference between the gradient at the quantized data point and the original gradient using a small number of bits. Lastly, in order to save communication, our approach adds a layer that decides whether to transmit a quantized data sample or not based on its importance for learning. We analyze the convergence of the proposed approach for smooth convex and non-convex objective functions and show that we can achieve order optimal convergence rates with communication that mostly depends on the data rather than the model (gradient) dimension. We use our proposed algorithm to train ResNet models on the CIFAR-10 and ImageNet datasets, and show that we can achieve an order of magnitude savings over gradient compression methods. These communication savings come at the cost of increasing computation at the learning agent, and thus our approach is beneficial in scenarios where communication load is the main problem.
    Fisher Task Distance and Its Applications in Transfer Learning and Neural Architecture Search. (arXiv:2103.12827v3 [cs.LG] UPDATED)
    (2 min) We formulate an asymmetric (or non-commutative) distance between tasks based on Fisher Information Matrices. We provide proof of consistency for our distance through theorems and experiments on various classification tasks. We then apply our proposed measure of task distance in transfer learning on visual tasks in the Taskonomy dataset. Additionally, we show how the proposed distance between a target task and a set of baseline tasks can be used to reduce the neural architecture search space for the target task. The complexity reduction in search space for task-specific architectures is achieved by building on the optimized architectures for similar tasks instead of doing a full search without using this side information. Experimental results demonstrate the efficacy of the proposed approach and its improvements over other methods.
    Time-Aware Evidence Ranking for Fact-Checking. (arXiv:2009.06402v4 [cs.CL] UPDATED)
    (2 min) Truth can vary over time. Fact-checking decisions on claim veracity should therefore take into account temporal information of both the claim and supporting or refuting evidence. In this work, we investigate the hypothesis that the timestamp of a Web page is crucial to how it should be ranked for a given claim. We delineate four temporal ranking methods that constrain evidence ranking differently and simulate hypothesis-specific evidence rankings given the evidence timestamps as gold standard. Evidence ranking in three fact-checking models is ultimately optimized using a learning-to-rank loss function. Our study reveals that time-aware evidence ranking not only surpasses relevance assumptions based purely on semantic similarity or position in a search results list, but also improves veracity predictions of time-sensitive claims in particular.
    UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer. (arXiv:2109.04335v1 [cs.CV])
    (2 min) Most recent semantic segmentation methods adopt a U-Net framework with an encoder-decoder architecture. It is still challenging for U-Net with a simple skip connection scheme to model the global multi-scale context: 1) Not each skip connection setting is effective due to the issue of incompatible feature sets of encoder and decoder stage, even some skip connection negatively influence the segmentation performance; 2) The original U-Net is worse than the one without any skip connection on some datasets. Based on our findings, we propose a new segmentation framework, named UCTransNet (with a proposed CTrans module in U-Net), from the channel perspective with attention mechanism. Specifically, the CTrans module is an alternate of the U-Net skip connections, which consists of a sub-module to conduct the multi-scale Channel Cross fusion with Transformer (named CCT) and a sub-module Channel-wise Cross-Attention (named CCA) to guide the fused multi-scale channel-wise information to effectively connect to the decoder features for eliminating the ambiguity. Hence, the proposed connection consisting of the CCT and CCA is able to replace the original skip connection to solve the semantic gaps for an accurate automatic medical image segmentation. The experimental results suggest that our UCTransNet produces more precise segmentation performance and achieves consistent improvements over the state-of-the-art for semantic segmentation across different datasets and conventional architectures involving transformer or U-shaped framework. Code: https://github.com/McGregorWwww/UCTransNet.
    Fair Conformal Predictors for Applications in Medical Imaging. (arXiv:2109.04392v1 [eess.IV])
    (2 min) Deep learning has the potential to augment many components of the clinical workflow, such as medical image interpretation. However, the translation of these black box algorithms into clinical practice has been marred by the relative lack of transparency compared to conventional machine learning methods, hindering in clinician trust in the systems for critical medical decision-making. Specifically, common deep learning approaches do not have intuitive ways of expressing uncertainty with respect to cases that might require further human review. Furthermore, the possibility of algorithmic bias has caused hesitancy regarding the use of developed algorithms in clinical settings. To these ends, we explore how conformal methods can complement deep learning models by providing both clinically intuitive way (by means of confidence prediction sets) of expressing model uncertainty as well as facilitating model transparency in clinical workflows. In this paper, we conduct a field survey with clinicians to assess clinical use-cases of conformal predictions. Next, we conduct experiments with a mammographic breast density and dermatology photography datasets to demonstrate the utility of conformal predictions in "rule-in" and "rule-out" disease scenarios. Further, we show that conformal predictors can be used to equalize coverage with respect to patient demographics such as race and skin tone. We find that a conformal predictions to be a promising framework with potential to increase clinical usability and transparency for better collaboration between deep learning algorithms and clinicians.
    NeuralFMU: Towards Structural Integration of FMUs into Neural Networks. (arXiv:2109.04351v1 [cs.LG])
    (2 min) This paper covers two major subjects: First, the presentation of a new open-source library called FMI.jl for integrating FMI into the Julia programming environment by providing the possibility to load, parameterize and simulate FMUs. Further, an extension to this library called FMIFlux.jl is introduced, that allows the integration of FMUs into a neural network topology to obtain a NeuralFMU. This structural combination of an industry typical black-box model and a data-driven machine learning model combines the different advantages of both modeling approaches in one single development environment. This allows for the usage of advanced data driven modeling techniques for physical effects that are difficult to model based on first principles.
    A fast and simple modification of Newton's method helping to avoid saddle points. (arXiv:2006.01512v4 [math.OC] UPDATED)
    (2 min) We propose in this paper New Q-Newton's method. The update rule is very simple conceptually, for example $x_{n+1}=x_n-w_n$ where $w_n=pr_{A_n,+}(v_n)-pr_{A_n,-}(v_n)$, with $A_n=\nabla ^2f(x_n)+\delta _n||\nabla f(x_n)||^2.Id$ and $v_n=A_n^{-1}.\nabla f(x_n)$. Here $\delta _n$ is an appropriate real number so that $A_n$ is invertible, and $pr_{A_n,\pm}$ are projections to the vector subspaces generated by eigenvectors of positive (correspondingly negative) eigenvalues of $A_n$. The main result of this paper roughly says that if $f$ is $C^3$ (can be unbounded from below) and a sequence $\{x_n\}$, constructed by the New Q-Newton's method from a random initial point $x_0$, {\bf converges}, then the limit point is a critical point and is not a saddle point, and the convergence rate is the same as that of Newton's method. The first author has recently been successful incorporating Backtracking line search to New Q-Newton's method, thus resolving the convergence guarantee issue observed for some (non-smooth) cost functions. An application to quickly finding zeros of a univariate meromorphic function will be discussed. Various experiments are performed, against well known algorithms such as BFGS and Adaptive Cubic Regularization are presented.
    PhysGNN: A Physics-Driven Graph Neural Network Based Model for Predicting Soft Tissue Deformation in Image-Guided Neurosurgery. (arXiv:2109.04352v1 [eess.IV])
    (2 min) Correctly capturing intraoperative brain shift in image-guided neurosurgical procedures is a critical task for aligning preoperative data with intraoperative geometry, ensuring effective surgical navigation and optimal surgical precision. While the finite element method (FEM) is a proven technique to effectively approximate soft tissue deformation through biomechanical formulations, their degree of success boils down to a trade-off between accuracy and speed. To circumvent this problem, the most recent works in this domain have proposed leveraging data-driven models obtained by training various machine learning algorithms, e.g. random forests, artificial neural networks (ANNs), with the results of finite element analysis (FEA) to speed up tissue deformation approximations by prediction. These methods, however, do not account for the structure of the finite element (FE) mesh during training that provides information on node connectivities as well as the distance between them, which can aid with approximating tissue deformation based on the proximity of force load points with the rest of the mesh nodes. Therefore, this work proposes a novel framework, PhysGNN, a data-driven model that approximates the solution of FEA by leveraging graph neural networks (GNNs), which are capable of accounting for the mesh structural information and inductive learning over unstructured grids and complex topological structures. Empirically, we demonstrate that the proposed architecture, PhysGNN, promises accurate and fast soft tissue deformation approximations while remaining computationally feasible, suitable for neurosurgical settings.
    Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning. (arXiv:2109.04432v1 [cs.LG])
    (2 min) Reliably predicting potential failure risks of machine learning (ML) systems when deployed with production data is a crucial aspect of trustworthy AI. This paper introduces Risk Advisor, a novel post-hoc meta-learner for estimating failure risks and predictive uncertainties of any already-trained black-box classification model. In addition to providing a risk score, the Risk Advisor decomposes the uncertainty estimates into aleatoric and epistemic uncertainty components, thus giving informative insights into the sources of uncertainty inducing the failures. Consequently, Risk Advisor can distinguish between failures caused by data variability, data shifts and model limitations and advise on mitigation actions (e.g., collecting more data to counter data shift). Extensive experiments on various families of black-box classification models and on real-world and synthetic datasets covering common ML failure scenarios show that the Risk Advisor reliably predicts deployment-time failure risks in all the scenarios, and outperforms strong baselines.
    Safe Learning Reference Governor: Theory and Application to Fuel Truck Rollover Avoidance. (arXiv:2101.09298v2 [eess.SY] UPDATED)
    (2 min) This paper proposes a learning reference governor (LRG) approach to enforce state and control constraints in systems for which an accurate model is unavailable, and this approach enables the reference governor to gradually improve command tracking performance through learning while enforcing the constraints during learning and after learning is completed. The learning can be performed either on a black-box type model of the system or directly on the hardware. After introducing the LRG algorithm and outlining its theoretical properties, this paper investigates LRG application to fuel truck (tank truck) rollover avoidance. Through simulations based on a fuel truck model that accounts for liquid fuel sloshing effects, we show that the proposed LRG can effectively protect fuel trucks from rollover accidents under various operating conditions.
    Fundamental Limits and Tradeoffs in Invariant Representation Learning. (arXiv:2012.10713v3 [cs.LG] UPDATED)
    (2 min) A wide range of machine learning applications such as privacy-preserving learning, algorithmic fairness, and domain adaptation/generalization among others, involve learning \emph{invariant representations} of the data that aim to achieve two competing goals: (a) maximize information or accuracy with respect to a target response, and (b) maximize invariance or independence with respect to a set of protected features (e.g.\ for fairness, privacy, etc). Despite their wide applicability, theoretical understanding of the optimal tradeoffs -- with respect to accuracy, and invariance -- achievable by invariant representations is still severely lacking. In this paper, we provide precisely such an information-theoretic analysis of such tradeoffs under both classification and regression settings. We provide a geometric characterization of the accuracy and invariance achievable by any representation of the data; we term this feasible region the information plane. We provide a lower bound for this feasible region for the classification case, and an exact characterization for the regression case, which allows us to either bound or exactly characterize the Pareto optimal frontier between accuracy and invariance. Although our contributions are mainly theoretical, a key practical application of our results is in certifying the potential sub-optimality of any given representation learning algorithm for either classification or regression tasks. Our results shed new light on the fundamental interplay between accuracy and invariance, and may be useful in guiding the design of future representation learning algorithms.
    Wave-Informed Matrix Factorization with Global Optimality Guarantees. (arXiv:2107.09144v2 [cs.LG] UPDATED)
    (2 min) With the recent success of representation learning methods, which includes deep learning as a special case, there has been considerable interest in developing representation learning techniques that can incorporate known physical constraints into the learned representation. As one example, in many applications that involve a signal propagating through physical media (e.g., optics, acoustics, fluid dynamics, etc), it is known that the dynamics of the signal must satisfy constraints imposed by the wave equation. Here we propose a matrix factorization technique that decomposes such signals into a sum of components, where each component is regularized to ensure that it satisfies wave equation constraints. Although our proposed formulation is non-convex, we prove that our model can be efficiently solved to global optimality in polynomial time. We demonstrate the benefits of our work by applications in structural health monitoring, where prior work has attempted to solve this problem using sparse dictionary learning approaches that do not come with any theoretical guarantees regarding convergence to global optimality and employ heuristics to capture desired physical constraints.
    The Price is (Probably) Right: Learning Market Equilibria from Samples. (arXiv:2012.14838v3 [cs.GT] UPDATED)
    (2 min) Equilibrium computation in markets usually considers settings where player valuation functions are known. We consider the setting where player valuations are unknown; using a PAC learning-theoretic framework, we analyze some classes of common valuation functions, and provide algorithms which output direct PAC equilibrium allocations, not estimates based on attempting to learn valuation functions. Since there exist trivial PAC market outcomes with an unbounded worst-case efficiency loss, we lower-bound the efficiency of our algorithms. While the efficiency loss under general distributions is rather high, we show that in some cases (e.g., unit-demand valuations), it is possible to find a PAC market equilibrium with significantly better utility.
    Leveraging Local Domains for Image-to-Image Translation. (arXiv:2109.04468v1 [cs.CV])
    (2 min) Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as 'local domains' and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training.
    Fast-Convergent Dynamics for Distributed Allocation of Resources Over Switching Sparse Networks with Quantized Communication Links. (arXiv:2012.08181v3 [eess.SY] UPDATED)
    (2 min) This paper proposes networked dynamics to solve resource allocation problems over time-varying multi-agent networks. The state of each agent represents the amount of used resources (or produced utilities) while the total amount of resources is fixed. The idea is to optimally allocate the resources among the group of agents by minimizing the overall cost function subject to fixed sum of resources. Each agents' information is restricted to its own state and cost function and those of its immediate in-neighbors. This is motivated by distributed applications such as mobile edge-computing, economic dispatch over smart grids, and multi-agent coverage control. This work provides a fast convergent solution (in comparison with linear dynamics) while considering relaxed network connectivity with quantized communication links. The proposed dynamics reaches optimal solution over switching (possibly disconnected) undirected networks as far as their union over some bounded non-overlapping time-intervals has a spanning-tree. We prove feasibility of the solution, uniqueness of the optimal state, and convergence to the optimal value under the proposed dynamics, where the analysis is applicable to similar 1st-order allocation dynamics with strongly sign-preserving nonlinearities, such as actuator saturation.
    Protein Folding Neural Networks Are Not Robust. (arXiv:2109.04460v1 [q-bio.BM])
    (2 min) Deep neural networks such as AlphaFold and RoseTTAFold predict remarkably accurate structures of proteins compared to other algorithmic approaches. It is known that biologically small perturbations in the protein sequence do not lead to drastic changes in the protein structure. In this paper, we demonstrate that RoseTTAFold does not exhibit such a robustness despite its high accuracy, and biologically small perturbations for some input sequences result in radically different predicted protein structures. This raises the challenge of detecting when these predicted protein structures cannot be trusted. We define the robustness measure for the predicted structure of a protein sequence to be the inverse of the root-mean-square distance (RMSD) in the predicted structure and the structure of its adversarially perturbed sequence. We use adversarial attack methods to create adversarial protein sequences, and show that the RMSD in the predicted protein structure ranges from 0.119\r{A} to 34.162\r{A} when the adversarial perturbations are bounded by 20 units in the BLOSUM62 distance. This demonstrates very high variance in the robustness measure of the predicted structures. We show that the magnitude of the correlation (0.917) between our robustness measure and the RMSD between the predicted structure and the ground truth is high, that is, the predictions with low robustness measure cannot be trusted. This is the first paper demonstrating the susceptibility of RoseTTAFold to adversarial attacks.
    Is a Classification Procedure Good Enough? A Goodness-of-Fit Assessment Tool for Classification Learning. (arXiv:1911.03063v2 [stat.ML] UPDATED)
    (2 min) In recent years, many non-traditional classification methods, such as Random Forest, Boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness-of-fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models.
    Editing Factual Knowledge in Language Models. (arXiv:2104.08164v2 [cs.CL] UPDATED)
    (2 min) The factual knowledge acquired during pre-training and stored in the parameters of Language Models (LMs) can be useful in downstream tasks (e.g., question answering or textual inference). However, some facts can be incorrectly induced or become obsolete over time. We present KnowledgeEditor, a method which can be used to edit this knowledge and, thus, fix 'bugs' or unexpected predictions without the need for expensive re-training or fine-tuning. Besides being computationally efficient, KnowledgeEditordoes not require any modifications in LM pre-training (e.g., the use of meta-learning). In our approach, we train a hyper-network with constrained optimization to modify a fact without affecting the rest of the knowledge; the trained hyper-network is then used to predict the weight update at test time. We show KnowledgeEditor's efficacy with two popular architectures and knowledge-intensive tasks: i) a BERT model fine-tuned for fact-checking, and ii) a sequence-to-sequence BART model for question answering. With our method, changing a prediction on the specific wording of a query tends to result in a consistent change in predictions also for its paraphrases. We show that this can be further encouraged by exploiting (e.g., automatically-generated) paraphrases during training. Interestingly, our hyper-network can be regarded as a 'probe' revealing which components need to be changed to manipulate factual knowledge; our analysis shows that the updates tend to be concentrated on a small subset of components. Source code available at https://github.com/nicola-decao/KnowledgeEditor
    A Supervised Machine Learning Model For Imputing Missing Boarding Stops In Smart Card Data. (arXiv:2003.05285v2 [cs.LG] UPDATED)
    (2 min) Public transport has become an essential part of urban existence with increased population densities and environmental awareness. Large quantities of data are currently generated, allowing for more robust methods to understand travel behavior by harvesting smart card usage. However, public transport datasets suffer from data integrity problems; boarding stop information may be missing due to imperfect acquirement processes or inadequate reporting. We developed a supervised machine learning method to impute missing boarding stops based on ordinal classification using GTFS timetable, smart card, and geospatial datasets. A new metric, Pareto Accuracy, is suggested to evaluate algorithms where classes have an ordinal nature. Results are based on a case study in the city of Beer Sheva, Israel, consisting of one month of smart card data. We show that our proposed method is robust to irregular travelers and significantly outperforms well-known imputation methods without the need to mine any additional datasets. Validation of data from another Israeli city using transfer learning shows the presented model is general and context-free. The implications for transportation planning and travel behavior research are further discussed.
    Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training. (arXiv:2101.12699v3 [cs.LG] UPDATED)
    (0 min) In this paper, we introduce the \textit{Layer-Peeled Model}, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse \cite{papyan2020prevalence}. More importantly, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term \textit{Minority Collapse}, which fundamentally limits the performance of deep learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before being confirmed by our computational experiments.
    High-Dimensional Differentially-Private EM Algorithm: Methods and Near-Optimal Statistical Guarantees. (arXiv:2104.00245v2 [stat.ML] UPDATED)
    (0 min) In this paper, we develop a general framework to design differentially private expectation-maximization (EM) algorithms in high-dimensional latent variable models, based on the noisy iterative hard-thresholding. We derive the statistical guarantees of the proposed framework and apply it to three specific models: Gaussian mixture, mixture of regression, and regression with missing covariates. In each model, we establish the near-optimal rate of convergence with differential privacy constraints, and show the proposed algorithm is minimax rate optimal up to logarithm factors. The technical tools developed for the high-dimensional setting are then extended to the classic low-dimensional latent variable models, and we propose a near rate-optimal EM algorithm with differential privacy guarantees in this setting. Simulation studies and real data analysis are conducted to support our results.
    ProAI: An Efficient Embedded AI Hardware for Automotive Applications -- a Benchmark Study. (arXiv:2108.05170v2 [cs.CV] UPDATED)
    (0 min) Development in the field of Single Board Computers (SBC) have been increasing for several years. They provide a good balance between computing performance and power consumption which is usually required for mobile platforms, like application in vehicles for Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD). However, there is an ever-increasing need of more powerful and efficient SBCs which can run power intensive Deep Neural Networks (DNNs) in real-time and can also satisfy necessary functional safety requirements such as Automotive Safety Integrity Level (ASIL). ProAI is being developed by ZF mainly to run powerful and efficient applications such as multitask DNNs and on top of that it also has the required safety certification for AD. In this work, we compare and discuss state of the art SBC on the basis of power intensive multitask DNN architecture called Multitask-CenterNet with respect to performance measures such as, FPS and power efficiency. As an automotive supercomputer, ProAI delivers an excellent combination of performance and efficiency, managing nearly twice the number of FPS per watt than a modern workstation laptop and almost four times compared to the Jetson Nano. Furthermore, it was also shown that there is still power in reserve for further and more complex tasks on the ProAI, based on the CPU and GPU utilization during the benchmark.
    HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks. (arXiv:2104.01002v2 [cs.SE] UPDATED)
    (0 min) Jupyter notebook allows data scientists to write machine learning code together with its documentation in cells. In this paper, we propose a new task of code documentation generation (CDG) for computational notebooks. In contrast to the previous CDG tasks which focus on generating documentation for single code snippets, in a computational notebook, one documentation in a markdown cell often corresponds to multiple code cells, and these code cells have an inherent structure. We proposed a new model (HAConvGNN) that uses a hierarchical attention mechanism to consider the relevant code cells and the relevant code tokens information when generating the documentation. Tested on a new corpus constructed from well-documented Kaggle notebooks, we show that our model outperforms other baseline models.
    Distributed Thompson Sampling. (arXiv:2012.01789v2 [cs.AI] UPDATED)
    (0 min) We study a cooperative multi-agent multi-armed bandits with M agents and K arms. The goal of the agents is to minimized the cumulative regret. We adapt a traditional Thompson Sampling algoirthm under the distributed setting. However, with agent's ability to communicate, we note that communication may further reduce the upper bound of the regret for a distributed Thompson Sampling approach. To further improve the performance of distributed Thompson Sampling, we propose a distributed Elimination based Thompson Sampling algorithm that allow the agents to learn collaboratively. We analyse the algorithm under Bernoulli reward and derived a problem dependent upper bound on the cumulative regret.
    ISCL: Interdependent Self-Cooperative Learning for Unpaired Image Denoising. (arXiv:2102.09858v2 [cs.CV] UPDATED)
    (0 min) With the advent of advances in self-supervised learning, paired clean-noisy data are no longer required in deep learning-based image denoising. However, existing blind denoising methods still require the assumption with regard to noise characteristics, such as zero-mean noise distribution and pixel-wise noise-signal independence; this hinders wide adaptation of the method in the medical domain. On the other hand, unpaired learning can overcome limitations related to the assumption on noise characteristics, which makes it more feasible for collecting the training data in real-world scenarios. In this paper, we propose a novel image denoising scheme, Interdependent Self-Cooperative Learning (ISCL), that leverages unpaired learning by combining cyclic adversarial learning with self-supervised residual learning. Unlike the existing unpaired image denoising methods relying on matching data distributions in different domains, the two architectures in ISCL, designed for different tasks, complement each other and boost the learning process. To assess the performance of the proposed method, we conducted extensive experiments in various biomedical image degradation scenarios, such as noise caused by physical characteristics of electron microscopy (EM) devices (film and charging noise), and structural noise found in low-dose computer tomography (CT). We demonstrate that the image quality of our method is superior to conventional and current state-of-the-art deep learning-based image denoising methods, including supervised learning.
    Automated Security Assessment for the Internet of Things. (arXiv:2109.04029v1 [cs.CR])
    (0 min) Internet of Things (IoT) based applications face an increasing number of potential security risks, which need to be systematically assessed and addressed. Expert-based manual assessment of IoT security is a predominant approach, which is usually inefficient. To address this problem, we propose an automated security assessment framework for IoT networks. Our framework first leverages machine learning and natural language processing to analyze vulnerability descriptions for predicting vulnerability metrics. The predicted metrics are then input into a two-layered graphical security model, which consists of an attack graph at the upper layer to present the network connectivity and an attack tree for each node in the network at the bottom layer to depict the vulnerability information. This security model automatically assesses the security of the IoT network by capturing potential attack paths. We evaluate the viability of our approach using a proof-of-concept smart building system model which contains a variety of real-world IoT devices and potential vulnerabilities. Our evaluation of the proposed framework demonstrates its effectiveness in terms of automatically predicting the vulnerability metrics of new vulnerabilities with more than 90% accuracy, on average, and identifying the most vulnerable attack paths within an IoT network. The produced assessment results can serve as a guideline for cybersecurity professionals to take further actions and mitigate risks in a timely manner.
    DROP: Deep relocating option policy for optimal ride-hailing vehicle repositioning. (arXiv:2109.04149v1 [cs.MA])
    (0 min) In a ride-hailing system, an optimal relocation of vacant vehicles can significantly reduce fleet idling time and balance the supply-demand distribution, enhancing system efficiency and promoting driver satisfaction and retention. Model-free deep reinforcement learning (DRL) has been shown to dynamically learn the relocating policy by actively interacting with the intrinsic dynamics in large-scale ride-hailing systems. However, the issues of sparse reward signals and unbalanced demand and supply distribution place critical barriers in developing effective DRL models. Conventional exploration strategy (e.g., the $\epsilon$-greedy) may barely work under such an environment because of dithering in low-demand regions distant from high-revenue regions. This study proposes the deep relocating option policy (DROP) that supervises vehicle agents to escape from oversupply areas and effectively relocate to potentially underserved areas. We propose to learn the Laplacian embedding of a time-expanded relocation graph, as an approximation representation of the system relocation policy. The embedding generates task-agnostic signals, which in combination with task-dependent signals, constitute the pseudo-reward function for generating DROPs. We present a hierarchical learning framework that trains a high-level relocation policy and a set of low-level DROPs. The effectiveness of our approach is demonstrated using a custom-built high-fidelity simulator with real-world trip record data. We report that DROP significantly improves baseline models with 15.7% more hourly revenue and can effectively resolve the dithering issue in low-demand areas.
    Compositional Affinity Propagation: When Clusters Have Compositional Structure. (arXiv:2109.04160v1 [cs.LG])
    (0 min) We consider a new kind of clustering problem in which clusters need not be independent of each other, but rather can have compositional relationships with other clusters (e.g., an image set consists of rectangles, circles, as well as combinations of rectangles and circles). This task is motivated by recent work in few-shot learning on compositional embedding models that structure the embedding space to distinguish the label sets, not just the individual labels, assigned to the examples. To tackle this clustering problem, we propose a new algorithm called Compositional Affinity Propagation (CAP). In contrast to standard Affinity Propagation as well as other algorithms for multi-view and hierarchical clustering, CAP can deduce compositionality among clusters automatically. We show promising results, compared to several existing clustering algorithms, on the MultiMNIST, OmniGlot, and LibriSpeech datasets. Our work has applications to multi-object image recognition and speaker diarization with simultaneous speech from multiple speakers.
    Sharp Lower Bounds on the Approximation Rate of Shallow Neural Networks. (arXiv:2106.14997v2 [stat.ML] UPDATED)
    (0 min) We consider the approximation rates of shallow neural networks with respect to the variation norm. Upper bounds on these rates have been established for sigmoidal and ReLU activation functions, but it has remained an important open problem whether these rates are sharp. In this article, we provide a solution to this problem by proving sharp lower bounds on the approximation rates for shallow neural networks, which are obtained by lower bounding the $L^2$-metric entropy of the convex hull of the neural network basis functions. In addition, our methods also give sharp lower bounds on the Kolmogorov $n$-widths of this convex hull, which show that the variation spaces corresponding to shallow neural networks cannot be efficiently approximated by linear methods. These lower bounds apply to both sigmoidal activation functions with bounded variation and to activation functions which are a power of the ReLU. Our results also quantify how much stronger the Barron spectral norm is than the variation norm and, combined with previous results, give the asymptotics of the $L^\infty$-metric entropy up to logarithmic factors in the case of the ReLU activation function.
    Matrix Completion of World Trade. (arXiv:2109.03930v1 [econ.GN])
    (0 min) This work applies Matrix Completion (MC) -- a class of machine-learning methods commonly used in the context of recommendation systems -- to analyse economic complexity. MC is applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, whose elements express the relative advantage of countries in given classes of products, as evidenced by yearly trade flows. A high-accuracy binary classifier is derived from the application of MC, with the aim of discriminating between elements of the RCA matrix that are, respectively, higher or lower than one. We introduce a novel Matrix cOmpletion iNdex of Economic complexitY (MONEY) based on MC, which is related to the predictability of countries' RCA (the lower the predictability, the higher the complexity). Differently from previously-developed indices of economic complexity, the MONEY index takes into account the various singular vectors of the matrix reconstructed by MC, whereas other indices are based only on one/two eigenvectors of a suitable symmetric matrix, derived from the RCA matrix. Finally, MC is compared with a state-of-the-art economic complexity index (GENEPY). We show that the false positive rate per country of a binary classifier constructed starting from the average entry-wise output of MC can be used as a proxy of GENEPY.
    IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System. (arXiv:2109.04202v1 [q-bio.QM])
    (0 min) Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging the task formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation. Unlike previous Neural Network-based systems, IMG2SMI builds around the task of molecule description generation, which enables IMG2SMI to outperform OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further research on this task, we release a new molecule prediction dataset. including 81 million molecules for molecule description generation
    Towards Fully Automated Segmentation of Rat Cardiac MRI by Leveraging Deep Learning Frameworks. (arXiv:2109.04188v1 [eess.IV])
    (0 min) Automated segmentation of human cardiac magnetic resonance datasets has been steadily improving during recent years. However, these methods are not directly applicable in preclinical context due to limited datasets and lower image resolution. Successful application of deep architectures for rat cardiac segmentation, although of critical importance for preclinical evaluation of cardiac function, has to our knowledge not yet been reported. We developed segmentation models that expand on the standard U-Net architecture and evaluated separate models for systole and diastole phases, 2MSA, and one model for all timepoints, 1MSA. Furthermore, we calibrated model outputs using a Gaussian Process (GP)-based prior to improve phase selection. Resulting models approach human performance in terms of left ventricular segmentation quality and ejection fraction (EF) estimation in both 1MSA and 2MSA settings (S{\o}rensen-Dice score 0.91 +/- 0.072 and 0.93 +/- 0.032, respectively). 2MSA achieved a mean absolute difference between estimated and reference EF of 3.5 +/- 2.5 %, while 1MSA resulted in 4.1 +/- 3.0 %. Applying Gaussian Processes to 1MSA allows to automate the selection of systole and diastole phases. Combined with a novel cardiac phase selection strategy, our work presents an important first step towards a fully automated segmentation pipeline in the context of rat cardiac analysis.
    DeepEMO: Deep Learning for Speech Emotion Recognition. (arXiv:2109.04081v1 [cs.SD])
    (0 min) We proposed the industry level deep learning approach for speech emotion recognition task. In industry, carefully proposed deep transfer learning technology shows real results due to mostly low amount of training data availability, machine training cost, and specialized learning on dedicated AI tasks. The proposed speech recognition framework, called DeepEMO, consists of two main pipelines such that preprocessing to extract efficient main features and deep transfer learning model to train and recognize. Main source code is in https://github.com/enkhtogtokh/deepemo repository
    How to Train BERT with an Academic Budget. (arXiv:2104.07705v2 [cs.CL] UPDATED)
    (0 min) While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
    Understanding Neural Code Intelligence Through Program Simplification. (arXiv:2106.03353v2 [cs.SE] UPDATED)
    (0 min) A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.
    Deep Reinforcement Learning for Equal Risk Pricing and Hedging under Dynamic Expectile Risk Measures. (arXiv:2109.04001v1 [q-fin.PR])
    (0 min) Recently equal risk pricing, a framework for fair derivative pricing, was extended to consider dynamic risk measures. However, all current implementations either employ a static risk measure that violates time consistency, or are based on traditional dynamic programming solution schemes that are impracticable in problems with a large number of underlying assets (due to the curse of dimensionality) or with incomplete asset dynamics information. In this paper, we extend for the first time a famous off-policy deterministic actor-critic deep reinforcement learning (ACRL) algorithm to the problem of solving a risk averse Markov decision process that models risk using a time consistent recursive expectile risk measure. This new ACRL algorithm allows us to identify high quality time consistent hedging policies (and equal risk prices) for options, such as basket options, that cannot be handled using traditional methods, or in context where only historical trajectories of the underlying assets are available. Our numerical experiments, which involve both a simple vanilla option and a more exotic basket option, confirm that the new ACRL algorithm can produce 1) in simple environments, nearly optimal hedging policies, and highly accurate prices, simultaneously for a range of maturities 2) in complex environments, good quality policies and prices using reasonable amount of computing resources; and 3) overall, hedging strategies that actually outperform the strategies produced using static risk measures when the risk is evaluated at later points of time.
    Mining Points of Interest via Address Embeddings: An Unsupervised Approach. (arXiv:2109.04467v1 [cs.LG])
    (0 min) Digital maps are commonly used across the globe for exploring places that users are interested in, commonly referred to as points of interest (PoI). In online food delivery platforms, PoIs could represent any major private compounds where customers could order from such as hospitals, residential complexes, office complexes, educational institutes and hostels. In this work, we propose an end-to-end unsupervised system design for obtaining polygon representations of PoIs (PoI polygons) from address locations and address texts. We preprocess the address texts using locality names and generate embeddings for the address texts using a deep learning-based architecture, viz. RoBERTa, trained on our internal address dataset. The PoI candidates are identified by jointly clustering the anonymised customer phone GPS locations (obtained during address onboarding) and the embeddings of the address texts. The final list of PoI polygons is obtained from these PoI candidates using novel post-processing steps. This algorithm identified 74.8 % more PoIs than those obtained using the Mummidi-Krumm baseline algorithm run on our internal dataset. The proposed algorithm achieves a median area precision of 98 %, a median area recall of 8 %, and a median F-score of 0.15. In order to improve the recall of the algorithmic polygons, we post-process them using building footprint polygons from the OpenStreetMap (OSM) database. The post-processing algorithm involves reshaping the algorithmic polygon using intersecting polygons and closed private roads from the OSM database, and accounting for intersection with public roads on the OSM database. We achieve a median area recall of 70 %, a median area precision of 69 %, and a median F-score of 0.69 on these post-processed polygons.
    NTS-NOTEARS: Learning Nonparametric Temporal DAGs With Time-Series Data and Prior Knowledge. (arXiv:2109.04286v1 [cs.LG])
    (0 min) We propose a score-based DAG structure learning method for time-series data that captures linear, nonlinear, lagged and instantaneous relations among variables while ensuring acyclicity throughout the entire graph. The proposed method extends nonparametric NOTEARS, a recent continuous optimization approach for learning nonparametric instantaneous DAGs. The proposed method is faster than constraint-based methods using nonlinear conditional independence tests. We also promote the use of optimization constraints to incorporate prior knowledge into the structure learning process. A broad set of experiments with simulated data demonstrates that the proposed method discovers better DAG structures than several recent comparison methods. We also evaluate the proposed method on complex real-world data acquired from NHL ice hockey games containing a mixture of continuous and discrete variables. The code is available at https://github.com/xiangyu-sun-789/NTS-NOTEARS/.
    Consistent Accelerated Inference via Confident Adaptive Transformers. (arXiv:2104.08803v2 [cs.CL] UPDATED)
    (2 min) We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.
    Graph Attention Multi-Layer Perceptron. (arXiv:2108.10097v2 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) have recently achieved state-of-the-art performance in many graph-based applications. Despite the high expressive power, they typically need to perform an expensive recursive neighborhood expansion in multiple training epochs and face a scalability issue. Moreover, most of them are inflexible since they are restricted to fixed-hop neighborhoods and insensitive to actual receptive field demands for different nodes. We circumvent these limitations by introducing a scalable and flexible Graph Attention Multilayer Perceptron (GAMLP). With the separation of the non-linear transformation and feature propagation, GAMLP significantly improves the scalability and efficiency by performing the propagation procedure in a pre-compute manner. With three principled receptive field attention, each node in GAMLP is flexible and adaptive in leveraging the propagated features over the different sizes of reception field. We conduct extensive evaluations on the three large open graph benchmarks (e.g., ogbn-papers100M, ogbn-products and ogbn-mag), demonstrating that GAMLP not only achieves the state-of-art performance, but also additionally provide high scalability and efficiency.
    Order preserving hierarchical agglomerative clustering. (arXiv:2004.12488v3 [cs.LG] UPDATED)
    (2 min) Partial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $a < b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, then we shall have $[a] < [b]$ in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.
    Modeling Massive Spatial Datasets Using a Conjugate Bayesian Linear Regression Framework. (arXiv:2109.04447v1 [stat.ML])
    (2 min) Geographic Information Systems (GIS) and related technologies have generated substantial interest among statisticians with regard to scalable methodologies for analyzing large spatial datasets. A variety of scalable spatial process models have been proposed that can be easily embedded within a hierarchical modeling framework to carry out Bayesian inference. While the focus of statistical research has mostly been directed toward innovative and more complex model development, relatively limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article discusses how point-referenced spatial process models can be cast as a conjugate Bayesian linear regression that can rapidly deliver inference on spatial processes. The approach allows exact sampling directly (avoids iterative algorithms such as Markov chain Monte Carlo) from the joint posterior distribution of regression parameters, the latent process and the predictive random variables, and can be easily implemented on statistical programming environments such as R.
    Adaptive Binary-Ternary Quantization. (arXiv:1909.12205v2 [cs.LG] UPDATED)
    (2 min) Neural network models are resource hungry. It is difficult to deploy such deep networks on devices with limited resources, like smart wearables, cellphones, drones, and autonomous vehicles. Low bit quantization such as binary and ternary quantization is a common approach to alleviate this resource requirements. Ternary quantization provides a more flexible model and outperforms binary quantization in terms of accuracy, however doubles the memory footprint and increases the computational cost. Contrary to these approaches, mixed quantized models allow a trade-off between accuracy and memory footprint. In such models, quantization depth is often chosen manually, or is tuned using a separate optimization routine. The latter requires training a quantized network multiple times. Here, we propose an adaptive combination of binary and ternary quantization, namely Smart Quantization (SQ), in which the quantization depth is modified directly via a regularization function, so that the model is trained only once. Our experimental results show that the proposed method adapts quantization depth successfully while keeping the model accuracy high on MNIST and CIFAR10 benchmarks.
    Proactive and AoI-aware Failure Recovery for Stateful NFV-enabled Zero-Touch 6G Networks: Model-Free DRL Approach. (arXiv:2103.03817v3 [eess.SP] UPDATED)
    (2 min) In this paper, we propose a Zero-Touch, deep reinforcement learning (DRL)-based Proactive Failure Recovery framework called ZT-PFR for stateful network function virtualization (NFV)-enabled networks. To this end, we formulate a resource-efficient optimization problem minimizing the network cost function including resource cost and wrong decision penalty. As a solution, we propose state-of-the-art DRL-based methods such as soft-actor-critic (SAC) and proximal-policy-optimization (PPO). In addition, to train and test our DRL agents, we propose a novel impending failure model. Moreover, to keep network status information at an acceptable freshness level for appropriate decision-making, we apply the concept of age of information to strike a balance between the event and scheduling-based monitoring. Several key systems and DRL algorithm design insights for ZT-PFR are drawn from our analysis and simulation results. For example, we use a hybrid neural network, consisting long short-term memory layers in the DRL agents
    Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies. (arXiv:2103.05245v2 [cs.DC] UPDATED)
    (2 min) Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do not only affect their origin but propagate through the distributed system. Taking this into account, we present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies and placement as edges to improve the identification and localization of anomalies. Given a series of metric KPIs, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. During our experiments, we simulate a distributed cloud application deployment and synthetically inject anomalies. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
    Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval. (arXiv:2104.08801v2 [cs.CL] UPDATED)
    (2 min) In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA) from source to target domain. While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between the target domain and synthetic data distribution, and reduces model overfitting to the source domain. We run UDA experiments on question generation and passage retrieval from the \textit{Natural Questions} domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6\% top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset- \textit{MLQuestions} containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.
    Online Enhanced Semantic Hashing: Towards Effective and Efficient Retrieval for Streaming Multi-Modal Data. (arXiv:2109.04260v1 [cs.MM])
    (2 min) With the vigorous development of multimedia equipment and applications, efficient retrieval of large-scale multi-modal data has become a trendy research topic. Thereinto, hashing has become a prevalent choice due to its retrieval efficiency and low storage cost. Although multi-modal hashing has drawn lots of attention in recent years, there still remain some problems. The first point is that existing methods are mainly designed in batch mode and not able to efficiently handle streaming multi-modal data. The second point is that all existing online multi-modal hashing methods fail to effectively handle unseen new classes which come continuously with streaming data chunks. In this paper, we propose a new model, termed Online enhAnced SemantIc haShing (OASIS). We design novel semantic-enhanced representation for data, which could help handle the new coming classes, and thereby construct the enhanced semantic objective function. An efficient and effective discrete online optimization algorithm is further proposed for OASIS. Extensive experiments show that our method can exceed the state-of-the-art models. For good reproducibility and benefiting the community, our code and data are already available in supplementary material and will be made publicly available.
    From Weakly Supervised Learning to Biquality Learning: an Introduction. (arXiv:2012.09632v3 [cs.LG] CROSS LISTED)
    (2 min) The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of "supervision deficiencies". In WSL use cases, a variety of situations exists where the collected "information" is imperfect. The paradigm of WSL attempts to list and cover these problems with associated solutions. In this paper, we review the research progress on WSL with the aim to make it as a brief introduction to this field. We present the three axis of WSL cube and an overview of most of all the elements of their facets. We propose three measurable quantities that acts as coordinates in the previously defined cube namely: Quality, Adaptability and Quantity of information. Thus we suggest that Biquality Learning framework can be defined as a plan of the WSL cube and propose to re-discover previously unrelated patches in WSL literature as a unified Biquality Learning literature.
    Deep controlled learning of MDP policies with an application to lost-sales inventory control. (arXiv:2011.15122v2 [cs.LG] UPDATED)
    (2 min) Recent literature established that neural networks can represent good policies across a range of stochastic dynamic models in supply chain and logistics. We incorporate variance reduction techniques in a newly proposed algorithm, to overcome limitations of the model-free algorithms typically employed to learn such neural network policies. For the classical lost sales inventory model, the algorithm learns neural network policies that are superior to those learned using model-free algorithms, while outperforming the best heuristic benchmarks by an order of magnitude. The algorithm is an interesting candidate to apply to other stochastic dynamic problems in supply chain and logistics, because the ideas in its development are generic.
    Dynamic Modeling of Hand-Object Interactions via Tactile Sensing. (arXiv:2109.04378v1 [cs.RO])
    (2 min) Tactile sensing is critical for humans to perform everyday tasks. While significant progress has been made in analyzing object grasping from vision, it remains unclear how we can utilize tactile sensing to reason about and model the dynamics of hand-object interactions. In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects. We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model, which can then be used on its own during the test time. The tactile model aims to predict the 3d locations of both the hand and the object purely from the touch data by combining a predictive model and a contrastive learning module. This framework can reason about the interaction patterns from the tactile data, hallucinate the changes in the environment, estimate the uncertainty of the prediction, and generalize to unseen objects. We also provide detailed ablation studies regarding different system designs as well as visualizations of the predicted trajectories. This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing, which opens the door for future applications in activity learning, human-computer interactions, and imitation learning for robotics.
    A divide-and-conquer algorithm for quantum state preparation. (arXiv:2008.01511v2 [quant-ph] UPDATED)
    (2 min) Advantages in several fields of research and industry are expected with the rise of quantum computers. However, the computational cost to load classical data in quantum computers can impose restrictions on possible quantum speedups. Known algorithms to create arbitrary quantum states require quantum circuits with depth O(N) to load an N-dimensional vector. Here, we show that it is possible to load an N-dimensional vector with a quantum circuit with polylogarithmic depth and entangled information in ancillary qubits. Results show that we can efficiently load data in quantum devices using a divide-and-conquer strategy to exchange computational time for space. We demonstrate a proof of concept on a real quantum device and present two applications for quantum machine learning. We expect that this new loading strategy allows the quantum speedup of tasks that require to load a significant volume of information to quantum devices.
    Cross DQN: Cross Deep Q Network for Ads Allocation in Feed. (arXiv:2109.04353v1 [cs.LG])
    (2 min) E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user behaviors, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. To this end, we propose Cross Deep Q Network (Cross DQN) to extract the arrangement signal by crossing the embeddings of different items and processing the crossed sequence in the feed. Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on Meituan feed to serve more than 300 millions of customers.
    Quantum Machine Learning for Finance. (arXiv:2109.04298v1 [quant-ph])
    (2 min) Quantum computers are expected to surpass the computational capabilities of classical computers during this decade, and achieve disruptive impact on numerous industry sectors, particularly finance. In fact, finance is estimated to be the first industry sector to benefit from Quantum Computing not only in the medium and long terms, but even in the short term. This review paper presents the state of the art of quantum algorithms for financial applications, with particular focus to those use cases that can be solved via Machine Learning.
    MutualGraphNet: A novel model for motor imagery classification. (arXiv:2109.04361v1 [eess.SP])
    (2 min) Motor imagery classification is of great significance to humans with mobility impairments, and how to extract and utilize the effective features from motor imagery electroencephalogram(EEG) channels has always been the focus of attention. There are many different methods for the motor imagery classification, but the limited understanding on human brain requires more effective methods for extracting the features of EEG data. Graph neural networks(GNNs) have demonstrated its effectiveness in classifying graph structures; and the use of GNN provides new possibilities for brain structure connection feature extraction. In this paper we propose a novel graph neural network based on the mutual information of the raw EEG channels called MutualGraphNet. We use the mutual information as the adjacency matrix combined with the spatial temporal graph convolution network(ST-GCN) could extract the transition rules of the motor imagery electroencephalogram(EEG) channels data more effectively. Experiments are conducted on motor imagery EEG data set and we compare our model with the current state-of-the-art approaches and the results suggest that MutualGraphNet is robust enough to learn the interpretable features and outperforms the current state-of-the-art methods.
    OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching. (arXiv:2109.04307v1 [cs.LG])
    (2 min) Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highly expensive. To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a stationary reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence. We demonstrate that our method is considerably more sample efficient and generalizes to novel environments through the experiments. Our method achieves better or comparable results on policy performance baselines with significantly fewer interactions. Furthermore, we empirically show that the recovered reward function generalizes to different tasks where prior arts are prone to fail.
    Energy Attack: On Transferring Adversarial Examples. (arXiv:2109.04300v1 [cs.LG])
    (2 min) In this work we propose Energy Attack, a transfer-based black-box $L_\infty$-adversarial attack. The attack is parameter-free and does not require gradient approximation. In particular, we first obtain white-box adversarial perturbations of a surrogate model and divide these perturbations into small patches. Then we extract the unit component vectors and eigenvalues of these patches with principal component analysis (PCA). Base on the eigenvalues, we can model the energy distribution of adversarial perturbations. We then perform black-box attacks by sampling from the perturbation patches according to their energy distribution, and tiling the sampled patches to form a full-size adversarial perturbation. This can be done without the available access to victim models. Extensive experiments well demonstrate that the proposed Energy Attack achieves state-of-the-art performance in black-box attacks on various models and several datasets. Moreover, the extracted distribution is able to transfer among different model architectures and different datasets, and is therefore intrinsic to vision architectures.
    Data-Driven Robust Optimization using Unsupervised Deep Learning. (arXiv:2011.09769v3 [math.OC] UPDATED)
    (2 min) Robust optimization has been established as a leading methodology to approach decision problems under uncertainty. To derive a robust optimization model, a central ingredient is to identify a suitable model for uncertainty, which is called the uncertainty set. An ongoing challenge in the recent literature is to derive uncertainty sets from given historical data that result in solutions that are robust regarding future scenarios. In this paper we use an unsupervised deep learning method to learn and extract hidden structures from data, leading to non-convex uncertainty sets and better robust solutions. We prove that most of the classical uncertainty classes are special cases of our derived sets and that optimizing over them is strongly NP-hard. Nevertheless, we show that the trained neural networks can be integrated into a robust optimization model by formulating the adversarial problem as a convex quadratic mixed-integer program. This allows us to derive robust solutions through an iterative scenario generation process. In our computational experiments, we compare this approach to a similar approach using kernel-based support vector clustering. We find that uncertainty sets derived by the unsupervised deep learning method find a better description of data and lead to robust solutions that outperform the comparison method both with respect to objective value and feasibility.
    All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. (arXiv:2109.04404v1 [cs.CL])
    (2 min) Similarity measures are a vital tool for understanding how language models represent and process language. Standard representational similarity measures such as cosine similarity and Euclidean distance have been successfully used in static word embedding models to understand how words cluster in semantic space. Recently, these measures have been applied to embeddings from contextualized models such as BERT and GPT-2. In this work, we call into question the informativity of such measures for contextualized language models. We find that a small number of rogue dimensions, often just 1-3, dominate these measures. Moreover, we find a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model. We show that simple postprocessing techniques such as standardization are able to correct for rogue dimensions and reveal underlying representational quality. We argue that accounting for rogue dimensions is essential for any similarity-based analysis of contextual language models.
    Dubhe: Towards Data Unbiasedness with Homomorphic Encryption in Federated Learning Client Selection. (arXiv:2109.04253v1 [cs.CR])
    (2 min) Federated learning (FL) is a distributed machine learning paradigm that allows clients to collaboratively train a model over their own local data. FL promises the privacy of clients and its security can be strengthened by cryptographic methods such as additively homomorphic encryption (HE). However, the efficiency of FL could seriously suffer from the statistical heterogeneity in both the data distribution discrepancy among clients and the global distribution skewness. We mathematically demonstrate the cause of performance degradation in FL and examine the performance of FL over various datasets. To tackle the statistical heterogeneity problem, we propose a pluggable system-level client selection method named Dubhe, which allows clients to proactively participate in training, meanwhile preserving their privacy with the assistance of HE. Experimental results show that Dubhe is comparable with the optimal greedy method on the classification accuracy, with negligible encryption and communication overhead.
    fGOT: Graph Distances based on Filters and Optimal Transport. (arXiv:2109.04442v1 [cs.LG])
    (2 min) Graph comparison deals with identifying similarities and dissimilarities between graphs. A major obstacle is the unknown alignment of graphs, as well as the lack of accurate and inexpensive comparison metrics. In this work we introduce the filter graph distance. It is an optimal transport based distance which drives graph comparison through the probability distribution of filtered graph signals. This creates a highly flexible distance, capable of prioritising different spectral information in observed graphs, offering a wide range of choices for a comparison metric. We tackle the problem of graph alignment by computing graph permutations that minimise our new filter distances, which implicitly solves the graph comparison problem. We then propose a new approximate cost function that circumvents many computational difficulties inherent to graph comparison and permits the exploitation of fast algorithms such as mirror gradient descent, without grossly sacrificing the performance. We finally propose a novel algorithm derived from a stochastic version of mirror gradient descent, which accommodates the non-convexity of the alignment problem, offering a good trade-off between performance accuracy and speed. The experiments on graph alignment and classification show that the flexibility gained through filter graph distances can have a significant impact on performance, while the difference in speed offered by the approximation cost makes the framework applicable in practical settings.
    Supervised Linear Dimension-Reduction Methods: Review, Extensions, and Comparisons. (arXiv:2109.04244v1 [stat.ML])
    (2 min) Principal component analysis (PCA) is a well-known linear dimension-reduction method that has been widely used in data analysis and modeling. It is an unsupervised learning technique that identifies a suitable linear subspace for the input variable that contains maximal variation and preserves as much information as possible. PCA has also been used in prediction models where the original, high-dimensional space of predictors is reduced to a smaller, more manageable, set before conducting regression analysis. However, this approach does not incorporate information in the response during the dimension-reduction stage and hence can have poor predictive performance. To address this concern, several supervised linear dimension-reduction techniques have been proposed in the literature. This paper reviews selected techniques, extends some of them, and compares their performance through simulations. Two of these techniques, partial least squares (PLS) and least-squares PCA (LSPCA), consistently outperform the others in this study.
    On the Hardness of PAC-learning stabilizer States with Noise. (arXiv:2102.05174v2 [quant-ph] UPDATED)
    (2 min) We consider the problem of learning stabilizer states with noise in the Probably Approximately Correct (PAC) framework of Aaronson (2007) for learning quantum states. In the noiseless setting, an algorithm for this problem was recently given by Rocchetto (2018), but the noisy case was left open. Motivated by approaches to noise tolerance from classical learning theory, we introduce the Statistical Query (SQ) model for PAC-learning quantum states, and prove that algorithms in this model are indeed resilient to common forms of noise, including classification and depolarizing noise. We prove an exponential lower bound on learning stabilizer states in the SQ model. Even outside the SQ model, we prove that learning stabilizer states with noise is in general as hard as Learning Parity with Noise (LPN) using classical examples. Our results position the problem of learning stabilizer states as a natural quantum analogue of the classical problem of learning parities: easy in the noiseless setting, but seemingly intractable even with simple forms of noise.
    GNisi: A graph network for reconstructing Ising models from multivariate binarized data. (arXiv:2109.04257v1 [cs.LG])
    (2 min) Ising models are a simple generative approach to describing interacting binary variables. They have proven useful in a number of biological settings because they enable one to represent observed many-body correlations as the separable consequence of many direct, pairwise statistical interactions. The inference of Ising models from data can be computationally very challenging and often one must be satisfied with numerical approximations or limited precision. In this paper we present a novel method for the determination of Ising parameters from data, called GNisi, which uses a Graph Neural network trained on known Ising models in order to construct the parameters for unseen data. We show that GNisi is more accurate than the existing state of the art software, and we illustrate our method by applying GNisi to gene expression data.
    DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly detection in air transportation. (arXiv:2109.04247v1 [cs.LG])
    (2 min) The Automatic Dependent Surveillance Broadcast protocol is one of the latest compulsory advances in air surveillance. While it supports the tracking of the ever-growing number of aircraft in the air, it also introduces cybersecurity issues that must be mitigated e.g., false data injection attacks where an attacker emits fake surveillance information. The recent data sources and tools available to obtain flight tracking records allow the researchers to create datasets and develop Machine Learning models capable of detecting such anomalies in En-Route trajectories. In this context, we propose a novel multivariate anomaly detection model called Discriminatory Auto-Encoder (DAE). It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase (e.g. climbing, cruising or descending) during its training.To illustrate the DAE's efficiency, an evaluation dataset was created using real-life anomalies as well as realistically crafted ones, with which the DAE as well as three anomaly detection models from the literature were evaluated. Results show that the DAE achieves better results in both accuracy and speed of detection. The dataset, the models implementations and the evaluation results are available in an online repository, thereby enabling replicability and facilitating future experiments.
    MATE: Multi-view Attention for Table Transformer Efficiency. (arXiv:2109.04312v1 [cs.CL])
    (2 min) This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.
    SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. (arXiv:1911.03437v5 [cs.CL] UPDATED)
    (2 min) Transfer learning has fundamentally changed the landscape of natural language processing (NLP) research. Many existing state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely large capacity of pre-trained models, aggressive fine-tuning often causes the adapted model to overfit the data of downstream tasks and forget the knowledge of the pre-trained model. To address the above issue in a more principled manner, we propose a new computational framework for robust and efficient fine-tuning for pre-trained language models. Specifically, our proposed framework contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the capacity of the model; 2. Bregman proximal point optimization, which is a class of trust-region methods and can prevent knowledge forgetting. Our experiments demonstrate that our proposed method achieves the state-of-the-art performance on multiple NLP benchmarks.
    The UU-test for Statistical Modeling of Unimodal Data. (arXiv:2008.12537v2 [cs.LG] UPDATED)
    (0 min) Deciding on the unimodality of a dataset is an important problem in data analysis and statistical modeling. It allows to obtain knowledge about the structure of the dataset, ie. whether data points have been generated by a probability distribution with a single or more than one peaks. Such knowledge is very useful for several data analysis problems, such as for deciding on the number of clusters and determining unimodal projections. We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset. The method operates on the empirical cumulative density function (ecdf) of the dataset. It attempts to build a piecewise linear approximation of the ecdf that is unimodal and models the data sufficiently in the sense that the data corresponding to each linear segment follows the uniform distribution. A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model. We present experimental results in order to assess the ability of the method to decide on unimodality and perform comparisons with the well-known dip-test approach. In addition, in the case of unimodal datasets we evaluate the Uniform Mixture Models provided by the proposed method using the test set log-likelihood and the two-sample Kolmogorov-Smirnov (KS) test.
    Estimation of Corporate Greenhouse Gas Emissions via Machine Learning. (arXiv:2109.04318v1 [cs.LG])
    (2 min) As an important step to fulfill the Paris Agreement and achieve net-zero emissions by 2050, the European Commission adopted the most ambitious package of climate impact measures in April 2021 to improve the flow of capital towards sustainable activities. For these and other international measures to be successful, reliable data is key. The ability to see the carbon footprint of companies around the world will be critical for investors to comply with the measures. However, with only a small portion of companies volunteering to disclose their greenhouse gas (GHG) emissions, it is nearly impossible for investors to align their investment strategies with the measures. By training a machine learning model on disclosed GHG emissions, we are able to estimate the emissions of other companies globally who do not disclose their emissions. In this paper, we show that our model provides accurate estimates of corporate GHG emissions to investors such that they are able to align their investments with the regulatory measures and achieve net-zero goals.
    Combining resampling and reweighting for faithful stochastic optimization. (arXiv:2105.14694v2 [cs.LG] UPDATED)
    (0 min) Many machine learning and data science tasks require solving non-convex optimization problems. When the loss function is a sum of multiple terms, a popular method is the stochastic gradient descent. Viewed as a process for sampling the loss function landscape, the stochastic gradient descent is known to prefer flat minima. Though this is desired for certain optimization problems such as in deep learning, it causes issues when the goal is to find the global minimum, especially if the global minimum resides in a sharp valley. Illustrated with a simple motivating example, we show that the fundamental reason is that the difference in the Lipschitz constants of multiple terms in the loss function causes stochastic gradient descent to experience different variances at different minima. In order to mitigate this effect and perform faithful optimization, we propose a combined resampling-reweighting scheme to balance the variance at local minima and extend to general loss functions. We explain from the numerical stability perspective how the proposed scheme is more likely to select the true global minimum, and the local convergence analysis perspective how it converges to a minimum faster when compared with the vanilla stochastic gradient descent. Experiments from robust statistics and computational chemistry are provided to demonstrate the theoretical findings.
    A Black-box Adversarial Attack Strategy with Adjustable Sparsity and Generalizability for Deep Image Classifiers. (arXiv:2004.13002v3 [cs.CR] UPDATED)
    (2 min) Constructing adversarial perturbations for deep neural networks is an important direction of research. Crafting image-dependent adversarial perturbations using white-box feedback has hitherto been the norm for such adversarial attacks. However, black-box attacks are much more practical for real-world applications. Universal perturbations applicable across multiple images are gaining popularity due to their innate generalizability. There have also been efforts to restrict the perturbations to a few pixels in the image. This helps to retain visual similarity with the original images making such attacks hard to detect. This paper marks an important step which combines all these directions of research. We propose the DEceit algorithm for constructing effective universal pixel-restricted perturbations using only black-box feedback from the target network. We conduct empirical investigations using the ImageNet validation set on the state-of-the-art deep neural classifiers by varying the number of pixels to be perturbed from a meagre 10 pixels to as high as all pixels in the image. We find that perturbing only about 10% of the pixels in an image using DEceit achieves a commendable and highly transferable Fooling Rate while retaining the visual quality. We further demonstrate that DEceit can be successfully applied to image dependent attacks as well. In both sets of experiments, we outperformed several state-of-the-art methods.
    EEGDnet: Fusing Non-Local and Local Self-Similarity for 1-D EEG Signal Denoising with 2-D Transformer. (arXiv:2109.04235v1 [eess.SP])
    (2 min) Electroencephalogram (EEG) has shown a useful approach to produce a brain-computer interface (BCI). One-dimensional (1-D) EEG signal is yet easily disturbed by certain artifacts (a.k.a. noise) due to the high temporal resolution. Thus, it is crucial to remove the noise in received EEG signal. Recently, deep learning-based EEG signal denoising approaches have achieved impressive performance compared with traditional ones. It is well known that the characteristics of self-similarity (including non-local and local ones) of data (e.g., natural images and time-domain signals) are widely leveraged for denoising. However, existing deep learning-based EEG signal denoising methods ignore either the non-local self-similarity (e.g., 1-D convolutional neural network) or local one (e.g., fully connected network and recurrent neural network). To address this issue, we propose a novel 1-D EEG signal denoising network with 2-D transformer, namely EEGDnet. Specifically, we comprehensively take into account the non-local and local self-similarity of EEG signal through the transformer module. By fusing non-local self-similarity in self-attention blocks and local self-similarity in feed forward blocks, the negative impact caused by noises and outliers can be reduced significantly. Extensive experiments show that, compared with other state-of-the-art models, EEGDnet achieves much better performance in terms of both quantitative and qualitative metrics.
    The challenge of reproducible ML: an empirical study on the impact of bugs. (arXiv:2109.03991v1 [cs.SE])
    (2 min) Reproducibility is a crucial requirement in scientific research. When results of research studies and scientific papers have been found difficult or impossible to reproduce, we face a challenge which is called reproducibility crisis. Although the demand for reproducibility in Machine Learning (ML) is acknowledged in the literature, a main barrier is inherent non-determinism in ML training and inference. In this paper, we establish the fundamental factors that cause non-determinism in ML systems. A framework, ReproduceML, is then introduced for deterministic evaluation of ML experiments in a real, controlled environment. ReproduceML allows researchers to investigate software configuration effects on ML training and inference. Using ReproduceML, we run a case study: investigation of the impact of bugs inside ML libraries on performance of ML experiments. This study attempts to quantify the impact that the occurrence of bugs in a popular ML framework, PyTorch, has on the performance of trained models. To do so, a comprehensive methodology is proposed to collect buggy versions of ML libraries and run deterministic ML experiments using ReproduceML. Our initial finding is that there is no evidence based on our limited dataset to show that bugs which occurred in PyTorch do affect the performance of trained models. The proposed methodology as well as ReproduceML can be employed for further research on non-determinism and bugs.
    Optimal Reservoir Operations using Long Short-Term Memory Network. (arXiv:2109.04255v1 [cs.LG])
    (2 min) A reliable forecast of inflows to the reservoir is a key factor in the optimal operation of reservoirs. Real-time operation of the reservoir based on forecasts of inflows can lead to substantial economic gains. However, the forecast of inflow is an intricate task as it has to incorporate the impacts of climate and hydrological changes. Therefore, the major objective of the present work is to develop a novel approach based on long short-term memory (LSTM) for the forecast of inflows. Real-time inflow forecast, in other words, daily inflow at the reservoir helps in efficient operation of water resources. Also, daily variations in the release can be monitored efficiently and the reliability of operation is improved. This work proposes a naive anomaly detection algorithm baseline based on LSTM. In other words, a strong baseline to forecast flood and drought for any deep learning-based prediction model. The practicality of the approach has been demonstrated using the observed daily data of the past 20 years from Bhakra Dam in India. The results of the simulations conducted herein clearly indicate the supremacy of the LSTM approach over the traditional methods of forecasting. Although, experiments are run on data from Bhakra Dam Reservoir in India, LSTM model, and anomaly detection algorithm are general purpose and can be applied to any basin with minimal changes. A distinct practical advantage of the LSTM method presented herein is that it can adequately simulate non-stationarity and non-linearity in the historical data.
    AutoSmart: An Efficient and Automatic Machine Learning framework for Temporal Relational Data. (arXiv:2109.04115v1 [cs.LG])
    (2 min) Temporal relational data, perhaps the most commonly used data type in industrial machine learning applications, needs labor-intensive feature engineering and data analyzing for giving precise model predictions. An automatic machine learning framework is needed to ease the manual efforts in fine-tuning the models so that the experts can focus more on other problems that really need humans' engagement such as problem definition, deployment, and business services. However, there are three main challenges for building automatic solutions for temporal relational data: 1) how to effectively and automatically mining useful information from the multiple tables and the relations from them? 2) how to be self-adjustable to control the time and memory consumption within a certain budget? and 3) how to give generic solutions to a wide range of tasks? In this work, we propose our solution that successfully addresses the above issues in an end-to-end automatic way. The proposed framework, AutoSmart, is the winning solution to the KDD Cup 2019 of the AutoML Track, which is one of the largest AutoML competition to date (860 teams with around 4,955 submissions). The framework includes automatic data processing, table merging, feature engineering, and model tuning, with a time\&memory controller for efficiently and automatically formulating the models. The proposed framework outperforms the baseline solution significantly on several datasets in various domains.
    Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging. (arXiv:2010.03060v5 [cs.LG] UPDATED)
    (2 min) A key challenge in training neural networks for a given medical imaging task is often the difficulty of obtaining a sufficient number of manually labeled examples. In contrast, textual imaging reports, which are often readily available in medical records, contain rich but unstructured interpretations written by experts as part of standard clinical practice. We propose using these textual reports as a form of weak supervision to improve the image interpretation performance of a neural network without requiring additional manually labeled examples. We use an image-text matching task to train a feature extractor and then fine-tune it in a transfer learning setting for a supervised task using a small labeled dataset. The end result is a neural network that automatically interprets imagery without requiring textual reports during inference. This approach can be applied to any task for which text-image pairs are readily available. We evaluate our method on three classification tasks and find consistent performance improvements, reducing the need for labeled data by 67%-98%.
    Adaptive Federated Optimization. (arXiv:2003.00295v5 [cs.LG] UPDATED)
    (2 min) Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.
    Sharp Bounds on the Approximation Rates, Metric Entropy, and $n$-widths of Shallow Neural Networks. (arXiv:2101.12365v7 [stat.ML] UPDATED)
    (2 min) In this article, we study approximation properties of the variation spaces corresponding to shallow neural networks with a variety of activation functions. We introduce two main tools for estimating the metric entropy, approximation rates, and $n$-widths of these spaces. First, we introduce the notion of a smoothly parameterized dictionary and give upper bounds on the non-linear approximation rates, metric entropy and $n$-widths of their absolute convex hull. The upper bounds depend upon the order of smoothness of the parameterization. This result is applied to dictionaries of ridge functions corresponding to shallow neural networks, and they improve upon existing results in many cases. Next, we provide a method for lower bounding the metric entropy and $n$-widths of variation spaces which contain certain classes of ridge functions. This result gives sharp lower bounds on the $L^2$-approximation rates, metric entropy, and $n$-widths for variation spaces corresponding to neural networks with a range of important activation functions, including ReLU$^k$, sigmoidal activation functions with bounded variation, and the B-spline activation functions.
    Multi-granularity Textual Adversarial Attack with Behavior Cloning. (arXiv:2109.04367v1 [cs.CL])
    (2 min) Recently, the textual adversarial attack models become increasingly popular due to their successful in estimating the robustness of NLP models. However, existing works have obvious deficiencies. (1) They usually consider only a single granularity of modification strategies (e.g. word-level or sentence-level), which is insufficient to explore the holistic textual space for generation; (2) They need to query victim models hundreds of times to make a successful attack, which is highly inefficient in practice. To address such problems, in this paper we propose MAYA, a Multi-grAnularitY Attack model to effectively generate high-quality adversarial samples with fewer queries to victim models. Furthermore, we propose a reinforcement-learning based method to train a multi-granularity attack agent through behavior cloning with the expert knowledge from our MAYA algorithm to further reduce the query times. Additionally, we also adapt the agent to attack black-box models that only output labels without confidence scores. We conduct comprehensive experiments to evaluate our attack models by attacking BiLSTM, BERT and RoBERTa in two different black-box attack settings and three benchmark datasets. Experimental results show that our models achieve overall better attacking performance and produce more fluent and grammatical adversarial samples compared to baseline models. Besides, our adversarial attack agent significantly reduces the query times in both attack settings. Our codes are released at https://github.com/Yangyi-Chen/MAYA.
    On the use of Wasserstein metric in topological clustering of distributional data. (arXiv:2109.04301v1 [cs.LG])
    (2 min) This paper deals with a clustering algorithm for histogram data based on a Self-Organizing Map (SOM) learning. It combines a dimension reduction by SOM and the clustering of the data in a reduced space. Related to the kind of data, a suitable dissimilarity measure between distributions is introduced: the $L_2$ Wasserstein distance. Moreover, the number of clusters is not fixed in advance but it is automatically found according to a local data density estimation in the original space. Applications on synthetic and real data sets corroborate the proposed strategy.
    MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces. (arXiv:2109.04240v1 [cs.LG])
    (2 min) Albeit the universal representational power of pre-trained language models, adapting them onto a specific NLP task still requires a considerably large amount of labeled data. Effective task fine-tuning meets challenges when only a few labeled examples are present for the task. In this paper, we aim to the address of the problem of few shot task learning by exploiting and transferring from a different task which admits a related but disparate label space. Specifically, we devise a label transfer network (LTN) to transform the labels from source task to the target task of interest for training. Both the LTN and the model for task prediction are learned via a bi-level optimization framework, which we term as MetaXT. MetaXT offers a principled solution to best adapt a pre-trained language model to the target task by transferring knowledge from the source task. Empirical evaluations on cross-task transfer settings for four NLP tasks, from two different types of label space disparities, demonstrate the effectiveness of MetaXT, especially when the labeled data in the target task is limited.
    System Optimization in Synchronous Federated Training: A Survey. (arXiv:2109.03999v1 [cs.DC])
    (2 min) The unprecedented demand for collaborative machine learning in a privacy-preserving manner gives rise to a novel machine learning paradigm called federated learning (FL). Given a sufficient level of privacy guarantees, the practicality of an FL system mainly depends on its time-to-accuracy performance during the training process. Despite bearing some resemblance with traditional distributed training, FL has four distinct challenges that complicate the optimization towards shorter time-to-accuracy: information deficiency, coupling for contrasting factors, client heterogeneity, and huge configuration space. Motivated by the need for inspiring related research, in this paper we survey highly relevant attempts in the FL literature and organize them by the related training phases in the standard workflow: selection, configuration, and reporting. We also review exploratory work including measurement studies and benchmarking tools to friendly support FL developers. Although a few survey articles on FL already exist, our work differs from them in terms of the focus, classification, and implications.
    Accounting for Variations in Speech Emotion Recognition with Nonparametric Hierarchical Neural Network. (arXiv:2109.04316v1 [cs.LG])
    (2 min) In recent years, deep-learning-based speech emotion recognition models have outperformed classical machine learning models. Previously, neural network designs, such as Multitask Learning, have accounted for variations in emotional expressions due to demographic and contextual factors. However, existing models face a few constraints: 1) they rely on a clear definition of domains (e.g. gender, noise condition, etc.) and the availability of domain labels; 2) they often attempt to learn domain-invariant features while emotion expressions can be domain-specific. In the present study, we propose the Nonparametric Hierarchical Neural Network (NHNN), a lightweight hierarchical neural network model based on Bayesian nonparametric clustering. In comparison to Multitask Learning approaches, the proposed model does not require domain/task labels. In our experiments, the NHNN models generally outperform the models with similar levels of complexity and state-of-the-art models in within-corpus and cross-corpus tests. Through clustering analysis, we show that the NHNN models are able to learn group-specific features and bridge the performance gap between groups.
    An Experimental Study of Class Imbalance in Federated Learning. (arXiv:2109.04094v1 [cs.LG])
    (2 min) Federated learning is a distributed machine learning paradigm that trains a global model for prediction based on a number of local models at clients while local data privacy is preserved. Class imbalance is believed to be one of the factors that degrades the global model performance. However, there has been very little research on if and how class imbalance can affect the global performance. class imbalance in federated learning is much more complex than that in traditional non-distributed machine learning, due to different class imbalance situations at local clients. Class imbalance needs to be re-defined in distributed learning environments. In this paper, first, we propose two new metrics to define class imbalance -- the global class imbalance degree (MID) and the local difference of class imbalance among clients (WCS). Then, we conduct extensive experiments to analyze the impact of class imbalance on the global performance in various scenarios based on our definition. Our results show that a higher MID and a larger WCS degrade more the performance of the global model. Besides, WCS is shown to slow down the convergence of the global model by misdirecting the optimization.
    Iterated Vector Fields and Conservatism, with Applications to Federated Learning. (arXiv:2109.03973v1 [math.OC])
    (2 min) We study iterated vector fields and investigate whether they are conservative, in the sense that they are the gradient of some scalar-valued function. We analyze the conservatism of various iterated vector fields, including gradient vector fields associated to loss functions of generalized linear models. We relate this study to optimization and derive novel convergence results for federated learning algorithms. In particular, we show that for certain classes of functions (including non-convex functions), federated averaging is equivalent to gradient descent on a surrogate loss function. Finally, we discuss a variety of open questions spanning topics in geometry, dynamical systems, and optimization.
    A Systematic Approach to Group Fairness in Automated Decision Making. (arXiv:2109.04230v1 [cs.CY])
    (2 min) While the field of algorithmic fairness has brought forth many ways to measure and improve the fairness of machine learning models, these findings are still not widely used in practice. We suspect that one reason for this is that the field of algorithmic fairness came up with a lot of definitions of fairness, which are difficult to navigate. The goal of this paper is to provide data scientists with an accessible introduction to group fairness metrics and to give some insight into the philosophical reasoning for caring about these metrics. We will do this by considering in which sense socio-demographic groups are compared for making a statement on fairness.
    Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction. (arXiv:2109.03821v1 [cs.IR])
    (2 min) Compliments and concerns in reviews are valuable for understanding users' shopping interests and their opinions with respect to specific aspects of certain items. Existing review-based recommenders favor large and complex language encoders that can only learn latent and uninterpretable text representations. They lack explicit user attention and item property modeling, which however could provide valuable information beyond the ability to recommend items. Therefore, we propose a tightly coupled two-stage approach, including an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE). Unsupervised ASPE mines Aspect-Sentiment pairs (AS-pairs) and APRE predicts ratings using AS-pairs as concrete aspect-level evidence. Extensive experiments on seven real-world Amazon Review Datasets demonstrate that ASPE can effectively extract AS-pairs which enable APRE to deliver superior accuracy over the leading baselines.
    Memory semantization through perturbed and adversarial dreaming. (arXiv:2109.04261v1 [q-bio.NC])
    (2 min) Classical theories of memory consolidation emphasize the importance of replay in extracting semantic information from episodic memories. However, the characteristic creative nature of dreams suggests that memory semantization may go beyond merely replaying previous experiences. We propose that rapid-eye-movement (REM) dreaming is essential for efficient memory semantization by randomly combining episodic memories to create new, virtual sensory experiences. We support this hypothesis by implementing a cortical architecture with hierarchically organized feedforward and feedback pathways, inspired by generative adversarial networks (GANs). Learning in our model is organized across three different global brain states mimicking wakefulness, non-REM (NREM) and REM sleep, optimizing different, but complementary objective functions. We train the model in an unsupervised fashion on standard datasets of natural images and evaluate the quality of the learned representations. Our results suggest that adversarial dreaming during REM sleep is essential for extracting memory contents, while perturbed dreaming during NREM sleep improves robustness of the latent representation to noisy sensory inputs. The model provides a new computational perspective on sleep states, memory replay and dreams and suggests a cortical implementation of GANs.
    Social Media Monitoring for IoT Cyber-Threats. (arXiv:2109.04306v1 [cs.CR])
    (2 min) The rapid development of IoT applications and their use in various fields of everyday life has resulted in an escalated number of different possible cyber-threats, and has consequently raised the need of securing IoT devices. Collecting Cyber-Threat Intelligence (e.g., zero-day vulnerabilities or trending exploits) from various online sources and utilizing it to proactively secure IoT systems or prepare mitigation scenarios has proven to be a promising direction. In this work, we focus on social media monitoring and investigate real-time Cyber-Threat Intelligence detection from the Twitter stream. Initially, we compare and extensively evaluate six different machine-learning based classification alternatives trained with vulnerability descriptions and tested with real-world data from the Twitter stream to identify the best-fitting solution. Subsequently, based on our findings, we propose a novel social media monitoring system tailored to the IoT domain; the system allows users to identify recent/trending vulnerabilities and exploits on IoT devices. Finally, to aid research on the field and support the reproducibility of our results we publicly release all annotated datasets created during this process.
    ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and Sparse DNNs. (arXiv:2109.04236v1 [cs.LG])
    (2 min) The remarkable success of deep neural networks (DNNs) in various applications is accompanied by a significant increase in network parameters and arithmetic operations. Such increases in memory and computational demands make deep learning prohibitive for resource-constrained hardware platforms such as mobile devices. Recent efforts aim to reduce these overheads, while preserving model performance as much as possible, and include parameter reduction techniques, parameter quantization, and lossless compression techniques. In this chapter, we develop and describe a novel quantization paradigm for DNNs: Our method leverages concepts of explainable AI (XAI) and concepts of information theory: Instead of assigning weight values based on their distances to the quantization clusters, the assignment function additionally considers weight relevances obtained from Layer-wise Relevance Propagation (LRP) and the information content of the clusters (entropy optimization). The ultimate goal is to preserve the most relevant weights in quantization clusters of highest information content. Experimental results show that this novel Entropy-Constrained and XAI-adjusted Quantization (ECQ$^{\text{x}}$) method generates ultra low-precision (2-5 bit) and simultaneously sparse neural networks while maintaining or even improving model performance. Due to reduced parameter precision and high number of zero-elements, the rendered networks are highly compressible in terms of file size, up to $103\times$ compared to the full-precision unquantized DNN model. Our approach was evaluated on different types of models and datasets (including Google Speech Commands and CIFAR-10) and compared with previous work.
    Where Did You Learn That From? Surprising Effectiveness of Membership Inference Attacks Against Temporally Correlated Data in Deep Reinforcement Learning. (arXiv:2109.03975v1 [cs.LG])
    (2 min) While significant research advances have been made in the field of deep reinforcement learning, a major challenge to widespread industrial adoption of deep reinforcement learning that has recently surfaced but little explored is the potential vulnerability to privacy breaches. In particular, there have been no concrete adversarial attack strategies in literature tailored for studying the vulnerability of deep reinforcement learning algorithms to membership inference attacks. To address this gap, we propose an adversarial attack framework tailored for testing the vulnerability of deep reinforcement learning algorithms to membership inference attacks. More specifically, we design a series of experiments to investigate the impact of temporal correlation, which naturally exists in reinforcement learning training data, on the probability of information leakage. Furthermore, we study the differences in the performance of \emph{collective} and \emph{individual} membership attacks against deep reinforcement learning algorithms. Experimental results show that the proposed adversarial attack framework is surprisingly effective at inferring the data used during deep reinforcement training with an accuracy exceeding $84\%$ in individual and $97\%$ in collective mode on two different control tasks in OpenAI Gym, which raises serious privacy concerns in the deployment of models resulting from deep reinforcement learning. Moreover, we show that the learning state of a reinforcement learning algorithm significantly influences the level of the privacy breach.
    DAE-PINN: A Physics-Informed Neural Network Model for Simulating Differential-Algebraic Equations with Application to Power Networks. (arXiv:2109.04304v1 [cs.LG])
    (2 min) Deep learning-based surrogate modeling is becoming a promising approach for learning and simulating dynamical systems. Deep-learning methods, however, find very challenging learning stiff dynamics. In this paper, we develop DAE-PINN, the first effective deep-learning framework for learning and simulating the solution trajectories of nonlinear differential-algebraic equations (DAE), which present a form of infinite stiffness and describe, for example, the dynamics of power networks. Our DAE-PINN bases its effectiveness on the synergy between implicit Runge-Kutta time-stepping schemes (designed specifically for solving DAEs) and physics-informed neural networks (PINN) (deep neural networks that we train to satisfy the dynamics of the underlying problem). Furthermore, our framework (i) enforces the neural network to satisfy the DAEs as (approximate) hard constraints using a penalty-based method and (ii) enables simulating DAEs for long-time horizons. We showcase the effectiveness and accuracy of DAE-PINN by learning and simulating the solution trajectories of a three-bus power network.
    Constants of Motion: The Antidote to Chaos in Optimization and Game Dynamics. (arXiv:2109.03974v1 [math.OC])
    (2 min) Several recent works in online optimization and game dynamics have established strong negative complexity results including the formal emergence of instability and chaos even in small such settings, e.g., $2\times 2$ games. These results motivate the following question: Which methodological tools can guarantee the regularity of such dynamics and how can we apply them in standard settings of interest such as discrete-time first-order optimization dynamics? We show how proving the existence of invariant functions, i.e., constant of motions, is a fundamental contribution in this direction and establish a plethora of such positive results (e.g. gradient descent, multiplicative weights update, alternating gradient descent and manifold gradient descent) both in optimization as well as in game settings. At a technical level, for some conservation laws we provide an explicit and concise closed form, whereas for other ones we present non-constructive proofs using tools from dynamical systems.
    Provably efficient RL with Rich Observations via Latent State Decoding. (arXiv:1901.09018v3 [cs.LG] UPDATED)
    (2 min) We study the exploration problem in episodic MDPs with rich observations generated from a small number of latent states. Under certain identifiability assumptions, we demonstrate how to estimate a mapping from the observations to latent states inductively through a sequence of regression and clustering steps -- where previously decoded latent states provide labels for later regression problems -- and use it to construct good exploration policies. We provide finite-sample guarantees on the quality of the learned state decoding function and exploration policies, and complement our theory with an empirical evaluation on a class of hard exploration problems. Our method exponentially improves over $Q$-learning with na\"ive exploration, even when $Q$-learning has cheating access to latent states.
    Toward a Perspectivist Turn in Ground Truthing for Predictive Computing. (arXiv:2109.04270v1 [cs.LG])
    (2 min) Most Artificial Intelligence applications are based on supervised machine learning (ML), which ultimately grounds on manually annotated data. The annotation process is often performed in terms of a majority vote and this has been proved to be often problematic, as highlighted by recent studies on the evaluation of ML models. In this article we describe and advocate for a different paradigm, which we call data perspectivism, which moves away from traditional gold standard datasets, towards the adoption of methods that integrate the opinions and perspectives of the human subjects involved in the knowledge representation step of ML processes. Drawing on previous works which inspired our proposal we describe the potential of our proposal for not only the more subjective tasks (e.g. those related to human language) but also to tasks commonly understood as objective (e.g. medical decision making), and present the main advantages of adopting a perspectivist stance in ML, as well as possible disadvantages, and various ways in which such a stance can be implemented in practice. Finally, we share a set of recommendations and outline a research agenda to advance the perspectivist stance in ML.
    Bayesian Optimisation for Sequential Experimental Design with Applications in Additive Manufacturing. (arXiv:2107.12809v2 [cs.LG] UPDATED)
    (2 min) Bayesian optimization (BO) is an approach to globally optimizing black-box objective functions that are expensive to evaluate. BO-powered experimental design has found wide application in materials science, chemistry, experimental physics, drug development, etc. This work aims to bring attention to the benefits of applying BO in designing experiments and to provide a BO manual, covering both methodology and software, for the convenience of anyone who wants to apply or learn BO. In particular, we briefly explain the BO technique, review all the applications of BO in additive manufacturing, compare and exemplify the features of different open BO libraries, unlock new potential applications of BO to other types of data (e.g., preferential output). This article is aimed at readers with some understanding of Bayesian methods, but not necessarily with knowledge of additive manufacturing; the software performance overview and implementation instructions are instrumental for any experimental-design practitioner. Moreover, our review in the field of additive manufacturing highlights the current knowledge and technological trends of BO.
    Active Offline Policy Selection. (arXiv:2106.10251v2 [cs.LG] UPDATED)
    (2 min) This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. At the same time, large amount of online interactions is often not feasible in practice. To overcome this problem, we introduce \emph{active offline policy selection} -- a novel sequential decision approach that combines logged data with online interaction to identify the best policy. This approach uses OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely, it relies on a Bayesian optimization method, with a kernel function that represents policy similarity, to decide which policy to evaluate next. We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.
    Learning Opinion Summarizers by Selecting Informative Reviews. (arXiv:2109.04325v1 [cs.CL])
    (2 min) Opinion summarization has been traditionally approached with unsupervised, weakly-supervised and few-shot learning techniques. In this work, we collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training. However, the number of reviews per product is large (320 on average), making summarization - and especially training a summarizer - impractical. Moreover, the content of many reviews is not reflected in the human-written summaries, and, thus, the summarizer trained on random review subsets hallucinates. In order to deal with both of these challenges, we formulate the task as jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets. The choice of the review subset is treated as a latent variable, predicted by a small and simple selector. The subset is then fed into a more powerful summarizer. For joint training, we use amortized variational inference and policy gradient methods. Our experiments demonstrate the importance of selecting informative reviews resulting in improved quality of summaries and reduced hallucinations.
    A Channel Coding Benchmark for Meta-Learning. (arXiv:2107.07579v2 [cs.LG] UPDATED)
    (2 min) Meta-learning provides a popular and effective family of methods for data-efficient learning of new tasks. However, several important issues in meta-learning have proven hard to study thus far. For example, performance degrades in real-world settings where meta-learners must learn from a wide and potentially multi-modal distribution of training tasks; and when distribution shift exists between meta-train and meta-test task distributions. These issues are typically hard to study since the shape of task distributions, and shift between them are not straightforward to measure or control in standard benchmarks. We propose the channel coding problem as a benchmark for meta-learning. Channel coding is an important practical application where task distributions naturally arise, and fast adaptation to new tasks is practically valuable. We use our MetaCC benchmark to study several aspects of meta-learning, including the impact of task distribution breadth and shift, which can be controlled in the coding problem. Going forward, MetaCC provides a tool for the community to study the capabilities and limitations of meta-learning, and to drive research on practically robust and effective meta-learners.
    An Interpretable Web-based Glioblastoma Multiforme Prognosis Prediction Tool using Random Forest Model. (arXiv:2108.13039v2 [cs.LG] UPDATED)
    (3 min) We propose predictive models that estimate GBM patients' health status of one-year after treatments (Classification task), predict the long-term prognosis of GBM patients at an individual level (Survival task). We used total of 467 GBM patients' clinical profile consists of 13 features and two follow-up dates. For baseline models of random forest classifier(RFC) and random survival forest model (RSF), we introduced generalized linear model (GLM), support vector machine (SVM) and Cox proportional hazardous model (COX), accelerated failure time model (AFT) respectively. After preprocessing and prefixing stratified 5-fold data set, we generated best performing models for model types using recursive feature elimination process. Total 10, 4, and 13 features were extracted for best performing one-year survival/progression status RFC models and RSF model via the recursive feature elimination process. In classification task, AUROC of best performing RFC recorded 0.6990 (for one-year survival status classification) and 0.7076 (for one-year progression classification) while that of second best baseline models (GLM in both cases) recorded 0.6691 and 0.6997 respectively. About survival task, the highest C-index of 0.7157 and the lowest IBS of 0.1038 came from the best performing RSF model while that of second best baseline models were 0.6556 and 0.1139 respectively. A simplified linear correlation (extracted from LIME and virtual patient group analysis) between each feature and prognosis of GBM patient were consistent with proven medical knowledge. Our machine learning models suggest that the top three prognostic factors for GBM patient survival were MGMT gene promoter, the extent of resection, and age. To the best of our knowledge, this study is the very first study introducing a interpretable and medical knowledge consistent GBM prognosis predictive models.
    Improving Multimodal fusion via Mutual Dependency Maximisation. (arXiv:2109.00922v2 [cs.LG] UPDATED)
    (2 min) Multimodal sentiment analysis is a trending area of research, and the multimodal fusion is one of its most active topic. Acknowledging humans communicate through a variety of channels (i.e visual, acoustic, linguistic), multimodal systems aim at integrating different unimodal representations into a synthetic one. So far, a consequent effort has been made on developing complex architectures allowing the fusion of these modalities. However, such systems are mainly trained by minimising simple losses such as $L_1$ or cross-entropy. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to $4.3$ on accuracy) across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: \texttt{CMU-MOSI} and \texttt{CMU-MOSEI}. Our method not only achieves a new SOTA on both datasets but also produces representations that are more robust to modality drops. Finally, a by-product of our methods includes a statistical network which can be used to interpret the high dimensional representations learnt by the model.
    Model-predictive control and reinforcement learning in multi-energy system case studies. (arXiv:2104.09785v2 [eess.SY] UPDATED)
    (2 min) Model-predictive-control (MPC) offers an optimal control technique to establish and ensure that the total operation cost of multi-energy systems remains at a minimum while fulfilling all system constraints. However, this method presumes an adequate model of the underlying system dynamics, which is prone to modelling errors and is not necessarily adaptive. This has an associated initial and ongoing project-specific engineering cost. In this paper, we present an on- and off-policy multi-objective reinforcement learning (RL) approach, that does not assume a model a priori, benchmarking this against a linear MPC (LMPC - to reflect current practice, though non-linear MPC performs better) - both derived from the general optimal control problem, highlighting their differences and similarities. In a simple multi-energy system (MES) configuration case study, we show that a twin delayed deep deterministic policy gradient (TD3) RL agent offers potential to match and outperform the perfect foresight LMPC benchmark (101.5%). This while the realistic LMPC, i.e. imperfect predictions, only achieves 98%. While in a more complex MES system configuration, the RL agent's performance is generally lower (94.6%), yet still better than the realistic LMPC (88.9%). In both case studies, the RL agents outperformed the realistic LMPC after a training period of 2 years using quarterly interactions with the environment. We conclude that reinforcement learning is a viable optimal control technique for multi-energy systems given adequate constraint handling and pre-training, to avoid unsafe interactions and long training periods, as is proposed in fundamental future work.
    Adaptive importance sampling for seismic fragility curve estimation. (arXiv:2109.04323v1 [stat.ML])
    (2 min) As part of Probabilistic Risk Assessment studies, it is necessary to study the fragility of mechanical and civil engineered structures when subjected to seismic loads. This risk can be measured with fragility curves, which express the probability of failure of the structure conditionally to a seismic intensity measure. The estimation of fragility curves relies on time-consuming numerical simulations, so that careful experimental design is required in order to gain the maximum information on the structure's fragility with a limited number of code evaluations. We propose and implement an active learning methodology based on adaptive importance sampling in order to reduce the variance of the training loss. The efficiency of the proposed method in terms of bias, standard deviation and prediction interval coverage are theoretically and numerically characterized.
    COLUMBUS: Automated Discovery of New Multi-Level Features for Domain Generalization via Knowledge Corruption. (arXiv:2109.04320v1 [cs.LG])
    (2 min) Machine learning models that can generalize to unseen domains are essential when applied in real-world scenarios involving strong domain shifts. We address the challenging domain generalization (DG) problem, where a model trained on a set of source domains is expected to generalize well in unseen domains without any exposure to their data. The main challenge of DG is that the features learned from the source domains are not necessarily present in the unseen target domains, leading to performance deterioration. We assume that learning a richer set of features is crucial to improve the transfer to a wider set of unknown domains. For this reason, we propose COLUMBUS, a method that enforces new feature discovery via a targeted corruption of the most relevant input and multi-level representations of the data. We conduct an extensive empirical evaluation to demonstrate the effectiveness of the proposed approach which achieves new state-of-the-art results by outperforming 18 DG algorithms on multiple DG benchmark datasets in the DomainBed framework.
    Fea2Fea: Exploring Structural Feature Correlations via Graph Neural Networks. (arXiv:2106.13061v4 [cs.LG] UPDATED)
    (2 min) Structural features are important features in a geometrical graph. Although there are some correlation analysis of features based on covariance, there is no relevant research on structural feature correlation analysis with graph neural networks. In this paper, we introuduce graph feature to feature (Fea2Fea) prediction pipelines in a low dimensional space to explore some preliminary results on structural feature correlation, which is based on graph neural network. The results show that there exists high correlation between some of the structural features. An irredundant feature combination with initial node features, which is filtered by graph neural network has improved its classification accuracy in some graph-based tasks. We compare differences between concatenation methods on connecting embeddings between features and show that the simplest is the best. We generalize on the synthetic geometric graphs and certify the results on prediction difficulty between structural features.
    NEAT: Neural Attention Fields for End-to-End Autonomous Driving. (arXiv:2109.04456v1 [cs.CV])
    (2 min) Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial prerequisite for autonomous driving. We present NEural ATtention fields (NEAT), a novel representation that enables such reasoning for end-to-end imitation learning models. NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows our model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert used to generate its training data. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability.
    Physics-Guided Deep Learning for Dynamical Systems: A Survey. (arXiv:2107.01272v3 [cs.LG] UPDATED)
    (2 min) Modeling complex physical dynamics is a fundamental task in science and engineering. Traditional physics-based models are sample efficient, interpretable but often rely on rigid assumptions. Furthermore, direct numerical approximation is usually computationally intensive, requiring significant computational resources and expertise. While deep learning (DL) provides novel alternatives for efficiently recognizing complex patterns and emulating nonlinear dynamics, its predictions do not necessarily obey the governing laws of physical systems, nor do they generalize well across different systems. Thus, the study of physics-guided DL emerged and has gained great progress. Physics-guided DL aims to take the best from both physics-based modeling and state-of-the-art DL models to better solve scientific problems. In this paper, we provide a structured overview of existing methodologies of integrating prior physical knowledge or physics-based modeling into DL, with a special emphasis on learning dynamical systems. We also discuss the fundamental challenges and emerging opportunities in the area.
    SONIC: A Sparse Neural Network Inference Accelerator with Silicon Photonics for Energy-Efficient Deep Learning. (arXiv:2109.04459v1 [cs.LG])
    (2 min) Sparse neural networks can greatly facilitate the deployment of neural networks on resource-constrained platforms as they offer compact model sizes while retaining inference accuracy. Because of the sparsity in parameter matrices, sparse neural networks can, in principle, be exploited in accelerator architectures for improved energy-efficiency and latency. However, to realize these improvements in practice, there is a need to explore sparsity-aware hardware-software co-design. In this paper, we propose a novel silicon photonics-based sparse neural network inference accelerator called SONIC. Our experimental analysis shows that SONIC can achieve up to 5.8x better performance-per-watt and 8.4x lower energy-per-bit than state-of-the-art sparse electronic neural network accelerators; and up to 13.8x better performance-per-watt and 27.6x lower energy-per-bit than the best known photonic neural network accelerators.
    QueryNet: An Attack Framework with Surrogates Carrying Multiple Identities. (arXiv:2105.15010v2 [cs.LG] UPDATED)
    (2 min) Deep Neural Networks (DNNs) are acknowledged as vulnerable to adversarial attacks, while the existing black-box attacks require extensive queries on the victim DNN to achieve high success rates. For query-efficiency, surrogate models of the victim are adopted as transferable attackers in consideration of their Gradient Similarity (GS), i.e., surrogates' attack gradients are similar to the victim's ones to some extent. However, it is generally neglected to exploit their similarity on outputs, namely the Prediction Similarity (PS), to filter out inefficient queries. To jointly utilize and also optimize surrogates' GS and PS, we develop QueryNet, an efficient attack network that can significantly reduce queries. QueryNet crafts several transferable Adversarial Examples (AEs) by surrogates, and then decides also by surrogates on the most promising AE, which is then sent to query the victim. That is to say, in QueryNet, surrogates are not only exploited as transferable attackers, but also as transferability evaluators for AEs. The AEs are generated using surrogates' GS and evaluated based on their PS, and therefore, the query results could be back-propagated to optimize surrogates' parameters and also their architectures, enhancing both the GS and the PS. QueryNet has significant query-efficiency, i.e., reduces queries by averagely about an order of magnitude compared to recent SOTA methods according to our comprehensive and real-world experiments: 11 victims (including 2 commercial models) on MNIST/CIFAR10/ImageNet, allowing only 8-bit image queries, and no access to the victim's training data.
    Leveraging Code Clones and Natural Language Processing for Log Statement Prediction. (arXiv:2109.03859v1 [cs.SE])
    (2 min) Software developers embed logging statements inside the source code as an imperative duty in modern software development as log files are necessary for tracking down runtime system issues and troubleshooting system management tasks. Prior research has emphasized the importance of logging statements in the operation and debugging of software systems. However, the current logging process is mostly manual and ad hoc, and thus, proper placement and content of logging statements remain as challenges. To overcome these challenges, methods that aim to automate log placement and log content, i.e., 'where, what, and how to log', are of high interest. Thus, we propose to accomplish the goal of this research, that is "to predict the log statements by utilizing source code clones and natural language processing (NLP)", as these approaches provide additional context and advantage for log prediction. We pursue the following four research objectives: (RO1) investigate whether source code clones can be leveraged for log statement location prediction, (RO2) propose a clone-based approach for log statement prediction, (RO3) predict log statement's description with code-clone and NLP models, and (RO4) examine approaches to automatically predict additional details of the log statement, such as its verbosity level and variables. For this purpose, we perform an experimental analysis on seven open-source java projects, extract their method-level code clones, investigate their attributes, and utilize them for log location and description prediction. Our work demonstrates the effectiveness of log-aware clone detection for automated log location and description prediction and outperforms the prior work.
    Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph. (arXiv:2109.04400v1 [cs.CL])
    (2 min) In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available, where the task is identical to that of a low-resource target language. However, collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns. This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries. First, we construct a dictionary-based heterogeneous graph (DHG) from bilingual dictionaries. This opens the possibility to use graph neural networks for cross-lingual transfer. The remaining challenge is the heterogeneity of DHG because multiple languages are considered. To address this challenge, we propose dictionary-based heterogeneous graph neural network (DHGNet) that effectively handles the heterogeneity of DHG by two-step aggregations, which are word-level and language-level aggregations. Experimental results demonstrate that our method outperforms pretrained models even though it does not access to large corpora. Furthermore, it can perform well even though dictionaries contain many incorrect translations. Its robustness allows the usage of a wider range of dictionaries such as an automatically constructed dictionary and crowdsourced dictionary, which are convenient for real-world applications.
    Cartography Active Learning. (arXiv:2109.04282v1 [cs.CL])
    (2 min) We propose Cartography Active Learning (CAL), a novel Active Learning (AL) algorithm that exploits the behavior of the model on individual instances during training as a proxy to find the most informative instances for labeling. CAL is inspired by data maps, which were recently proposed to derive insights into dataset quality (Swayamdipta et al., 2020). We compare our method on popular text classification tasks to commonly used AL strategies, which instead rely on post-training behavior. We demonstrate that CAL is competitive to other common AL methods, showing that training dynamics derived from small seed data can be successfully used for AL. We provide insights into our new AL method by analyzing batch-level statistics utilizing the data maps. Our results further show that CAL results in a more data-efficient learning strategy, achieving comparable or better results with considerably less training data.
    Gradual (In)Compatibility of Fairness Criteria. (arXiv:2109.04399v1 [cs.LG])
    (2 min) Impossibility results show that important fairness measures (independence, separation, sufficiency) cannot be satisfied at the same time under reasonable assumptions. This paper explores whether we can satisfy and/or improve these fairness measures simultaneously to a certain degree. We introduce information-theoretic formulations of the fairness measures and define degrees of fairness based on these formulations. The information-theoretic formulations suggest unexplored theoretical relations between the three fairness measures. In the experimental part, we use the information-theoretic expressions as regularizers to obtain fairness-regularized predictors for three standard datasets. Our experiments show that a) fairness regularization directly increases fairness measures, in line with existing work, and b) some fairness regularizations indirectly increase other fairness measures, as suggested by our theoretical findings. This establishes that it is possible to increase the degree to which some fairness measures are satisfied at the same time -- some fairness measures are gradually compatible.
    Learning Intrusion Prevention Policies through Optimal Stopping. (arXiv:2106.07160v7 [cs.AI] UPDATED)
    (2 min) We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal stopping problem. This formulation allows us insight into the structure of the optimal policies, which turn out to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we approximate the optimal policy through reinforcement learning in a simulation environment. To define the dynamics of the simulation, we emulate the target infrastructure and collect measurements. Our evaluations show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.
    HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints. (arXiv:2109.04443v1 [cs.CL])
    (2 min) Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve effectiveness of the available BT data, we introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using both high and low quality BT data by providing hints (as source tags on the encoder) to the model about the quality of each source-target pair. We don't filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data. Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script). For such cases, we propose training the model with additional hints (as target tags on the decoder) that provide information about the operation required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: {Hindi,Gujarati,Tamil}-to-English. Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings.
    Generative Adversarial Network: Some Analytical Perspectives. (arXiv:2104.12210v2 [q-fin.MF] UPDATED)
    (2 min) Ever since its debut, generative adversarial networks (GANs) have attracted tremendous amount of attention. Over the past years, different variations of GANs models have been developed and tailored to different applications in practice. Meanwhile, some issues regarding the performance and training of GANs have been noticed and investigated from various theoretical perspectives. This subchapter will start from an introduction of GANs from an analytical perspective, then move on to the training of GANs via SDE approximations and finally discuss some applications of GANs in computing high dimensional MFGs as well as tackling mathematical finance problems.
    ErfAct: Non-monotonic smooth trainable Activation Functions. (arXiv:2109.04386v1 [cs.NE])
    (2 min) An activation function is a crucial component of a neural network that introduces non-linearity in the network. The state-of-the-art performance of a neural network depends on the perfect choice of an activation function. We propose two novel non-monotonic smooth trainable activation functions, called ErfAct-1 and ErfAct-2. Experiments suggest that the proposed functions improve the network performance significantly compared to the widely used activations like ReLU, Swish, and Mish. Replacing ReLU by ErfAct-1 and ErfAct-2, we have 5.21% and 5.04% improvement for top-1 accuracy on PreactResNet-34 network in CIFAR100 dataset, 2.58% and 2.76% improvement for top-1 accuracy on PreactResNet-34 network in CIFAR10 dataset, 1.0%, and 1.0% improvement on mean average precision (mAP) on SSD300 model in Pascal VOC dataset.
    Robust Optimal Classification Trees Against Adversarial Examples. (arXiv:2109.03857v1 [cs.LG])
    (2 min) Decision trees are a popular choice of explainable model, but just like neural networks, they suffer from adversarial examples. Existing algorithms for fitting decision trees robust against adversarial examples are greedy heuristics and lack approximation guarantees. In this paper we propose ROCT, a collection of methods to train decision trees that are optimally robust against user-specified attack models. We show that the min-max optimization problem that arises in adversarial learning can be solved using a single minimization formulation for decision trees with 0-1 loss. We propose such formulations in Mixed-Integer Linear Programming and Maximum Satisfiability, which widely available solvers can optimize. We also present a method that determines the upper bound on adversarial accuracy for any model using bipartite matching. Our experimental results demonstrate that the existing heuristics achieve close to optimal scores while ROCT achieves state-of-the-art scores.
    Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with Synthetic Data. (arXiv:2109.04319v1 [cs.CL])
    (2 min) While multilingual pretrained language models (LMs) fine-tuned on a single language have shown substantial cross-lingual task transfer capabilities, there is still a wide performance gap in semantic parsing tasks when target language supervision is available. In this paper, we propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parser. This method simplifies the popular Translate-Align-Project (TAP) pipeline and consists of a sequence-to-sequence filler model that constructs a full parse conditioned on an utterance and a view of the same parse. Our filler is trained on English data only but can accurately complete instances in other languages (i.e., translations of the English training utterances), in a zero-shot fashion. Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems which rely on traditional alignment techniques.
    Learning the Physics of Particle Transport via Transformers. (arXiv:2109.03951v1 [cs.LG])
    (2 min) Particle physics simulations are the cornerstone of nuclear engineering applications. Among them radiotherapy (RT) is crucial for society, with 50% of cancer patients receiving radiation treatments. For the most precise targeting of tumors, next generation RT treatments aim for real-time correction during radiation delivery, necessitating particle transport algorithms that yield precise dose distributions in sub-second times even in highly heterogeneous patient geometries. This is infeasible with currently available, purely physics based simulations. In this study, we present a data-driven dose calculation algorithm predicting the dose deposited by mono-energetic proton beams for arbitrary energies and patient geometries. Our approach frames particle transport as sequence modeling, where convolutional layers extract important spatial features into tokens and the transformer self-attention mechanism routes information between such tokens in the sequence and a beam energy token. We train our network and evaluate prediction accuracy using computationally expensive but accurate Monte Carlo (MC) simulations, considered the gold standard in particle physics. Our proposed model is 33 times faster than current clinical analytic pencil beam algorithms, improving upon their accuracy in the most heterogeneous and challenging geometries. With a relative error of 0.34% and very high gamma pass rate of 99.59% (1%, 3 mm), it also greatly outperforms the only published similar data-driven proton dose algorithm, even at a finer grid resolution. Offering MC precision 400 times faster, our model could overcome a major obstacle that has so far prohibited real-time adaptive proton treatments and significantly increase cancer treatment efficacy. Its potential to model physics interactions of other particles could also boost heavy ion treatment planning procedures limited by the speed of traditional methods.
    NU:BRIEF -- A Privacy-aware Newsletter Personalization Engine for Publishers. (arXiv:2109.03955v1 [cs.DL])
    (2 min) Newsletters have (re-) emerged as a powerful tool for publishers to engage with their readers directly and more effectively. Despite the diversity in their audiences, publishers' newsletters remain largely a one-size-fits-all offering, which is suboptimal. In this paper, we present NU:BRIEF, a web application for publishers that enables them to personalize their newsletters without harvesting personal data. Personalized newsletters build a habit and become a great conversion tool for publishers, providing an alternative readers-generated revenue model to a declining ad/clickbait-centered business model.
    On the estimation of discrete choice models to capture irrational customer behaviors. (arXiv:2109.03882v1 [econ.EM])
    (2 min) The Random Utility Maximization model is by far the most adopted framework to estimate consumer choice behavior. However, behavioral economics has provided strong empirical evidence of irrational choice behavior, such as halo effects, that are incompatible with this framework. Models belonging to the Random Utility Maximization family may therefore not accurately capture such irrational behavior. Hence, more general choice models, overcoming such limitations, have been proposed. However, the flexibility of such models comes at the price of increased risk of overfitting. As such, estimating such models remains a challenge. In this work, we propose an estimation method for the recently proposed Generalized Stochastic Preference choice model, which subsumes the family of Random Utility Maximization models and is capable of capturing halo effects. Specifically, we show how to use partially-ranked preferences to efficiently model rational and irrational customer types from transaction data. Our estimation procedure is based on column generation, where relevant customer types are efficiently extracted by expanding a tree-like data structure containing the customer behaviors. Further, we propose a new dominance rule among customer types whose effect is to prioritize low orders of interactions among products. An extensive set of experiments assesses the predictive accuracy of the proposed approach. Our results show that accounting for irrational preferences can boost predictive accuracy by 12.5% on average, when tested on a real-world dataset from a large chain of grocery and drug stores.
    Online Learning for Cooperative Multi-Player Multi-Armed Bandits. (arXiv:2109.03818v1 [cs.LG])
    (2 min) We introduce a framework for decentralized online learning for multi-armed bandits (MAB) with multiple cooperative players. The reward obtained by the players in each round depends on the actions taken by all the players. It's a team setting, and the objective is common. Information asymmetry is what makes the problem interesting and challenging. We consider three types of information asymmetry: action information asymmetry when the actions of the players can't be observed but the rewards received are common; reward information asymmetry when the actions of the other players are observable but rewards received are IID from the same distribution; and when we have both action and reward information asymmetry. For the first setting, we propose a UCB-inspired algorithm that achieves $O(\log T)$ regret whether the rewards are IID or Markovian. For the second section, we offer an environment such that the algorithm given for the first setting gives linear regret. For the third setting, we show that a variation of the `explore then commit' algorithm achieves almost log regret.
    SORNet: Spatial Object-Centric Representations for Sequential Manipulation. (arXiv:2109.03891v1 [cs.RO])
    (2 min) Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state, where the ability to reason about spatial relationships among object entities from raw sensor inputs is crucial. Prior works relying on explicit state estimation or end-to-end learning struggle with novel objects. In this work, we propose SORNet (Spatial Object-Centric Representation Network), which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest. We show that the object embeddings learned by SORNet generalize zero-shot to unseen object entities on three spatial reasoning tasks: spatial relationship classification, skill precondition classification and relative direction regression, significantly outperforming baselines. Further, we present real-world robotic experiments demonstrating the usage of the learned object embeddings in task planning for sequential manipulation.
    Single Image 3D Object Estimation with Primitive Graph Networks. (arXiv:2109.04153v1 [cs.CV])
    (2 min) Reconstructing 3D object from a single image (RGB or depth) is a fundamental problem in visual scene understanding and yet remains challenging due to its ill-posed nature and complexity in real-world scenes. To address those challenges, we adopt a primitive-based representation for 3D object, and propose a two-stage graph network for primitive-based 3D object estimation, which consists of a sequential proposal module and a graph reasoning module. Given a 2D image, our proposal module first generates a sequence of 3D primitives from input image with local feature attention. Then the graph reasoning module performs joint reasoning on a primitive graph to capture the global shape context for each primitive. Such a framework is capable of taking into account rich geometry and semantic constraints during 3D structure recovery, producing 3D objects with more coherent structure even under challenging viewing conditions. We train the entire graph neural network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet and NYU Depth V2. Extensive experiments show that our approach outperforms the previous state of the arts with a considerable margin.
    Intriguing Parameters of Structural Causal Models. (arXiv:2105.12697v3 [cs.LG] UPDATED)
    (2 min) In recent years there has been a lot of focus on adversarial attacks, especially on deep neural networks. Here, we argue that they are more general in nature and can easily affect a larger class of models, e.g., any differentiable perturbed optimizers. We further show that such attacks can be determined by the hidden confounders in a domain, thus drawing a novel connection between such attacks and causality. Establishing this causal perspective is characterized by the influence of the structural causal model's data generating process on the subsequent optimization thereby exhibiting intriguing parameters of the former. We reveal the existence of such parameters for three combinatorial optimization problems, namely linear assignment, shortest path and a real world problem of energy systems. Our empirical examination also unveils worrisome consequences of these attacks on differentiable perturbed optimizers thereby highlighting the criticality of our findings.
    Coordinate Descent Methods for DC Minimization. (arXiv:2109.04228v1 [math.OC])
    (2 min) Difference-of-Convex (DC) minimization, referring to the problem of minimizing the difference of two convex functions, has been found rich applications in statistical learning and studied extensively for decades. However, existing methods are primarily based on multi-stage convex relaxation, only leading to weak optimality of critical points. This paper proposes a coordinate descent method for minimizing DC functions based on sequential nonconvex approximation. Our approach iteratively solves a nonconvex one-dimensional subproblem globally, and it is guaranteed to converge to a coordinate-wise stationary point. We prove that this new optimality condition is always stronger than the critical point condition and the directional point condition when the objective function is weakly convex. For comparisons, we also include a naive variant of coordinate descent methods based on sequential convex approximation in our study. When the objective function satisfies an additional regularity condition called \emph{sharpness}, coordinate descent methods with an appropriate initialization converge \emph{linearly} to the optimal solution set. Also, for many applications of interest, we show that the nonconvex one-dimensional subproblem can be computed exactly and efficiently using a breakpoint searching method. We present some discussions and extensions of our proposed method. Finally, we have conducted extensive experiments on several statistical learning tasks to show the superiority of our approach. Keywords: Coordinate Descent, DC Minimization, DC Programming, Difference-of-Convex Programs, Nonconvex Optimization, Sparse Optimization, Binary Optimization.
    On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) using Mean Field Control (MFC). (arXiv:2109.04024v1 [cs.LG])
    (2 min) Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{\mathrm{pop}}$ heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem. We consider three scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of $(1)$ joint state and action distributions across all classes, $(2)$ individual distributions of each class, and $(3)$ marginal distributions of the entire population. We show that, in these cases, the $K$-class MARL problem can be approximated by MFC with errors given as $e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}||\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})$, $e_2=\mathcal{O}(\sqrt{|\mathcal{X}||\mathcal{U}|}\sum_{k}\frac{1}{\sqrt{N_k}})$ and $e_3=\mathcal{O}\left(\sqrt{|\mathcal{X}||\mathcal{U}|}\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)$, respectively, where $A, B$ are some constants and $|\mathcal{X}|,|\mathcal{U}|$ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within $\mathcal{O}(e_j)$ error with a sample complexity of $\mathcal{O}(e_j^{-3})$, $j\in\{1,2,3\}$, respectively.
    Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel. (arXiv:2011.04558v3 [stat.ML] UPDATED)
    (2 min) Spectral clustering is a popular method for community detection in network graphs: starting from a matrix representation of the graph, the nodes are clustered on a low dimensional projection obtained from a truncated spectral decomposition of the matrix. Estimating correctly the number of communities and the dimension of the reduced latent space is critical for good performance of spectral clustering algorithms. Furthermore, many real-world graphs, such as enterprise computer networks studied in cyber-security applications, often display heterogeneous within-community degree distributions. Such heterogeneous degree distributions are usually not well captured by standard spectral clustering algorithms. In this article, a novel spectral clustering algorithm is proposed for community detection under the degree-corrected stochastic blockmodel. The proposed method is based on a transformation of the spectral embedding to spherical coordinates, and a novel modelling assumption in the transformed space. The method allows for simultaneous and automated selection of the number of communities and the latent dimension for spectral embeddings of graphs with uneven node degrees. Results show improved performance over competing methods in representing computer networks.
    An objective function for order preserving hierarchical clustering. (arXiv:2109.04266v1 [cs.LG])
    (2 min) We present an objective function for similarity based hierarchical clustering of partially ordered data that preserves the partial order in the sense that if $x \leq y$, and if $[x]$ and $[y]$ are the respective clusters of $x$ and $y$, then there is an order relation $\leq'$ on the clusters for which $[x] \leq' |y]$. The model distinguishes itself from existing methods and models for clustering of ordered data in that the order relation and the similarity are combined to obtain an optimal hierarchical clustering seeking to satisfy both, and that the order relation is equipped with a pairwise level of comparability in the range $[0,1]$. In particular, if the similarity and the order relation are not aligned, then order preservation may have to yield in favor of clustering. Finding an optimal solution is NP-hard, so we provide a polynomial time approximation algorithm, with a relative performance guarantee of $O(\log^{3/2}n)$, based on successive applications of directed sparsest cut. The model is an extension of the Dasgupta cost function for divisive hierarchical clustering.
    Relating Graph Neural Networks to Structural Causal Models. (arXiv:2109.04173v1 [cs.LG])
    (2 min) Causality can be described in terms of a structural causal model (SCM) that carries information on the variables of interest and their mechanistic relations. For most processes of interest the underlying SCM will only be partially observable, thus causal inference tries to leverage any exposed information. Graph neural networks (GNN) as universal approximators on structured input pose a viable candidate for causal learning, suggesting a tighter integration with SCM. To this effect we present a theoretical analysis from first principles that establishes a novel connection between GNN and SCM while providing an extended view on general neural-causal models. We then establish a new model class for GNN-based causal inference that is necessary and sufficient for causal effect identification. Our empirical illustration on simulations and standard benchmarks validate our theoretical proofs.
    Self-supervised Reinforcement Learning with Independently Controllable Subgoals. (arXiv:2109.04150v1 [cs.LG])
    (2 min) To successfully tackle challenging manipulation tasks, autonomous agents must learn a diverse set of skills and how to combine them. Recently, self-supervised agents that set their own abstract goals by exploiting the discovered structure in the environment were shown to perform well on many different tasks. In particular, some of them were applied to learn basic manipulation skills in compositional multi-object environments. However, these methods learn skills without taking the dependencies between objects into account. Thus, the learned skills are difficult to combine in realistic environments. We propose a novel self-supervised agent that estimates relations between environment components and uses them to independently control different parts of the environment state. In addition, the estimated relations between objects can be used to decompose a complex goal into a compatible sequence of subgoals. We show that, by using this framework, an agent can efficiently and automatically learn manipulation tasks in multi-object environments with different relations between objects.
    Towards Robust Cross-domain Image Understanding with Unsupervised Noise Removal. (arXiv:2109.04284v1 [cs.CV])
    (3 min) Deep learning models usually require a large amount of labeled data to achieve satisfactory performance. In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge transfer from a label rich source domain to a label scarce target domain, thus potentially alleviates the annotation requirement for deep learning models. However, we find that contemporary domain adaptation methods for cross-domain image understanding perform poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Prior methods on WSDA remove noisy source data and align the marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, which have the problem of class misalignment, e.g., features of cats in the target domain might be mapped near features of dogs in the source domain. In this paper, we propose a novel method, termed Noise Tolerant Domain Adaptation, for WSDA. Specifically, we adopt the cluster assumption and learn cluster discriminatively with class prototypes in the embedding space. We propose to leverage the location information of the data points in the embedding space and model the location information with a Gaussian mixture model to identify noisy source data. We then design a network which incorporates the Gaussian mixture noise model as a sub-module for unsupervised noise removal and propose a novel cluster-level adversarial adaptation method which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images from COVID-19 and e-commerce datasets. The results show that our method significantly outperforms state-of-the-art WSDA methods.
    Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency. (arXiv:2105.08667v2 [cs.CY] UPDATED)
    (2 min) Twitter uses machine learning to crop images, where crops are centered around the part predicted to be the most salient. In fall 2020, Twitter users raised concerns that the automated image cropping system on Twitter favored light-skinned over dark-skinned individuals, as well as concerns that the system favored cropping woman's bodies instead of their heads. In order to address these concerns, we conduct an extensive analysis using formalized group fairness metrics. We find systematic disparities in cropping and identify contributing factors, including the fact that the cropping based on the single most salient point can amplify the disparities because of an effect we term argmax bias. However, we demonstrate that formalized fairness metrics and quantitative analysis on their own are insufficient for capturing the risk of representational harm in automatic cropping. We suggest the removal of saliency-based cropping in favor of a solution that better preserves user agency. For developing a new solution that sufficiently address concerns related to representational harm, our critique motivates a combination of quantitative and qualitative methods that include human-centered design.
    Incentivizing an Unknown Crowd. (arXiv:2109.04226v1 [cs.LG])
    (2 min) Motivated by the common strategic activities in crowdsourcing labeling, we study the problem of sequential eliciting information without verification (EIWV) for workers with a heterogeneous and unknown crowd. We propose a reinforcement learning-based approach that is effective against a wide range of settings including potential irrationality and collusion among workers. With the aid of a costly oracle and the inference method, our approach dynamically decides the oracle calls and gains robustness even under the presence of frequent collusion activities. Extensive experiments show the advantage of our approach. Our results also present the first comprehensive experiments of EIWV on large-scale real datasets and the first thorough study of the effects of environmental variables.
    Juvenile state hypothesis: What we can learn from lottery ticket hypothesis researches?. (arXiv:2109.03862v1 [cs.LG])
    (2 min) The proposition of lottery ticket hypothesis revealed the relationship between network structure and initialization parameters and the learning potential of neural networks. The original lottery ticket hypothesis performs pruning and weight resetting after training convergence, exposing it to the problem of forgotten learning knowledge and potential high cost of training. Therefore, we propose a strategy that combines the idea of neural network structure search with a pruning algorithm to alleviate this problem. This algorithm searches and extends the network structure on existing winning ticket sub-network to producing new winning ticket recursively. This allows the training and pruning process to continue without compromising performance. A new winning ticket sub-network with deeper network structure, better generalization ability and better test performance can be obtained in this recursive manner. This method can solve: the difficulty of training or performance degradation of the sub-networks after pruning, the forgetting of the weights of the original lottery ticket hypothesis and the difficulty of generating winning ticket sub-network when the final network structure is not given. We validate this strategy on the MNIST and CIFAR-10 datasets. And after relating it to similar biological phenomena and relevant lottery ticket hypothesis studies in recent years, we will further propose a new hypothesis to discuss which factors that can keep a network juvenile, i.e., those possible factors that influence the learning potential or generalization performance of a neural network during training.
    SensiX++: Bringing MLOPs and Multi-tenant Model Serving to Sensory Edge Devices. (arXiv:2109.03947v1 [cs.LG])
    (2 min) We present SensiX++ - a multi-tenant runtime for adaptive model execution with integrated MLOps on edge devices, e.g., a camera, a microphone, or IoT sensors. SensiX++ operates on two fundamental principles - highly modular componentisation to externalise data operations with clear abstractions and document-centric manifestation for system-wide orchestration. First, a data coordinator manages the lifecycle of sensors and serves models with correct data through automated transformations. Next, a resource-aware model server executes multiple models in isolation through model abstraction, pipeline automation and feature sharing. An adaptive scheduler then orchestrates the best-effort executions of multiple models across heterogeneous accelerators, balancing latency and throughput. Finally, microservices with REST APIs serve synthesised model predictions, system statistics, and continuous deployment. Collectively, these components enable SensiX++ to serve multiple models efficiently with fine-grained control on edge devices while minimising data operation redundancy, managing data and device heterogeneity, reducing resource contention and removing manual MLOps. We benchmark SensiX++ with ten different vision and acoustics models across various multi-tenant configurations on different edge accelerators (Jetson AGX and Coral TPU) designed for sensory devices. We report on the overall throughput and quantified benefits of various automation components of SensiX++ and demonstrate its efficacy to significantly reduce operational complexity and lower the effort to deploy, upgrade, reconfigure and serve embedded models on edge devices.
    Fixing exposure bias with imitation learning needs powerful oracles. (arXiv:2109.04114v1 [cs.CL])
    (2 min) We apply imitation learning (IL) to tackle the NMT exposure bias problem with error-correcting oracles, and evaluate an SMT lattice-based oracle which, despite its excellent performance in an unconstrained oracle translation task, turned out to be too pruned and idiosyncratic to serve as the oracle for IL.
    Table-based Fact Verification with Salience-aware Learning. (arXiv:2109.04053v1 [cs.CL])
    (2 min) Tables provide valuable knowledge that can be used to verify textual statements. While a number of works have considered table-based fact verification, direct alignments of tabular data with tokens in textual statements are rarely available. Moreover, training a generalized fact verification model requires abundant labeled training data. In this paper, we propose a novel system to address these problems. Inspired by counterfactual causality, our system identifies token-level salience in the statement with probing-based salience estimation. Salience estimation allows enhanced learning of fact verification from two perspectives. From one perspective, our system conducts masked salient token prediction to enhance the model for alignment and reasoning between the table and the statement. From the other perspective, our system applies salience-aware data augmentation to generate a more diverse set of training instances by replacing non-salient terms. Experimental results on TabFact show the effective improvement by the proposed salience-aware learning techniques, leading to the new SOTA performance on the benchmark. Our code is publicly available at https://github.com/luka-group/Salience-aware-Learning .
    TimeTraveler: Reinforcement Learning for Temporal Knowledge Graph Forecasting. (arXiv:2109.04101v1 [cs.LG])
    (2 min) Temporal knowledge graph (TKG) reasoning is a crucial task that has gained increasing research interest in recent years. Most existing methods focus on reasoning at past timestamps to complete the missing facts, and there are only a few works of reasoning on known TKGs to forecast future facts. Compared with the completion task, the forecasting task is more difficult that faces two main challenges: (1) how to effectively model the time information to handle future timestamps? (2) how to make inductive inference to handle previously unseen entities that emerge over time? To address these challenges, we propose the first reinforcement learning method for forecasting. Specifically, the agent travels on historical knowledge graph snapshots to search for the answer. Our method defines a relative time encoding function to capture the timespan information, and we design a novel time-shaped reward based on Dirichlet distribution to guide the model learning. Furthermore, we propose a novel representation method for unseen entities to improve the inductive inference ability of the model. We evaluate our method for this link prediction task at future timestamps. Extensive experiments on four benchmark datasets demonstrate substantial performance improvement meanwhile with higher explainability, less calculation, and fewer parameters when compared with existing state-of-the-art methods.
    QUINT: Node embedding using network hashing. (arXiv:2109.04206v1 [cs.SI])
    (2 min) Representation learning using network embedding has received tremendous attention due to its efficacy to solve downstream tasks. Popular embedding methods (such as deepwalk, node2vec, LINE) are based on a neural architecture, thus unable to scale on large networks both in terms of time and space usage. Recently, we proposed BinSketch, a sketching technique for compressing binary vectors to binary vectors. In this paper, we show how to extend BinSketch and use it for network hashing. Our proposal named QUINT is built upon BinSketch, and it embeds nodes of a sparse network onto a low-dimensional space using simple bi-wise operations. QUINT is the first of its kind that provides tremendous gain in terms of speed and space usage without compromising much on the accuracy of the downstream tasks. Extensive experiments are conducted to compare QUINT with seven state-of-the-art network embedding methods for two end tasks - link prediction and node classification. We observe huge performance gain for QUINT in terms of speedup (up to 7000x) and space saving (up to 800x) due to its bit-wise nature to obtain node embedding. Moreover, QUINT is a consistent top-performer for both the tasks among the baselines across all the datasets. Our empirical observations are backed by rigorous theoretical analysis to justify the effectiveness of QUINT. In particular, we prove that QUINT retains enough structural information which can be used further to approximate many topological properties of networks with high confidence.
    Versions of Gradient Temporal Difference Learning. (arXiv:2109.04033v1 [cs.LG])
    (2 min) Sutton, Szepesv\'{a}ri and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.
    Local Augmentation for Graph Neural Networks. (arXiv:2109.03856v1 [cs.LG])
    (2 min) Data augmentation has been widely used in image data and linguistic data but remains under-explored on graph-structured data. Existing methods focus on augmenting the graph data from a global perspective and largely fall into two genres: structural manipulation and adversarial training with feature noise injection. However, the structural manipulation approach suffers information loss issues while the adversarial training approach may downgrade the feature quality by injecting noise. In this work, we introduce the local augmentation, which enhances node features by its local subgraph structures. Specifically, we model the data argumentation as a feature generation process. Given the central node's feature, our local augmentation approach learns the conditional distribution of its neighbors' features and generates the neighbors' optimal feature to boost the performance of downstream tasks. Based on the local augmentation, we further design a novel framework: LA-GNN, which can apply to any GNN models in a plug-and-play manner. Extensive experiments and analyses show that local augmentation consistently yields performance improvement for various GNN architectures across a diverse set of benchmarks. Code is available at https://github.com/Soughing0823/LAGNN.
    Mean-Square Analysis with An Application to Optimal Dimension Dependence of Langevin Monte Carlo. (arXiv:2109.03839v1 [cs.LG])
    (2 min) Sampling algorithms based on discretizations of Stochastic Differential Equations (SDEs) compose a rich and popular subset of MCMC methods. This work provides a general framework for the non-asymptotic analysis of sampling error in 2-Wasserstein distance, which also leads to a bound of mixing time. The method applies to any consistent discretization of contractive SDEs. When applied to Langevin Monte Carlo algorithm, it establishes $\tilde{\mathcal{O}}\left( \frac{\sqrt{d}}{\epsilon} \right)$ mixing time, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures at infinity. This bound improves the best previously known $\tilde{\mathcal{O}}\left( \frac{d}{\epsilon} \right)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $\epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.
    PowerGym: A Reinforcement Learning Environment for Volt-Var Control in Power Distribution Systems. (arXiv:2109.03970v1 [cs.LG])
    (2 min) We introduce PowerGym, an open-source reinforcement learning environment for Volt-Var control in power distribution systems. Following OpenAI Gym APIs, PowerGym targets minimizing power loss and voltage violations under physical networked constraints. PowerGym provides four distribution systems (13Bus, 34Bus, 123Bus, and 8500Node) based on IEEE benchmark systems and design variants for various control difficulties. To foster generalization, PowerGym offers a detailed customization guide for users working with their distribution systems. As a demonstration, we examine state-of-the-art reinforcement learning algorithms in PowerGym and validate the environment by studying controller behaviors.
    Stationary Density Estimation of It\^o Diffusions Using Deep Learning. (arXiv:2109.03992v1 [math.NA])
    (2 min) In this paper, we consider the density estimation problem associated with the stationary measure of ergodic It\^o diffusions from a discrete-time series that approximate the solutions of the stochastic differential equations. To take an advantage of the characterization of density function through the stationary solution of a parabolic-type Fokker-Planck PDE, we proceed as follows. First, we employ deep neural networks to approximate the drift and diffusion terms of the SDE by solving appropriate supervised learning tasks. Subsequently, we solve a steady-state Fokker-Plank equation associated with the estimated drift and diffusion coefficients with a neural-network-based least-squares method. We establish the convergence of the proposed scheme under appropriate mathematical assumptions, accounting for the generalization errors induced by regressing the drift and diffusion coefficients, and the PDE solvers. This theoretical study relies on a recent perturbation theory of Markov chain result that shows a linear dependence of the density estimation to the error in estimating the drift term, and generalization error results of nonparametric regression and of PDE regression solution obtained with neural-network models. The effectiveness of this method is reflected by numerical simulations of a two-dimensional Student's t distribution and a 20-dimensional Langevin dynamics.
    Mapping Research Topics in Software Testing: A Bibliometric Analysis. (arXiv:2109.04086v1 [cs.DL])
    (2 min) In this study, we apply co-word analysis - a text mining technique based on the co-occurrence of terms - to map the topology of software testing research topics, with the goal of providing current and prospective researchers with a map, and observations about the evolution, of the software testing field. Our analysis enables the mapping of software testing research into clusters of connected topics, from which emerge a total of 16 high-level research themes and a further 18 subthemes. This map also suggests topics that are growing in importance, including topics related to web and mobile applications and artificial intelligence. Exploration of author and country-based collaboration patterns offers similar insight into the implicit and explicit factors that influence collaboration and suggests emerging sources of collaboration for future work. We make our observations - and the underlying mapping of research topics and research collaborations - available so that researchers can gain a deeper understanding of the topology of the software testing field, inspiration regarding new areas and connections to explore, and collaborators who will broaden their perspectives.
    Machine learning modeling of family wide enzyme-substrate specificity screens. (arXiv:2109.03900v1 [q-bio.BM])
    (2 min) Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.
    Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models. (arXiv:2109.03892v1 [cs.CL])
    (2 min) We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.
    Unsupervised Pre-training with Structured Knowledge for Improving Natural Language Inference. (arXiv:2109.03941v1 [cs.CL])
    (2 min) While recent research on natural language inference has considerably benefited from large annotated datasets, the amount of inference-related knowledge (including commonsense) provided in the annotated data is still rather limited. There have been two lines of approaches that can be used to further address the limitation: (1) unsupervised pretraining can leverage knowledge in much larger unstructured text data; (2) structured (often human-curated) knowledge has started to be considered in neural-network-based models for NLI. An immediate question is whether these two approaches complement each other, or how to develop models that can bring together their advantages. In this paper, we propose models that leverage structured knowledge in different components of pre-trained models. Our results show that the proposed models perform better than previous BERT-based state-of-the-art models. Although our models are proposed for NLI, they can be easily extended to other sentence or sentence-pair classification problems.
    Bag of Tricks for Optimizing Transformer Efficiency. (arXiv:2109.04030v1 [cs.LG])
    (2 min) Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods are either sophisticated in implementation or dependent on hardware. In this paper, we show that the efficiency of Transformer can be improved by combining some simple and hardware-agnostic methods, including tuning hyper-parameters, better design choices and training strategies. On the WMT news translation tasks, we improve the inference efficiency of a strong Transformer system by 3.80X on CPU and 2.52X on GPU. The code is publicly available at https://github.com/Lollipop321/mini-decoder-network.
    MaterialsAtlas.org: A Materials Informatics Web App Platform for Materials Discovery and Survey of State-of-the-Art. (arXiv:2109.04007v1 [cond-mat.mtrl-sci])
    (2 min) The availability and easy access of large scale experimental and computational materials data have enabled the emergence of accelerated development of algorithms and models for materials property prediction, structure prediction, and generative design of materials. However, lack of user-friendly materials informatics web servers has severely constrained the wide adoption of such tools in the daily practice of materials screening, tinkering, and design space exploration by materials scientists. Herein we first survey current materials informatics web apps and then propose and develop MaterialsAtlas.org, a web based materials informatics toolbox for materials discovery, which includes a variety of routinely needed tools for exploratory materials discovery, including materials composition and structure check (e.g. for neutrality, electronegativity balance, dynamic stability, Pauling rules), materials property prediction (e.g. band gap, elastic moduli, hardness, thermal conductivity), and search for hypothetical materials. These user-friendly tools can be freely accessed at \url{www.materialsatlas.org}. We argue that such materials informatics apps should be widely developed by the community to speed up the materials discovery processes.
    Popularity Adjusted Block Models are Generalized Random Dot Product Graphs. (arXiv:2109.04010v1 [stat.ML])
    (2 min) We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.
    Sensitive Samples Revisited: Detecting Neural Network Attacks Using Constraint Solvers. (arXiv:2109.03966v1 [cs.LG])
    (2 min) Neural Networks are used today in numerous security- and safety-relevant domains and are, as such, a popular target of attacks that subvert their classification capabilities, by manipulating the network parameters. Prior work has introduced sensitive samples -- inputs highly sensitive to parameter changes -- to detect such manipulations, and proposed a gradient ascent-based approach to compute them. In this paper we offer an alternative, using symbolic constraint solvers. We model the network and a formal specification of a sensitive sample in the language of the solver and ask for a solution. This approach supports a rich class of queries, corresponding, for instance, to the presence of certain types of attacks. Unlike earlier techniques, our approach does not depend on convex search domains, or on the suitability of a starting point for the search. We address the performance limitations of constraint solvers by partitioning the search space for the solver, and exploring the partitions according to a balanced schedule that still retains completeness of the search. We demonstrate the impact of the use of solvers in terms of functionality and search efficiency, using a case study for the detection of Trojan attacks on Neural Networks.
    Generation, augmentation, and alignment: A pseudo-source domain based method for source-free domain adaptation. (arXiv:2109.04015v1 [cs.LG])
    (2 min) Conventional unsupervised domain adaptation (UDA) methods need to access both labeled source samples and unlabeled target samples simultaneously to train the model. While in some scenarios, the source samples are not available for the target domain due to data privacy and safety. To overcome this challenge, recently, source-free domain adaptation (SFDA) has attracted the attention of researchers, where both a trained source model and unlabeled target samples are given. Existing SFDA methods either adopt a pseudo-label based strategy or generate more samples. However, these methods do not explicitly reduce the distribution shift across domains, which is the key to a good adaptation. Although there are no source samples available, fortunately, we find that some target samples are very similar to the source domain and can be used to approximate the source domain. This approximated domain is denoted as the pseudo-source domain. In this paper, inspired by this observation, we propose a novel method based on the pseudo-source domain. The proposed method firstly generates and augments the pseudo-source domain, and then employs distribution alignment with four novel losses based on pseudo-label based strategy. Among them, a domain adversarial loss is introduced between the pseudo-source domain the remaining target domain to reduce the distribution shift. The results on three real-world datasets verify the effectiveness of the proposed method.
    Knowledge mining of unstructured information: application to cyber-domain. (arXiv:2109.03848v1 [cs.CR])
    (2 min) Cyber intelligence is widely and abundantly available in numerous open online sources with reports on vulnerabilities and incidents. This constant stream of noisy information requires new tools and techniques if it is to be used for the benefit of analysts and investigators in various organizations. In this paper we present and implement a novel knowledge graph and knowledge mining framework for extracting relevant information from free-form text about incidents in the cyber domain. Our framework includes a machine learning based pipeline as well as crawling methods for generating graphs of entities, attackers and the related information with our non-technical cyber ontology. We test our framework on publicly available cyber incident datasets to evaluate the accuracy of our knowledge mining methods as well as the usefulness of the framework in the use of cyber analysts. Our results show analyzing the knowledge graph constructed using the novel framework, an analyst can infer additional information from the current cyber landscape in terms of risk to various entities and the propagation of risk between industries and countries. Expanding the framework to accommodate more technical and operational level information can increase the accuracy and explainability of trends and risk in the knowledge graph.
    Tom: Leveraging trend of the observed gradients for faster convergence. (arXiv:2109.03820v1 [cs.LG])
    (2 min) The success of deep learning can be attributed to various factors such as increase in computational power, large datasets, deep convolutional neural networks, optimizers etc. Particularly, the choice of optimizer affects the generalization, convergence rate, and training stability. Stochastic Gradient Descent (SGD) is a first order iterative optimizer that updates the gradient uniformly for all parameters. This uniform update may not be suitable across the entire training phase. A rudimentary solution for this is to employ a fine-tuned learning rate scheduler which decreases learning rate as a function of iteration. To eliminate the dependency of learning rate schedulers, adaptive gradient optimizers such as AdaGrad, AdaDelta, RMSProp, Adam employ a parameter-wise scaling term for learning rate which is a function of the gradient itself. We propose Tom (Trend over Momentum) optimizer, which is a novel variant of Adam that takes into account of the trend which is observed for the gradients in the loss landscape traversed by the neural network. In the proposed Tom optimizer, an additional smoothing equation is introduced to address the trend observed during the process of optimization. The smoothing parameter introduced for the trend requires no tuning and can be used with default values. Experimental results for classification datasets such as CIFAR-10, CIFAR-100 and CINIC-10 image datasets show that Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence. The source code is publicly made available at https://github.com/AnirudhMaiya/Tom
    Model Explanations via the Axiomatic Causal Lens. (arXiv:2109.03890v1 [cs.LG])
    (2 min) Explaining the decisions of black-box models has been a central theme in the study of trustworthy ML. Numerous measures have been proposed in the literature; however, none of them have been able to adopt a provably causal take on explainability. Building upon Halpern and Pearl's formal definition of a causal explanation, we derive an analogous set of axioms for the classification setting, and use them to derive three explanation measures. Our first measure is a natural adaptation of Chockler and Halpern's notion of causal responsibility, whereas the other two correspond to existing game-theoretic influence measures. We present an axiomatic treatment for our proposed indices, showing that they can be uniquely characterized by a set of desirable properties. We compliment this with computational analysis, providing probabilistic approximation schemes for all of our proposed measures. Thus, our work is the first to formally bridge the gap between model explanations, game-theoretic influence, and causal analysis.
    Distributionally Robust Multilingual Machine Translation. (arXiv:2109.04020v1 [cs.CL])
    (2 min) Multilingual neural machine translation (MNMT) learns to translate multiple language pairs with a single model, potentially improving both the accuracy and the memory-efficiency of deployed models. However, the heavy data imbalance between languages hinders the model from performing uniformly across language pairs. In this paper, we propose a new learning objective for MNMT based on distributionally robust optimization, which minimizes the worst-case expected loss over the set of language pairs. We further show how to practically optimize this objective for large translation corpora using an iterated best response scheme, which is both effective and incurs negligible additional computational cost compared to standard empirical risk minimization. We perform extensive experiments on three sets of languages from two datasets and show that our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
    Detecting Attacks on IoT Devices using Featureless 1D-CNN. (arXiv:2109.03989v1 [cs.CR])
    (2 min) The generalization of deep learning has helped us, in the past, address challenges such as malware identification and anomaly detection in the network security domain. However, as effective as it is, scarcity of memory and processing power makes it difficult to perform these tasks in Internet of Things (IoT) devices. This research finds an easy way out of this bottleneck by depreciating the need for feature engineering and subsequent processing in machine learning techniques. In this study, we introduce a Featureless machine learning process to perform anomaly detection. It uses unprocessed byte streams of packets as training data. Featureless machine learning enables a low cost and low memory time-series analysis of network traffic. It benefits from eliminating the significant investment in subject matter experts and the time required for feature engineering.
    Learning the hypotheses space from data through a U-curve algorithm: a statistically consistent complexity regularizer for Model Selection. (arXiv:2109.03866v1 [stat.ML])
    (2 min) This paper proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection, that is an extension of the classical agnostic PAC learning model. In this approach, learning problems are modeled not only by a hypothesis space $\mathcal{H}$, but also by a Learning Space $\mathbb{L}(\mathcal{H})$, a poset of subspaces of $\mathcal{H}$, which covers $\mathcal{H}$ and satisfies a property regarding the VC dimension of related subspaces, that is a suitable algebraic search space for Model Selection algorithms. Our main contributions are a data-driven general learning algorithm to perform regularized Model Selection on $\mathbb{L}(\mathcal{H})$ and a framework under which one can, theoretically, better estimate a target hypothesis with a given sample size by properly modeling $\mathbb{L}(\mathcal{H})$ and employing high computational power. A remarkable consequence of this approach are conditions under which a non-exhaustive search of $\mathbb{L}(\mathcal{H})$ can return an optimal solution. The results of this paper lead to a practical property of Machine Learning, that the lack of experimental data may be mitigated by a high computational capacity. In a context of continuous popularization of computational power, this property may help understand why Machine Learning has become so important, even where data is expensive and hard to get.
    AdjointNet: Constraining machine learning models with physics-based codes. (arXiv:2109.03956v1 [math.NA])
    (2 min) Physics-informed Machine Learning has recently become attractive for learning physical parameters and features from simulation and observation data. However, most existing methods do not ensure that the physics, such as balance laws (e.g., mass, momentum, energy conservation), are constrained. Some recent works (e.g., physics-informed neural networks) softly enforce physics constraints by including partial differential equation (PDE)-based loss functions but need re-discretization of the PDEs using auto-differentiation. Training these neural nets on observational data showed that one could solve forward and inverse problems in one shot. They evaluate the state variables and the parameters in a PDE. This re-discretization of PDEs is not necessarily an attractive option for domain scientists that work with physics-based codes that have been developed for decades with sophisticated discretization techniques to solve complex process models and advanced equations of state. This paper proposes a physics constrained machine learning framework, AdjointNet, allowing domain scientists to embed their physics code in neural network training workflows. This embedding ensures that physics is constrained everywhere in the domain. Additionally, the mathematical properties such as consistency, stability, and convergence vital to the numerical solution of a PDE are still satisfied. We show that the proposed AdjointNet framework can be used for parameter estimation (and uncertainty quantification by extension) and experimental design using active learning. The applicability of our framework is demonstrated for four flow cases. Results show that AdjointNet-based inversion can estimate process model parameters with reasonable accuracy. These examples demonstrate the applicability of using existing software with no changes in source code to perform accurate and reliable inversion of model parameters.
    Initialization for Nonnegative Matrix Factorization: a Comprehensive Review. (arXiv:2109.03874v1 [math.OC])
    (2 min) Non-negative matrix factorization (NMF) has become a popular method for representing meaningful data by extracting a non-negative basis feature from an observed non-negative data matrix. Some of the unique features of this method in identifying hidden data put this method amongst the powerful methods in the machine learning area. The NMF is a known non-convex optimization problem and the initial point has a significant effect on finding an efficient local solution. In this paper, we investigate the most popular initialization procedures proposed for NMF so far. We describe each method and present some of their advantages and disadvantages. Finally, some numerical results to illustrate the performance of each algorithm are presented.
    Powering Comparative Classification with Sentiment Analysis via Domain Adaptive Knowledge Transfer. (arXiv:2109.03819v1 [cs.CL])
    (2 min) We study Comparative Preference Classification (CPC) which aims at predicting whether a preference comparison exists between two entities in a given sentence and, if so, which entity is preferred over the other. High-quality CPC models can significantly benefit applications such as comparative question answering and review-based recommendations. Among the existing approaches, non-deep learning methods suffer from inferior performances. The state-of-the-art graph neural network-based ED-GAT (Ma et al., 2020) only considers syntactic information while ignoring the critical semantic relations and the sentiments to the compared entities. We proposed sentiment Analysis Enhanced COmparative Network (SAECON) which improves CPC ac-curacy with a sentiment analyzer that learns sentiments to individual entities via domain adaptive knowledge transfer. Experiments on the CompSent-19 (Panchenko et al., 2019) dataset present a significant improvement on the F1 scores over the best existing CPC approaches.
    Simplified Quantum Algorithm for the Oracle Identification Problem. (arXiv:2109.03902v1 [quant-ph])
    (2 min) In the oracle identification problem we have oracle access to bits of an unknown string $x$ of length $n$, with the promise that it belongs to a known set $C\subseteq\{0,1\}^n$. The goal is to identify $x$ using as few queries to the oracle as possible. We develop a quantum query algorithm for this problem with query complexity $O\left(\sqrt{\frac{n\log M }{\log(n/\log M)+1}}\right)$, where $M$ is the size of $C$. This bound is already derived by Kothari in 2014, for which we provide a more elegant simpler proof.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v1 [cs.AI])
    (2 min) Markov Chain Monte Carlo (MCMC) methods are promising solutions to sample from target distributions in high dimensions. While MCMC methods enjoy nice theoretical properties, like guaranteed convergence and mixing to the true target, in practice their sampling efficiency depends on the choice of the proposal distribution and the target at hand. This work considers using machine learning to adapt the proposal distribution to the target, in order to improve the sampling efficiency in the purely discrete domain. Specifically, (i) it proposes a new parametrization for a family of proposal distributions, called locally balanced proposals, (ii) it defines an objective function based on mutual information and (iii) it devises a learning procedure to adapt the parameters of the proposal to the target, thus achieving fast convergence and fast mixing. We call the resulting sampler as the Locally Self-Balancing Sampler (LSB). We show through experimental analysis on the Ising model and Bayesian networks that LSB is indeed able to improve the efficiency over a state-of-the-art sampler based on locally balanced proposals, thus reducing the number of iterations required to converge, while achieving comparable mixing performance.

2021-09-09

  • cs.CL updates on arXiv.org

    Discrete and Soft Prompting for Multilingual Models. (arXiv:2109.03630v1 [cs.CL])
    (0 min) It has been shown for English that discrete and soft prompting perform strongly in few-shot learning with pretrained language models (PLMs). In this paper, we show that discrete and soft prompting perform better than finetuning in multilingual cases: Crosslingual transfer and in-language training of multilingual natural language inference. For example, with 48 English training examples, finetuning obtains 33.74% accuracy in crosslingual transfer, barely surpassing the majority baseline (33.33%). In contrast, discrete and soft prompting outperform finetuning, achieving 36.43% and 38.79%. We also demonstrate good performance of prompting with training data in multiple languages other than English.
    A Dual-Channel Framework for Sarcasm Recognition by Detecting Sentiment Conflict. (arXiv:2109.03587v1 [cs.CL])
    (0 min) Sarcasm employs ambivalence, where one says something positive but actually means negative, and vice versa. Due to the sophisticated and obscure sentiment, sarcasm brings in great challenges to sentiment analysis. In this paper, we show up the essence of sarcastic text is that the literal sentiment (expressed by the surface form of the text) is opposite to the deep sentiment (expressed by the actual meaning of the text). To this end, we propose a Dual-Channel Framework by modeling both literal and deep sentiments to recognize the sentiment conflict. Specifically, the proposed framework is capable of detecting the sentiment conflict between the literal and deep meanings of the input text. Experiments on the political debates and the Twitter datasets show that our framework achieves the best performance on sarcasm recognition.
    MergeBERT: Program Merge Conflict Resolution via Neural Transformers. (arXiv:2109.00084v2 [cs.SE] UPDATED)
    (0 min) Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. In this paper, we introduce MergeBERT, a novel neural program merge framework based on the token-level three-way differencing and a transformer encoder model. Exploiting restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 64--69% precision of merge resolution synthesis, yielding nearly a 2x performance improvement over existing structured and neural program merge tools. Finally, we demonstrate versatility of our model, which is able to perform program merge in a multilingual setting with Java, JavaScript, TypeScript, and C# programming languages, generalizing zero-shot to unseen languages.
    Stream-level Latency Evaluation for Simultaneous Machine Translation. (arXiv:2104.08817v2 [cs.CL] UPDATED)
    (0 min) Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream-level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.
    Smelting Gold and Silver for Improved Multilingual AMR-to-Text Generation. (arXiv:2109.03808v1 [cs.CL])
    (0 min) Recent work on multilingual AMR-to-text generation has exclusively focused on data augmentation strategies that utilize silver AMR. However, this assumes a high quality of generated AMRs, potentially limiting the transferability to the target task. In this paper, we investigate different techniques for automatically generating AMR annotations, where we aim to study which source of information yields better multilingual results. Our models trained on gold AMR with silver (machine translated) sentences outperform approaches which leverage generated silver AMR. We find that combining both complementary sources of information further improves multilingual AMR-to-text generation. Our models surpass the previous state of the art for German, Italian, Spanish, and Chinese by a large margin.
    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. (arXiv:2104.08663v3 [cs.IR] UPDATED)
    (0 min) Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.
    Cross-Policy Compliance Detection via Question Answering. (arXiv:2109.03731v1 [cs.CL])
    (0 min) Policy compliance detection is the task of ensuring that a scenario conforms to a policy (e.g. a claim is valid according to government rules or a post in an online platform conforms to community guidelines). This task has been previously instantiated as a form of textual entailment, which results in poor accuracy due to the complexity of the policies. In this paper we propose to address policy compliance detection via decomposing it into question answering, where questions check whether the conditions stated in the policy apply to the scenario, and an expression tree combines the answers to obtain the label. Despite the initial upfront annotation cost, we demonstrate that this approach results in better accuracy, especially in the cross-policy setup where the policies during testing are unseen in training. In addition, it allows us to use existing question answering models pre-trained on existing large datasets. Finally, it explicitly identifies the information missing from a scenario in case policy compliance cannot be determined. We conduct our experiments using a recent dataset consisting of government policies, which we augment with expert annotations and find that the cost of annotating question answering decomposition is largely offset by improved inter-annotator agreement and speed.
    BERTnesia: Investigating the capture and forgetting of knowledge in BERT. (arXiv:2106.02902v2 [cs.CL] UPDATED)
    (0 min) Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this article, we probe BERT specifically to understand and measure the relational knowledge it captures in its parametric memory. While probing for linguistic understanding is commonly applied to all layers of BERT as well as fine-tuned models, this has not been done for factual knowledge. We utilize existing knowledge base completion tasks (LAMA) to probe every layer of pre-trained as well as fine-tuned BERT models(ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten. The extent of forgetting is impacted by the fine-tuning objective and the training data. We found that ranking models forget the least and retain more knowledge in their final layer compared to masked language modeling and question-answering. However, masked language modeling performed the best at acquiring new knowledge from the training data. When it comes to learning facts, we found that capacity and fact density are key factors. We hope this initial work will spur further research into understanding the parametric memory of language models and the effect of training objectives on factual knowledge. The code to repeat the experiments is publicly available on GitHub.
    PermuteFormer: Efficient Relative Position Encoding for Long Sequences. (arXiv:2109.02377v2 [cs.CL] UPDATED)
    (0 min) A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset for long sequences, as well as WikiText-103, a language modeling dataset. The experiments show that PermuteFormer uniformly improves the performance of Performer with almost no computational overhead and outperforms vanilla Transformer on most of the tasks.
    On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets. (arXiv:2109.03537v1 [cs.CL])
    (0 min) Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram distribution between pre-training and downstream fine-tuning, 2) the presence of the explicit dependencies among the tokens in a sequence, 3) the length of the implicit dependencies among the tokens in a sequence. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies. Based on our analysis, we find that models pre-trained with artificial datasets are prone to learn spurious correlation in downstream tasks. Our work reveals that even if the LMs are not pre-trained on natural language, they still gain transferability on certain human language downstream tasks once the LMs learn to model the token dependencies in the sequences. This result helps us understand the exceptional transferability of pre-trained LMs.
    Open Aspect Target Sentiment Classification with Natural Language Prompts. (arXiv:2109.03685v1 [cs.CL])
    (0 min) For many business applications, we often seek to analyze sentiments associated with any arbitrary aspects of commercial products, despite having a very limited amount of labels or even without any labels at all. However, existing aspect target sentiment classification (ATSC) models are not trainable if annotated datasets are not available. Even with labeled data, they fall short of reaching satisfactory performance. To address this, we propose simple approaches that better solve ATSC with natural language prompts, enabling the task under zero-shot cases and enhancing supervised settings, especially for few-shot cases. Under the few-shot setting for SemEval 2014 Task 4 laptop domain, our method of reformulating ATSC as an NLI task outperforms supervised SOTA approaches by up to 24.13 accuracy points and 33.14 macro F1 points. Moreover, we demonstrate that our prompts could handle implicitly stated aspects as well: our models reach about 77% accuracy on detecting sentiments for aspect categories (e.g., food), which do not necessarily appear within the text, even though we trained the models only with explicitly mentioned aspect terms (e.g., fajitas) from just 16 reviews - while the accuracy of the no-prompt baseline is only around 65%.
    Label Verbalization and Entailment for Effective Zero- and Few-Shot Relation Extraction. (arXiv:2109.03659v1 [cs.CL])
    (0 min) Relation extraction systems require large amounts of labeled examples which are costly to annotate. In this work we reformulate relation extraction as an entailment task, with simple, hand-made, verbalizations of relations produced in less than 15 min per relation. The system relies on a pretrained textual entailment engine which is run as-is (no training examples, zero-shot) or further fine-tuned on labeled examples (few-shot or fully trained). In our experiments on TACRED we attain 63% F1 zero-shot, 69% with 16 examples per relation (17% points better than the best supervised system on the same conditions), and only 4 points short to the state-of-the-art (which uses 20 times more training data). We also show that the performance can be improved significantly with larger entailment models, up to 12 points in zero-shot, allowing to report the best results to date on TACRED when fully trained. The analysis shows that our few-shot systems are specially effective when discriminating between relations, and that the performance difference in low data regimes comes mainly from identifying no-relation cases.
    Active Learning by Acquiring Contrastive Examples. (arXiv:2109.03764v1 [cs.CL])
    (0 min) Common acquisition functions for active learning use either uncertainty or diversity sampling, aiming to select difficult and diverse data points from the pool of unlabeled data, respectively. In this work, leveraging the best of both worlds, we propose an acquisition function that opts for selecting \textit{contrastive examples}, i.e. data points that are similar in the model feature space and yet the model outputs maximally different predictive likelihoods. We compare our approach, CAL (Contrastive Active Learning), with a diverse set of acquisition functions in four natural language understanding tasks and seven datasets. Our experiments show that CAL performs consistently better or equal than the best performing baseline across all tasks, on both in-domain and out-of-domain data. We also conduct an extensive ablation study of our method and we further analyze all actively acquired datasets showing that CAL achieves a better trade-off between uncertainty and diversity compared to other strategies.
    Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models. (arXiv:2109.03300v1 [cs.CL])
    (0 min) All AI models are susceptible to learning biases in data that they are trained on. For generative dialogue models, being trained on real human conversations containing unbalanced gender and race/ethnicity references can lead to models that display learned biases, which we define here broadly as any measurable differences in the distributions of words or semantic content of conversations based on demographic groups. We measure the strength of such biases by producing artificial conversations between two copies of a dialogue model, conditioning one conversational partner to state a name commonly associated with a certain gender and/or race/ethnicity. We find that larger capacity models tend to exhibit more gender bias and greater stereotyping of occupations by gender. We show that several methods of tuning these dialogue models, specifically name scrambling, controlled generation, and unlikelihood training, are effective in reducing bias in conversation, including on a downstream conversational task. Name scrambling is also effective in lowering differences in token usage across conversations where partners have names associated with different genders or races/ethnicities.
    Continuous Entailment Patterns for Lexical Inference in Context. (arXiv:2109.03695v1 [cs.CL])
    (0 min) Combining a pretrained language model (PLM) with textual patterns has been shown to help in both zero- and few-shot settings. For zero-shot performance, it makes sense to design patterns that closely resemble the text seen during self-supervised pretraining because the model has never seen anything else. Supervised training allows for more flexibility. If we allow for tokens outside the PLM's vocabulary, patterns can be adapted more flexibly to a PLM's idiosyncrasies. Contrasting patterns where a "token" can be any continuous vector vs. those where a discrete choice between vocabulary elements has to be made, we call our method CONtinuous pAtterNs (CONAN). We evaluate CONAN on two established benchmarks for lexical inference in context (LIiC) a.k.a. predicate entailment, a challenging natural language understanding task with relatively small training sets. In a direct comparison with discrete patterns, CONAN consistently leads to improved performance, setting a new state of the art. Our experiments give valuable insights into the kind of pattern that enhances a PLM's performance on LIiC and raise important questions regarding our understanding of PLMs using text patterns.
    ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Archival News Collections. (arXiv:2109.03438v1 [cs.CL])
    (0 min) In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning several decades, are rarely used in training the models despite they are quite valuable for our society. In order to foster the research in the field of ODQA on such historical collections, we present ArchivalQA, a large question answering dataset consisting of 1,067,056 question-answer pairs which is designed for temporal news QA. In addition, we create four subparts of our dataset based on the question difficulty levels and the containment of temporal expressions, which we believe could be useful for training or testing ODQA systems characterized by different strengths and abilities. The novel QA dataset-constructing framework that we introduce can be also applied to create datasets over other types of collections.
    Sequence Level Contrastive Learning for Text Summarization. (arXiv:2109.03481v1 [cs.CL])
    (0 min) Contrastive learning models have achieved great success in unsupervised visual representation learning, which maximize the similarities between feature representations of different views of the same image, while minimize the similarities between feature representations of views of different images. In text summarization, the output summary is a shorter form of the input document and they have similar meanings. In this paper, we propose a contrastive learning model for supervised abstractive text summarization, where we view a document, its gold summary and its model generated summaries as different views of the same mean representation and maximize the similarities between them during training. We improve over a strong sequence-to-sequence text generation model (i.e., BART) on three different summarization datasets. Human evaluation also shows that our model achieves better faithfulness ratings compared to its counterpart without contrastive objectives.
    ProtoInfoMax: Prototypical Networks with Mutual Information Maximization for Out-of-Domain Detection. (arXiv:2108.12229v4 [cs.CL] UPDATED)
    (0 min) The ability to detect Out-of-Domain (OOD) inputs has been a critical requirement in many real-world NLP applications since the inclusion of unsupported OOD inputs may lead to catastrophic failure of systems. However, it remains an empirical question whether current algorithms can tackle such problem reliably in a realistic scenario where zero OOD training data is available. In this study, we propose ProtoInfoMax, a new architecture that extends Prototypical Networks to simultaneously process In-Domain (ID) and OOD sentences via Mutual Information Maximization (InfoMax) objective. Experimental results show that our proposed method can substantially improve performance up to 20% for OOD detection in low resource settings of text classification. We also show that ProtoInfoMax is less prone to typical over-confidence Error of Neural Networks, leading to more reliable ID and OOD prediction outcomes.
    Corpus-based Open-Domain Event Type Induction. (arXiv:2109.03322v1 [cs.CL])
    (0 min) Traditional event extraction methods require predefined event types and their corresponding annotations to learn event extractors. These prerequisites are often hard to be satisfied in real-world applications. This work presents a corpus-based open-domain event type induction method that automatically discovers a set of event types from a given corpus. As events of the same type could be expressed in multiple ways, we propose to represent each event type as a cluster of pairs. Specifically, our method (1) selects salient predicates and object heads, (2) disambiguates predicate senses using only a verb sense dictionary, and (3) obtains event types by jointly embedding and clustering pairs in a latent spherical space. Our experiments, on three datasets from different domains, show our method can discover salient and high-quality event types, according to both automatic and human evaluations.
    R2-D2: A Modular Baseline for Open-Domain Question Answering. (arXiv:2109.03502v1 [cs.CL])
    (0 min) This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from all system's components. We demonstrate its strength across three open-domain QA datasets: NaturalQuestions, TriviaQA and EfficientQA, surpassing state-of-the-art on the first two. Our analysis demonstrates that: (i) combining extractive and generative reader yields absolute improvements up to 5 exact match and it is at least twice as effective as the posterior averaging ensemble of the same models with different parameters, (ii) the extractive reader with fewer parameters can match the performance of the generative reader on extractive QA datasets.
    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. (arXiv:2104.01027v2 [cs.SD] UPDATED)
    (0 min) Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq.
    BERTnesia: Investigating the capture and forgetting of knowledge in BERT. (arXiv:2010.09313v2 [cs.CL] UPDATED)
    (0 min) Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this paper, we probe BERT specifically to understand and measure the relational knowledge it captures. We utilize knowledge base completion tasks to probe every layer of pre-trained as well as fine-tuned BERT (ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten but the extent of forgetting is impacted by the fine-tuning objective but not the size of the dataset. We found that ranking models forget the least and retain more knowledge in their final layer. We release our code on github to repeat the experiments.
    Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models. (arXiv:2004.13805v4 [cs.CL] UPDATED)
    (2 min) As it has been unveiled that pre-trained language models (PLMs) are to some extent capable of recognizing syntactic concepts in natural language, much effort has been made to develop a method for extracting complete (binary) parses from PLMs without training separate parsers. We improve upon this paradigm by proposing a novel chart-based method and an effective top-K ensemble technique. Moreover, we demonstrate that we can broaden the scope of application of the approach into multilingual settings. Specifically, we show that by applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages in an integrated and language-agnostic manner, attaining performance superior or comparable to that of unsupervised PCFGs. We also verify that our approach is robust to cross-lingual transfer. Finally, we provide analyses on the inner workings of our method. For instance, we discover universal attention heads which are consistently sensitive to syntactic information irrespective of the input language.
    Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories. (arXiv:2109.03754v1 [cs.CL])
    (2 min) Measuring event salience is essential in the understanding of stories. This paper takes a recent unsupervised method for salience detection derived from Barthes Cardinal Functions and theories of surprise and applies it to longer narrative forms. We improve the standard transformer language model by incorporating an external knowledgebase (derived from Retrieval Augmented Generation) and adding a memory mechanism to enhance performance on longer works. We use a novel approach to derive salience annotation using chapter-aligned summaries from the Shmoop corpus for classic literary works. Our evaluation against this data demonstrates that our salience detection model improves performance over and above a non-knowledgebase and memory augmented language model, both of which are crucial to this improvement.
    Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization. (arXiv:2109.02401v2 [cs.CL] UPDATED)
    (2 min) Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs' powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our visual guidance method contributes 83.6% of the overall improvement. Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.
    Minimum projective linearizations of trees in linear time. (arXiv:2102.03277v4 [cs.DS] UPDATED)
    (2 min) The Minimum Linear Arrangement problem (MLA) consists of finding a mapping $\pi$ from vertices of a graph to distinct integers that minimizes $\sum_{\{u,v\}\in E}|\pi(u) - \pi(v)|$. In that setting, vertices are often assumed to lie on a horizontal line and edges are drawn as semicircles above said line. For trees, various algorithms are available to solve the problem in polynomial time in $n=|V|$. There exist variants of the MLA in which the arrangements are constrained. Iordanskii, and later Hochberg and Stallmann (HS), put forward $O(n)$-time algorithms that solve the problem when arrangements are constrained to be planar (also known as one-page book embeddings). We also consider linear arrangements of rooted trees that are constrained to be projective (planar embeddings where the root is not covered by any edge). Gildea and Temperley (GT) sketched an algorithm for projective arrangements which they claimed runs in $O(n)$ but did not provide any justification of its cost. In contrast, Park and Levy claimed that GT's algorithm runs in $O(n \log d_{max})$ where $d_{max}$ is the maximum degree but did not provide sufficient detail. Here we correct an error in HS's algorithm for the planar case, show its relationship with the projective case, and derive simple algorithms for the projective and planar cases that run without a doubt in $O(n)$ time.
    TrollsWithOpinion: A Dataset for Predicting Domain-specific Opinion Manipulation in Troll Memes. (arXiv:2109.03571v1 [cs.SI])
    (2 min) Research into the classification of Image with Text (IWT) troll memes has recently become popular. Since the online community utilizes the refuge of memes to express themselves, there is an abundance of data in the form of memes. These memes have the potential to demean, harras, or bully targeted individuals. Moreover, the targeted individual could fall prey to opinion manipulation. To comprehend the use of memes in opinion manipulation, we define three specific domains (product, political or others) which we classify into troll or not-troll, with or without opinion manipulation. To enable this analysis, we enhanced an existing dataset by annotating the data with our defined classes, resulting in a dataset of 8,881 IWT or multimodal memes in the English language (TrollsWithOpinion dataset). We perform baseline experiments on the annotated dataset, and our result shows that existing state-of-the-art techniques could only reach a weighted-average F1-score of 0.37. This shows the need for a development of a specific technique to deal with multimodal troll memes.
    Effective Sequence-to-Sequence Dialogue State Tracking. (arXiv:2108.13990v2 [cs.CL] UPDATED)
    (2 min) Sequence-to-sequence models have been applied to a wide variety of NLP tasks, but how to properly use them for dialogue state tracking has not been systematically investigated. In this paper, we study this problem from the perspectives of pre-training objectives as well as the formats of context representations. We demonstrate that the choice of pre-training objective makes a significant difference to the state tracking quality. In particular, we find that masked span prediction is more effective than auto-regressive language modeling. We also explore using Pegasus, a span prediction-based pre-training objective for text summarization, for the state tracking model. We found that pre-training for the seemingly distant summarization task works surprisingly well for dialogue state tracking. In addition, we found that while recurrent state context representation works also reasonably well, the model may have a hard time recovering from earlier mistakes. We conducted experiments on the MultiWOZ 2.1-2.4, WOZ 2.0, and DSTC2 datasets with consistent observations.
    Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension. (arXiv:2109.03772v1 [cs.CL])
    (2 min) Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue, resulting in intricate speaker information flows and noisy dialogue contexts. To alleviate such difficulties, previous models focus on how to incorporate these information using complex graph-based modules and additional manually labeled data, which is usually rare in real scenarios. In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows, and capture salient clues in a long dialogue. Experimental results on two benchmark datasets have justified the effectiveness of our method over competitive baselines and current state-of-the-art models.
    SelfExplain: A Self-Explaining Architecture for Neural Text Classifiers. (arXiv:2103.12279v2 [cs.CL] UPDATED)
    (2 min) We introduce SelfExplain, a novel self-explaining model that explains a text classifier's predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SelfExplain facilitates interpretability without sacrificing performance. Most importantly, explanations from SelfExplain show sufficiency for model predictions and are perceived as adequate, trustworthy and understandable by human judges compared to existing widely-used baselines.
    Highly Parallel Autoregressive Entity Linking with Discriminative Correction. (arXiv:2109.03792v1 [cs.CL])
    (2 min) Generative approaches have been recently shown to be effective for both Entity Disambiguation and Entity Linking (i.e., joint mention detection and disambiguation). However, the previously proposed autoregressive formulation for EL suffers from i) high computational cost due to a complex (deep) decoder, ii) non-parallelizable decoding that scales with the source sequence length, and iii) the need for training on a large amount of data. In this work, we propose a very efficient approach that parallelizes autoregressive linking across all potential mentions and relies on a shallow and efficient decoder. Moreover, we augment the generative objective with an extra discriminative component, i.e., a correction term which lets us directly optimize the generator's ranking. When taken together, these techniques tackle all the above issues: our model is >70 times faster and more accurate than the previous generative method, outperforming state-of-the-art approaches on the standard English dataset AIDA-CoNLL. Source code available at https://github.com/nicola-decao/efficient-autoregressive-EL
    Contrastive Out-of-Distribution Detection for Pretrained Transformers. (arXiv:2104.08812v2 [cs.CL] UPDATED)
    (2 min) Pretrained Transformers achieve remarkable performance when training and test data are from the same distribution. However, in real-world scenarios, the model often faces out-of-distribution (OOD) instances that can cause severe semantic shift problems at inference time. Therefore, in practice, a reliable model should identify such instances, and then either reject them during inference or pass them over to models that handle another distribution. In this paper, we develop an unsupervised OOD detection method, in which only the in-distribution (ID) data are used in training. We propose to fine-tune the Transformers with a contrastive loss, which improves the compactness of representations, such that OOD instances can be better differentiated from ID ones. These OOD instances can then be accurately detected using the Mahalanobis distance in the model's penultimate layer. We experiment with comprehensive settings and achieve near-perfect OOD detection performance, outperforming baselines drastically. We further investigate the rationales behind the improvement, finding that more compact representations through margin-based contrastive learning bring the improvement. We release our code to the community for future research.
    Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi. (arXiv:2109.03552v1 [cs.CL])
    (2 min) The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.
    RefineCap: Concept-Aware Refinement for Image Captioning. (arXiv:2109.03529v1 [cs.CL])
    (2 min) Automatically translating images to texts involves image scene understanding and language modeling. In this paper, we propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics, and implicitly learns the mapping between visual tag words and images. The proposed Visual-Concept Refinement method can allow the generator to attend to semantic details in the image, thereby generating more semantically descriptive captions. Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models.
    Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion. (arXiv:2109.03551v1 [cs.SD])
    (2 min) Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.
    Sustainable Modular Debiasing of Language Models. (arXiv:2109.03646v1 [cs.CL])
    (2 min) Unfair stereotypical biases (e.g., gender, racial, or religious biases) encoded in modern pretrained language models (PLMs) have negative ethical implications for widespread adoption of state-of-the-art language technology. To remedy for this, a wide range of debiasing techniques have recently been introduced to remove such stereotypical biases from PLMs. Existing debiasing methods, however, directly modify all of the PLMs parameters, which -- besides being computationally expensive -- comes with the inherent risk of (catastrophic) forgetting of useful language knowledge acquired in pretraining. In this work, we propose a more sustainable modular debiasing approach based on dedicated debiasing adapters, dubbed ADELE. Concretely, we (1) inject adapter modules into the original PLM layers and (2) update only the adapters (i.e., we keep the original PLM parameters frozen) via language modeling training on a counterfactually augmented corpus. We showcase ADELE, in gender debiasing of BERT: our extensive evaluation, encompassing three intrinsic and two extrinsic bias measures, renders ADELE, very effective in bias mitigation. We further show that -- due to its modular nature -- ADELE, coupled with task adapters, retains fairness even after large-scale downstream training. Finally, by means of multilingual BERT, we successfully transfer ADELE, to six target languages.
    Structural Adapters in Pretrained Language Models for AMR-to-text Generation. (arXiv:2103.09120v2 [cs.CL] UPDATED)
    (2 min) Pretrained language models (PLM) have recently advanced graph-to-text generation, where the input graph is linearized into a sequence and fed into the PLM to obtain its representation. However, efficiently encoding the graph structure in PLMs is challenging because such models were pretrained on natural language, and modeling structured data may lead to catastrophic forgetting of distributional knowledge. In this paper, we propose StructAdapt, an adapter method to encode graph structure into PLMs. Contrary to prior work, StructAdapt effectively models interactions among the nodes based on the graph connectivity, only training graph structure-aware adapter parameters. In this way, we incorporate task-specific knowledge while maintaining the topological structure of the graph. We empirically show the benefits of explicitly encoding graph structure into PLMs using StructAdapt, outperforming the state of the art on two AMR-to-text datasets, training only 5.1% of the PLM parameters.
    Forget me not: A Gentle Reminder to Mind the Simple Multi-Layer Perceptron Baseline for Text Classification. (arXiv:2109.03777v1 [cs.CL])
    (2 min) Graph neural networks have triggered a resurgence of graph-based text classification. We show that already a simple MLP baseline achieves comparable performance on benchmark datasets, questioning the importance of synthetic graph structures. When considering an inductive scenario, i. e., when adding new documents to a corpus, a simple MLP even outperforms most graph-based models. We further fine-tune DistilBERT for comparison and find that it outperforms all state-of-the-art models. We suggest that future studies use at least an MLP baseline to contextualize the results. We provide recommendations for the design and training of such a baseline.
    Mixup Decoding for Diverse Machine Translation. (arXiv:2109.03402v1 [cs.CL])
    (2 min) Diverse machine translation aims at generating various target language translations for a given source language sentence. Leveraging the linear relationship in the sentence latent space introduced by the mixup training, we propose a novel method, MixDiversity, to generate different translations for the input sentence by linearly interpolating it with different sentence pairs sampled from the training corpus when decoding. To further improve the faithfulness and diversity of the translations, we propose two simple but effective approaches to select diverse sentence pairs in the training corpus and adjust the interpolation weight for each pair correspondingly. Moreover, by controlling the interpolation weight, our method can achieve the trade-off between faithfulness and diversity without any additional training, which is required in most of the previous methods. Experiments on WMT'16 en-ro, WMT'14 en-de, and WMT'17 zh-en are conducted to show that our method substantially outperforms all previous diverse machine translation methods.
    A Common Semantic Space for Monolingual and Cross-Lingual Meta-Embeddings. (arXiv:2001.06381v2 [cs.CL] UPDATED)
    (2 min) This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings. Our method integrates multiple word embeddings created from complementary techniques, textual sources, knowledge bases and languages. Existing word vectors are projected to a common semantic space using linear transformations and averaging. With our method the resulting meta-embeddings maintain the dimensionality of the original embeddings without losing information while dealing with the out-of-vocabulary problem. An extensive empirical evaluation demonstrates the effectiveness of our technique with respect to previous work on various intrinsic and extrinsic multilingual evaluations, obtaining competitive results for Semantic Textual Similarity and state-of-the-art performance for word similarity and POS tagging (English and Spanish). The resulting cross-lingual meta-embeddings also exhibit excellent cross-lingual transfer learning capabilities. In other words, we can leverage pre-trained source embeddings from a resource-rich language in order to improve the word representations for under-resourced languages.
    Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections. (arXiv:2104.04670v5 [cs.CL] UPDATED)
    (2 min) Large pre-trained language models (LMs) such as GPT-3 have acquired a surprising ability to perform zero-shot learning. For example, to classify sentiment without any training examples, we can "prompt" the LM with the review and the label description "Does the user like this movie?", and ask whether the next word is "yes" or "no". However, the next word prediction training objective is still misaligned with the target zero-shot learning objective. To address this weakness, we propose meta-tuning, which directly optimizes the zero-shot learning objective by fine-tuning pre-trained language models on a collection of datasets. We focus on classification tasks, and construct the meta-dataset by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format. When evaluated on unseen tasks, meta-tuned models outperform a same-sized QA model and the previous SOTA zero-shot learning system based on natural language inference. Additionally, increasing parameter count from 220M to 770M improves AUC-ROC scores by 6.3%, and we forecast that even larger models would perform better. Therefore, measuring zero-shot learning performance on language models out-of-the-box might underestimate their true potential, and community-wide efforts on aggregating datasets and unifying their formats can help build models that answer prompts better.
    Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis. (arXiv:2109.03439v1 [eess.AS])
    (2 min) Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S model by only using readily accessible low-quality data. The S2W model is trained with high-quality target data, which is adopted to effectively aggregate style descriptors and generate high-fidelity speech in the target speaker's voice. Experimental results are presented, showing that Referee outperforms a global-style-token (GST)-based baseline approach in CSST.
    BROS: A Layout-Aware Pre-trained Language Model for Understanding Documents. (arXiv:2108.04539v3 [cs.CL] UPDATED)
    (2 min) Understanding documents from their visual snapshots is an emerging problem that requires both advanced computer vision and NLP methods. The recent advance in OCR enables the accurate recognition of text blocks, yet it is still challenging to extract key information from documents due to the diversity of their layouts. Although recent studies on pre-trained language models show the importance of incorporating layout information on this task, the conjugation of texts and their layouts still follows the style of BERT optimized for understanding the 1D text. This implies there is room for further improvement considering the 2D nature of text layouts. This paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), which effectively utilizes the information included in individual text blocks and their layouts. Specifically, BROS encodes spatial information by utilizing relative positions and learns spatial dependencies between OCR blocks with a novel area-masking strategy. These two novel approaches lead to an efficient encoding of spatial layout information highlighted by the robust performance of BROS under low-resource environments. We also introduce a general-purpose parser that can be combined with BROS to extract key information even when there is no order information between text blocks. BROS shows its superiority on four public benchmarks -- FUNSD, SROIE*, CORD, and SciTSR -- and its robustness in practical cases where order information of text blocks is not available. Further experiments with a varying number of training examples demonstrate the high training efficiency of our approach. Our code will be open to the public.
    A Dual-Decoder Conformer for Multilingual Speech Recognition. (arXiv:2109.03277v1 [cs.CL])
    (2 min) Transformer-based models have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a Conformer [1] encoder, two parallel transformer decoders, and a language classifier. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information. We consider phoneme recognition and language identification as auxiliary tasks in the multi-task learning framework. We jointly optimize the network for phoneme recognition, grapheme recognition, and language identification tasks with Joint CTC-Attention [2] training. Our experiments show that we can obtain a significant reduction in WER over the baseline approaches. We also show that our dual-decoder approach obtains significant improvement over the single decoder approach.
    Formal Query Building with Query Structure Prediction for Complex Question Answering over Knowledge Base. (arXiv:2109.03614v1 [cs.CL])
    (2 min) Formal query building is an important part of complex question answering over knowledge bases. It aims to build correct executable queries for questions. Recent methods try to rank candidate queries generated by a state-transition strategy. However, this candidate generation strategy ignores the structure of queries, resulting in a considerable number of noisy queries. In this paper, we propose a new formal query building approach that consists of two stages. In the first stage, we predict the query structure of the question and leverage the structure to constrain the generation of the candidate queries. We propose a novel graph generation framework to handle the structure prediction task and design an encoder-decoder model to predict the argument of the predetermined operation in each generative step. In the second stage, we follow the previous methods to rank the candidate queries. The experimental results show that our formal query building approach outperforms existing methods on complex questions while staying competitive on simple questions.
    Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach. (arXiv:2109.03645v1 [cs.CL])
    (2 min) In the context of neural machine translation, data augmentation (DA) techniques may be used for generating additional training samples when the available parallel data are scarce. Many DA approaches aim at expanding the support of the empirical data distribution by generating new sentence pairs that contain infrequent words, thus making it closer to the true data distribution of parallel sentences. In this paper, we propose to follow a completely different approach and present a multi-task DA approach in which we generate new sentence pairs with transformations, such as reversing the order of the target sentence, which produce unfluent target sentences. During training, these augmented sentences are used as auxiliary tasks in a multi-task framework with the aim of providing new contexts where the target prefix is not informative enough to predict the next word. This strengthens the encoder and forces the decoder to pay more attention to the source representations of the encoder. Experiments carried out on six low-resource translation tasks show consistent improvements over the baseline and over DA methods aiming at extending the support of the empirical data distribution. The systems trained with our approach rely more on the source tokens, are more robust against domain shift and suffer less hallucinations.
    Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario. (arXiv:2109.03570v1 [cs.CL])
    (2 min) This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model suitable for real-world clinical data. We evaluated our models on Named Entity Recognition (NER) tasks for biomedical documents and challenging hospital discharge reports. When compared against the competitive mBERT and BETO models, we outperform them in all NER tasks by a significant margin. Finally, we studied the impact of the model's vocabulary on the NER performances by offering an interesting vocabulary-centric analysis. The results confirm that domain-specific pretraining is fundamental to achieving higher performances in downstream NER tasks, even within a mid-resource scenario. To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine. Our models will be made freely available after publication.
    Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation. (arXiv:2108.13134v2 [cs.CL] UPDATED)
    (2 min) Despite significant progress has been achieved in text summarization, factual inconsistency in generated summaries still severely limits its practical applications. Among the key factors to ensure factual consistency, a reliable automatic evaluation metric is the first and the most crucial one. However, existing metrics either neglect the intrinsic cause of the factual inconsistency or rely on auxiliary tasks, leading to an unsatisfied correlation with human judgments or increasing the inconvenience of usage in practice. In light of these challenges, we propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation, which formulates the causal relationship among the source document, the generated summary, and the language prior. We remove the effect of language prior, which can cause factual inconsistency, from the total causal effect on the generated summary, and provides a simple yet effective way to evaluate consistency without relying on other auxiliary tasks. We conduct a series of experiments on three public abstractive text summarization datasets, and demonstrate the advantages of the proposed metric in both improving the correlation with human judgments and the convenience of usage. The source code is available at https://github.com/xieyxclack/factual_coco.
    Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media. (arXiv:2104.08116v2 [cs.CL] UPDATED)
    (2 min) Language use differs between domains and even within a domain, language use changes over time. For pre-trained language models like BERT, domain adaptation through continued pre-training has been shown to improve performance on in-domain downstream tasks. In this article, we investigate whether temporal adaptation can bring additional benefits. For this purpose, we introduce a corpus of social media comments sampled over three years. It contains unlabelled data for adaptation and evaluation on an upstream masked language modelling task as well as labelled data for fine-tuning and evaluation on a downstream document classification task. We find that temporality matters for both tasks: temporal adaptation improves upstream and temporal fine-tuning downstream task performance. Time-specific models generally perform better on past than on future test sets, which matches evidence on the bursty usage of topical words. However, adapting BERT to time and domain does not improve performance on the downstream task over only adapting to domain. Token-level analysis shows that temporal adaptation captures event-driven changes in language use in the downstream task, but not those changes that are actually relevant to task performance. Based on our findings, we discuss when temporal adaptation may be more effective.
    Learning from Noisy Labels for Entity-Centric Information Extraction. (arXiv:2104.08656v2 [cs.CL] UPDATED)
    (2 min) Recent information extraction approaches have relied on training deep neural models. However, such models can easily overfit noisy labels and suffer from performance degradation. While it is very costly to filter noisy labels in large learning resources, recent studies show that such labels take more training steps to be memorized and are more frequently forgotten than clean labels, therefore are identifiable in training. Motivated by such properties, we propose a simple co-regularization framework for entity-centric information extraction, which consists of several neural models with identical structures but different parameter initialization. These models are jointly optimized with the task-specific losses and are regularized to generate similar predictions based on an agreement loss, which prevents overfitting on noisy labels. Extensive experiments on two widely used but noisy benchmarks for information extraction, TACRED and CoNLL03, demonstrate the effectiveness of our framework. We release our code to the community for future research.
    DeepZensols: Deep Natural Language Processing Framework. (arXiv:2109.03383v1 [cs.CL])
    (2 min) Reproducing results in publications by distributing publicly available source code is becoming ever more popular. Given the difficulty of reproducing machine learning (ML) experiments, there have been significant efforts in reducing the variance of these results. As in any science, the ability to consistently reproduce results effectively strengthens the underlying hypothesis of the work, and thus, should be regarded as important as the novel aspect of the research itself. The contribution of this work is a framework that is able to reproduce consistent results and provides a means of easily creating, training, and evaluating natural language processing (NLP) deep learning (DL) models.
    Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. (arXiv:2109.01163v2 [eess.AS] CROSS LISTED)
    (2 min) The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce the Conformer architecture complexity with a limited computing budget, leading to a more efficient architecture design that we call Efficient Conformer. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention, allowing us to reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence length $n$, hidden dimension $d$ and group size parameter $g$. We also experiment the use of strided multi-head self-attention as a global downsampling operation. Our experiments are performed on the LibriSpeech dataset with CTC and RNN-Transducer losses. We show that within the same computing budget, the proposed architecture achieves better performances with faster training and decoding compared to the Conformer. Our 13M parameters CTC model achieves competitive WERs of 3.6%/9.0% without using a language model and 2.7%/6.7% with an external n-gram language model on the test-clean/test-other sets while being 29% faster than our CTC Conformer baseline at inference and 36% faster to train.
    Text-Free Prosody-Aware Generative Spoken Language Modeling. (arXiv:2109.03264v1 [cs.CL])
    (2 min) Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm.
    Word Equations: Inherently Interpretable Sparse Word Embeddingsthrough Sparse Coding. (arXiv:2004.13847v2 [cs.CL] UPDATED)
    (2 min) Word embeddings are a powerful natural lan-guage processing technique, but they are ex-tremely difficult to interpret. To enable inter-pretable NLP models, we create vectors whereeach dimension isinherently interpretable. Byinherently interpretable, we mean a systemwhere each dimension is associated with somehuman-understandablehintthat can describethe meaning of that dimension. In order tocreate more interpretable word embeddings,we transform pretrained dense word embed-dings into sparse embeddings. These new em-beddings are inherently interpretable: each oftheir dimensions is created from and repre-sents a natural language word or specific gram-matical concept. We construct these embed-dings through sparse coding, where each vec-tor in the basis set is itself a word embedding.Therefore, each dimension of our sparse vec-tors corresponds to a natural language word.We also show that models trained using thesesparse embeddings can achieve good perfor-mance and are more interpretable in practice,including through human evaluations.
    Spelling provides a precise (but sometimes misplaced) phonological target. Orthography and acoustic variability in second language word learning. (arXiv:2109.03490v1 [cs.CL])
    (2 min) L1 French participants learned novel L2 English words over two days of learning sessions, with half of the words presented with their orthographic forms (Audio-Ortho) and half without (Audio only). One group heard the words pronounced by a single talker, while another group heard them pronounced by multiple talkers. On the third day, they completed a variety of tasks to evaluate their learning. Our results show a robust influence of orthography, with faster response times in both production (picture naming) and recognition (picture mapping) tasks for words learned in the Audio-Ortho condition. Moreover, formant analyses of the picture naming responses show that orthographic input pulls pronunciations of English novel words towards a non-native (French) phonological target. Words learned with their orthographic forms were pronounced more precisely (with smaller Dispersion Scores), but were misplaced in the vowel space (as reflected by smaller Euclidian distances with respect to French vowels). Finally, we found only limited evidence of an effect of talker-based acoustic variability: novel words learned with multiple talkers showed faster responses times in the picture naming task, but only in the Audio-only condition, which suggests that orthographic information may have overwhelmed any advantage of talker-based acoustic variability.
    Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models. (arXiv:2109.03415v1 [cs.CL])
    (2 min) Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights.
    NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task--Next Sentence Prediction. (arXiv:2109.03564v1 [cs.CL])
    (2 min) Using prompts to utilize language models to perform various downstream tasks, also known as prompt-based learning or prompt-learning, has lately gained significant success in comparison to the pre-train and fine-tune paradigm. Nonetheless, virtually all prompt-based methods are token-level, meaning they all utilize GPT's left-to-right language model or BERT's masked language model to perform cloze-style tasks. In this paper, we attempt to accomplish several NLP tasks in the zero-shot scenario using a BERT original pre-training task abandoned by RoBERTa and other models--Next Sentence Prediction (NSP). Unlike token-level techniques, our sentence-level prompt-based method NSP-BERT does not need to fix the length of the prompt or the position to be predicted, allowing it to handle tasks such as entity linking with ease. Based on the characteristics of NSP-BERT, we offer several quick building templates for various downstream tasks. We suggest a two-stage prompt method for word sense disambiguation tasks in particular. Our strategies for mapping the labels significantly enhance the model's performance on sentence pair tasks. On the FewCLUE benchmark, our NSP-BERT outperforms other zero-shot methods on most of these tasks and comes close to the few-shot methods.
    On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings. (arXiv:2109.03334v1 [cs.CL])
    (2 min) Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these "multi-hop" explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36% when compared with full manual expert judgements, with different models being disproportionately affected. This poses a significant methodological challenge to accurately evaluating explanations produced by compositional reasoning models.
    It is AI's Turn to Ask Human a Question: Question and Answer Pair Generation for Children Storybooks in FairytaleQA Dataset. (arXiv:2109.03423v1 [cs.CL])
    (2 min) Existing question answering (QA) datasets are created mainly for the application of having AI to be able to answer questions asked by humans. But in educational applications, teachers and parents sometimes may not know what questions they should ask a child that can maximize their language learning results. With a newly released book QA dataset (FairytaleQA), which educational experts labeled on 46 fairytale storybooks for early childhood readers, we developed an automated QA generation model architecture for this novel application. Our model (1) extracts candidate answers from a given storybook passage through carefully designed heuristics based on a pedagogical framework; (2) generates appropriate questions corresponding to each extracted answer using a language model; and, (3) uses another QA model to rank top QA-pairs. Automatic and human evaluations show that our model outperforms baselines. We also demonstrate that our method can help with the scarcity issue of the children's book QA dataset via data augmentation on 200 unlabeled storybooks.
    Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering. (arXiv:2109.03381v1 [cs.CL])
    (2 min) Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks.
    JNLP Team: Deep Learning Approaches for Legal Processing Tasks in COLIEE 2021. (arXiv:2106.13405v2 [cs.CL] UPDATED)
    (2 min) COLIEE is an annual competition in automatic computerized legal text processing. Automatic legal document processing is an ambitious goal, and the structure and semantics of the law are often far more complex than everyday language. In this article, we survey and report our methods and experimental results in using deep learning in legal document processing. The results show the difficulties as well as potentials in this family of approaches.
    Social Analysis of Young Basque Speaking Communities in Twitter. (arXiv:2109.03487v1 [cs.CY])
    (2 min) In this paper we take into account both social and linguistic aspects to perform demographic analysis by processing a large amount of tweets in Basque language. The study of demographic characteristics and social relationships are approached by applying machine learning and modern deep-learning Natural Language Processing (NLP) techniques, combining social sciences with automatic text processing. More specifically, our main objective is to combine demographic inference and social analysis in order to detect young Basque Twitter users and to identify the communities that arise from their relationships or shared content. This social and demographic analysis will be entirely based on the~automatically collected tweets using NLP to convert unstructured textual information into interpretable knowledge.
  • cs.CV updates on arXiv.org

    Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type. (arXiv:2109.03783v1 [cs.CV])
    (2 min) We introduce a multi-stage framework that uses mean curvature on a hand surface and focuses on learning interaction between hand and object by analyzing hand grasp type for hand action recognition in egocentric videos. The proposed method does not require 3D information of objects including 6D object poses which are difficult to annotate for learning an object's behavior while it interacts with hands. Instead, the framework synthesizes the mean curvature of the hand mesh model to encode the hand surface geometry in 3D space. Additionally, our method learns the hand grasp type which is highly correlated with the hand action. From our experiment, we notice that using hand grasp type and mean curvature of hand increases the performance of the hand action recognition.
    CAGAN: Text-To-Image Generation with Combined Attention GANs. (arXiv:2104.12663v3 [cs.CV] UPDATED)
    (0 min) Generating images according to natural language descriptions is a challenging task. Prior research has mainly focused to enhance the quality of generation by investigating the use of spatial attention and/or textual attention thereby neglecting the relationship between channels. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions. The proposed CAGAN utilises two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels. With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset. Furthermore, we demonstrate that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS, outperforming the state of the art on the CUB dataset, but generates unrealistic images through feature repetition.
    fastMRI+: Clinical Pathology Annotations for Knee and Brain Fully Sampled Multi-Coil MRI Data. (arXiv:2109.03812v1 [eess.IV])
    (0 min) Improving speed and image quality of Magnetic Resonance Imaging (MRI) via novel reconstruction approaches remains one of the highest impact applications for deep learning in medical imaging. The fastMRI dataset, unique in that it contains large volumes of raw MRI data, has enabled significant advances in accelerating MRI using deep learning-based reconstruction methods. While the impact of the fastMRI dataset on the field of medical imaging is unquestioned, the dataset currently lacks clinical expert pathology annotations, critical to addressing clinically relevant reconstruction frameworks and exploring important questions regarding rendering of specific pathology using such novel approaches. This work introduces fastMRI+, which consists of 16154 subspecialist expert bounding box annotations and 13 study-level labels for 22 different pathology categories on the fastMRI knee dataset, and 7570 subspecialist expert bounding box annotations and 643 study-level labels for 30 different pathology categories for the fastMRI brain dataset. The fastMRI+ dataset is open access and aims to support further research and advancement of medical imaging in MRI reconstruction and beyond.
    End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net. (arXiv:2106.00952v2 [cs.CV] UPDATED)
    (0 min) Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the \textit{Multi-Stage Attentional U-Net}. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40\% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications.
    An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT. (arXiv:2012.01986v2 [eess.IV] UPDATED)
    (0 min) Dual-energy computed tomography (DECT) has been widely used in many applications that need material decomposition. Image-domain methods directly decompose material images from high- and low-energy attenuation images, and thus, are susceptible to noise and artifacts on attenuation images. The purpose of this study is to develop an improved iterative neural network (INN) for high-quality image-domain material decomposition in DECT, and to study its properties. We propose a new INN architecture for DECT material decomposition. The proposed INN architecture uses distinct cross-material convolutional neural network (CNN) in image refining modules, and uses image decomposition physics in image reconstruction modules. The distinct cross-material CNN refiners incorporate distinct encoding-decoding filters and cross-material model that captures correlations between different materials. We study the distinct cross-material CNN refiner with patch-based reformulation and tight-frame condition. Numerical experiments with extended cardiactorso (XCAT) phantom and clinical data show that the proposed INN significantly improves the image quality over several image-domain material decomposition methods, including a conventional model-based image decomposition (MBID) method using an edge-preserving regularizer, a recent MBID method using pre-learned material-wise sparsifying transforms, and a noniterative deep CNN method. Our study with patch-based reformulations reveals that learned filters of distinct cross-material CNN refiners can approximately satisfy the tight-frame condition.
    Scaling-up Disentanglement for Image Translation. (arXiv:2103.14017v2 [cs.CV] UPDATED)
    (0 min) Image translation methods typically aim to manipulate a set of labeled attributes (given as supervision at training time e.g. domain label) while leaving the unlabeled attributes intact. Current methods achieve either: (i) disentanglement, which exhibits low visual fidelity and can only be satisfied where the attributes are perfectly uncorrelated. (ii) visually-plausible translations, which are clearly not disentangled. In this work, we propose OverLORD, a single framework for disentangling labeled and unlabeled attributes as well as synthesizing high-fidelity images, which is composed of two stages; (i) Disentanglement: Learning disentangled representations with latent optimization. Differently from previous approaches, we do not rely on adversarial training or any architectural biases. (ii) Synthesis: Training feed-forward encoders for inferring the learned attributes and tuning the generator in an adversarial manner to increase the perceptual quality. When the labeled and unlabeled attributes are correlated, we model an additional representation that accounts for the correlated attributes and improves disentanglement. We highlight that our flexible framework covers multiple settings as disentangling labeled attributes, pose and appearance, localized concepts, and shape and texture. We present significantly better disentanglement with higher translation quality and greater output diversity than state-of-the-art methods.
    Rethinking the Aligned and Misaligned Features in One-stage Object Detection. (arXiv:2108.12176v2 [cs.CV] UPDATED)
    (0 min) One-stage object detectors rely on a point feature to predict the detection results. However, the point feature often lacks the information of the whole object, thereby leading to a misalignment between the object and the point feature. Meanwhile, the classification and regression tasks are sensitive to different object regions, but their features are spatially aligned. Both of these two problems hinder the detection performance. In order to solve these two problems, we propose a simple and plug-in operator that can generate aligned and disentangled features for each task, respectively, without breaking the fully convolutional manner. By predicting two task-aware point sets that are located in each sensitive region, the proposed operator can align the point feature with the object and disentangle the two tasks from the spatial dimension. We also reveal an interesting finding of the opposite effect of the long-range skip connection for classification and regression. On the basis of the Object-Aligned and Task-disentangled operator (OAT), we propose OAT-Net, which explicitly exploits point-set features for accurate detection results. Extensive experiments on the MS-COCO dataset show that OAT can consistently boost different state-of-the-art one-stage detectors by $\sim$2 AP. Notably, OAT-Net with Res2Net-101-DCN backbone achieves 53.7 AP on the COCO test-dev.
    Salient Object Detection via Integrity Learning. (arXiv:2101.07663v4 [cs.CV] UPDATED)
    (0 min) Albeit current salient object detection (SOD) works have achieved fantastic progress, they are cast into the shade when it comes to the integrity of the predicted salient regions. We define the concept of integrity at both the micro and macro level. Specifically, at the micro level, the model should highlight all parts that belong to a certain salient object, while at the macro level, the model needs to discover all salient objects from the given image scene. To facilitate integrity learning for salient object detection, we design a novel Integrity Cognition Network (ICON), which explores three important components to learn strong integrity features. 1) Unlike the existing models that focus more on feature discriminability, we introduce a diverse feature aggregation (DFA) component to aggregate features with various receptive fields (i.e.,, kernel shape and context) and increase the feature diversity. Such diversity is the foundation for mining the integral salient objects. 2) Based on the DFA features, we introduce the integrity channel enhancement (ICE) component with the goal of enhancing feature channels that highlight the integral salient objects at the macro level, while suppressing the other distracting ones. 3) After extracting the enhanced features, the part-whole verification (PWV) method is employed to determine whether the part and whole object features have strong agreement. Such part-whole agreements can further improve the micro-level integrity for each salient object. To demonstrate the effectiveness of ICON, comprehensive experiments are conducted on seven challenging benchmarks, where promising results are achieved.
    Lidar Point Cloud Guided Monocular 3D Object Detection. (arXiv:2104.09035v2 [cs.CV] UPDATED)
    (0 min) Monocular 3D detection currently struggles with extremely lower detection rates compared to LiDAR-based methods. The poor accuracy is mainly caused by the absence of accurate location cues due to the ill-posed nature of monocular imagery. LiDAR point clouds, which provide precise spatial measurement, can offer beneficial information for the training of monocular methods. To make use of LiDAR point clouds, prior works project them to form depth map labels, subsequently training a dense depth estimator to extract explicit location features. This indirect and complicated way introduces intermediate products, i.e., depth map predictions, taking much computation costs as well as leading to suboptimal performances. In this paper, we propose LPCG (LiDAR point cloud guided monocular 3D object detection), which is a general framework for guiding the training of monocular 3D detectors with LiDAR point clouds. Specifically, we use LiDAR point clouds to generate pseudo labels, allowing monocular 3D detectors to benefit from easy-collected massive unlabeled data. LPCG works well under both supervised and unsupervised setups. Thanks to a general design, LPCG can be plugged into any monocular 3D detector, significantly boosting the performance. As a result, we take the first place on KITTI monocular 3D/BEV (bird's-eye-view) detection benchmark with a considerable margin. The code will be made publicly available soon.
    Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification. (arXiv:2012.03173v2 [cs.LG] UPDATED)
    (0 min) Deep AUC Maximization (DAM) is a new paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of large-scale DAM on difficult tasks are missing. In this work, we aim to make DAM more practical for interesting real-world applications (e.g., medical image classification). First, we propose a new margin-based min-max surrogate loss function for the AUC score (named as AUC min-max-margin loss or simply AUC margin loss for short). It is more robust than the commonly used AUC square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. Second, we conduct extensive empirical studies of our DAM method on four difficult medical image classification tasks, namely (i) classification of chest x-ray images for identifying many threatening diseases, (ii) classification of images of skin lesions for identifying melanoma, (iii) classification of mammogram for breast cancer screening, and (iv) classification of microscopic images for identifying tumor tissue. Our studies demonstrate that the proposed DAM method improves the performance of optimizing cross-entropy loss by a large margin, and also achieves better performance than optimizing the existing AUC square loss on these medical image classification tasks. Specifically, our DAM method has achieved the 1st place on Stanford CheXpert competition on Aug. 31, 2020. To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets. We also conduct extensive ablation studies to demonstrate the advantages of the new AUC margin loss over the AUC square loss on benchmark datasets. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org).
    ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations. (arXiv:2107.14483v4 [cs.LG] UPDATED)
    (0 min) Object manipulation from 3D visual inputs poses many challenges on building generalizable perception and policy models. However, 3D assets in existing benchmarks mostly lack the diversity of 3D shapes that align with real-world intra-class complexity in topology and geometry. Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator. 3D assets in ManiSkill include large intra-class topological and geometric variations. Tasks are carefully chosen to cover distinct types of manipulation challenges. Latest progress in 3D vision also makes us believe that we should customize the benchmark so that the challenge is inviting to researchers working on 3D deep learning. To this end, we simulate a moving panoramic camera that returns ego-centric point clouds or RGB-D images. In addition, we would like ManiSkill to serve a broad set of researchers interested in manipulation research. Besides supporting the learning of policies from interactions, we also support learning-from-demonstrations (LfD) methods, by providing a large number of high-quality demonstrations (~36,000 successful trajectories, ~1.5M point cloud/RGB-D frames in total). We provide baselines using 3D deep learning and LfD algorithms. All code of our benchmark (simulator, environment, SDK, and baselines) is open-sourced, and a challenge facing interdisciplinary researchers will be held based on the benchmark.
    Point-Based Neural Rendering with Per-View Optimization. (arXiv:2109.02369v2 [cs.CV] UPDATED)
    (0 min) There has recently been great interest in neural rendering methods. Some approaches use 3D geometry reconstructed with Multi-View Stereo (MVS) but cannot recover from the errors of this process, while others directly learn a volumetric neural representation, but suffer from expensive training and inference. We introduce a general approach that is initialized with MVS, but allows further optimization of scene properties in the space of input views, including depth and reprojected features, resulting in improved novel-view synthesis. A key element of our approach is our new differentiable point-based pipeline, based on bi-directional Elliptical Weighted Average splatting, a probabilistic depth test and effective camera selection. We use these elements together in our neural renderer, that outperforms all previous methods both in quality and speed in almost all scenes we tested. Our pipeline can be applied to multi-view harmonization and stylization in addition to novel-view synthesis.
    Axial multi-layer perceptron architecture for automatic segmentation of choroid plexus in multiple sclerosis. (arXiv:2109.03778v1 [eess.IV])
    (0 min) Choroid plexuses (CP) are structures of the ventricles of the brain which produce most of the cerebrospinal fluid (CSF). Several postmortem and in vivo studies have pointed towards their role in the inflammatory process in multiple sclerosis (MS). Automatic segmentation of CP from MRI thus has high value for studying their characteristics in large cohorts of patients. To the best of our knowledge, the only freely available tool for CP segmentation is FreeSurfer but its accuracy for this specific structure is poor. In this paper, we propose to automatically segment CP from non-contrast enhanced T1-weighted MRI. To that end, we introduce a new model called "Axial-MLP" based on an assembly of Axial multi-layer perceptrons (MLPs). This is inspired by recent works which showed that the self-attention layers of Transformers can be replaced with MLPs. This approach is systematically compared with a standard 3D U-Net, nnU-Net, Freesurfer and FastSurfer. For our experiments, we make use of a dataset of 141 subjects (44 controls and 97 patients with MS). We show that all the tested deep learning (DL) methods outperform FreeSurfer (Dice around 0.7 for DL vs 0.33 for FreeSurfer). Axial-MLP is competitive with U-Nets even though it is slightly less accurate. The conclusions of our paper are two-fold: 1) the studied deep learning methods could be useful tools to study CP in large cohorts of MS patients; 2)~Axial-MLP is a potentially viable alternative to convolutional neural networks for such tasks, although it could benefit from further improvements.
    Weakly supervised semantic segmentation of tomographic images in the diagnosis of stroke. (arXiv:2109.01887v1 [eess.IV] CROSS LISTED)
    (0 min) This paper presents an automatic algorithm for the segmentation of areas affected by an acute stroke on the non-contrast computed tomography brain images. The proposed algorithm is designed for learning in a weakly supervised scenario when some images are labeled accurately, and some images are labeled inaccurately. Wrong labels appear as a result of inaccuracy made by a radiologist in the process of manual annotation of computed tomography images. We propose methods for solving the segmentation problem in the case of inaccurately labeled training data. We use the U-Net neural network architecture with several modifications. Experiments on real computed tomography scans show that the proposed methods increase the segmentation accuracy.
    On Recognizing Occluded Faces in the Wild. (arXiv:2109.03672v1 [cs.CV])
    (0 min) Facial appearance variations due to occlusion has been one of the main challenges for face recognition systems. To facilitate further research in this area, it is necessary and important to have occluded face datasets collected from real-world, as synthetically generated occluded faces cannot represent the nature of the problem. In this paper, we present the Real World Occluded Faces (ROF) dataset, that contains faces with both upper face occlusion, due to sunglasses, and lower face occlusion, due to masks. We propose two evaluation protocols for this dataset. Benchmark experiments on the dataset have shown that no matter how powerful the deep face representation models are, their performance degrades significantly when they are tested on real-world occluded faces. It is observed that the performance drop is far less when the models are tested on synthetically generated occluded faces. The ROF dataset and the associated evaluation protocols are publicly available at the following link https://github.com/ekremerakin/RealWorldOccludedFaces.
    Plant Disease Detection Using Image Processing and Machine Learning. (arXiv:2106.10698v2 [cs.CV] UPDATED)
    (0 min) One of the important and tedious task in agricultural practices is the detection of the disease on crops. It requires huge time as well as skilled labor. This paper proposes a smart and efficient technique for detection of crop disease which uses computer vision and machine learning techniques. The proposed system is able to detect 20 different diseases of 5 common plants with 93% accuracy.
    Deriving Explanation of Deep Visual Saliency Models. (arXiv:2109.03575v1 [cs.CV])
    (0 min) Deep neural networks have shown their profound impact on achieving human level performance in visual saliency prediction. However, it is still unclear how they learn the task and what it means in terms of understanding human visual system. In this work, we develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models by applying human perception theories and the conventional concepts of saliency. This technique helps us understand the learning pattern of the deep network at its intermediate layers through their activation maps. Initially, we consider two state-of-the-art deep saliency models, namely UNISAL and MSI-Net for our interpretation. We use a set of biologically plausible log-gabor filters for identifying and reconstructing the activation maps of them using our explainable saliency model. The final saliency map is generated using these reconstructed activation maps. We also build our own deep saliency model named cross-concatenated multi-scale residual block based network (CMRNet) for saliency prediction. Then, we evaluate and compare the performance of the explainable models derived from UNISAL, MSI-Net and CMRNet on three benchmark datasets with other state-of-the-art methods. Hence, we propose that this approach of explainability can be applied to any deep visual saliency model for interpretation which makes it a generic one.
    Anatomical-Guided Attention Enhances Unsupervised PET Image Denoising Performance. (arXiv:2109.00802v2 [physics.med-ph] UPDATED)
    (0 min) Although supervised convolutional neural networks (CNNs) often outperform conventional alternatives for denoising positron emission tomography (PET) images, they require many low- and high-quality reference PET image pairs. Herein, we propose an unsupervised 3D PET image denoising method based on an anatomical information-guided attention mechanism. The proposed magnetic resonance-guided deep decoder (MR-GDD) utilizes the spatial details and semantic features of MR-guidance image more effectively by introducing encoder-decoder and deep decoder subnetworks. Moreover, the specific shapes and patterns of the guidance image do not affect the denoised PET image, because the guidance image is input to the network through an attention gate. In a Monte Carlo simulation of [$^{18}$F]fluoro-2-deoxy-D-glucose (FDG), the proposed method achieved the highest peak signal-to-noise ratio and structural similarity (27.92 $\pm$ 0.44 dB/0.886 $\pm$ 0.007), as compared with Gaussian filtering (26.68 $\pm$ 0.10 dB/0.807 $\pm$ 0.004), image guided filtering (27.40 $\pm$ 0.11 dB/0.849 $\pm$ 0.003), deep image prior (DIP) (24.22 $\pm$ 0.43 dB/0.737 $\pm$ 0.017), and MR-DIP (27.65 $\pm$ 0.42 dB/0.879 $\pm$ 0.007). Furthermore, we experimentally visualized the behavior of the optimization process, which is often unknown in unsupervised CNN-based restoration problems. For preclinical (using [$^{18}$F]FDG and [$^{11}$C]raclopride) and clinical (using [$^{18}$F]florbetapir) studies, the proposed method demonstrates state-of-the-art denoising performance while retaining spatial resolution and quantitative accuracy, despite using a common network architecture for various noisy PET images with 1/10th of the full counts. These results suggest that the proposed MR-GDD can reduce PET scan times and PET tracer doses considerably without impacting patients.
    LiDARTouch: Monocular metric depth estimation with a few-beam LiDAR. (arXiv:2109.03569v1 [cs.CV])
    (0 min) Vision-based depth estimation is a key feature in autonomous systems, which often relies on a single camera or several independent ones. In such a monocular setup, dense depth is obtained with either additional input from one or several expensive LiDARs, e.g., with 64 beams, or camera-only methods, which suffer from scale-ambiguity and infinite-depth problems. In this paper, we propose a new alternative of densely estimating metric depth by combining a monocular camera with a light-weight LiDAR, e.g., with 4 beams, typical of today's automotive-grade mass-produced laser scanners. Inspired by recent self-supervised methods, we introduce a novel framework, called LiDARTouch, to estimate dense depth maps from monocular images with the help of ``touches'' of LiDAR, i.e., without the need for dense ground-truth depth. In our setup, the minimal LiDAR input contributes on three different levels: as an additional model's input, in a self-supervised LiDAR reconstruction objective function, and to estimate changes of pose (a key component of self-supervised depth estimation architectures). Our LiDARTouch framework achieves new state of the art in self-supervised depth estimation on the KITTI dataset, thus supporting our choices of integrating the very sparse LiDAR signal with other visual features. Moreover, we show that the use of a few-beam LiDAR alleviates scale ambiguity and infinite-depth issues that camera-only methods suffer from. We also demonstrate that methods from the fully-supervised depth-completion literature can be adapted to a self-supervised regime with a minimal LiDAR signal.
    Digging into Uncertainty in Self-supervised Multi-view Stereo. (arXiv:2108.12966v2 [cs.CV] UPDATED)
    (0 min) Self-supervised Multi-view stereo (MVS) with a pretext task of image reconstruction has achieved significant progress recently. However, previous methods are built upon intuitions, lacking comprehensive explanations about the effectiveness of the pretext task in self-supervised MVS. To this end, we propose to estimate epistemic uncertainty in self-supervised MVS, accounting for what the model ignores. Specially, the limitations can be categorized into two types: ambiguious supervision in foreground and invalid supervision in background. To address these issues, we propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning. To alleviate ambiguous supervision in foreground, we involve extra correspondence prior with a flow-depth consistency loss. The dense 2D correspondence of optical flows is used to regularize the 3D stereo correspondence in MVS. To handle the invalid supervision in background, we use Monte-Carlo Dropout to acquire the uncertainty map and further filter the unreliable supervision signals on invalid regions. Extensive experiments on DTU and Tank&Temples benchmark show that our U-MVS framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.
    Shuffled Patch-Wise Supervision for Presentation Attack Detection. (arXiv:2109.03484v1 [cs.CV])
    (0 min) Face anti-spoofing is essential to prevent false facial verification by using a photo, video, mask, or a different substitute for an authorized person's face. Most of the state-of-the-art presentation attack detection (PAD) systems suffer from overfitting, where they achieve near-perfect scores on a single dataset but fail on a different dataset with more realistic data. This problem drives researchers to develop models that perform well under real-world conditions. This is an especially challenging problem for frame-based presentation attack detection systems that use convolutional neural networks (CNN). To this end, we propose a new PAD approach, which combines pixel-wise binary supervision with patch-based CNN. We believe that training a CNN with face patches allows the model to distinguish spoofs without learning background or dataset-specific traces. We tested the proposed method both on the standard benchmark datasets -- Replay-Mobile, OULU-NPU -- and on a real-world dataset. The proposed approach shows its superiority on challenging experimental setups. Namely, it achieves higher performance on OULU-NPU protocol 3, 4 and on inter-dataset real-world experiments.
    Suppress-and-Refine Framework for End-to-End 3D Object Detection. (arXiv:2103.10042v2 [cs.CV] UPDATED)
    (0 min) 3D object detector based on Hough voting achieves great success and derives many follow-up works. Despite constantly refreshing the detection accuracy, these works suffer from handcrafted components used to eliminate redundant boxes, and thus are non-end-to-end and time-consuming. In this work, we propose a suppress-and-refine framework to remove these handcrafted components. To fully utilize full-resolution information and achieve real-time speed, it directly consumes feature points and redundant 3D proposals. Specifically, it first suppresses noisy 3D feature points and then feeds them to 3D proposals for the following RoI-aware refinement. With the gating mechanism to build fine proposal features and the self-attention mechanism to model relationships, our method can produce high-quality predictions with a small computation budget in an end-to-end manner. To this end, we present the first fully end-to-end 3D detector, SRDet, on the basis of VoteNet. It achieves state-of-the-art performance on the challenging ScanNetV2 and SUN RGB-D datasets with the fastest speed ever. Our code will be available at https://github.com/ZJULearning/SRDet.
    Image-based Plant Disease Diagnosis with Unsupervised Anomaly Detection Based on Reconstructability of Colors. (arXiv:2011.14306v5 [cs.CV] UPDATED)
    (0 min) This paper proposes an unsupervised anomaly detection technique for image-based plant disease diagnosis. The construction of large and publicly available datasets containing labeled images of healthy and diseased crop plants led to growing interest in computer vision techniques for automatic plant disease diagnosis. Although supervised image classifiers based on deep learning can be a powerful tool for plant disease diagnosis, they require a huge amount of labeled data. The data mining technique of anomaly detection includes unsupervised approaches that do not require rare samples for training classifiers. We propose an unsupervised anomaly detection technique for image-based plant disease diagnosis that is based on the reconstructability of colors; a deep encoder-decoder network trained to reconstruct the colors of \textit{healthy} plant images should fail to reconstruct colors of symptomatic regions. Our proposed method includes a new image-based framework for plant disease detection that utilizes a conditional adversarial network called pix2pix and a new anomaly score based on CIEDE2000 color difference. Experiments with PlantVillage dataset demonstrated the superiority of our proposed method compared to an existing anomaly detector at identifying diseased crop images in terms of accuracy, interpretability and computational efficiency.
    Tactile Image-to-Image Disentanglement of Contact Geometry from Motion-Induced Shear. (arXiv:2109.03615v1 [cs.RO])
    (0 min) Robotic touch, particularly when using soft optical tactile sensors, suffers from distortion caused by motion-dependent shear. The manner in which the sensor contacts a stimulus is entangled with the tactile information about the geometry of the stimulus. In this work, we propose a supervised convolutional deep neural network model that learns to disentangle, in the latent space, the components of sensor deformations caused by contact geometry from those due to sliding-induced shear. The approach is validated by reconstructing unsheared tactile images from sheared images and showing they match unsheared tactile images collected with no sliding motion. In addition, the unsheared tactile images give a faithful reconstruction of the contact geometry that is not possible from the sheared data, and robust estimation of the contact pose that can be used for servo control sliding around various 2D shapes. Finally, the contact geometry reconstruction in conjunction with servo control sliding were used for faithful full object reconstruction of various 2D shapes. The methods have broad applicability to deep learning models for robots with a shear-sensitive sense of touch.
    Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection. (arXiv:2109.03426v1 [cs.CV])
    (0 min) Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.
    Unsupervised clothing change adaptive person ReID. (arXiv:2109.03702v1 [cs.CV])
    (0 min) Clothing changes and lack of data labels are both crucial challenges in person ReID. For the former challenge, people may occur multiple times at different locations wearing different clothing. However, most of the current person ReID research works focus on the benchmarks in which a person's clothing is kept the same all the time. For the last challenge, some researchers try to make model learn information from a labeled dataset as a source to an unlabeled dataset. Whereas purely unsupervised training is less used. In this paper, we aim to solve both problems at the same time. We design a novel unsupervised model, Sync-Person-Cloud ReID, to solve the unsupervised clothing change person ReID problem. We developer a purely unsupervised clothing change person ReID pipeline with person sync augmentation operation and same person feature restriction. The person sync augmentation is to supply additional same person resources. These same person's resources can be used as part supervised input by same person feature restriction. The extensive experiments on clothing change ReID datasets show the out-performance of our methods.
    Digitize-PID: Automatic Digitization of Piping and Instrumentation Diagrams. (arXiv:2109.03794v1 [cs.CV])
    (0 min) Digitization of scanned Piping and Instrumentation diagrams(P&ID), widely used in manufacturing or mechanical industries such as oil and gas over several decades, has become a critical bottleneck in dynamic inventory management and creation of smart P&IDs that are compatible with the latest CAD tools. Historically, P&ID sheets have been manually generated at the design stage, before being scanned and stored as PDFs. Current digitization initiatives involve manual processing and are consequently very time consuming, labour intensive and error-prone.Thanks to advances in image processing, machine and deep learning techniques there are emerging works on P&ID digitization. However, existing solutions face several challenges owing to the variation in the scale, size and noise in the P&IDs, sheer complexity and crowdedness within drawings, domain knowledge required to interpret the drawings. This motivates our current solution called Digitize-PID which comprises of an end-to-end pipeline for detection of core components from P&IDs like pipes, symbols and textual information, followed by their association with each other and eventually, the validation and correction of output data based on inherent domain knowledge. A novel and efficient kernel-based line detection and a two-step method for detection of complex symbols based on a fine-grained deep recognition technique is presented in the paper. In addition, we have created an annotated synthetic dataset, Dataset-P&ID, of 500 P&IDs by incorporating different types of noise and complex symbols which is made available for public use (currently there exists no public P&ID dataset). We evaluate our proposed method on this synthetic dataset and a real-world anonymized private dataset of 12 P&ID sheets. Results show that Digitize-PID outperforms the existing state-of-the-art for P&ID digitization.
    Melatect: A Machine Learning Model Approach For Identifying Malignant Melanoma in Skin Growths. (arXiv:2109.03310v1 [cs.CV])
    (0 min) Malignant melanoma is a common skin cancer that is mostly curable before metastasis, where melanoma growths spawn in organs away from the original site. Melanoma is the most dangerous type of skin cancer if left untreated due to the high chance of metastasis. This paper presents Melatect, a machine learning model that identifies potential malignant melanoma. A recursive computer image analysis algorithm was used to create a machine learning model which is capable of detecting likely melanoma. The comparison is performed using 20,000 raw images of benign and malignant lesions from the International Skin Imaging Collaboration (ISIC) archive that were augmented to 60,000 images. Tests of the algorithm using subsets of the ISIC images suggest it accurately classifies lesions as malignant or benign over 95% of the time with no apparent bias or overfitting. The Melatect iOS app was later created (unpublished), in which the machine learning model was embedded. With the app, users have the ability to take pictures of skin lesions (moles) using the app, which are then processed through the machine learning model, and users are notified whether their lesion could be abnormal or not. Melatect provides a convenient way to get free advice on lesions and track these lesions over time.
    Self supervised contrastive learning for digital histopathology. (arXiv:2011.13971v2 [eess.IV] UPDATED)
    (0 min) Unsupervised learning has been a long-standing goal of machine learning and is especially important for medical image analysis, where the learning can compensate for the scarcity of labeled datasets. A promising subclass of unsupervised learning is self-supervised learning, which aims to learn salient features using the raw input as the learning signal. In this paper, we use a contrastive self-supervised learning method called SimCLR that achieved state-of-the-art results on natural-scene images and apply this method to digital histopathology by collecting and pretraining on 57 histopathology datasets without any labels. We find that combining multiple multi-organ datasets with different types of staining and resolution properties improves the quality of the learned features. Furthermore, we find using more images for pretraining leads to a better performance in multiple downstream tasks. Linear classifiers trained on top of the learned features show that networks pretrained on digital histopathology datasets perform better than ImageNet pretrained networks, boosting task performances by more than 28% in F1 scores on average. These findings may also be useful when applying newer contrastive techniques to histopathology data. Pretrained PyTorch models are made publicly available at https://github.com/ozanciga/self-supervised-histopathology.
    Identification of Social-Media Platform of Videos through the Use of Shared Features. (arXiv:2109.03598v1 [cs.CV])
    (0 min) Videos have become a powerful tool for spreading illegal content such as military propaganda, revenge porn, or bullying through social networks. To counter these illegal activities, it has become essential to try new methods to verify the origin of videos from these platforms. However, collecting datasets large enough to train neural networks for this task has become difficult because of the privacy regulations that have been enacted in recent years. To mitigate this limitation, in this work we propose two different solutions based on transfer learning and multitask learning to determine whether a video has been uploaded from or downloaded to a specific social platform through the use of shared features with images trained on the same task. By transferring features from the shallowest to the deepest levels of the network from the image task to videos, we measure the amount of information shared between these two tasks. Then, we introduce a model based on multitask learning, which learns from both tasks simultaneously. The promising experimental results show, in particular, the effectiveness of the multitask approach. According to our knowledge, this is the first work that addresses the problem of social media platform identification of videos through the use of shared features.
    The Animation Transformer: Visual Correspondence via Segment Matching. (arXiv:2109.02614v2 [cs.CV] UPDATED)
    (0 min) Visual correspondence is a fundamental building block on the way to building assistive tools for hand-drawn animation. However, while a large body of work has focused on learning visual correspondences at the pixel-level, few approaches have emerged to learn correspondence at the level of line enclosures (segments) that naturally occur in hand-drawn animation. Exploiting this structure in animation has numerous benefits: it avoids the intractable memory complexity of attending to individual pixels in high resolution images and enables the use of real-world animation datasets that contain correspondence information at the level of per-segment colors. To that end, we propose the Animation Transformer (AnT) which uses a transformer-based architecture to learn the spatial and visual relationships between segments across a sequence of images. AnT enables practical ML-assisted colorization for professional animation workflows and is publicly accessible as a creative tool in Cadmium.
    Cross-Site Severity Assessment of COVID-19 from CT Images via Domain Adaptation. (arXiv:2109.03478v1 [eess.IV])
    (0 min) Early and accurate severity assessment of Coronavirus disease 2019 (COVID-19) based on computed tomography (CT) images offers a great help to the estimation of intensive care unit event and the clinical decision of treatment planning. To augment the labeled data and improve the generalization ability of the classification model, it is necessary to aggregate data from multiple sites. This task faces several challenges including class imbalance between mild and severe infections, domain distribution discrepancy between sites, and presence of heterogeneous features. In this paper, we propose a novel domain adaptation (DA) method with two components to address these problems. The first component is a stochastic class-balanced boosting sampling strategy that overcomes the imbalanced learning problem and improves the classification performance on poorly-predicted classes. The second component is a representation learning that guarantees three properties: 1) domain-transferability by prototype triplet loss, 2) discriminant by conditional maximum mean discrepancy loss, and 3) completeness by multi-view reconstruction loss. Particularly, we propose a domain translator and align the heterogeneous data to the estimated class prototypes (i.e., class centers) in a hyper-sphere manifold. Experiments on cross-site severity assessment of COVID-19 from CT images show that the proposed method can effectively tackle the imbalanced learning problem and outperform recent DA approaches.
    Learning Local-Global Contextual Adaptation for Fully End-to-End Bottom-Up Human Pose Estimation. (arXiv:2109.03622v1 [cs.CV])
    (0 min) This paper presents a method of learning Local-GlObal Contextual Adaptation for fully end-to-end and fast bottom-up human Pose estimation, dubbed as LOGO-CAP. It is built on the conceptually simple center-offset formulation that lacks inaccuracy for pose estimation. When revisiting the bottom-up human pose estimation with the thought of "thinking, fast and slow" by D. Kahneman, we introduce a "slow keypointer" to remedy the lack of sufficient accuracy of the "fast keypointer". In learning the "slow keypointer", the proposed LOGO-CAP lifts the initial "fast" keypoints by offset predictions to keypoint expansion maps (KEMs) to counter their uncertainty in two modules. Firstly, the local KEMs (e.g., 11x11) are extracted from a low-dimensional feature map. A proposed convolutional message passing module learns to "re-focus" the local KEMs to the keypoint attraction maps (KAMs) by accounting for the structured output prediction nature of human pose estimation, which is directly supervised by the object keypoint similarity (OKS) loss in training. Secondly, the global KEMs are extracted, with a sufficiently large region-of-interest (e.g., 97x97), from the keypoint heatmaps that are computed by a direct map-to-map regression. Then, a local-global contextual adaptation module is proposed to convolve the global KEMs using the learned KAMs as the kernels. This convolution can be understood as the learnable offsets guided deformable and dynamic convolution in a pose-sensitive way. The proposed method is end-to-end trainable with near real-time inference speed, obtaining state-of-the-art performance on the COCO keypoint benchmark for bottom-up human pose estimation. With the COCO trained model, our LOGO-CAP also outperforms prior arts by a large margin on the challenging OCHuman dataset.
    Real-World Adversarial Examples involving Makeup Application. (arXiv:2109.03329v1 [cs.CR])
    (0 min) Deep neural networks have developed rapidly and have achieved outstanding performance in several tasks, such as image classification and natural language processing. However, recent studies have indicated that both digital and physical adversarial examples can fool neural networks. Face-recognition systems are used in various applications that involve security threats from physical adversarial examples. Herein, we propose a physical adversarial attack with the use of full-face makeup. The presence of makeup on the human face is a reasonable possibility, which possibly increases the imperceptibility of attacks. In our attack framework, we combine the cycle-adversarial generative network (cycle-GAN) and a victimized classifier. The Cycle-GAN is used to generate adversarial makeup, and the architecture of the victimized classifier is VGG 16. Our experimental results show that our attack can effectively overcome manual errors in makeup application, such as color and position-related errors. We also demonstrate that the approaches used to train the models can influence physical attacks; the adversarial perturbations crafted from the pre-trained model are affected by the corresponding training data.
    Transformers in Vision: A Survey. (arXiv:2101.01169v3 [cs.CV] UPDATED)
    (0 min) Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
    Adaptive Few-Shot Learning PoC Ultrasound COVID-19 Diagnostic System. (arXiv:2109.03793v1 [eess.IV])
    (0 min) This paper presents a novel ultrasound imaging point-of-care (PoC) COVID-19 diagnostic system. The adaptive visual diagnostics utilize few-shot learning (FSL) to generate encoded disease state models that are stored and classified using a dictionary of knowns. The novel vocabulary based feature processing of the pipeline adapts the knowledge of a pretrained deep neural network to compress the ultrasound images into discrimative descriptions. The computational efficiency of the FSL approach enables high diagnostic deep learning performance in PoC settings, where training data is limited and the annotation process is not strictly controlled. The algorithm performance is evaluated on the open source COVID-19 POCUS Dataset to validate the system's ability to distinguish COVID-19, pneumonia, and healthy disease states. The results of the empirical analyses demonstrate the appropriate efficiency and accuracy for scalable PoC use. The code for this work will be made publicly available on GitHub upon acceptance.
    Certifiable Outlier-Robust Geometric Perception: Exact Semidefinite Relaxations and Scalable Global Optimization. (arXiv:2109.03349v1 [cs.CV])
    (0 min) We propose the first general and scalable framework to design certifiable algorithms for robust geometric perception in the presence of outliers. Our first contribution is to show that estimation using common robust costs, such as truncated least squares (TLS), maximum consensus, Geman-McClure, Tukey's biweight, among others, can be reformulated as polynomial optimization problems (POPs). By focusing on the TLS cost, our second contribution is to exploit sparsity in the POP and propose a sparse semidefinite programming (SDP) relaxation that is much smaller than the standard Lasserre's hierarchy while preserving exactness, i.e., the SDP recovers the optimizer of the nonconvex POP with an optimality certificate. Our third contribution is to solve the SDP relaxations at an unprecedented scale and accuracy by presenting STRIDE, a solver that blends global descent on the convex SDP with fast local search on the nonconvex POP. Our fourth contribution is an evaluation of the proposed framework on six geometric perception problems including single and multiple rotation averaging, point cloud and mesh registration, absolute pose estimation, and category-level object pose and shape estimation. Our experiments demonstrate that (i) our sparse SDP relaxation is exact with up to 60%-90% outliers across applications; (ii) while still being far from real-time, STRIDE is up to 100 times faster than existing SDP solvers on medium-scale problems, and is the only solver that can solve large-scale SDPs with hundreds of thousands of constraints to high accuracy; (iii) STRIDE provides a safeguard to existing fast heuristics for robust estimation (e.g., RANSAC or Graduated Non-Convexity), i.e., it certifies global optimality if the heuristic estimates are optimal, or detects and allows escaping local optima when the heuristic estimates are suboptimal.
    Toward Real-World Super-Resolution via Adaptive Downsampling Models. (arXiv:2109.03444v1 [eess.IV])
    (0 min) Most image super-resolution (SR) methods are developed on synthetic low-resolution (LR) and high-resolution (HR) image pairs that are constructed by a predetermined operation, e.g., bicubic downsampling. As existing methods typically learn an inverse mapping of the specific function, they produce blurry results when applied to real-world images whose exact formulation is different and unknown. Therefore, several methods attempt to synthesize much more diverse LR samples or learn a realistic downsampling model. However, due to restrictive assumptions on the downsampling process, they are still biased and less generalizable. This study proposes a novel method to simulate an unknown downsampling process without imposing restrictive prior knowledge. We propose a generalizable low-frequency loss (LFL) in the adversarial training framework to imitate the distribution of target LR images without using any paired examples. Furthermore, we design an adaptive data loss (ADL) for the downsampler, which can be adaptively learned and updated from the data during the training loops. Extensive experiments validate that our downsampling model can facilitate existing SR methods to perform more accurate reconstructions on various synthetic and real-world examples than the conventional approaches.
    MRI Reconstruction Using Deep Energy-Based Model. (arXiv:2109.03237v1 [eess.IV])
    (0 min) Purpose: Although recent deep energy-based generative models (EBMs) have shown encouraging results in many image generation tasks, how to take advantage of the self-adversarial cogitation in deep EBMs to boost the performance of Magnetic Resonance Imaging (MRI) reconstruction is still desired. Methods: With the successful application of deep learning in a wide range of MRI reconstruction, a line of emerging research involves formulating an optimization-based reconstruction method in the space of a generative model. Leveraging this, a novel regularization strategy is introduced in this article which takes advantage of self-adversarial cogitation of the deep energy-based model. More precisely, we advocate for alternative learning a more powerful energy-based model with maximum likelihood estimation to obtain the deep energy-based information, represented as image prior. Simultaneously, implicit inference with Langevin dynamics is a unique property of re-construction. In contrast to other generative models for reconstruction, the proposed method utilizes deep energy-based information as the image prior in reconstruction to improve the quality of image. Results: Experiment results that imply the proposed technique can obtain remarkable performance in terms of high reconstruction accuracy that is competitive with state-of-the-art methods, and does not suffer from mode collapse. Conclusion: Algorithmically, an iterative approach was presented to strengthen EBM training with the gradient of energy network. The robustness and the reproducibility of the algorithm were also experimentally validated. More importantly, the proposed reconstruction framework can be generalized for most MRI reconstruction scenarios.
    Master Face Attacks on Face Recognition Systems. (arXiv:2109.03398v1 [cs.CV])
    (0 min) Face authentication is now widely used, especially on mobile devices, rather than authentication using a personal identification number or an unlock pattern, due to its convenience. It has thus become a tempting target for attackers using a presentation attack. Traditional presentation attacks use facial images or videos of the victim. Previous work has proven the existence of master faces, i.e., faces that match multiple enrolled templates in face recognition systems, and their existence extends the ability of presentation attacks. In this paper, we perform an extensive study on latent variable evolution (LVE), a method commonly used to generate master faces. We run an LVE algorithm for various scenarios and with more than one database and/or face recognition system to study the properties of the master faces and to understand in which conditions strong master faces could be generated. Moreover, through analysis, we hypothesize that master faces come from some dense areas in the embedding spaces of the face recognition systems. Last but not least, simulated presentation attacks using generated master faces generally preserve the false-matching ability of their original digital forms, thus demonstrating that the existence of master faces poses an actual threat.
    Unfolding Taylor's Approximations for Image Restoration. (arXiv:2109.03442v1 [cs.CV])
    (0 min) Deep learning provides a new avenue for image restoration, which demands a delicate balance between fine-grained details and high-level contextualized information during recovering the latent clear image. In practice, however, existing methods empirically construct encapsulated end-to-end mapping networks without deepening into the rationality, and neglect the intrinsic prior knowledge of restoration task. To solve the above problems, inspired by Taylor's Approximations, we unfold Taylor's Formula to construct a novel framework for image restoration. We find the main part and the derivative part of Taylor's Approximations take the same effect as the two competing goals of high-level contextualized information and spatial details of image restoration respectively. Specifically, our framework consists of two steps, correspondingly responsible for the mapping and derivative functions. The former first learns the high-level contextualized information and the later combines it with the degraded input to progressively recover local high-order spatial details. Our proposed framework is orthogonal to existing methods and thus can be easily integrated with them for further improvement, and extensive experiments demonstrate the effectiveness and scalability of our proposed framework.
    Self-Supervised Representation Learning using Visual Field Expansion on Digital Pathology. (arXiv:2109.03299v1 [eess.IV])
    (0 min) The examination of histopathology images is considered to be the gold standard for the diagnosis and stratification of cancer patients. A key challenge in the analysis of such images is their size, which can run into the gigapixels and can require tedious screening by clinicians. With the recent advances in computational medicine, automatic tools have been proposed to assist clinicians in their everyday practice. Such tools typically process these large images by slicing them into tiles that can then be encoded and utilized for different clinical models. In this study, we propose a novel generative framework that can learn powerful representations for such tiles by learning to plausibly expand their visual field. In particular, we developed a progressively grown generative model with the objective of visual field expansion. Thus trained, our model learns to generate different tissue types with fine details, while simultaneously learning powerful representations that can be used for different clinical endpoints, all in a self-supervised way. To evaluate the performance of our model, we conducted classification experiments on CAMELYON17 and CRC benchmark datasets, comparing favorably to other self-supervised and pre-trained strategies that are commonly used in digital pathology. Our code is available at https://github.com/jcboyd/cdpath21-gan.
    TSI: Temporal Saliency Integration for Video Action Recognition. (arXiv:2106.01088v2 [cs.CV] UPDATED)
    (0 min) Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit motion clues to assist in short-term temporal modeling through temporal difference over consecutive frames. However, insignificant noises will be inevitably introduced due to the camera movement. Besides, movements of different actions can vary greatly. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through local-global motion modeling, where the saliency alignment and pyramidal feature difference are conducted successively between neighboring frames to capture motion dynamics with less noises caused by misaligned background. CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different scales are integrated with attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something V1 & V2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness and superiority of our proposed method.
    ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning. (arXiv:2108.10501v2 [cs.CV] UPDATED)
    (0 min) The central idea of contrastive learning is to discriminate between different instances and force different views of the same instance to share the same representation. To avoid trivial solutions, augmentation plays an important role in generating different views, among which random cropping is shown to be effective for the model to learn a strong and generalized representation. Commonly used random crop operation keeps the difference between two views statistically consistent along the training process. In this work, we challenge this convention by showing that adaptively controlling the disparity between two augmented views along the training process enhances the quality of the learnt representation. Specifically, we present a parametric cubic cropping operation, ParamCrop, for video contrastive learning, which automatically crops a 3D cubic from the video by differentiable 3D affine transformations. ParamCrop is trained simultaneously with the video backbone using an adversarial objective and learns an optimal cropping strategy from the data. The visualizations show that the center distance and the IoU between two augmented views are adaptively controlled by ParamCrop and the learned change in the disparity along the training process is beneficial to learning a strong representation. Extensive ablation studies demonstrate the effectiveness of the proposed ParamCrop on multiple contrastive learning frameworks and video backbones. With ParamCrop, we improve the state-of-the-art performance on both HMDB51 and UCF101 datasets.
    Wanderlust: Online Continual Object Detection in the Real World. (arXiv:2108.11005v2 [cs.CV] UPDATED)
    (0 min) Online continual learning from data streams in dynamic environments is a critical direction in the computer vision field. However, realistic benchmarks and fundamental studies in this line are still missing. To bridge the gap, we present a new online continual object detection benchmark with an egocentric video dataset, Objects Around Krishna (OAK). OAK adopts the KrishnaCAM videos, an ego-centric video stream collected over nine months by a graduate student. OAK provides exhaustive bounding box annotations of 80 video snippets (~17.5 hours) for 105 object categories in outdoor scenes. The emergence of new object categories in our benchmark follows a pattern similar to what a single person might see in their day-to-day life. The dataset also captures the natural distribution shifts as the person travels to different places. These egocentric long-running videos provide a realistic playground for continual learning algorithms, especially in online embodied settings. We also introduce new evaluation metrics to evaluate the model performance and catastrophic forgetting and provide baseline studies for online continual object detection. We believe this benchmark will pose new exciting challenges for learning from non-stationary data in continual learning. The OAK dataset and the associated benchmark are released at https://oakdata.github.io/.
    RGB-D Salient Object Detection with Ubiquitous Target Awareness. (arXiv:2109.03425v1 [cs.CV])
    (0 min) Conventional RGB-D salient object detection methods aim to leverage depth as complementary information to find the salient regions in both modalities. However, the salient object detection results heavily rely on the quality of captured depth data which sometimes are unavailable. In this work, we make the first attempt to solve the RGB-D salient object detection problem with a novel depth-awareness framework. This framework only relies on RGB data in the testing phase, utilizing captured depth data as supervision for representation learning. To construct our framework as well as achieving accurate salient detection results, we propose a Ubiquitous Target Awareness (UTA) network to solve three important challenges in RGB-D SOD task: 1) a depth awareness module to excavate depth information and to mine ambiguous regions via adaptive depth-error weights, 2) a spatial-aware cross-modal interaction and a channel-aware cross-level interaction, exploiting the low-level boundary cues and amplifying high-level salient channels, and 3) a gated multi-scale predictor module to perceive the object saliency in different contextual scales. Besides its high performance, our proposed UTA network is depth-free for inference and runs in real-time with 43 FPS. Experimental evidence demonstrates that our proposed network not only surpasses the state-of-the-art methods on five public RGB-D SOD benchmarks by a large margin, but also verifies its extensibility on five public RGB SOD benchmarks.
    Built-in Elastic Transformations for Improved Robustness. (arXiv:2107.09391v2 [cs.CV] UPDATED)
    (0 min) We focus on building robustness in the convolutions of neural visual classifiers, especially against natural perturbations like elastic deformations, occlusions and Gaussian noise. Existing CNNs show outstanding performance on clean images, but fail to tackle naturally occurring perturbations. In this paper, we start from elastic perturbations, which approximate (local) view-point changes of the object. We present elastically-augmented convolutions (EAConv) by parameterizing filters as a combination of fixed elastically-perturbed bases functions and trainable weights for the purpose of integrating unseen viewpoints in the CNN. We show on CIFAR-10 and STL-10 datasets that the general robustness of our method on unseen occlusion, zoom, rotation, image cut and Gaussian perturbations improves, while significantly improving the performance on clean images without any data augmentation.
    Disentangling Alzheimer's disease neurodegeneration from typical brain aging using machine learning. (arXiv:2109.03723v1 [cs.LG])
    (0 min) Neuroimaging biomarkers that distinguish between typical brain aging and Alzheimer's disease (AD) are valuable for determining how much each contributes to cognitive decline. Machine learning models can derive multi-variate brain change patterns related to the two processes, including the SPARE-AD (Spatial Patterns of Atrophy for Recognition of Alzheimer's Disease) and SPARE-BA (of Brain Aging) investigated herein. However, substantial overlap between brain regions affected in the two processes confounds measuring them independently. We present a methodology toward disentangling the two. T1-weighted MRI images of 4,054 participants (48-95 years) with AD, mild cognitive impairment (MCI), or cognitively normal (CN) diagnoses from the iSTAGING (Imaging-based coordinate SysTem for AGIng and NeurodeGenerative diseases) consortium were analyzed. First, a subset of AD patients and CN adults were selected based purely on clinical diagnoses to train SPARE-BA1 (regression of age using CN individuals) and SPARE-AD1 (classification of CN versus AD). Second, analogous groups were selected based on clinical and molecular markers to train SPARE-BA2 and SPARE-AD2: amyloid-positive (A+) AD continuum group (consisting of A+AD, A+MCI, and A+ and tau-positive CN individuals) and amyloid-negative (A-) CN group. Finally, the combined group of the AD continuum and A-/CN individuals was used to train SPARE-BA3, with the intention to estimate brain age regardless of AD-related brain changes. Disentangled SPARE models derived brain patterns that were more specific to the two types of the brain changes. Correlation between the SPARE-BA and SPARE-AD was significantly reduced. Correlation of disentangled SPARE-AD was non-inferior to the molecular measurements and to the number of APOE4 alleles, but was less to AD-related psychometric test scores, suggesting contribution of advanced brain aging to these scores.
    FaceCook: Face Generation Based on Linear Scaling Factors. (arXiv:2109.03492v1 [cs.CV])
    (0 min) With the excellent disentanglement properties of state-of-the-art generative models, image editing has been the dominant approach to control the attributes of synthesised face images. However, these edited results often suffer from artifacts or incorrect feature rendering, especially when there is a large discrepancy between the image to be edited and the desired feature set. Therefore, we propose a new approach to mapping the latent vectors of the generative model to the scaling factors through solving a set of multivariate linear equations. The coefficients of the equations are the eigenvectors of the weight parameters of the pre-trained model, which form the basis of a hyper coordinate system. The qualitative and quantitative results both show that the proposed method outperforms the baseline in terms of image diversity. In addition, the method is much more time-efficient because you can obtain synthesised images with desirable features directly from the latent vectors, rather than the former process of editing randomly generated images requiring many processing steps.
    Learning to Discriminate Information for Online Action Detection: Analysis and Application. (arXiv:2109.03393v1 [cs.CV])
    (0 min) Online action detection, which aims to identify an ongoing action from a streaming video, is an important subject in real-world applications. For this task, previous methods use recurrent neural networks for modeling temporal relations in an input sequence. However, these methods overlook the fact that the input image sequence includes not only the action of interest but background and irrelevant actions. This would induce recurrent units to accumulate unnecessary information for encoding features on the action of interest. To overcome this problem, we propose a novel recurrent unit, named Information Discrimination Unit (IDU), which explicitly discriminates the information relevancy between an ongoing action and others to decide whether to accumulate the input information. This enables learning more discriminative representations for identifying an ongoing action. In this paper, we further present a new recurrent unit, called Information Integration Unit (IIU), for action anticipation. Our IIU exploits the outputs from IDU as pseudo action labels as well as RGB frames to learn enriched features of observed actions effectively. In experiments on TVSeries and THUMOS-14, the proposed methods outperform state-of-the-art methods by a significant margin in online action detection and action anticipation. Moreover, we demonstrate the effectiveness of the proposed units by conducting comprehensive ablation studies.
    Scaled ReLU Matters for Training Vision Transformers. (arXiv:2109.03810v1 [cs.CV])
    (0 min) Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in ~\cite{xiao2021early}, and the authors conjecture that the issue lies with the \textit{patchify-stem} of ViT models and propose that early convolutions help transformers see better. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the \textit{convolutional stem} (\textit{conv-stem}) matters. We verify, both theoretically and empirically, that scaled ReLU in \textit{conv-stem} not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.
    Panoptic SegFormer. (arXiv:2109.03814v1 [cs.CV])
    (0 min) We present Panoptic SegFormer, a general framework for end-to-end panoptic segmentation with Transformers. The proposed method extends Deformable DETR with a unified mask prediction workflow for both things and stuff, making the panoptic segmentation pipeline concise and effective. With a ResNet-50 backbone, our method achieves 50.0\% PQ on the COCO test-dev split, surpassing previous state-of-the-art methods by significant margins without bells and whistles. Using a more powerful PVTv2-B5 backbone, Panoptic-SegFormer achieves a new record of 54.1\%PQ and 54.4\% PQ on the COCO val and test-dev splits with single scale input.
    Temporal RoI Align for Video Object Recognition. (arXiv:2109.03495v1 [cs.CV])
    (0 min) Video object detection is challenging in the presence of appearance deterioration in certain video frames. Therefore, it is a natural choice to aggregate temporal information from other frames of the same video into the current frame. However, RoI Align, as one of the most core procedures of video detectors, still remains extracting features from a single-frame feature map for proposals, making the extracted RoI features lack temporal information from videos. In this work, considering the features of the same object instance are highly similar among frames in a video, a novel Temporal RoI Align operator is proposed to extract features from other frames feature maps for current frame proposals by utilizing feature similarity. The proposed Temporal RoI Align operator can extract temporal information from the entire video for proposals. We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal RoI Align operator can consistently and significantly boost the performance. Besides, the proposed Temporal RoI Align can also be applied into video instance segmentation.
    Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking. (arXiv:2109.03805v1 [cs.CV])
    (0 min) Panoptic scene understanding and tracking of dynamic agents are essential for robots and automated vehicles to navigate in urban environments. As LiDARs provide accurate illumination-independent geometric depictions of the scene, performing these tasks using LiDAR point clouds provides reliable predictions. However, existing datasets lack diversity in the type of urban scenes and have a limited number of dynamic object instances which hinders both learning of these tasks as well as credible benchmarking of the developed methods. In this paper, we introduce the large-scale Panoptic nuScenes benchmark dataset that extends our popular nuScenes dataset with point-wise groundtruth annotations for semantic segmentation, panoptic segmentation, and panoptic tracking tasks. To facilitate comparison, we provide several strong baselines for each of these tasks on our proposed dataset. Moreover, we analyze the drawbacks of the existing metrics for the panoptic tracking problem and propose a novel instance-centric metric that addresses the concerns. We present extensive experiments that demonstrate the utility of Panoptic nuScenes compared to existing datasets and make the online evaluation server available at \url{nuScenes.org}. We believe that this extension will accelerate the research of novel methods for scene understanding of dynamic urban environments.
    RoadAtlas: Intelligent Platform for Automated Road Defect Detection and Asset Management. (arXiv:2109.03385v1 [cs.CV])
    (0 min) With the rapid development of intelligent detection algorithms based on deep learning, much progress has been made in automatic road defect recognition and road marking parsing. This can effectively address the issue of an expensive and time-consuming process for professional inspectors to review the street manually. Towards this goal, we present RoadAtlas, a novel end-to-end integrated system that can support 1) road defect detection, 2) road marking parsing, 3) a web-based dashboard for presenting and inputting data by users, and 4) a backend containing a well-structured database and developed APIs.
    Which and Where to Focus: A Simple yet Accurate Framework for Arbitrary-Shaped Nearby Text Detection in Scene Images. (arXiv:2109.03451v1 [cs.CV])
    (0 min) Scene text detection has drawn the close attention of researchers. Though many methods have been proposed for horizontal and oriented texts, previous methods may not perform well when dealing with arbitrary-shaped texts such as curved texts. In particular, confusion problem arises in the case of nearby text instances. In this paper, we propose a simple yet effective method for accurate arbitrary-shaped nearby scene text detection. Firstly, a One-to-Many Training Scheme (OMTS) is designed to eliminate confusion and enable the proposals to learn more appropriate groundtruths in the case of nearby text instances. Secondly, we propose a Proposal Feature Attention Module (PFAM) to exploit more effective features for each proposal, which can better adapt to arbitrary-shaped text instances. Finally, we propose a baseline that is based on Faster R-CNN and outputs the curve representation directly. Equipped with PFAM and OMTS, the detector can achieve state-of-the-art or competitive performance on several challenging benchmarks.
    Quasi-Dense Similarity Learning for Multiple Object Tracking. (arXiv:2006.06664v4 [cs.CV] UPDATED)
    (2 min) Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions on the images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of region proposals on a pair of images for contrastive learning. We can directly combine this similarity learning with existing detection methods to build Quasi-Dense Tracking (QDTrack) without turning to displacement regression or motion priors. We also find that the resulting distinctive feature space admits a simple nearest neighbor search at the inference time. Despite its simplicity, QDTrack outperforms all existing methods on MOT, BDD100K, Waymo, and TAO tracking benchmarks. It achieves 68.7 MOTA at 20.3 FPS on MOT17 without using external training data. Compared to methods with similar detectors, it boosts almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K and Waymo datasets. Our code and trained models are available at this http URL
    FIDNet: LiDAR Point Cloud Semantic Segmentation with Fully Interpolation Decoding. (arXiv:2109.03787v1 [cs.CV])
    (3 min) Projecting the point cloud on the 2D spherical range image transforms the LiDAR semantic segmentation to a 2D segmentation task on the range image. However, the LiDAR range image is still naturally different from the regular 2D RGB image; for example, each position on the range image encodes the unique geometry information. In this paper, we propose a new projection-based LiDAR semantic segmentation pipeline that consists of a novel network structure and an efficient post-processing step. In our network structure, we design a FID (fully interpolation decoding) module that directly upsamples the multi-resolution feature maps using bilinear interpolation. Inspired by the 3D distance interpolation used in PointNet++, we argue this FID module is a 2D version distance interpolation on $(\theta, \phi)$ space. As a parameter-free decoding module, the FID largely reduces the model complexity by maintaining good performance. Besides the network structure, we empirically find that our model predictions have clear boundaries between different semantic classes. This makes us rethink whether the widely used K-nearest-neighbor post-processing is still necessary for our pipeline. Then, we realize the many-to-one mapping causes the blurring effect that some points are mapped into the same pixel and share the same label. Therefore, we propose to process those occluded points by assigning the nearest predicted label to them. This NLA (nearest label assignment) post-processing step shows a better performance than KNN with faster inference speed in the ablation study. On the SemanticKITTI dataset, our pipeline achieves the best performance among all projection-based methods with $64 \times 2048$ resolution and all point-wise solutions. With a ResNet-34 as the backbone, both the training and testing of our model can be finished on a single RTX 2080 Ti with 11G memory. The code is released.
    Domain Adaptive Ensemble Learning. (arXiv:2003.07325v3 [cs.CV] UPDATED)
    (2 min) The problem of generalizing deep neural networks from multiple source domains to a target one is studied under two settings: When unlabeled target data is available, it is a multi-source unsupervised domain adaptation (UDA) problem, otherwise a domain generalization (DG) problem. We propose a unified framework termed domain adaptive ensemble learning (DAEL) to address both problems. A DAEL model is composed of a CNN feature extractor shared across domains and multiple classifier heads each trained to specialize in a particular source domain. Each such classifier is an expert to its own domain and a non-expert to others. DAEL aims to learn these experts collaboratively so that when forming an ensemble, they can leverage complementary information from each other to be more effective for an unseen target domain. To this end, each source domain is used in turn as a pseudo-target-domain with its own expert providing supervisory signal to the ensemble of non-experts learned from the other sources. For unlabeled target data under the UDA setting where real expert does not exist, DAEL uses pseudo-label to supervise the ensemble learning. Extensive experiments on three multi-source UDA datasets and two DG datasets show that DAEL improves the state of the art on both problems, often by significant margins. The code is released at \url{https://github.com/KaiyangZhou/Dassl.pytorch}.
    DMN4: Few-shot Learning via Discriminative Mutual Nearest Neighbor Neural Network. (arXiv:2103.08160v3 [cs.CV] UPDATED)
    (2 min) Few-shot learning (FSL) aims to classify images under low-data regimes, where the conventional pooled global feature is likely to lose useful local characteristics. Recent work has achieved promising performances by using deep descriptors. They generally take all deep descriptors from neural networks into consideration while ignoring that some of them are useless in classification due to their limited receptive field, e.g., task-irrelevant descriptors could be misleading and multiple aggregative descriptors from background clutter could even overwhelm the object's presence. In this paper, we argue that a Mutual Nearest Neighbor (MNN) relation should be established to explicitly select the query descriptors that are most relevant to each task and discard less relevant ones from aggregative clutters in FSL. Specifically, we propose Discriminative Mutual Nearest Neighbor Neural Network (DMN4) for FSL. Extensive experiments demonstrate that our method outperforms the existing state-of-the-arts on both fine-grained and generalized datasets.
    Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive Learning. (arXiv:2106.05953v3 [cs.CV] UPDATED)
    (0 min) Encouraged by the success of contrastive learning on image classification tasks, we propose a new self-supervised method for the structured regression task of 3D hand pose estimation. Contrastive learning makes use of unlabeled data for the purpose of representation learning via a loss formulation that encourages the learned feature representations to be invariant under any image transformation. For 3D hand pose estimation, it too is desirable to have invariance to appearance transformation such as color jitter. However, the task requires equivariance under affine transformations, such as rotation and translation. To address this issue, we propose an equivariant contrastive objective and demonstrate its effectiveness in the context of 3D hand pose estimation. We experimentally investigate the impact of invariant and equivariant contrastive objectives and show that learning equivariant features leads to better representations for the task of 3D hand pose estimation. Furthermore, we show that standard ResNets with sufficient depth, trained on additional unlabeled data, attain improvements of up to 14.5% in PA-EPE on FreiHAND and thus achieves state-of-the-art performance without any task specific, specialized architectures. Code and models are available at https://ait.ethz.ch/projects/2021/PeCLR/
    Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification. (arXiv:2109.03483v1 [cs.CV])
    (2 min) Person Re-Identification (Re-Id) in occlusion scenarios is a challenging problem because a pedestrian can be partially occluded. The use of local information for feature extraction and matching is still necessary. Therefore, we propose a Pose-guided inter-and intra-part relational transformer (Pirt) for occluded person Re-Id, which builds part-aware long-term correlations by introducing transformers. In our framework, we firstly develop a pose-guided feature extraction module with regional grouping and mask construction for robust feature representations. The positions of a pedestrian in the image under surveillance scenarios are relatively fixed, hence we propose an intra-part and inter-part relational transformer. The intra-part module creates local relations with mask-guided features, while the inter-part relationship builds correlations with transformers, to develop cross relationships between part nodes. With the collaborative learning inter- and intra-part relationships, experiments reveal that our proposed Pirt model achieves a new state of the art on the public occluded dataset, and further extensions on standard non-occluded person Re-Id datasets also reveal our comparable performances.
    SSEGEP: Small SEGment Emphasized Performance evaluation metric for medical image segmentation. (arXiv:2109.03435v1 [eess.IV])
    (2 min) Automatic image segmentation is a critical component of medical image analysis, and hence quantifying segmentation performance is crucial. Challenges in medical image segmentation are mainly due to spatial variations of regions to be segmented and imbalance in distribution of classes. Commonly used metrics treat all detected pixels, indiscriminately. However, pixels in smaller segments must be treated differently from pixels in larger segments, as detection of smaller ones aid in early treatment of associated disease and are also easier to miss. To address this, we propose a novel evaluation metric for segmentation performance, emphasizing smaller segments, by assigning higher weightage to smaller segment pixels. Weighted false positives are also considered in deriving the new metric named, "SSEGEP"(Small SEGment Emphasized Performance evaluation metric), (range : 0(Bad) to 1(Good)). The experiments were performed on diverse anatomies(eye, liver, pancreas and breast) from publicly available datasets to show applicability of the proposed metric across different imaging techniques. Mean opinion score (MOS) and statistical significance testing is used to quantify the relevance of proposed approach. Across 33 fundus images, where the largest exudate is 1.41%, and the smallest is 0.0002% of the image, the proposed metric is 30% closer to MOS, as compared to Dice Similarity Coefficient (DSC). Statistical significance testing resulted in promising p-value of order 10^{-18} with SSEGEP for hepatic tumor compared to DSC. The proposed metric is found to perform better for the images having multiple segments for a single label.
    Reconstructing High-resolution Turbulent Flows Using Physics-Guided Neural Networks. (arXiv:2109.03327v1 [physics.flu-dyn])
    (2 min) Direct numerical simulation (DNS) of turbulent flows is computationally expensive and cannot be applied to flows with large Reynolds numbers. Large eddy simulation (LES) is an alternative that is computationally less demanding, but is unable to capture all of the scales of turbulent transport accurately. Our goal in this work is to build a new data-driven methodology based on super-resolution techniques to reconstruct DNS data from LES predictions. We leverage the underlying physical relationships to regularize the relationships amongst different physical variables. We also introduce a hierarchical generative process and a reverse degradation process to fully explore the correspondence between DNS and LES data. We demonstrate the effectiveness of our method through a single-snapshot experiment and a cross-time experiment. The results confirm that our method can better reconstruct high-resolution DNS data over space and over time in terms of pixel-wise reconstruction error and structural similarity. Visual comparisons show that our method performs much better in capturing fine-level flow dynamics.
    Level Set Binocular Stereo with Occlusions. (arXiv:2109.03464v1 [cs.CV])
    (2 min) Localizing stereo boundaries and predicting nearby disparities are difficult because stereo boundaries induce occluded regions where matching cues are absent. Most modern computer vision algorithms treat occlusions secondarily (e.g., via left-right consistency checks after matching) or rely on high-level cues to improve nearby disparities (e.g., via deep networks and large training sets). They ignore the geometry of stereo occlusions, which dictates that the spatial extent of occlusion must equal the amplitude of the disparity jump that causes it. This paper introduces an energy and level-set optimizer that improves boundaries by encoding occlusion geometry. Our model applies to two-layer, figure-ground scenes, and it can be implemented cooperatively using messages that pass predominantly between parents and children in an undecimated hierarchy of multi-scale image patches. In a small collection of figure-ground scenes curated from Middlebury and Falling Things stereo datasets, our model provides more accurate boundaries than previous occlusion-handling stereo techniques. This suggests new directions for creating cooperative stereo systems that incorporate occlusion cues in a human-like manner.
    Recalibrating the KITTI Dataset Camera Setup for Improved Odometry Accuracy. (arXiv:2109.03462v1 [cs.RO])
    (2 min) Over the last decade, one of the most relevant public datasets for evaluating odometry accuracy is the KITTI dataset. Beside the quality and rich sensor setup, its success is also due to the online evaluation tool, which enables researchers to benchmark and compare algorithms. The results are evaluated on the test subset solely, without any knowledge about the ground truth, yielding unbiased, overfit free and therefore relevant validation for robot localization based on cameras, 3D laser or combination of both. However, as any sensor setup, it requires prior calibration and rectified stereo images are provided, introducing dependence on the default calibration parameters. Given that, a natural question arises if a better set of calibration parameters can be found that would yield higher odometry accuracy. In this paper, we propose a new approach for one shot calibration of the KITTI dataset multiple camera setup. The approach yields better calibration parameters, both in the sense of lower calibration reprojection errors and lower visual odometry error. We conducted experiments where we show for three different odometry algorithms, namely SOFT2, ORB-SLAM2 and VISO2, that odometry accuracy is significantly improved with the proposed calibration parameters. Moreover, our odometry, SOFT2, in conjunction with the proposed calibration method achieved the highest accuracy on the official KITTI scoreboard with 0.53% translational and 0.0009 deg/m rotational error, outperforming even 3D laser-based methods.
    Simple Video Generation using Neural ODEs. (arXiv:2109.03292v1 [cs.CV])
    (2 min) Despite having been studied to a great extent, the task of conditional generation of sequences of frames, or videos, remains extremely challenging. It is a common belief that a key step towards solving this task resides in modelling accurately both spatial and temporal information in video signals. A promising direction to do so has been to learn latent variable models that predict the future in latent space and project back to pixels, as suggested in recent literature. Following this line of work and building on top of a family of models introduced in prior work, Neural ODE, we investigate an approach that models time-continuous dynamics over a continuous latent space with a differential equation with respect to time. The intuition behind this approach is that these trajectories in latent space could then be extrapolated to generate video frames beyond the time steps for which the model is trained. We show that our approach yields promising results in the task of future frame prediction on the Moving MNIST dataset with 1 and 2 digits.
    Multi-Branch Deep Radial Basis Function Networks for Facial Emotion Recognition. (arXiv:2109.03336v1 [cs.CV])
    (2 min) Emotion recognition (ER) from facial images is one of the landmark tasks in affective computing with major developments in the last decade. Initial efforts on ER relied on handcrafted features that were used to characterize facial images and then feed to standard predictive models. Recent methodologies comprise end-to-end trainable deep learning methods that simultaneously learn both, features and predictive model. Perhaps the most successful models are based on convolutional neural networks (CNNs). While these models have excelled at this task, they still fail at capturing local patterns that could emerge in the learning process. We hypothesize these patterns could be captured by variants based on locally weighted learning. Specifically, in this paper we propose a CNN based architecture enhanced with multiple branches formed by radial basis function (RBF) units that aims at exploiting local information at the final stage of the learning process. Intuitively, these RBF units capture local patterns shared by similar instances using an intermediate representation, then the outputs of the RBFs are feed to a softmax layer that exploits this information to improve the predictive performance of the model. This feature could be particularly advantageous in ER as cultural / ethnicity differences may be identified by the local units. We evaluate the proposed method in several ER datasets and show the proposed methodology achieves state-of-the-art in some of them, even when we adopt a pre-trained VGG-Face model as backbone. We show it is the incorporation of local information what makes the proposed model competitive.
    Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks. (arXiv:2109.03513v1 [cs.CV])
    (2 min) Quantization has been proven to be a vital method for improving the inference efficiency of deep neural networks (DNNs). However, it is still challenging to strike a good balance between accuracy and efficiency while quantizing DNN weights or activation values from high-precision formats to their quantized counterparts. We propose a new method called elastic significant bit quantization (ESB) that controls the number of significant bits of quantized values to obtain better inference accuracy with fewer resources. We design a unified mathematical formula to constrain the quantized values of the ESB with a flexible number of significant bits. We also introduce a distribution difference aligner (DDA) to quantitatively align the distributions between the full-precision weight or activation values and quantized values. Consequently, ESB is suitable for various bell-shaped distributions of weights and activation of DNNs, thus maintaining a high inference accuracy. Benefitting from fewer significant bits of quantized values, ESB can reduce the multiplication complexity. We implement ESB as an accelerator and quantitatively evaluate its efficiency on FPGAs. Extensive experimental results illustrate that ESB quantization consistently outperforms state-of-the-art methods and achieves average accuracy improvements of 4.78%, 1.92%, and 3.56% over AlexNet, ResNet18, and MobileNetV2, respectively. Furthermore, ESB as an accelerator can achieve 10.95 GOPS peak performance of 1k LUTs without DSPs on the Xilinx ZCU102 FPGA platform. Compared with CPU, GPU, and state-of-the-art accelerators on FPGAs, the ESB accelerator can improve the energy efficiency by up to 65x, 11x, and 26x, respectively.
    Matching in the Dark: A Dataset for Matching Image Pairs of Low-light Scenes. (arXiv:2109.03585v1 [cs.CV])
    (2 min) This paper considers matching images of low-light scenes, aiming to widen the frontier of SfM and visual SLAM applications. Recent image sensors can record the brightness of scenes with more than eight-bit precision, available in their RAW-format image. We are interested in making full use of such high-precision information to match extremely low-light scene images that conventional methods cannot handle. For extreme low-light scenes, even if some of their brightness information exists in the RAW format images' low bits, the standard raw image processing on cameras fails to utilize them properly. As was recently shown by Chen et al., CNNs can learn to produce images with a natural appearance from such RAW-format images. To consider if and how well we can utilize such information stored in RAW-format images for image matching, we have created a new dataset named MID (matching in the dark). Using it, we experimentally evaluated combinations of eight image-enhancing methods and eleven image matching methods consisting of classical/neural local descriptors and classical/neural initial point-matching methods. The results show the advantage of using the RAW-format images and the strengths and weaknesses of the above component methods. They also imply there is room for further research.
    YouRefIt: Embodied Reference Understanding with Language and Gesture. (arXiv:2109.03413v1 [cs.CV])
    (2 min) We study the understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Of note, this new visual task requires understanding multimodal cues with perspective-taking to identify which object is being referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced dataset of embodied reference collected in various physical scenes; the dataset contains 4,195 unique reference clips in 432 indoor scenes. To the best of our knowledge, this is the first embodied reference dataset that allows us to study referring expressions in daily physical scenes to understand referential behavior, human communication, and human-robot interaction. We further devise two benchmarks for image-based and video-based embodied reference understanding. Comprehensive baselines and extensive experiments provide the very first result of machine perception on how the referring expressions and gestures affect the embodied reference understanding. Our results provide essential evidence that gestural cues are as critical as language cues in understanding the embodied reference.
    Capturing the objects of vision with neural networks. (arXiv:2109.03351v1 [q-bio.NC])
    (2 min) Human visual perception carves a scene at its physical joints, decomposing the world into objects, which are selectively attended, tracked, and predicted as we engage our surroundings. Object representations emancipate perception from the sensory input, enabling us to keep in mind that which is out of sight and to use perceptual content as a basis for action and symbolic cognition. Human behavioral studies have documented how object representations emerge through grouping, amodal completion, proto-objects, and object files. Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input, despite achieving human-level performance at labeling objects. Here, we review related work in both fields and examine how these fields can help each other. The cognitive literature provides a starting point for the development of new experimental tasks that reveal mechanisms of human object perception and serve as benchmarks driving development of deep neural network models that will put the object into object recognition.
    Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion. (arXiv:2109.03551v1 [cs.SD])
    (2 min) Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.
    GTT-Net: Learned Generalized Trajectory Triangulation. (arXiv:2109.03408v1 [cs.CV])
    (2 min) We present GTT-Net, a supervised learning framework for the reconstruction of sparse dynamic 3D geometry. We build on a graph-theoretic formulation of the generalized trajectory triangulation problem, where non-concurrent multi-view imaging geometry is known but global image sequencing is not provided. GTT-Net learns pairwise affinities modeling the spatio-temporal relationships among our input observations and leverages them to determine 3D geometry estimates. Experiments reconstructing 3D motion-capture sequences show GTT-Net outperforms the state of the art in terms of accuracy and robustness. Within the context of articulated motion reconstruction, our proposed architecture is 1) able to learn and enforce semantic 3D motion priors for shared training and test domains, while being 2) able to generalize its performance across different training and test domains. Moreover, GTT-Net provides a computationally streamlined framework for trajectory triangulation with applications to multi-instance reconstruction and event segmentation.
  • cs.IR updates on arXiv.org

    Chord Recognition- Music and Audio Information Retrieval. (arXiv:2105.07019v2 [cs.SD] UPDATED)
    (2 min) Music Information Retrieval (MIR) is a collaborative scientific study that help to build innovative information research themes, novel frameworks, and developing connected delivery mechanisms in addition to making the world's massive collection of music open for everyone. Modern rock music proved to be difficult to estimate tempo and chord recognition did not work. All of the findings indicate that modern rock and metal music can be analysed, despite its complexity, but that further research is needed in this area to make it useful. Using a neural network has been one of the simplest ways of dealing with it. The pitch class profile vector is used in the neural network method. Because the vector only contains 12 elements of semi-tone values, it is enough for chord recognition. Of course, there are other ways of achieving this work, most of them depend on pitch class profiling to transform the chord into a type that can be recognised, but the recognition process is time-consuming centred on extremely complicated and memory-intensive methods.
    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. (arXiv:2104.08663v3 [cs.IR] UPDATED)
    (2 min) Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.
    AutoDebias: Learning to Debias for Recommendation. (arXiv:2105.04170v4 [cs.LG] UPDATED)
    (2 min) Recommender systems rely on user behavior data like ratings and clicks to build personalization model. However, the collected data is observational rather than experimental, causing various biases in the data which significantly affect the learned model. Most existing work for recommendation debiasing, such as the inverse propensity scoring and imputation approaches, focuses on one or two specific biases, lacking the universal capacity that can account for mixed or even unknown biases in the data. Towards this research gap, we first analyze the origin of biases from the perspective of \textit{risk discrepancy} that represents the difference between the expectation empirical risk and the true risk. Remarkably, we derive a general learning framework that well summarizes most existing debiasing strategies by specifying some parameters of the general framework. This provides a valuable opportunity to develop a universal solution for debiasing, e.g., by learning the debiasing parameters from data. However, the training data lacks important signal of how the data is biased and what the unbiased data looks like. To move this idea forward, we propose \textit{AotoDebias} that leverages another (small) set of uniform data to optimize the debiasing parameters by solving the bi-level optimization problem with meta-learning. Through theoretical analyses, we derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy. Extensive experiments on two real datasets and a simulated dataset demonstrated effectiveness of AutoDebias. The code is available at \url{https://github.com/DongHande/AutoDebias}.
    Forget me not: A Gentle Reminder to Mind the Simple Multi-Layer Perceptron Baseline for Text Classification. (arXiv:2109.03777v1 [cs.CL])
    (2 min) Graph neural networks have triggered a resurgence of graph-based text classification. We show that already a simple MLP baseline achieves comparable performance on benchmark datasets, questioning the importance of synthetic graph structures. When considering an inductive scenario, i. e., when adding new documents to a corpus, a simple MLP even outperforms most graph-based models. We further fine-tune DistilBERT for comparison and find that it outperforms all state-of-the-art models. We suggest that future studies use at least an MLP baseline to contextualize the results. We provide recommendations for the design and training of such a baseline.
    A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions. (arXiv:2109.03540v1 [cs.IR])
    (2 min) In light of the emergence of deep reinforcement learning (DRL) in recommender systems research and several fruitful results in recent years, this survey aims to provide a timely and comprehensive overview of the recent trends of deep reinforcement learning in recommender systems. We start with the motivation of applying DRL in recommender systems. Then, we provide a taxonomy of current DRL-based recommender systems and a summary of existing methods. We discuss emerging topics and open issues, and provide our perspective on advancing the domain. This survey serves as introductory material for readers from academia and industry into the topic and identifies notable opportunities for further research.
    R2-D2: A Modular Baseline for Open-Domain Question Answering. (arXiv:2109.03502v1 [cs.CL])
    (2 min) This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from all system's components. We demonstrate its strength across three open-domain QA datasets: NaturalQuestions, TriviaQA and EfficientQA, surpassing state-of-the-art on the first two. Our analysis demonstrates that: (i) combining extractive and generative reader yields absolute improvements up to 5 exact match and it is at least twice as effective as the posterior averaging ensemble of the same models with different parameters, (ii) the extractive reader with fewer parameters can match the performance of the generative reader on extractive QA datasets.
    Tracing Affordance and Item Adoption on Music Streaming Platforms. (arXiv:2109.03538v1 [cs.IR])
    (2 min) Popular music streaming platforms offer users a diverse network of content exploration through a triad of affordances: organic, algorithmic and editorial access modes. Whilst offering great potential for discovery, such platform developments also pose the modern user with daily adoption decisions on two fronts: platform affordance adoption and the adoption of recommendations therein. Following a carefully constrained set of Deezer users over a 2-year observation period, our work explores factors driving user behaviour in the broad sense, by differentiating users on the basis of their temporal daily usage, adoption of the main platform affordances, and the ways in which they react to them, especially in terms of recommendation adoption. Diverging from a perspective common in studies on the effects of recommendation, we assume and confirm that users exhibit very diverse behaviours in using and adopting the platform affordances. The resulting complex and quite heterogeneous picture demonstrates that there is no blanket answer for adoption practices of both recommendation features and recommendations.
    AppQ: Warm-starting App Recommendation Based on View Graphs. (arXiv:2109.03798v1 [cs.IR])
    (2 min) Current app ranking and recommendation systems are mainly based on user-generated information, e.g., number of downloads and ratings. However, new apps often have few (or even no) user feedback, suffering from the classic cold-start problem. How to quickly identify and then recommend new apps of high quality is a challenging issue. Here, a fundamental requirement is the capability to accurately measure an app's quality based on its inborn features, rather than user-generated features. Since users obtain first-hand experience of an app by interacting with its views, we speculate that the inborn features are largely related to the visual quality of individual views in an app and the ways the views switch to one another. In this work, we propose AppQ, a novel app quality grading and recommendation system that extracts inborn features of apps based on app source code. In particular, AppQ works in parallel to perform code analysis to extract app-level features as well as dynamic analysis to capture view-level layout hierarchy and the switching among views. Each app is then expressed as an attributed view graph, which is converted into a vector and fed to classifiers for recognizing its quality classes. Our evaluation with an app dataset from Google Play reports that AppQ achieves the best performance with accuracy of 85.0\%. This shows a lot of promise to warm-start app grading and recommendation systems with AppQ.
    Dual Correction Strategy for Ranking Distillation in Top-N Recommender System. (arXiv:2109.03459v1 [cs.IR])
    (2 min) Knowledge Distillation (KD), which transfers the knowledge of a well-trained large model (teacher) to a small model (student), has become an important area of research for practical deployment of recommender systems. Recently, Relaxed Ranking Distillation (RRD) has shown that distilling the ranking information in the recommendation list significantly improves the performance. However, the method still has limitations in that 1) it does not fully utilize the prediction errors of the student model, which makes the training not fully efficient, and 2) it only distills the user-side ranking information, which provides an insufficient view under the sparse implicit feedback. This paper presents Dual Correction strategy for Distillation (DCD), which transfers the ranking information from the teacher model to the student model in a more efficient manner. Most importantly, DCD uses the discrepancy between the teacher model and the student model predictions to decide which knowledge to be distilled. By doing so, DCD essentially provides the learning guidance tailored to "correcting" what the student model has failed to accurately predict. This process is applied for transferring the ranking information from the user-side as well as the item-side to address sparse implicit user feedback. Our experiments show that the proposed method outperforms the state-of-the-art baselines, and ablation studies validate the effectiveness of each component.
  • cs.LG updates on arXiv.org

    Scaling-up Disentanglement for Image Translation. (arXiv:2103.14017v2 [cs.CV] UPDATED)
    (2 min) Image translation methods typically aim to manipulate a set of labeled attributes (given as supervision at training time e.g. domain label) while leaving the unlabeled attributes intact. Current methods achieve either: (i) disentanglement, which exhibits low visual fidelity and can only be satisfied where the attributes are perfectly uncorrelated. (ii) visually-plausible translations, which are clearly not disentangled. In this work, we propose OverLORD, a single framework for disentangling labeled and unlabeled attributes as well as synthesizing high-fidelity images, which is composed of two stages; (i) Disentanglement: Learning disentangled representations with latent optimization. Differently from previous approaches, we do not rely on adversarial training or any architectural biases. (ii) Synthesis: Training feed-forward encoders for inferring the learned attributes and tuning the generator in an adversarial manner to increase the perceptual quality. When the labeled and unlabeled attributes are correlated, we model an additional representation that accounts for the correlated attributes and improves disentanglement. We highlight that our flexible framework covers multiple settings as disentangling labeled attributes, pose and appearance, localized concepts, and shape and texture. We present significantly better disentanglement with higher translation quality and greater output diversity than state-of-the-art methods.
    PermuteFormer: Efficient Relative Position Encoding for Long Sequences. (arXiv:2109.02377v2 [cs.CL] UPDATED)
    (2 min) A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset for long sequences, as well as WikiText-103, a language modeling dataset. The experiments show that PermuteFormer uniformly improves the performance of Performer with almost no computational overhead and outperforms vanilla Transformer on most of the tasks.
    Predicting Intraoperative Hypoxemia with Joint Sequence Autoencoder Networks. (arXiv:2104.14756v3 [cs.LG] UPDATED)
    (2 min) We present an end-to-end model using streaming physiological time series to accurately predict near-term risk for hypoxemia, a rare, but life-threatening condition known to cause serious patient harm during surgery. Our proposed model makes inference on both hypoxemia outcomes and future input sequences, enabled by a joint sequence autoencoder that simultaneously optimizes a discriminative decoder for label prediction, and two auxiliary decoders trained for data reconstruction and forecast, which seamlessly learns future-indicative latent representation. All decoders share a memory-based encoder that helps capture the global dynamics of patient data. In a large surgical cohort of 73,536 surgeries at a major academic medical center, our model outperforms all baselines and gives a large performance gain over the state-of-the-art hypoxemia prediction system. With a high sensitivity cutoff at 80%, it presents 99.36% precision in predicting hypoxemia and 86.81% precision in predicting the much more severe and rare hypoxemic condition, persistent hypoxemia. With exceptionally low rate of false alarms, our proposed model is promising in improving clinical decision making and easing burden on the health system.
    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. (arXiv:2104.01027v2 [cs.SD] UPDATED)
    (2 min) Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq.
    Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking. (arXiv:2109.03805v1 [cs.CV])
    (2 min) Panoptic scene understanding and tracking of dynamic agents are essential for robots and automated vehicles to navigate in urban environments. As LiDARs provide accurate illumination-independent geometric depictions of the scene, performing these tasks using LiDAR point clouds provides reliable predictions. However, existing datasets lack diversity in the type of urban scenes and have a limited number of dynamic object instances which hinders both learning of these tasks as well as credible benchmarking of the developed methods. In this paper, we introduce the large-scale Panoptic nuScenes benchmark dataset that extends our popular nuScenes dataset with point-wise groundtruth annotations for semantic segmentation, panoptic segmentation, and panoptic tracking tasks. To facilitate comparison, we provide several strong baselines for each of these tasks on our proposed dataset. Moreover, we analyze the drawbacks of the existing metrics for the panoptic tracking problem and propose a novel instance-centric metric that addresses the concerns. We present extensive experiments that demonstrate the utility of Panoptic nuScenes compared to existing datasets and make the online evaluation server available at \url{nuScenes.org}. We believe that this extension will accelerate the research of novel methods for scene understanding of dynamic urban environments.
    AppQ: Warm-starting App Recommendation Based on View Graphs. (arXiv:2109.03798v1 [cs.IR])
    (2 min) Current app ranking and recommendation systems are mainly based on user-generated information, e.g., number of downloads and ratings. However, new apps often have few (or even no) user feedback, suffering from the classic cold-start problem. How to quickly identify and then recommend new apps of high quality is a challenging issue. Here, a fundamental requirement is the capability to accurately measure an app's quality based on its inborn features, rather than user-generated features. Since users obtain first-hand experience of an app by interacting with its views, we speculate that the inborn features are largely related to the visual quality of individual views in an app and the ways the views switch to one another. In this work, we propose AppQ, a novel app quality grading and recommendation system that extracts inborn features of apps based on app source code. In particular, AppQ works in parallel to perform code analysis to extract app-level features as well as dynamic analysis to capture view-level layout hierarchy and the switching among views. Each app is then expressed as an attributed view graph, which is converted into a vector and fed to classifiers for recognizing its quality classes. Our evaluation with an app dataset from Google Play reports that AppQ achieves the best performance with accuracy of 85.0\%. This shows a lot of promise to warm-start app grading and recommendation systems with AppQ.
    Discrete-Valued Latent Preference Matrix Estimation with Graph Side Information. (arXiv:2003.07040v2 [cs.IT] UPDATED)
    (0 min) Incorporating graph side information into recommender systems has been widely used to better predict ratings, but relatively few works have focused on theoretical guarantees. Ahn et al. (2018) firstly characterized the optimal sample complexity in the presence of graph side information, but the results are limited due to strict, unrealistic assumptions made on the unknown latent preference matrix and the structure of user clusters. In this work, we propose a new model in which 1) the unknown latent preference matrix can have any discrete values, and 2) users can be clustered into multiple clusters, thereby relaxing the assumptions made in prior work. Under this new model, we fully characterize the optimal sample complexity and develop a computationally-efficient algorithm that matches the optimal sample complexity. Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance on both synthetic and real data.
    Interpretable and Efficient Heterogeneous Graph Convolutional Network. (arXiv:2005.13183v3 [cs.LG] UPDATED)
    (0 min) Graph Convolutional Network (GCN) has achieved extraordinary success in learning effective task-specific representations of nodes in graphs. However, regarding Heterogeneous Information Network (HIN), existing HIN-oriented GCN methods still suffer from two deficiencies: (1) they cannot flexibly explore all possible meta-paths and extract the most useful ones for a target object, which hinders both effectiveness and interpretability; (2) they often need to generate intermediate meta-path based dense graphs, which leads to high computational complexity. To address the above issues, we propose an interpretable and efficient Heterogeneous Graph Convolutional Network (ie-HGCN) to learn the representations of objects in HINs. It is designed as a hierarchical aggregation architecture, i.e., object-level aggregation first, followed by type-level aggregation. The novel architecture can automatically extract useful meta-paths for each object from all possible meta-paths (within a length limit), which brings good model interpretability. It can also reduce the computational cost by avoiding intermediate HIN transformation and neighborhood attention. We provide theoretical analysis about the proposed ie-HGCN in terms of evaluating the usefulness of all possible meta-paths, its connection to the spectral graph convolution on HINs, and its quasi-linear time complexity. Extensive experiments on three real network datasets demonstrate the superiority of ie-HGCN over the state-of-the-art methods.
    Quasi-Dense Similarity Learning for Multiple Object Tracking. (arXiv:2006.06664v4 [cs.CV] UPDATED)
    (2 min) Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions on the images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of region proposals on a pair of images for contrastive learning. We can directly combine this similarity learning with existing detection methods to build Quasi-Dense Tracking (QDTrack) without turning to displacement regression or motion priors. We also find that the resulting distinctive feature space admits a simple nearest neighbor search at the inference time. Despite its simplicity, QDTrack outperforms all existing methods on MOT, BDD100K, Waymo, and TAO tracking benchmarks. It achieves 68.7 MOTA at 20.3 FPS on MOT17 without using external training data. Compared to methods with similar detectors, it boosts almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K and Waymo datasets. Our code and trained models are available at this http URL
    Priming PCA with EigenGame. (arXiv:2109.03709v1 [cs.LG])
    (2 min) We introduce primed-PCA (pPCA), an extension of the recently proposed EigenGame algorithm for computing principal components in a large-scale setup. Our algorithm first runs EigenGame to get an approximation of the principal components, and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use of EigenGame, this second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of EigenGame is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon EigenGame under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we achieve improvements in convergence speed by factors of 5-25 on the datasets of the original EigenGame paper.
    A New One-Point Residual-Feedback Oracle For Black-Box Learning and Control. (arXiv:2006.10820v2 [math.OC] UPDATED)
    (2 min) Zeroth-order optimization (ZO) algorithms have been recently used to solve black-box or simulation-based learning and control problems, where the gradient of the objective function cannot be easily computed but can be approximated using the objective function values. Many existing ZO algorithms adopt two-point feedback schemes due to their fast convergence rate compared to one-point feedback schemes. However, two-point schemes require two evaluations of the objective function at each iteration, which can be impractical in applications where the data are not all available a priori, e.g., in online optimization. In this paper, we propose a novel one-point feedback scheme that queries the function value once at each iteration and estimates the gradient using the residual between two consecutive points. When optimizing a deterministic Lipschitz function, we show that the query complexity of ZO with the proposed one-point residual feedback matches that of ZO with the existing two-point schemes. Moreover, the query complexity of the proposed algorithm can be improved when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization problems where only noisy objective function values are given, we show that ZO with one-point residual feedback achieves the same convergence rate as that of two-point scheme with uncontrollable data samples. We demonstrate the effectiveness of the proposed one-point residual feedback via extensive numerical experiments.
    Maximum and Leaky Maximum Propagation. (arXiv:2105.10277v2 [cs.LG] UPDATED)
    (2 min) In this work, we present an alternative to conventional residual connections, which is inspired by maxout nets. This means that instead of the addition in residual connections, our approach only propagates the maximum value or, in the leaky formulation, propagates a percentage of both. In our evaluation, we show on different public data sets that the presented approaches are comparable to the residual connections and have other interesting properties, such as better generalization with a constant batch normalization, faster learning, and also the possibility to generalize without additional activation functions. In addition, the proposed approaches work very well if ensembles together with residual networks are formed. https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/?p=%2FMaximumPropagation&mode=list
    Multiscale Laplacian Learning. (arXiv:2109.03718v1 [cs.LG])
    (2 min) Machine learning methods have greatly changed science, engineering, finance, business, and other fields. Despite the tremendous accomplishments of machine learning and deep learning methods, many challenges still remain. In particular, the performance of machine learning methods is often severely affected in case of diverse data, usually associated with smaller data sets or data related to areas of study where the size of the data sets is constrained by the complexity and/or high cost of experiments. Moreover, data with limited labeled samples is a challenge to most learning approaches. In this paper, the aforementioned challenges are addressed by integrating graph-based frameworks, multiscale structure, modified and adapted optimization procedures and semi-supervised techniques. This results in two innovative multiscale Laplacian learning (MLL) approaches for machine learning tasks, such as data classification, and for tackling diverse data, data with limited samples and smaller data sets. The first approach, called multikernel manifold learning (MML), integrates manifold learning with multikernel information and solves a regularization problem consisting of a loss function and a warped kernel regularizer using multiscale graph Laplacians. The second approach, called the multiscale MBO (MMBO) method, introduces multiscale Laplacians to a modification of the famous classical Merriman-Bence-Osher (MBO) scheme, and makes use of fast solvers for finding the approximations to the extremal eigenvectors of the graph Laplacian. We demonstrate the performance of our methods experimentally on a variety of data sets, such as biological, text and image data, and compare them favorably to existing approaches.
    Word Equations: Inherently Interpretable Sparse Word Embeddingsthrough Sparse Coding. (arXiv:2004.13847v2 [cs.CL] UPDATED)
    (0 min) Word embeddings are a powerful natural lan-guage processing technique, but they are ex-tremely difficult to interpret. To enable inter-pretable NLP models, we create vectors whereeach dimension isinherently interpretable. Byinherently interpretable, we mean a systemwhere each dimension is associated with somehuman-understandablehintthat can describethe meaning of that dimension. In order tocreate more interpretable word embeddings,we transform pretrained dense word embed-dings into sparse embeddings. These new em-beddings are inherently interpretable: each oftheir dimensions is created from and repre-sents a natural language word or specific gram-matical concept. We construct these embed-dings through sparse coding, where each vec-tor in the basis set is itself a word embedding.Therefore, each dimension of our sparse vec-tors corresponds to a natural language word.We also show that models trained using thesesparse embeddings can achieve good perfor-mance and are more interpretable in practice,including through human evaluations.
    Active Learning by Acquiring Contrastive Examples. (arXiv:2109.03764v1 [cs.CL])
    (2 min) Common acquisition functions for active learning use either uncertainty or diversity sampling, aiming to select difficult and diverse data points from the pool of unlabeled data, respectively. In this work, leveraging the best of both worlds, we propose an acquisition function that opts for selecting \textit{contrastive examples}, i.e. data points that are similar in the model feature space and yet the model outputs maximally different predictive likelihoods. We compare our approach, CAL (Contrastive Active Learning), with a diverse set of acquisition functions in four natural language understanding tasks and seven datasets. Our experiments show that CAL performs consistently better or equal than the best performing baseline across all tasks, on both in-domain and out-of-domain data. We also conduct an extensive ablation study of our method and we further analyze all actively acquired datasets showing that CAL achieves a better trade-off between uncertainty and diversity compared to other strategies.
    Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions. (arXiv:2010.04678v2 [cs.MS] UPDATED)
    (2 min) Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
    FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks. (arXiv:2104.07145v2 [cs.LG] UPDATED)
    (3 min) Graph Neural Network (GNN) research is rapidly growing thanks to the capacity of GNNs in learning distributed representations from graph-structured data. However, centralizing a massive amount of real-world graph data for GNN training is prohibitive due to privacy concerns, regulation restrictions, and commercial competitions. Federated learning (FL), a trending distributed learning paradigm, provides possibilities to solve this challenge while preserving data privacy. Despite recent advances in vision and language domains, there is no suitable platform for the FL of GNNs. To this end, we introduce FedGraphNN, an open FL benchmark system that can facilitate research on federated GNNs. FedGraphNN is built on a unified formulation of graph FL and contains a wide range of datasets from different domains, popular GNN models, and FL algorithms, with secure and efficient system support. Particularly for the datasets, we collect, preprocess, and partition 36 datasets from 7 domains, including both publicly available ones and specifically obtained ones such as hERG and Tencent. Our empirical analysis showcases the utility of our benchmark system, while exposing significant challenges in graph FL: federated GNNs perform worse in most datasets with a non-IID split than centralized GNNs; the GNN model that attains the best result in the centralized setting may not maintain its advantage in the FL setting. These results imply that more research efforts are needed to unravel the mystery behind federated GNNs. Moreover, our system performance analysis demonstrates that the FedGraphNN system is computationally efficient and secure to large-scale graphs datasets. We maintain the source code at https://github.com/FedML-AI/FedGraphNN.
    The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC. (arXiv:2105.14933v2 [hep-ph] UPDATED)
    (2 min) Semi-supervision in Machine Learning can be used in searches for new physics where the signal plus background regions are not labelled. This strongly reduces model dependency in the search for signals Beyond the Standard Model. This approach displays the drawback in that over-fitting can give rise to fake signals. Tossing toy Monte Carlo (MC) events can be used to estimate the corresponding trials factor through a frequentist inference. However, MC events that are based on full detector simulations are resource intensive. Generative Adversarial Networks (GANs) can be used to mimic MC generators. GANs are powerful generative models, but often suffer from training instability. We henceforth show a review of GANs. We advocate the use of Wasserstein GAN (WGAN) with weight clipping and WGAN with gradient penalty (WGAN-GP) where the norm of gradient of the critic is penalized with respect to its input. Following the emergence of multi-lepton anomalies at the LHC, we apply GANs for the generation of di-leptons final states in association with b-quarks at the LHC. A good agreement between the MC events and the WGAN-GP events is found for the observables selected in the study.
    Transformers in Vision: A Survey. (arXiv:2101.01169v3 [cs.CV] UPDATED)
    (2 min) Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
    Highly Scalable and Provably Accurate Classification in Poincare Balls. (arXiv:2109.03781v1 [cs.LG])
    (2 min) Many high-dimensional and large-volume data sets of practical relevance have hierarchical structures induced by trees, graphs or time series. Such data sets are hard to process in Euclidean spaces and one often seeks low-dimensional embeddings in other space forms to perform required learning tasks. For hierarchical data, the space of choice is a hyperbolic space since it guarantees low-distortion embeddings for tree-like structures. Unfortunately, the geometry of hyperbolic spaces has properties not encountered in Euclidean spaces that pose challenges when trying to rigorously analyze algorithmic solutions. Here, for the first time, we establish a unified framework for learning scalable and simple hyperbolic linear classifiers with provable performance guarantees. The gist of our approach is to focus on Poincar\'e ball models and formulate the classification problems using tangent space formalisms. Our results include a new hyperbolic and second-order perceptron algorithm as well as an efficient and highly accurate convex optimization setup for hyperbolic support vector machine classifiers. All algorithms provably converge and are highly scalable as they have complexities comparable to those of their Euclidean counterparts. Their performance accuracies on synthetic data sets comprising millions of points, as well as on complex real-world data sets such as single-cell RNA-seq expression measurements, CIFAR10, Fashion-MNIST and mini-ImageNet.
    Privacy Enhancing Machine Learning via Removal of Unwanted Dependencies. (arXiv:2007.15710v4 [cs.LG] UPDATED)
    (2 min) The rapid rise of IoT and Big Data has facilitated copious data driven applications to enhance our quality of life. However, the omnipresent and all-encompassing nature of the data collection can generate privacy concerns. Hence, there is a strong need to develop techniques that ensure the data serve only the intended purposes, giving users control over the information they share. To this end, this paper studies new variants of supervised and adversarial learning methods, which remove the sensitive information in the data before they are sent out for a particular application. The explored methods optimize privacy preserving feature mappings and predictive models simultaneously in an end-to-end fashion. Additionally, the models are built with an emphasis on placing little computational burden on the user side so that the data can be desensitized on device in a cheap manner. Experimental results on mobile sensing and face datasets demonstrate that our models can successfully maintain the utility performances of predictive models while causing sensitive predictions to perform poorly.
    FedZKT: Zero-Shot Knowledge Transfer towards Heterogeneous On-Device Models in Federated Learning. (arXiv:2109.03775v1 [cs.LG])
    (2 min) Federated learning enables distributed devices to collaboratively learn a shared prediction model without centralizing on-device training data. Most of the current algorithms require comparable individual efforts to train on-device models with the same structure and size, impeding participation from resource-constrained devices. Given the widespread yet heterogeneous devices nowadays, this paper proposes a new framework supporting federated learning across heterogeneous on-device models via Zero-shot Knowledge Transfer, named by FedZKT. Specifically, FedZKT allows participating devices to independently determine their on-device models. To transfer knowledge across on-device models, FedZKT develops a zero-shot distillation approach contrary to certain prior research based on a public dataset or a pre-trained data generator. To utmostly reduce on-device workload, the resource-intensive distillation task is assigned to the server, which constructs a generator to adversarially train with the ensemble of the received heterogeneous on-device models. The distilled central knowledge will then be sent back in the form of the corresponding on-device model parameters, which can be easily absorbed at the device side. Experimental studies demonstrate the effectiveness and the robustness of FedZKT towards heterogeneous on-device models and challenging federated learning scenarios, such as non-iid data distribution and straggler effects.
    Continuous Safety Verification of Neural Networks. (arXiv:2010.05689v2 [cs.LG] UPDATED)
    (2 min) Deploying deep neural networks (DNNs) as core functions in autonomous driving creates unique verification and validation challenges. In particular, the continuous engineering paradigm of gradually perfecting a DNN-based perception can make the previously established result of safety verification no longer valid. This can occur either due to the newly encountered examples (i.e., input domain enlargement) inside the Operational Design Domain or due to the subsequent parameter fine-tuning activities of a DNN. This paper considers approaches to transfer results established in the previous DNN safety verification problem to the modified problem setting. By considering the reuse of state abstractions, network abstractions, and Lipschitz constants, we develop several sufficient conditions that only require formally analyzing a small part of the DNN in the new problem. The overall concept is evaluated in a $1/10$-scaled vehicle that equips a DNN controller to determine the visual waypoint from the perceived image.
    Improving Adversarial Robustness via Attention and Adversarial Logit Pairing. (arXiv:1908.11435v2 [cs.LG] UPDATED)
    (2 min) Though deep neural networks have achieved the state of the art performance in visual classification, recent studies have shown that they are all vulnerable to the attack of adversarial examples. In this paper, we develop improved techniques for defending against adversarial examples. First, we propose an enhanced defense technique denoted Attention and Adversarial Logit Pairing(AT+ALP), which encourages both attention map and logit for the pairs of examples to be similar. When being applied to clean examples and their adversarial counterparts, AT+ALP improves accuracy on adversarial examples over adversarial training. We show that AT+ALP can effectively increase the average activations of adversarial examples in the key area and demonstrate that it focuses on discriminate features to improve the robustness of the model. Finally, we conduct extensive experiments using a wide range of datasets and the experiment results show that our AT+ALP achieves the state of the art defense performance. For example, on 17 Flower Category Database, under strong 200-iteration PGD gray-box and black-box attacks where prior art has 34% and 39% accuracy, our method achieves 50% and 51%. Compared with previous work, our work is evaluated under highly challenging PGD attack: the maximum perturbation $\epsilon \in \{0.25,0.5\}$ i.e. $L_\infty \in \{0.25,0.5\}$ with 10 to 200 attack iterations. To the best of our knowledge, such a strong attack has not been previously explored on a wide range of datasets.
    Learning to Communicate Using Counterfactual Reasoning. (arXiv:2006.07200v2 [cs.LG] UPDATED)
    (2 min) Learning to communicate in order to share state information is an active problem in the area of multi-agent reinforcement learning (MARL). The credit assignment problem, the non-stationarity of the communication environment and the creation of influenceable agents are major challenges within this research field which need to be overcome in order to learn a valid communication protocol. This paper introduces the novel multi-agent counterfactual communication learning (MACC) method which adapts counterfactual reasoning in order to overcome the credit assignment problem for communicating agents. Secondly, the non-stationarity of the communication environment while learning the communication Q-function is overcome by creating the communication Q-function using the action policy of the other agents and the Q-function of the action environment. Additionally, a social loss function is introduced in order to create influenceable agents which is required to learn a valid communication protocol. Our experiments show that MACC is able to outperform the state-of-the-art baselines in four different scenarios in the Particle environment.
    Challenges and Countermeasures for Adversarial Attacks on Deep Reinforcement Learning. (arXiv:2001.09684v2 [cs.LG] UPDATED)
    (2 min) Deep Reinforcement Learning (DRL) has numerous applications in the real world thanks to its outstanding ability in quickly adapting to the surrounding environments. Despite its great advantages, DRL is susceptible to adversarial attacks, which precludes its use in real-life critical systems and applications (e.g., smart grids, traffic controls, and autonomous vehicles) unless its vulnerabilities are addressed and mitigated. Thus, this paper provides a comprehensive survey that discusses emerging attacks in DRL-based systems and the potential countermeasures to defend against these attacks. We first cover some fundamental backgrounds about DRL and present emerging adversarial attacks on machine learning techniques. We then investigate more details of the vulnerabilities that the adversary can exploit to attack DRL along with the state-of-the-art countermeasures to prevent such attacks. Finally, we highlight open issues and research challenges for developing solutions to deal with attacks for DRL-based intelligent systems.
    A robust approach for deep neural networks in presence of label noise: relabelling and filtering instances during training. (arXiv:2109.03748v1 [cs.LG])
    (0 min) Deep learning has outperformed other machine learning algorithms in a variety of tasks, and as a result, it has become more and more popular and used. However, as other machine learning algorithms, deep learning, and convolutional neural networks (CNNs) in particular, perform worse when the data sets present label noise. Therefore, it is important to develop algorithms that help the training of deep networks and their generalization to noise-free test sets. In this paper, we propose a robust training strategy against label noise, called RAFNI, that can be used with any CNN. This algorithm filters and relabels instances of the training set based on the predictions and their probabilities made by the backbone neural network during the training process. That way, this algorithm improves the generalization ability of the CNN on its own. RAFNI consists of three mechanisms: two mechanisms that filter instances and one mechanism that relabels instances. In addition, it does not suppose that the noise rate is known nor does it need to be estimated. We evaluated our algorithm using different data sets of several sizes and characteristics. We also compared it with state-of-the-art models using the CIFAR10 and CIFAR100 benchmarks under different types and rates of label noise and found that RAFNI achieves better results in most cases.
    Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models. (arXiv:2004.13805v4 [cs.CL] UPDATED)
    (0 min) As it has been unveiled that pre-trained language models (PLMs) are to some extent capable of recognizing syntactic concepts in natural language, much effort has been made to develop a method for extracting complete (binary) parses from PLMs without training separate parsers. We improve upon this paradigm by proposing a novel chart-based method and an effective top-K ensemble technique. Moreover, we demonstrate that we can broaden the scope of application of the approach into multilingual settings. Specifically, we show that by applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages in an integrated and language-agnostic manner, attaining performance superior or comparable to that of unsupervised PCFGs. We also verify that our approach is robust to cross-lingual transfer. Finally, we provide analyses on the inner workings of our method. For instance, we discover universal attention heads which are consistently sensitive to syntactic information irrespective of the input language.
    Parkinson's Disease Diagnosis based on Gait Cycle Analysis Through an Interpretable Interval Type-2 Neuro-Fuzzy System. (arXiv:2109.02442v2 [cs.LG] UPDATED)
    (2 min) In this paper, an interpretable classifier using an interval type-2 fuzzy neural network for detecting patients suffering from Parkinson's Disease (PD) based on analyzing the gait cycle is presented. The proposed method utilizes clinical features extracted from the vertical Ground Reaction Force (vGRF), measured by 16 wearable sensors placed in the soles of subjects' shoes and learns interpretable fuzzy rules. Therefore, experts can verify the decision made by the proposed method based on investigating the firing strength of interpretable fuzzy rules. Moreover, experts can utilize the extracted fuzzy rules for patient diagnosing or adjust them based on their knowledge. To improve the robustness of the proposed method against uncertainty and noisy sensor measurements, Interval Type-2 Fuzzy Logic is applied. To learn fuzzy rules, two paradigms are proposed: 1- A batch learning approach based on clustering available samples is applied to extract initial fuzzy rules, 2- A complementary online learning is proposed to improve the rule base encountering new labeled samples. The performance of the method is evaluated for classifying patients and healthy subjects in different conditions including the presence of noise or observing new instances. Moreover, the performance of the model is compared to some previous supervised and unsupervised machine learning approaches. The final Accuracy, Precision, Recall, and F1 Score of the proposed method are 88.74%, 89.41%, 95.10%, and 92.16%. Finally, the extracted fuzzy sets for each feature are reported.
    Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification. (arXiv:2012.03173v2 [cs.LG] UPDATED)
    (3 min) Deep AUC Maximization (DAM) is a new paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of large-scale DAM on difficult tasks are missing. In this work, we aim to make DAM more practical for interesting real-world applications (e.g., medical image classification). First, we propose a new margin-based min-max surrogate loss function for the AUC score (named as AUC min-max-margin loss or simply AUC margin loss for short). It is more robust than the commonly used AUC square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. Second, we conduct extensive empirical studies of our DAM method on four difficult medical image classification tasks, namely (i) classification of chest x-ray images for identifying many threatening diseases, (ii) classification of images of skin lesions for identifying melanoma, (iii) classification of mammogram for breast cancer screening, and (iv) classification of microscopic images for identifying tumor tissue. Our studies demonstrate that the proposed DAM method improves the performance of optimizing cross-entropy loss by a large margin, and also achieves better performance than optimizing the existing AUC square loss on these medical image classification tasks. Specifically, our DAM method has achieved the 1st place on Stanford CheXpert competition on Aug. 31, 2020. To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets. We also conduct extensive ablation studies to demonstrate the advantages of the new AUC margin loss over the AUC square loss on benchmark datasets. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org).
    Detecting False Data Injection Attacks in Smart Grids with Modeling Errors: A Deep Transfer Learning Based Approach. (arXiv:2104.06307v3 [eess.SP] UPDATED)
    (2 min) Most traditional false data injection attack (FDIA) detection approaches rely on a key assumption, i.e., the power system can be accurately modeled. However, the transmission line parameters are dynamic and cannot be accurately known during operation and thus the involved modeling errors should not be neglected. In this paper, an illustrative case has revealed that modeling errors in transmission lines significantly weaken the detection effectiveness of conventional FDIA approaches. To tackle this issue, we propose an FDIA detection mechanism from the perspective of transfer learning. Specifically, the simulated power system is treated as a source domain, which provides abundant simulated normal and attack data. The real world's running system whose transmission line parameters are unknown is taken as a target domain where sufficient real normal data are collected for tracking the latest system states online. The designed transfer strategy that aims at making full use of data in hand is divided into two optimization stages. In the first stage, a deep neural network (DNN) is built by simultaneously optimizing several well-designed objective terms with both simulated data and real data, and then it is fine-tuned via real data in the second stage. Several case studies on the IEEE 14-bus and 118-bus systems verify the effectiveness of the proposed mechanism.
    GitTables: A Large-Scale Corpus of Relational Tables. (arXiv:2106.07258v2 [cs.DB] UPDATED)
    (2 min) The practical success of deep learning has sparked interest in improving relational table tasks, like data search, with models trained on large table corpora. Existing corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need additional resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 20M tables. We annotate table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. The corpus is available at https://gittables.github.io. Our analysis of GitTables shows that its structure, content, and topical coverage differ significantly from existing table corpora. We evaluate our annotation pipeline on hand-labeled tables from the T2Dv2 benchmark and find that our approach provides results on par with human annotations. We demonstrate a use case of GitTables by training a semantic type detection model on it and obtain high prediction accuracy. We also show that the same model trained on tables from theWeb generalizes poorly.
    Image-based Plant Disease Diagnosis with Unsupervised Anomaly Detection Based on Reconstructability of Colors. (arXiv:2011.14306v5 [cs.CV] UPDATED)
    (2 min) This paper proposes an unsupervised anomaly detection technique for image-based plant disease diagnosis. The construction of large and publicly available datasets containing labeled images of healthy and diseased crop plants led to growing interest in computer vision techniques for automatic plant disease diagnosis. Although supervised image classifiers based on deep learning can be a powerful tool for plant disease diagnosis, they require a huge amount of labeled data. The data mining technique of anomaly detection includes unsupervised approaches that do not require rare samples for training classifiers. We propose an unsupervised anomaly detection technique for image-based plant disease diagnosis that is based on the reconstructability of colors; a deep encoder-decoder network trained to reconstruct the colors of \textit{healthy} plant images should fail to reconstruct colors of symptomatic regions. Our proposed method includes a new image-based framework for plant disease detection that utilizes a conditional adversarial network called pix2pix and a new anomaly score based on CIEDE2000 color difference. Experiments with PlantVillage dataset demonstrated the superiority of our proposed method compared to an existing anomaly detector at identifying diseased crop images in terms of accuracy, interpretability and computational efficiency.
    Emergent Unfairness in Algorithmic Fairness-Accuracy Trade-Off Research. (arXiv:2102.01203v3 [cs.CY] UPDATED)
    (2 min) Across machine learning (ML) sub-disciplines, researchers make explicit mathematical assumptions in order to facilitate proof-writing. We note that, specifically in the area of fairness-accuracy trade-off optimization scholarship, similar attention is not paid to the normative assumptions that ground this approach. Such assumptions presume that 1) accuracy and fairness are in inherent opposition to one another, 2) strict notions of mathematical equality can adequately model fairness, 3) it is possible to measure the accuracy and fairness of decisions independent from historical context, and 4) collecting more data on marginalized individuals is a reasonable solution to mitigate the effects of the trade-off. We argue that such assumptions, which are often left implicit and unexamined, lead to inconsistent conclusions: While the intended goal of this work may be to improve the fairness of machine learning models, these unexamined, implicit assumptions can in fact result in emergent unfairness. We conclude by suggesting a concrete path forward toward a potential resolution.
    tvopt: A Python Framework for Time-Varying Optimization. (arXiv:2011.07119v2 [cs.MS] UPDATED)
    (2 min) This paper introduces tvopt, a Python framework for prototyping and benchmarking time-varying (or online) optimization algorithms. The paper first describes the theoretical approach that informed the development of tvopt. Then it discusses the different components of the framework and their use for modeling and solving time-varying optimization problems. In particular, tvopt provides functionalities for defining both centralized and distributed online problems, and a collection of built-in algorithms to solve them, for example gradient-based methods, ADMM and other splitting methods. Moreover, the framework implements prediction strategies to improve the accuracy of the online solvers. The paper then proposes some numerical results on a benchmark problem and discusses their implementation using tvopt. The code for tvopt is available at https://github.com/nicola-bastianello/tvopt.
    How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization?. (arXiv:2108.09507v2 [stat.ML] UPDATED)
    (3 min) Recent works report that increasing the learning rate or decreasing the minibatch size in stochastic gradient descent (SGD) can improve test set performance. We argue this is expected under some conditions in models with a loss function with multiple local minima. Our main contribution is an approximate but analytical approach inspired by methods in Physics to study the role of the SGD learning rate and batch size in generalization. We characterize test set performance under a shift between the training and test data distributions for loss functions with multiple minima. The shift can simply be due to sampling, and is therefore typically present in practical applications. We show that the resulting shift in local minima worsens test performance by picking up curvature, implying that generalization improves by selecting wide and/or little-shifted local minima. We then specialize to SGD, and study its test performance under stationarity. Because obtaining the exact stationary distribution of SGD is intractable, we derive a Fokker-Planck approximation of SGD and obtain its stationary distribution instead. This process shows that the learning rate divided by the minibatch size plays a role analogous to temperature in statistical mechanics, and implies that SGD, including its stationary distribution, is largely invariant to changes in learning rate or batch size that leave its temperature constant. We show that increasing SGD temperature encourages the selection of local minima with lower curvature, and can enable better generalization. We provide experiments on CIFAR10 demonstrating the temperature invariance of SGD, improvement of the test loss as SGD temperature increases, and quantifying the impact of sampling versus domain shift in driving this effect. Finally, we present synthetic experiments showing how our theory applies in a simplified loss with two local minima.
    Desiderata for Representation Learning: A Causal Perspective. (arXiv:2109.03795v1 [stat.ML])
    (2 min) Representation learning constructs low-dimensional representations to summarize essential features of high-dimensional data. This learning problem is often approached by describing various desiderata associated with learned representations; e.g., that they be non-spurious, efficient, or disentangled. It can be challenging, however, to turn these intuitive desiderata into formal criteria that can be measured and enhanced based on observed data. In this paper, we take a causal perspective on representation learning, formalizing non-spuriousness and efficiency (in supervised representation learning) and disentanglement (in unsupervised representation learning) using counterfactual quantities and observable consequences of causal assertions. This yields computable metrics that can be used to assess the degree to which representations satisfy the desiderata of interest and learn non-spurious and disentangled representations from single observational datasets.
    Training Algorithm Matters for the Performance of Neural Network Potential. (arXiv:2109.03769v1 [physics.chem-ph])
    (2 min) One hidden yet important issue for developing neural network potentials (NNPs) is the choice of training algorithm. Here we compare the performance of two popular training algorithms, the adaptive moment estimation algorithm (Adam) and the extended Kalman filter algorithm (EKF), using the Behler-Parrinello neural network (BPNN) and two publicly accessible datasets of liquid water. It is found that NNPs trained with EKF are more transferable and less sensitive to the value of the learning rate, as compared to Adam. In both cases, error metrics of the test set do not always serve as a good indicator for the actual performance of NNPs. Instead, we show that their performance correlates well with a Fisher information based similarity measure.
    Wanderlust: Online Continual Object Detection in the Real World. (arXiv:2108.11005v2 [cs.CV] UPDATED)
    (2 min) Online continual learning from data streams in dynamic environments is a critical direction in the computer vision field. However, realistic benchmarks and fundamental studies in this line are still missing. To bridge the gap, we present a new online continual object detection benchmark with an egocentric video dataset, Objects Around Krishna (OAK). OAK adopts the KrishnaCAM videos, an ego-centric video stream collected over nine months by a graduate student. OAK provides exhaustive bounding box annotations of 80 video snippets (~17.5 hours) for 105 object categories in outdoor scenes. The emergence of new object categories in our benchmark follows a pattern similar to what a single person might see in their day-to-day life. The dataset also captures the natural distribution shifts as the person travels to different places. These egocentric long-running videos provide a realistic playground for continual learning algorithms, especially in online embodied settings. We also introduce new evaluation metrics to evaluate the model performance and catastrophic forgetting and provide baseline studies for online continual object detection. We believe this benchmark will pose new exciting challenges for learning from non-stationary data in continual learning. The OAK dataset and the associated benchmark are released at https://oakdata.github.io/.
    Anatomical-Guided Attention Enhances Unsupervised PET Image Denoising Performance. (arXiv:2109.00802v2 [physics.med-ph] UPDATED)
    (0 min) Although supervised convolutional neural networks (CNNs) often outperform conventional alternatives for denoising positron emission tomography (PET) images, they require many low- and high-quality reference PET image pairs. Herein, we propose an unsupervised 3D PET image denoising method based on an anatomical information-guided attention mechanism. The proposed magnetic resonance-guided deep decoder (MR-GDD) utilizes the spatial details and semantic features of MR-guidance image more effectively by introducing encoder-decoder and deep decoder subnetworks. Moreover, the specific shapes and patterns of the guidance image do not affect the denoised PET image, because the guidance image is input to the network through an attention gate. In a Monte Carlo simulation of [$^{18}$F]fluoro-2-deoxy-D-glucose (FDG), the proposed method achieved the highest peak signal-to-noise ratio and structural similarity (27.92 $\pm$ 0.44 dB/0.886 $\pm$ 0.007), as compared with Gaussian filtering (26.68 $\pm$ 0.10 dB/0.807 $\pm$ 0.004), image guided filtering (27.40 $\pm$ 0.11 dB/0.849 $\pm$ 0.003), deep image prior (DIP) (24.22 $\pm$ 0.43 dB/0.737 $\pm$ 0.017), and MR-DIP (27.65 $\pm$ 0.42 dB/0.879 $\pm$ 0.007). Furthermore, we experimentally visualized the behavior of the optimization process, which is often unknown in unsupervised CNN-based restoration problems. For preclinical (using [$^{18}$F]FDG and [$^{11}$C]raclopride) and clinical (using [$^{18}$F]florbetapir) studies, the proposed method demonstrates state-of-the-art denoising performance while retaining spatial resolution and quantitative accuracy, despite using a common network architecture for various noisy PET images with 1/10th of the full counts. These results suggest that the proposed MR-GDD can reduce PET scan times and PET tracer doses considerably without impacting patients.
    On the Fundamental Trade-offs in Learning Invariant Representations. (arXiv:2109.03386v1 [cs.LG])
    (2 min) Many applications of representation learning, such as privacy-preservation, algorithmic fairness and domain adaptation, desire explicit control over semantic information being discarded. This goal is often formulated as satisfying two potentially competing objectives: maximizing utility for predicting a target attribute while simultaneously being independent or invariant with respect to a known semantic attribute. In this paper, we \emph{identify and determine} two fundamental trade-offs between utility and semantic dependence induced by the statistical dependencies between the data and its corresponding target and semantic attributes. We derive closed-form solutions for the global optima of the underlying optimization problems under mild assumptions, which in turn yields closed formulae for the exact trade-offs. We also derive empirical estimates of the trade-offs and show their convergence to the corresponding population counterparts. Finally, we numerically quantify the trade-offs on representative problems and compare to the solutions achieved by baseline representation learning algorithms.
    Deriving Explanation of Deep Visual Saliency Models. (arXiv:2109.03575v1 [cs.CV])
    (0 min) Deep neural networks have shown their profound impact on achieving human level performance in visual saliency prediction. However, it is still unclear how they learn the task and what it means in terms of understanding human visual system. In this work, we develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models by applying human perception theories and the conventional concepts of saliency. This technique helps us understand the learning pattern of the deep network at its intermediate layers through their activation maps. Initially, we consider two state-of-the-art deep saliency models, namely UNISAL and MSI-Net for our interpretation. We use a set of biologically plausible log-gabor filters for identifying and reconstructing the activation maps of them using our explainable saliency model. The final saliency map is generated using these reconstructed activation maps. We also build our own deep saliency model named cross-concatenated multi-scale residual block based network (CMRNet) for saliency prediction. Then, we evaluate and compare the performance of the explainable models derived from UNISAL, MSI-Net and CMRNet on three benchmark datasets with other state-of-the-art methods. Hence, we propose that this approach of explainability can be applied to any deep visual saliency model for interpretation which makes it a generic one.
    StreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm. (arXiv:2108.10411v2 [cs.LG] UPDATED)
    (0 min) Kernel ridge regression (KRR) is a popular scheme for non-linear non-parametric learning. However, existing implementations of KRR require that all the data is stored in the main memory, which severely limits the use of KRR in contexts where data size far exceeds the memory size. Such applications are increasingly common in data mining, bioinformatics, and control. A powerful paradigm for computing on data sets that are too large for memory is the streaming model of computation, where we process one data sample at a time, discarding each sample before moving on to the next one. In this paper, we propose StreaMRAK - a streaming version of KRR. StreaMRAK improves on existing KRR schemes by dividing the problem into several levels of resolution, which allows continual refinement to the predictions. The algorithm reduces the memory requirement by continuously and efficiently integrating new samples into the training model. With a novel sub-sampling scheme, StreaMRAK reduces memory and computational complexities by creating a sketch of the original data, where the sub-sampling density is adapted to the bandwidth of the kernel and the local dimensionality of the data. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum. The results show that the proposed algorithm is fast and accurate.
    C-MinHash: Rigorously Reducing $K$ Permutations to Two. (arXiv:2109.03337v1 [stat.ML])
    (0 min) Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {\bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need \textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.
    YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for Hyperparameter Optimization. (arXiv:2109.03670v1 [cs.LG])
    (0 min) When developing and analyzing new hyperparameter optimization (HPO) methods, it is vital to empirically evaluate and compare them on well-curated benchmark suites. In this work, we list desirable properties and requirements for such benchmarks and propose a new set of challenging and relevant multifidelity HPO benchmark problems motivated by these requirements. For this, we revisit the concept of surrogate-based benchmarks and empirically compare them to more widely-used tabular benchmarks, showing that the latter ones may induce bias in performance estimation and ranking of HPO methods. We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total. All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO. We examine and compare our benchmark suite with respect to the defined requirements and show that our benchmarks provide viable additions to existing suites.
    Plant Disease Detection Using Image Processing and Machine Learning. (arXiv:2106.10698v2 [cs.CV] UPDATED)
    (0 min) One of the important and tedious task in agricultural practices is the detection of the disease on crops. It requires huge time as well as skilled labor. This paper proposes a smart and efficient technique for detection of crop disease which uses computer vision and machine learning techniques. The proposed system is able to detect 20 different diseases of 5 common plants with 93% accuracy.
    Real-World Adversarial Examples involving Makeup Application. (arXiv:2109.03329v1 [cs.CR])
    (0 min) Deep neural networks have developed rapidly and have achieved outstanding performance in several tasks, such as image classification and natural language processing. However, recent studies have indicated that both digital and physical adversarial examples can fool neural networks. Face-recognition systems are used in various applications that involve security threats from physical adversarial examples. Herein, we propose a physical adversarial attack with the use of full-face makeup. The presence of makeup on the human face is a reasonable possibility, which possibly increases the imperceptibility of attacks. In our attack framework, we combine the cycle-adversarial generative network (cycle-GAN) and a victimized classifier. The Cycle-GAN is used to generate adversarial makeup, and the architecture of the victimized classifier is VGG 16. Our experimental results show that our attack can effectively overcome manual errors in makeup application, such as color and position-related errors. We also demonstrate that the approaches used to train the models can influence physical attacks; the adversarial perturbations crafted from the pre-trained model are affected by the corresponding training data.
    DMN4: Few-shot Learning via Discriminative Mutual Nearest Neighbor Neural Network. (arXiv:2103.08160v3 [cs.CV] UPDATED)
    (0 min) Few-shot learning (FSL) aims to classify images under low-data regimes, where the conventional pooled global feature is likely to lose useful local characteristics. Recent work has achieved promising performances by using deep descriptors. They generally take all deep descriptors from neural networks into consideration while ignoring that some of them are useless in classification due to their limited receptive field, e.g., task-irrelevant descriptors could be misleading and multiple aggregative descriptors from background clutter could even overwhelm the object's presence. In this paper, we argue that a Mutual Nearest Neighbor (MNN) relation should be established to explicitly select the query descriptors that are most relevant to each task and discard less relevant ones from aggregative clutters in FSL. Specifically, we propose Discriminative Mutual Nearest Neighbor Neural Network (DMN4) for FSL. Extensive experiments demonstrate that our method outperforms the existing state-of-the-arts on both fine-grained and generalized datasets.
    Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning. (arXiv:2105.14074v2 [cs.AI] UPDATED)
    (0 min) In robotic domains, learning and planning are complicated by continuous state spaces, continuous action spaces, and long task horizons. In this work, we address these challenges with Neuro-Symbolic Relational Transition Models (NSRTs), a novel class of models that are data-efficient to learn, compatible with powerful robotic planning methods, and generalizable over objects. NSRTs have both symbolic and neural components, enabling a bilevel planning scheme where symbolic AI planning in an outer loop guides continuous planning with neural models in an inner loop. Experiments in four robotic planning domains show that NSRTs can be learned after only tens or hundreds of training episodes, and then used for fast planning in new tasks that require up to 60 actions and involve many more objects than were seen during training. Video: https://tinyurl.com/chitnis-nsrts
    Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. (arXiv:2012.12803v3 [cs.LG] UPDATED)
    (0 min) Recent work of Erlingsson, Feldman, Mironov, Raghunathan, Talwar, and Thakurta [EFMRTT19] demonstrates that random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification implies substantially stronger privacy guarantees for systems in which data is contributed anonymously [BEMMRLRKTS17] and has lead to significant interest in the shuffle model of privacy [CSUZZ19; EFMRTT19]. We show that random shuffling of $n$ data records that are input to $\varepsilon_0$-differentially private local randomizers results in an $(O((1-e^{-\varepsilon_0})\sqrt{\frac{e^{\varepsilon_0}\log(1/\delta)}{n}}), \delta)$-differentially private algorithm. This significantly improves over previous work and achieves the asymptotically optimal dependence in $\varepsilon_0$. Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees. Importantly, our work also yields an algorithm for deriving tighter bounds on the resulting $\varepsilon$ and $\delta$ as well as R\'enyi differential privacy guarantees. We show numerically that our algorithm gets to within a small constant factor of the optimal bound. As a direct corollary of our analysis we derive a simple and nearly optimal algorithm for frequency estimation in the shuffle model of privacy. We also observe that our result implies the first asymptotically optimal privacy analysis of noisy stochastic gradient descent that applies to sampling without replacement.
    AutoDebias: Learning to Debias for Recommendation. (arXiv:2105.04170v4 [cs.LG] UPDATED)
    (0 min) Recommender systems rely on user behavior data like ratings and clicks to build personalization model. However, the collected data is observational rather than experimental, causing various biases in the data which significantly affect the learned model. Most existing work for recommendation debiasing, such as the inverse propensity scoring and imputation approaches, focuses on one or two specific biases, lacking the universal capacity that can account for mixed or even unknown biases in the data. Towards this research gap, we first analyze the origin of biases from the perspective of \textit{risk discrepancy} that represents the difference between the expectation empirical risk and the true risk. Remarkably, we derive a general learning framework that well summarizes most existing debiasing strategies by specifying some parameters of the general framework. This provides a valuable opportunity to develop a universal solution for debiasing, e.g., by learning the debiasing parameters from data. However, the training data lacks important signal of how the data is biased and what the unbiased data looks like. To move this idea forward, we propose \textit{AotoDebias} that leverages another (small) set of uniform data to optimize the debiasing parameters by solving the bi-level optimization problem with meta-learning. Through theoretical analyses, we derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy. Extensive experiments on two real datasets and a simulated dataset demonstrated effectiveness of AutoDebias. The code is available at \url{https://github.com/DongHande/AutoDebias}.
    Residual Model Learning for Microrobot Control. (arXiv:2104.00631v2 [cs.RO] UPDATED)
    (0 min) A majority of microrobots are constructed using compliant materials that are difficult to model analytically, limiting the utility of traditional model-based controllers. Challenges in data collection on microrobots and large errors between simulated models and real robots make current model-based learning and sim-to-real transfer methods difficult to apply. We propose a novel framework residual model learning (RML) that leverages approximate models to substantially reduce the sample complexity associated with learning an accurate robot model. We show that using RML, we can learn a model of the Harvard Ambulatory MicroRobot (HAMR) using just 12 seconds of passively collected interaction data. The learned model is accurate enough to be leveraged as "proxy-simulator" for learning walking and turning behaviors using model-free reinforcement learning algorithms. RML provides a general framework for learning from extremely small amounts of interaction data, and our experiments with HAMR clearly demonstrate that RML substantially outperforms existing techniques.
    Self-explaining variational posterior distributions for Gaussian Process models. (arXiv:2109.03708v1 [cs.LG])
    (0 min) Bayesian methods have become a popular way to incorporate prior knowledge and a notion of uncertainty into machine learning models. At the same time, the complexity of modern machine learning makes it challenging to comprehend a model's reasoning process, let alone express specific prior assumptions in a rigorous manner. While primarily interested in the former issue, recent developments intransparent machine learning could also broaden the range of prior information that we can provide to complex Bayesian models. Inspired by the idea of self-explaining models, we introduce a corresponding concept for variational GaussianProcesses. On the one hand, our contribution improves transparency for these types of models. More importantly though, our proposed self-explaining variational posterior distribution allows to incorporate both general prior knowledge about a target function as a whole and prior knowledge about the contribution of individual features.
    Forget me not: A Gentle Reminder to Mind the Simple Multi-Layer Perceptron Baseline for Text Classification. (arXiv:2109.03777v1 [cs.CL])
    (0 min) Graph neural networks have triggered a resurgence of graph-based text classification. We show that already a simple MLP baseline achieves comparable performance on benchmark datasets, questioning the importance of synthetic graph structures. When considering an inductive scenario, i. e., when adding new documents to a corpus, a simple MLP even outperforms most graph-based models. We further fine-tune DistilBERT for comparison and find that it outperforms all state-of-the-art models. We suggest that future studies use at least an MLP baseline to contextualize the results. We provide recommendations for the design and training of such a baseline.
    Graph Neural Networks: Methods, Applications, and Opportunities. (arXiv:2108.10733v2 [cs.LG] UPDATED)
    (0 min) In the last decade or so, we have witnessed deep learning reinvigorating the machine learning field. It has solved many problems in the domains of computer vision, speech recognition, natural language processing, and various other tasks with state-of-the-art performance. The data is generally represented in the Euclidean space in these domains. Various other domains conform to non-Euclidean space, for which graph is an ideal representation. Graphs are suitable for representing the dependencies and interrelationships between various entities. Traditionally, handcrafted features for graphs are incapable of providing the necessary inference for various tasks from this complex data representation. Recently, there is an emergence of employing various advances in deep learning to graph data-based tasks. This article provides a comprehensive survey of graph neural networks (GNNs) in each learning setting: supervised, unsupervised, semi-supervised, and self-supervised learning. Taxonomy of each graph based learning setting is provided with logical divisions of methods falling in the given learning setting. The approaches for each learning task are analyzed from both theoretical as well as empirical standpoints. Further, we provide general architecture guidelines for building GNNs. Various applications and benchmark datasets are also provided, along with open challenges still plaguing the general applicability of GNNs.
    End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net. (arXiv:2106.00952v2 [cs.CV] UPDATED)
    (0 min) Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the \textit{Multi-Stage Attentional U-Net}. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40\% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications.
    Axial multi-layer perceptron architecture for automatic segmentation of choroid plexus in multiple sclerosis. (arXiv:2109.03778v1 [eess.IV])
    (0 min) Choroid plexuses (CP) are structures of the ventricles of the brain which produce most of the cerebrospinal fluid (CSF). Several postmortem and in vivo studies have pointed towards their role in the inflammatory process in multiple sclerosis (MS). Automatic segmentation of CP from MRI thus has high value for studying their characteristics in large cohorts of patients. To the best of our knowledge, the only freely available tool for CP segmentation is FreeSurfer but its accuracy for this specific structure is poor. In this paper, we propose to automatically segment CP from non-contrast enhanced T1-weighted MRI. To that end, we introduce a new model called "Axial-MLP" based on an assembly of Axial multi-layer perceptrons (MLPs). This is inspired by recent works which showed that the self-attention layers of Transformers can be replaced with MLPs. This approach is systematically compared with a standard 3D U-Net, nnU-Net, Freesurfer and FastSurfer. For our experiments, we make use of a dataset of 141 subjects (44 controls and 97 patients with MS). We show that all the tested deep learning (DL) methods outperform FreeSurfer (Dice around 0.7 for DL vs 0.33 for FreeSurfer). Axial-MLP is competitive with U-Nets even though it is slightly less accurate. The conclusions of our paper are two-fold: 1) the studied deep learning methods could be useful tools to study CP in large cohorts of MS patients; 2)~Axial-MLP is a potentially viable alternative to convolutional neural networks for such tasks, although it could benefit from further improvements.
    On Event-Driven Knowledge Graph Completion in Digital Factories. (arXiv:2109.03655v1 [cs.LG])
    (0 min) Smart factories are equipped with machines that can sense their manufacturing environments, interact with each other, and control production processes. Smooth operation of such factories requires that the machines and engineering personnel that conduct their monitoring and diagnostics share a detailed common industrial knowledge about the factory, e.g., in the form of knowledge graphs. Creation and maintenance of such knowledge is expensive and requires automation. In this work we show how machine learning that is specifically tailored towards industrial applications can help in knowledge graph completion. In particular, we show how knowledge completion can benefit from event logs that are common in smart factories. We evaluate this on the knowledge graph from a real world-inspired smart factory with encouraging results.
    A comparison of combined data assimilation and machine learning methods for offline and online model error correction. (arXiv:2107.11114v2 [stat.ML] UPDATED)
    (0 min) Recent studies have shown that it is possible to combine machine learning methods with data assimilation to reconstruct a dynamical system using only sparse and noisy observations of that system. The same approach can be used to correct the error of a knowledge-based model. The resulting surrogate model is hybrid, with a statistical part supplementing a physical part. In practice, the correction can be added as an integrated term (i.e. in the model resolvent) or directly inside the tendencies of the physical model. The resolvent correction is easy to implement. The tendency correction is more technical, in particular it requires the adjoint of the physical model, but also more flexible. We use the two-scale Lorenz model to compare the two methods. The accuracy in long-range forecast experiments is somewhat similar between the surrogate models using the resolvent correction and the tendency correction. By contrast, the surrogate models using the tendency correction significantly outperform the surrogate models using the resolvent correction in data assimilation experiments. Finally, we show that the tendency correction opens the possibility to make online model error correction, i.e. improving the model progressively as new observations become available. The resulting algorithm can be seen as a new formulation of weak-constraint 4D-Var. We compare online and offline learning using the same framework with the two-scale Lorenz system, and show that with online learning, it is possible to extract all the information from sparse and noisy observations.
    DRLD-SP: A Deep Reinforcement Learning-based Dynamic Service Placement in Edge-Enabled Internet of Vehicles. (arXiv:2106.06291v2 [cs.NI] UPDATED)
    (0 min) The growth of 5G and edge computing has enabled the emergence of Internet of Vehicles. It supports different types of services with different resource and service requirements. However, limited resources at the edge, high mobility of vehicles, increasing demand, and dynamicity in service request-types have made service placement a challenging task. A typical static placement solution is not effective as it does not consider the traffic mobility and service dynamics. Handling dynamics in IoV for service placement is an important and challenging problem which is the primary focus of our work in this paper. We propose a Deep Reinforcement Learning-based Dynamic Service Placement (DRLD-SP) framework with the objective of minimizing the maximum edge resource usage and service delay while considering the vehicle's mobility, varying demand, and dynamics in the requests for different types of services. We use SUMO and MATLAB to carry out simulation experiments. The experimental results show that the proposed DRLD-SP approach is effective and outperforms other static and dynamic placement approaches.
    How Well Generative Adversarial Networks Learn Distributions. (arXiv:1811.03179v4 [math.ST] UPDATED)
    (0 min) This paper studies the rates of convergence for learning distributions implicitly with the adversarial framework and Generative Adversarial Networks (GANs), which subsume Wasserstein, Sobolev, MMD GAN, and Generalized/Simulated Method of Moments (GMM/SMM) as special cases. We study a wide range of parametric and nonparametric target distributions under a host of objective evaluation metrics. We investigate how to obtain valid statistical guarantees for GANs through the lens of regularization. On the nonparametric end, we derive the optimal minimax rates for distribution estimation under the adversarial framework. On the parametric end, we establish a theory for general neural network classes (including deep leaky ReLU networks) that characterizes the interplay on the choice of generator and discriminator pair. We discover and isolate a new notion of regularization, called the generator-discriminator-pair regularization, that sheds light on the advantage of GANs compared to classical parametric and nonparametric approaches for explicit distribution estimation. We develop novel oracle inequalities as the main technical tools for analyzing GANs, which are of independent interest.
    Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models. (arXiv:2010.13933v3 [stat.ML] UPDATED)
    (0 min) The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold in the interpolation regime, the training error remains zero while the test error decreases. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.
    On the Approximation Power of Two-Layer Networks of Random ReLUs. (arXiv:2102.02336v2 [cs.LG] UPDATED)
    (0 min) This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upper- and lower-bounds for $L_2$-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments.
    DeepAltTrip: Top-k Alternative Itineraries for Trip Recommendation. (arXiv:2109.03535v1 [cs.LG])
    (0 min) Trip itinerary recommendation finds an ordered sequence of Points-of-Interest (POIs) from a large number of candidate POIs in a city. In this paper, we propose a deep learning-based framework, called DeepAltTrip, that learns to recommend top-k alternative itineraries for given source and destination POIs. These alternative itineraries would be not only popular given the historical routes adopted by past users but also dissimilar (or diverse) to each other. The DeepAltTrip consists of two major components: (i) Itinerary Net (ITRNet) which estimates the likelihood of POIs on an itinerary by using graph autoencoders and two (forward and backward) LSTMs; and (ii) a route generation procedure to generate k diverse itineraries passing through relevant POIs obtained using ITRNet. For the route generation step, we propose a novel sampling algorithm that can seamlessly handle a wide variety of user-defined constraints. To the best of our knowledge, this is the first work that learns from historical trips to provide a set of alternative itineraries to the users. Extensive experiments conducted on eight popular real-world datasets show the effectiveness and efficacy of our approach over state-of-the-art methods.
    Effective Sequence-to-Sequence Dialogue State Tracking. (arXiv:2108.13990v2 [cs.CL] UPDATED)
    (0 min) Sequence-to-sequence models have been applied to a wide variety of NLP tasks, but how to properly use them for dialogue state tracking has not been systematically investigated. In this paper, we study this problem from the perspectives of pre-training objectives as well as the formats of context representations. We demonstrate that the choice of pre-training objective makes a significant difference to the state tracking quality. In particular, we find that masked span prediction is more effective than auto-regressive language modeling. We also explore using Pegasus, a span prediction-based pre-training objective for text summarization, for the state tracking model. We found that pre-training for the seemingly distant summarization task works surprisingly well for dialogue state tracking. In addition, we found that while recurrent state context representation works also reasonably well, the model may have a hard time recovering from earlier mistakes. We conducted experiments on the MultiWOZ 2.1-2.4, WOZ 2.0, and DSTC2 datasets with consistent observations.
    On the space of coefficients of a Feed Forward Neural Network. (arXiv:2109.03362v1 [cs.LG])
    (0 min) We define and establish the conditions for `equivalent neural networks' - neural networks with different weights, biases, and threshold functions that result in the same associated function. We prove that given a neural network $\mathcal{N}$ with piece-wise linear activation, the space of coefficients describing all equivalent neural networks is given by a semialgebraic set. This result is obtained by studying different representations of a given piece-wise linear function using the Tarski-Seidenberg theorem.
    Graph-MVP: Multi-View Prototypical Contrastive Learning for Multiplex Graphs. (arXiv:2109.03560v1 [cs.LG])
    (0 min) Contrastive Learning (CL) is one of the most popular self-supervised learning frameworks for graph representation learning, which trains a Graph Neural Network (GNN) by discriminating positive and negative node pairs. However, there are two challenges for CL on graphs. On the one hand, traditional CL methods will unavoidably introduce semantic errors since they will treat some semantically similar nodes as negative pairs. On the other hand, most of the existing CL methods ignore the multiplexity nature of the real-world graphs, where nodes are connected by various relations and each relation represents a view of the graph. To address these challenges, we propose a novel Graph Multi-View Prototypical (Graph-MVP) framework to extract node embeddings on multiplex graphs. Firstly, we introduce a Graph Prototypical Contrastive Learning (Graph-PCL) framework to capture both node-level and semantic-level information for each view of multiplex graphs. Graph-PCL captures the node-level information by a simple yet effective data transformation technique. It captures the semantic-level information by an Expectation-Maximization (EM) algorithm, which alternatively performs clustering over node embeddings and parameter updating for GNN. Next, we introduce Graph-MVP based on Graph-PCL to jointly model different views of the multiplex graphs. Our key insight behind Graph-MVP is that different view-specific embeddings of the same node should have similar underlying semantic, based on which we propose two versions of Graph-MVP: Graph-MVP_hard and Graph-MVP_soft to align embeddings across views. Finally, we evaluate the proposed Graph-PCL and Graph-MVP on a variety of real-world datasets and downstream tasks. The experimental results demonstrate the effectiveness of the proposed Graph-PCL and Graph-MVP frameworks.
    Effective and interpretable dispatching rules for dynamic job shops via guided empirical learning. (arXiv:2109.03323v1 [cs.LG])
    (0 min) The emergence of Industry 4.0 is making production systems more flexible and also more dynamic. In these settings, schedules often need to be adapted in real-time by dispatching rules. Although substantial progress was made until the '90s, the performance of these rules is still rather limited. The machine learning literature is developing a variety of methods to improve them, but the resulting rules are difficult to interpret and do not generalise well for a wide range of settings. This paper is the first major attempt at combining machine learning with domain problem reasoning for scheduling. The idea consists of using the insights obtained with the latter to guide the empirical search of the former. Our hypothesis is that this guided empirical learning process should result in dispatching rules that are effective and interpretable and which generalise well to different instance classes. We test our approach in the classical dynamic job shop scheduling problem minimising tardiness, which is one of the most well-studied scheduling problems. Nonetheless, results suggest that our approach was able to find new state-of-the-art rules, which significantly outperform the existing literature in the vast majority of settings, from loose to tight due dates and from low utilisation conditions to congested shops. Overall, the average improvement is 19%. Moreover, the rules are compact, interpretable, and generalise well to extreme, unseen scenarios.
    CyGIL: A Cyber Gym for Training Autonomous Agents over Emulated Network Systems. (arXiv:2109.03331v1 [cs.CR])
    (0 min) Given the success of reinforcement learning (RL) in various domains, it is promising to explore the application of its methods to the development of intelligent and autonomous cyber agents. Enabling this development requires a representative RL training environment. To that end, this work presents CyGIL: an experimental testbed of an emulated RL training environment for network cyber operations. CyGIL uses a stateless environment architecture and incorporates the MITRE ATT&CK framework to establish a high fidelity training environment, while presenting a sufficiently abstracted interface to enable RL training. Its comprehensive action space and flexible game design allow the agent training to focus on particular advanced persistent threat (APT) profiles, and to incorporate a broad range of potential threats and vulnerabilities. By striking a balance between fidelity and simplicity, it aims to leverage state of the art RL algorithms for application to real-world cyber defence.
    BERTnesia: Investigating the capture and forgetting of knowledge in BERT. (arXiv:2010.09313v2 [cs.CL] UPDATED)
    (0 min) Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this paper, we probe BERT specifically to understand and measure the relational knowledge it captures. We utilize knowledge base completion tasks to probe every layer of pre-trained as well as fine-tuned BERT (ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten but the extent of forgetting is impacted by the fine-tuning objective but not the size of the dataset. We found that ranking models forget the least and retain more knowledge in their final layer. We release our code on github to repeat the experiments.
    Class-conditioned Domain Generalization via Wasserstein Distributional Robust Optimization. (arXiv:2109.03676v1 [cs.LG])
    (0 min) Given multiple source domains, domain generalization aims at learning a universal model that performs well on any unseen but related target domain. In this work, we focus on the domain generalization scenario where domain shifts occur among class-conditional distributions of different domains. Existing approaches are not sufficiently robust when the variation of conditional distributions given the same class is large. In this work, we extend the concept of distributional robust optimization to solve the class-conditional domain generalization problem. Our approach optimizes the worst-case performance of a classifier over class-conditional distributions within a Wasserstein ball centered around the barycenter of the source conditional distributions. We also propose an iterative algorithm for learning the optimal radius of the Wasserstein balls automatically. Experiments show that the proposed framework has better performance on unseen target domain than approaches without domain generalization.
    Entangled Datasets for Quantum Machine Learning. (arXiv:2109.03400v1 [quant-ph])
    (0 min) High-quality, large-scale datasets have played a crucial role in the development and success of classical machine learning. Quantum Machine Learning (QML) is a new field that aims to use quantum computers for data analysis, with the hope of obtaining a quantum advantage of some sort. While most proposed QML architectures are benchmarked using classical datasets, there is still doubt whether QML on classical datasets will achieve such an advantage. In this work, we argue that one should instead employ quantum datasets composed of quantum states. For this purpose, we introduce the NTangled dataset composed of quantum states with different amounts and types of multipartite entanglement. We first show how a quantum neural network can be trained to generate the states in the NTangled dataset. Then, we use the NTangled dataset to benchmark QML models for supervised learning classification tasks. We also consider an alternative entanglement-based dataset, which is scalable and is composed of states prepared by quantum circuits with different depths. As a byproduct of our results, we introduce a novel method for generating multipartite entangled states, providing a use-case of quantum neural networks for quantum entanglement theory.
    A Review of Sound Source Localization with Deep Learning Methods. (arXiv:2109.03465v1 [cs.SD])
    (0 min) This article is a review on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
    ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations. (arXiv:2107.14483v4 [cs.LG] UPDATED)
    (0 min) Object manipulation from 3D visual inputs poses many challenges on building generalizable perception and policy models. However, 3D assets in existing benchmarks mostly lack the diversity of 3D shapes that align with real-world intra-class complexity in topology and geometry. Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator. 3D assets in ManiSkill include large intra-class topological and geometric variations. Tasks are carefully chosen to cover distinct types of manipulation challenges. Latest progress in 3D vision also makes us believe that we should customize the benchmark so that the challenge is inviting to researchers working on 3D deep learning. To this end, we simulate a moving panoramic camera that returns ego-centric point clouds or RGB-D images. In addition, we would like ManiSkill to serve a broad set of researchers interested in manipulation research. Besides supporting the learning of policies from interactions, we also support learning-from-demonstrations (LfD) methods, by providing a large number of high-quality demonstrations (~36,000 successful trajectories, ~1.5M point cloud/RGB-D frames in total). We provide baselines using 3D deep learning and LfD algorithms. All code of our benchmark (simulator, environment, SDK, and baselines) is open-sourced, and a challenge facing interdisciplinary researchers will be held based on the benchmark.
    Dual Correction Strategy for Ranking Distillation in Top-N Recommender System. (arXiv:2109.03459v1 [cs.IR])
    (0 min) Knowledge Distillation (KD), which transfers the knowledge of a well-trained large model (teacher) to a small model (student), has become an important area of research for practical deployment of recommender systems. Recently, Relaxed Ranking Distillation (RRD) has shown that distilling the ranking information in the recommendation list significantly improves the performance. However, the method still has limitations in that 1) it does not fully utilize the prediction errors of the student model, which makes the training not fully efficient, and 2) it only distills the user-side ranking information, which provides an insufficient view under the sparse implicit feedback. This paper presents Dual Correction strategy for Distillation (DCD), which transfers the ranking information from the teacher model to the student model in a more efficient manner. Most importantly, DCD uses the discrepancy between the teacher model and the student model predictions to decide which knowledge to be distilled. By doing so, DCD essentially provides the learning guidance tailored to "correcting" what the student model has failed to accurately predict. This process is applied for transferring the ranking information from the user-side as well as the item-side to address sparse implicit user feedback. Our experiments show that the proposed method outperforms the state-of-the-art baselines, and ablation studies validate the effectiveness of each component.
    EMA: Auditing Data Removal from Trained Models. (arXiv:2109.03675v1 [cs.LG])
    (0 min) Data auditing is a process to verify whether certain data have been removed from a trained model. A recently proposed method (Liu et al. 20) uses Kolmogorov-Smirnov (KS) distance for such data auditing. However, it fails under certain practical conditions. In this paper, we propose a new method called Ensembled Membership Auditing (EMA) for auditing data removal to overcome these limitations. We compare both methods using benchmark datasets (MNIST and SVHN) and Chest X-ray datasets with multi-layer perceptrons (MLP) and convolutional neural networks (CNN). Our experiments show that EMA is robust under various conditions, including the failure cases of the previously proposed method. Our code is available at: https://github.com/Hazelsuko07/EMA.
    Conservative Policy Construction Using Variational Autoencoders for Logged Data with Missing Values. (arXiv:2109.03747v1 [cs.LG])
    (0 min) In high-stakes applications of data-driven decision making like healthcare, it is of paramount importance to learn a policy that maximizes the reward while avoiding potentially dangerous actions when there is uncertainty. There are two main challenges usually associated with this problem. Firstly, learning through online exploration is not possible due to the critical nature of such applications. Therefore, we need to resort to observational datasets with no counterfactuals. Secondly, such datasets are usually imperfect, additionally cursed with missing values in the attributes of features. In this paper, we consider the problem of constructing personalized policies using logged data when there are missing values in the attributes of features in both training and test data. The goal is to recommend an action (treatment) when $\Xt$, a degraded version of $\Xb$ with missing values, is observed. We consider three strategies for dealing with missingness. In particular, we introduce the \textit{conservative strategy} where the policy is designed to safely handle the uncertainty due to missingness. In order to implement this strategy we need to estimate posterior distribution $p(\Xb|\Xt)$, we use variational autoencoder to achieve this. In particular, our method is based on partial variational autoencoders (PVAE) which are designed to capture the underlying structure of features with missing values.
    Distributed Reinforcement Learning for Age of Information Minimization in Real-Time IoT Systems. (arXiv:2104.01527v2 [cs.IT] UPDATED)
    (0 min) In this paper, the problem of minimizing the weighted sum of age of information (AoI) and total energy consumption of Internet of Things (IoT) devices is studied. In the considered model, each IoT device monitors a physical process that follows nonlinear dynamics. As the dynamics of the physical process vary over time, each device must find an optimal sampling frequency to sample the real-time dynamics of the physical system and send sampled information to a base station (BS). Due to limited wireless resources, the BS can only select a subset of devices to transmit their sampled information. Thus, edge devices must cooperatively sample their monitored dynamics based on the local observations and the BS must collect the sampled information from the devices immediately, hence avoiding the additional time and energy used for sampling and information transmission. To this end, it is necessary to jointly optimize the sampling policy of each device and the device selection scheme of the BS so as to accurately monitor the dynamics of the physical process using minimum energy. This problem is formulated as an optimization problem whose goal is to minimize the weighted sum of AoI cost and energy consumption. To solve this problem, we propose a novel distributed reinforcement learning (RL) approach for the sampling policy optimization. The proposed algorithm enables edge devices to cooperatively find the global optimal sampling policy using their own local observations. Given the sampling policy, the device selection scheme can be optimized thus minimizing the weighted sum of AoI and energy consumption of all devices. Simulations with real data of PM 2.5 pollution show that the proposed algorithm can reduce the sum of AoI by up to 17.8% and 33.9% and the total energy consumption by up to 13.2% and 35.1%, compared to a conventional deep Q network method and a uniform sampling policy.
    Model Selection's Disparate Impact in Real-World Deep Learning Applications. (arXiv:2104.00606v2 [cs.LG] UPDATED)
    (0 min) Algorithmic fairness has emphasized the role of biased data in automated decision outcomes. Recently, there has been a shift in attention to sources of bias that implicate fairness in other stages in the ML pipeline. We contend that one source of such bias, human preferences in model selection, remains under-explored in terms of its role in disparate impact across demographic groups. Using a deep learning model trained on real-world medical imaging data, we verify our claim empirically and argue that choice of metric for model comparison, especially those that do not take variability into account, can significantly bias model selection outcomes.
    DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode. (arXiv:2109.03326v1 [cs.CR])
    (0 min) Computer vision has witnessed several advances in recent years, with unprecedented performance provided by deep representation learning research. Image formats thus appear attractive to other fields such as malware detection, where deep learning on images alleviates the need for comprehensively hand-crafted features generalising to different malware variants. We postulate that this research direction could become the next frontier in Android malware detection, and therefore requires a clear roadmap to ensure that new approaches indeed bring novel contributions. We contribute with a first building block by developing and assessing a baseline pipeline for image-based malware detection with straightforward steps. We propose DexRay, which converts the bytecode of the app DEX files into grey-scale "vector" images and feeds them to a 1-dimensional Convolutional Neural Network model. We view DexRay as foundational due to the exceedingly basic nature of the design choices, allowing to infer what could be a minimal performance that can be obtained with image-based learning in malware detection. The performance of DexRay evaluated on over 158k apps demonstrates that, while simple, our approach is effective with a high detection rate(F1-score= 0.96). Finally, we investigate the impact of time decay and image-resizing on the performance of DexRay and assess its resilience to obfuscation. This work-in-progress paper contributes to the domain of Deep Learning based Malware detection by providing a sound, simple, yet effective approach (with available artefacts) that can be the basis to scope the many profound questions that will need to be investigated to fully develop this domain.
    Federated Learning Beyond the Star: Local D2D Model Consensus with Global Cluster Sampling. (arXiv:2109.03350v1 [cs.LG])
    (0 min) Federated learning has emerged as a popular technique for distributing model training across the network edge. Its learning architecture is conventionally a star topology between the devices and a central server. In this paper, we propose two timescale hybrid federated learning (TT-HF), which migrates to a more distributed topology via device-to-device (D2D) communications. In TT-HF, local model training occurs at devices via successive gradient iterations, and the synchronization process occurs at two timescales: (i) macro-scale, where global aggregations are carried out via device-server interactions, and (ii) micro-scale, where local aggregations are carried out via D2D cooperative consensus formation in different device clusters. Our theoretical analysis reveals how device, cluster, and network-level parameters affect the convergence of TT-HF, and leads to a set of conditions under which a convergence rate of O(1/t) is guaranteed. Experimental results demonstrate the improvements in convergence and utilization that can be obtained by TT-HF over state-of-the-art federated learning baselines.
    ADER:Adapting between Exploration and Robustness for Actor-Critic Methods. (arXiv:2109.03443v1 [cs.LG])
    (0 min) Combining off-policy reinforcement learning methods with function approximators such as neural networks has been found to lead to overestimation of the value function and sub-optimal solutions. Improvement such as TD3 has been proposed to address this issue. However, we surprisingly find that its performance lags behind the vanilla actor-critic methods (such as DDPG) in some primitive environments. In this paper, we show that the failure of some cases can be attributed to insufficient exploration. We reveal the culprit of insufficient exploration in TD3, and propose a novel algorithm toward this problem that ADapts between Exploration and Robustness, namely ADER. To enhance the exploration ability while eliminating the overestimation bias, we introduce a dynamic penalty term in value estimation calculated from estimated uncertainty, which takes into account different compositions of the uncertainty in different learning stages. Experiments in several challenging environments demonstrate the supremacy of the proposed method in continuous control tasks.
    Fixed Support Tree-Sliced Wasserstein Barycenter. (arXiv:2109.03431v1 [cs.AI])
    (0 min) The Wasserstein barycenter has been widely studied in various fields, including natural language processing, and computer vision. However, it requires a high computational cost to solve the Wasserstein barycenter problem because the computation of the Wasserstein distance requires a quadratic time with respect to the number of supports. By contrast, the Wasserstein distance on a tree, called the tree-Wasserstein distance, can be computed in linear time and allows for the fast comparison of a large number of distributions. In this study, we propose a barycenter under the tree-Wasserstein distance, called the fixed support tree-Wasserstein barycenter (FS-TWB) and its extension, called the fixed support tree-sliced Wasserstein barycenter (FS-TSWB). More specifically, we first show that the FS-TWB and FS-TSWB problems are convex optimization problems and can be solved by using the projected subgradient descent. Moreover, we propose a more efficient algorithm to compute the subgradient and objective function value by using the properties of tree-Wasserstein barycenter problems. Through real-world experiments, we show that, by using the proposed algorithm, the FS-TWB and FS-TSWB can be solved two orders of magnitude faster than the original Wasserstein barycenter.
    Learning Zero-sum Stochastic Games with Posterior Sampling. (arXiv:2109.03396v1 [cs.LG])
    (0 min) In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. This improves the best existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et. al., 2017 under the same assumption and matches the theoretical lower bound in $A$ and $T$.
    Can Noise on Qubits Be Learned in Quantum Neural Network? A Case Study on QuantumFlow. (arXiv:2109.03430v1 [quant-ph])
    (0 min) In the noisy intermediate-scale quantum (NISQ) era, one of the key questions is how to deal with the high noise level existing in physical quantum bits (qubits). Quantum error correction is promising but requires an extensive number (e.g., over 1,000) of physical qubits to create one "perfect" qubit, exceeding the capacity of the existing quantum computers. This paper aims to tackle the noise issue from another angle: instead of creating perfect qubits for general quantum algorithms, we investigate the potential to mitigate the noise issue for dedicate algorithms. Specifically, this paper targets quantum neural network (QNN), and proposes to learn the errors in the training phase, so that the identified QNN model can be resilient to noise. As a result, the implementation of QNN needs no or a small number of additional physical qubits, which is more realistic for the near-term quantum computers. To achieve this goal, an application-specific compiler is essential: on the one hand, the error cannot be learned if the mapping from logical qubits to physical qubits exists randomness; on the other hand, the compiler needs to be efficient so that the lengthy training procedure can be completed in a reasonable time. In this paper, we utilize the recent QNN framework, QuantumFlow, as a case study. Experimental results show that the proposed approach can optimize QNN models for different errors in qubits, achieving up to 28% accuracy improvement compared with the model obtained by the error-agnostic training.
    A New Non-Negative Matrix Co-Factorisation Approach for Noisy Neonatal Chest Sound Separation. (arXiv:2109.03275v1 [eess.AS])
    (0 min) Obtaining high-quality heart and lung sounds enables clinicians to accurately assess a newborn's cardio-respiratory health and provide timely care. However, noisy chest sound recordings are common, hindering timely and accurate assessment. A new Non-negative Matrix Co-Factorisation-based approach is proposed to separate noisy chest sound recordings into heart, lung, and noise components to address this problem. This method is achieved through training with 20 high-quality heart and lung sounds, in parallel with separating the sounds of the noisy recording. The method was tested on 68 10-second noisy recordings containing both heart and lung sounds and compared to the current state of the art Non-negative Matrix Factorisation methods. Results show significant improvements in heart and lung sound quality scores respectively, and improved accuracy of 3.6bpm and 1.2bpm in heart and breathing rate estimation respectively, when compared to existing methods.
    Federated Deep AUC Maximization for Heterogeneous Data with a Constant Communication Complexity. (arXiv:2102.04635v2 [cs.LG] UPDATED)
    (0 min) Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization objective is non-decomposable over individual examples. In this paper, we propose improved FDAM algorithms for heterogeneous data by solving the popular non-convex strongly-concave min-max formulation of DAM in a distributed fashion, which can also be applied to a class of non-convex strongly-concave min-max problems. A striking result of this paper is that the communication complexity of the proposed algorithm is a constant independent of the number of machines and also independent of the accuracy level, which improves an existing result by orders of magnitude. The experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets, and on medical chest X-ray images from different organizations. Our experiment shows that the performance of FDAM using data from multiple hospitals can improve the AUC score on testing data from a single hospital for detecting life-threatening diseases based on chest radiographs.
    A Bottom-up method Towards the Automatic and Objective Monitoring of Smoking Behavior In-the-wild using Wrist-mounted Inertial Sensors. (arXiv:2109.03475v1 [eess.SP])
    (0 min) The consumption of tobacco has reached global epidemic proportions and is characterized as the leading cause of death and illness. Among the different ways of consuming tobacco (e.g., smokeless, cigars), smoking cigarettes is the most widespread. In this paper, we present a two-step, bottom-up algorithm towards the automatic and objective monitoring of cigarette-based, smoking behavior during the day, using the 3D acceleration and orientation velocity measurements from a commercial smartwatch. In the first step, our algorithm performs the detection of individual smoking gestures (i.e., puffs) using an artificial neural network with both convolutional and recurrent layers. In the second step, we make use of the detected puff density to achieve the temporal localization of smoking sessions that occur throughout the day. In the experimental section we provide extended evaluation regarding each step of the proposed algorithm, using our publicly available, realistic Smoking Event Detection (SED) and Free-living Smoking Event Detection (SED-FL) datasets recorded under semi-controlled and free-living conditions, respectively. In particular, leave-one-subject-out (LOSO) experiments reveal an F1-score of 0.863 for the detection of puffs and an F1-score/Jaccard index equal to 0.878/0.604 towards the temporal localization of smoking sessions during the day. Finally, to gain further insight, we also compare the puff detection part of our algorithm with a similar approach found in the recent literature.
    Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering. (arXiv:2109.03381v1 [cs.CL])
    (0 min) Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks.
    CRNNTL: convolutional recurrent neural network and transfer learning for QSAR modelling. (arXiv:2109.03309v1 [q-bio.QM])
    (0 min) In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) for QSAR modelling. The method was inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our strategy takes advantages of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. Herein, CRNNTL is evaluated on 20 benchmark datasets in comparison with baseline methods. In addition, one isomers based dataset is used to elucidate its ability for both local and global feature extraction. Then, knowledge transfer performance of CRNNTL is tested, especially for small biological activity datasets. Finally, different latent representations from other type of AEs were used for versatility study of our model. The results show the effectiveness of CRNNTL using different latent representation. Moreover, efficient knowledge transfer is achieved to overcome data scarcity considering binding site similarity between different targets.
    Reconstructing High-resolution Turbulent Flows Using Physics-Guided Neural Networks. (arXiv:2109.03327v1 [physics.flu-dyn])
    (0 min) Direct numerical simulation (DNS) of turbulent flows is computationally expensive and cannot be applied to flows with large Reynolds numbers. Large eddy simulation (LES) is an alternative that is computationally less demanding, but is unable to capture all of the scales of turbulent transport accurately. Our goal in this work is to build a new data-driven methodology based on super-resolution techniques to reconstruct DNS data from LES predictions. We leverage the underlying physical relationships to regularize the relationships amongst different physical variables. We also introduce a hierarchical generative process and a reverse degradation process to fully explore the correspondence between DNS and LES data. We demonstrate the effectiveness of our method through a single-snapshot experiment and a cross-time experiment. The results confirm that our method can better reconstruct high-resolution DNS data over space and over time in terms of pixel-wise reconstruction error and structural similarity. Visual comparisons show that our method performs much better in capturing fine-level flow dynamics.
    Built-in Elastic Transformations for Improved Robustness. (arXiv:2107.09391v2 [cs.CV] UPDATED)
    (0 min) We focus on building robustness in the convolutions of neural visual classifiers, especially against natural perturbations like elastic deformations, occlusions and Gaussian noise. Existing CNNs show outstanding performance on clean images, but fail to tackle naturally occurring perturbations. In this paper, we start from elastic perturbations, which approximate (local) view-point changes of the object. We present elastically-augmented convolutions (EAConv) by parameterizing filters as a combination of fixed elastically-perturbed bases functions and trainable weights for the purpose of integrating unseen viewpoints in the CNN. We show on CIFAR-10 and STL-10 datasets that the general robustness of our method on unseen occlusion, zoom, rotation, image cut and Gaussian perturbations improves, while significantly improving the performance on clean images without any data augmentation.
    Disentangling Alzheimer's disease neurodegeneration from typical brain aging using machine learning. (arXiv:2109.03723v1 [cs.LG])
    (0 min) Neuroimaging biomarkers that distinguish between typical brain aging and Alzheimer's disease (AD) are valuable for determining how much each contributes to cognitive decline. Machine learning models can derive multi-variate brain change patterns related to the two processes, including the SPARE-AD (Spatial Patterns of Atrophy for Recognition of Alzheimer's Disease) and SPARE-BA (of Brain Aging) investigated herein. However, substantial overlap between brain regions affected in the two processes confounds measuring them independently. We present a methodology toward disentangling the two. T1-weighted MRI images of 4,054 participants (48-95 years) with AD, mild cognitive impairment (MCI), or cognitively normal (CN) diagnoses from the iSTAGING (Imaging-based coordinate SysTem for AGIng and NeurodeGenerative diseases) consortium were analyzed. First, a subset of AD patients and CN adults were selected based purely on clinical diagnoses to train SPARE-BA1 (regression of age using CN individuals) and SPARE-AD1 (classification of CN versus AD). Second, analogous groups were selected based on clinical and molecular markers to train SPARE-BA2 and SPARE-AD2: amyloid-positive (A+) AD continuum group (consisting of A+AD, A+MCI, and A+ and tau-positive CN individuals) and amyloid-negative (A-) CN group. Finally, the combined group of the AD continuum and A-/CN individuals was used to train SPARE-BA3, with the intention to estimate brain age regardless of AD-related brain changes. Disentangled SPARE models derived brain patterns that were more specific to the two types of the brain changes. Correlation between the SPARE-BA and SPARE-AD was significantly reduced. Correlation of disentangled SPARE-AD was non-inferior to the molecular measurements and to the number of APOE4 alleles, but was less to AD-related psychometric test scores, suggesting contribution of advanced brain aging to these scores.
    Federated Edge Learning with Misaligned Over-The-Air Computation. (arXiv:2102.13604v3 [cs.IT] UPDATED)
    (0 min) Over-the-air computation (OAC) is a promising technique to realize fast model aggregation in the uplink of federated edge learning. OAC, however, hinges on accurate channel-gain precoding and strict synchronization among the edge devices, which are challenging in practice. As such, how to design the maximum likelihood (ML) estimator in the presence of residual channel-gain mismatch and asynchronies is an open problem. To fill this gap, this paper formulates the problem of misaligned OAC for federated edge learning and puts forth a whitened matched filtering and sampling scheme to obtain oversampled, but independent, samples from the misaligned and overlapped signals. Given the whitened samples, a sum-product ML estimator and an aligned-sample estimator are devised to estimate the arithmetic sum of the transmitted symbols. In particular, the computational complexity of our sum-product ML estimator is linear in the packet length and hence is significantly lower than the conventional ML estimator. Extensive simulations on the test accuracy versus the average received energy per symbol to noise power spectral density ratio (EsN0) yield two main results: 1) In the low EsN0 regime, the aligned-sample estimator can achieve superior test accuracy provided that the phase misalignment is non-severe. In contrast, the ML estimator does not work well due to the error propagation and noise enhancement in the estimation process. 2) In the high EsN0 regime, the ML estimator attains the optimal learning performance regardless of the severity of phase misalignment. On the other hand, the aligned-sample estimator suffers from a test-accuracy loss caused by phase misalignment.
    fastMRI+: Clinical Pathology Annotations for Knee and Brain Fully Sampled Multi-Coil MRI Data. (arXiv:2109.03812v1 [eess.IV])
    (0 min) Improving speed and image quality of Magnetic Resonance Imaging (MRI) via novel reconstruction approaches remains one of the highest impact applications for deep learning in medical imaging. The fastMRI dataset, unique in that it contains large volumes of raw MRI data, has enabled significant advances in accelerating MRI using deep learning-based reconstruction methods. While the impact of the fastMRI dataset on the field of medical imaging is unquestioned, the dataset currently lacks clinical expert pathology annotations, critical to addressing clinically relevant reconstruction frameworks and exploring important questions regarding rendering of specific pathology using such novel approaches. This work introduces fastMRI+, which consists of 16154 subspecialist expert bounding box annotations and 13 study-level labels for 22 different pathology categories on the fastMRI knee dataset, and 7570 subspecialist expert bounding box annotations and 643 study-level labels for 30 different pathology categories for the fastMRI brain dataset. The fastMRI+ dataset is open access and aims to support further research and advancement of medical imaging in MRI reconstruction and beyond.
    Diagnostics-Guided Explanation Generation. (arXiv:2109.03756v1 [cs.LG])
    (0 min) Explanations shed light on a machine learning model's rationales and can aid in identifying deficiencies in its reasoning process. Explanation generation models are typically trained in a supervised way given human explanations. When such annotations are not available, explanations are often selected as those portions of the input that maximise a downstream task's performance, which corresponds to optimising an explanation's Faithfulness to a given model. Faithfulness is one of several so-called diagnostic properties, which prior work has identified as useful for gauging the quality of an explanation without requiring annotations. Other diagnostic properties are Data Consistency, which measures how similar explanations are for similar input instances, and Confidence Indication, which shows whether the explanation reflects the confidence of the model. In this work, we show how to directly optimise for these diagnostic properties when training a model to generate sentence-level explanations, which markedly improves explanation quality, agreement with human rationales, and downstream task performance on three complex reasoning tasks.
    A Survey on Machine Learning Techniques for Auto Labeling of Video, Audio, and Text Data. (arXiv:2109.03784v1 [cs.LG])
    (0 min) Machine learning has been utilized to perform tasks in many different domains such as classification, object detection, image segmentation and natural language analysis. Data labeling has always been one of the most important tasks in machine learning. However, labeling large amounts of data increases the monetary cost in machine learning. As a result, researchers started to focus on reducing data annotation and labeling costs. Transfer learning was designed and widely used as an efficient approach that can reasonably reduce the negative impact of limited data, which in turn, reduces the data preparation cost. Even transferring previous knowledge from a source domain reduces the amount of data needed in a target domain. However, large amounts of annotated data are still demanded to build robust models and improve the prediction accuracy of the model. Therefore, researchers started to pay more attention on auto annotation and labeling. In this survey paper, we provide a review of previous techniques that focuses on optimized data annotation and labeling for video, audio, and text data.
    RoadAtlas: Intelligent Platform for Automated Road Defect Detection and Asset Management. (arXiv:2109.03385v1 [cs.CV])
    (0 min) With the rapid development of intelligent detection algorithms based on deep learning, much progress has been made in automatic road defect recognition and road marking parsing. This can effectively address the issue of an expensive and time-consuming process for professional inspectors to review the street manually. Towards this goal, we present RoadAtlas, a novel end-to-end integrated system that can support 1) road defect detection, 2) road marking parsing, 3) a web-based dashboard for presenting and inputting data by users, and 4) a backend containing a well-structured database and developed APIs.
    Text-Free Prosody-Aware Generative Spoken Language Modeling. (arXiv:2109.03264v1 [cs.CL])
    (0 min) Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm.
    Open Aspect Target Sentiment Classification with Natural Language Prompts. (arXiv:2109.03685v1 [cs.CL])
    (0 min) For many business applications, we often seek to analyze sentiments associated with any arbitrary aspects of commercial products, despite having a very limited amount of labels or even without any labels at all. However, existing aspect target sentiment classification (ATSC) models are not trainable if annotated datasets are not available. Even with labeled data, they fall short of reaching satisfactory performance. To address this, we propose simple approaches that better solve ATSC with natural language prompts, enabling the task under zero-shot cases and enhancing supervised settings, especially for few-shot cases. Under the few-shot setting for SemEval 2014 Task 4 laptop domain, our method of reformulating ATSC as an NLI task outperforms supervised SOTA approaches by up to 24.13 accuracy points and 33.14 macro F1 points. Moreover, we demonstrate that our prompts could handle implicitly stated aspects as well: our models reach about 77% accuracy on detecting sentiments for aspect categories (e.g., food), which do not necessarily appear within the text, even though we trained the models only with explicitly mentioned aspect terms (e.g., fajitas) from just 16 reviews - while the accuracy of the no-prompt baseline is only around 65%.
    AWGAN: Empowering High-Dimensional Discriminator Output for Generative Adversarial Networks. (arXiv:2109.03378v1 [stat.ML])
    (0 min) Empirically multidimensional discriminator (critic) output can be advantageous, while a solid explanation for it has not been discussed. In this paper, (i) we rigorously prove that high-dimensional critic output has advantage on distinguishing real and fake distributions; (ii) we also introduce an square-root velocity transformation (SRVT) block which further magnifies this advantage. The proof is based on our proposed maximal p-centrality discrepancy which is bounded above by p-Wasserstein distance and perfectly fits the Wasserstein GAN framework with high-dimensional critic output n. We have also showed when n = 1, the proposed discrepancy is equivalent to 1-Wasserstein distance. The SRVT block is applied to break the symmetric structure of high-dimensional critic output and improve the generalization capability of the discriminator network. In terms of implementation, the proposed framework does not require additional hyper-parameter tuning, which largely facilitates its usage. Experiments on image generation tasks show performance improvement on benchmark datasets.
    An Attribute-Aligned Strategy for Learning Speech Representation. (arXiv:2106.02810v2 [eess.AS] UPDATED)
    (0 min) Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes, to derive an identity-free representation for speech emotion recognition (SER), and an emotionless representation for speaker verification (SV). Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV, comparing to the current state-of-the-art method of using adversarial learning applied on a large emotion corpora, the MSP-Podcast. Also, our proposed learning strategy reduces the model and training process needed to achieve multiple privacy-preserving tasks.
    Forward and Inverse models in HCI:Physical simulation and deep learning for inferring 3D finger pose. (arXiv:2109.03366v1 [cs.HC])
    (0 min) We outline the role of forward and inverse modelling approaches in the design of human--computer interaction systems. Causal, forward models tend to be easier to specify and simulate, but HCI requires solutions of the inverse problem. We infer finger 3D position $(x,y,z)$ and pose (pitch and yaw) on a mobile device using capacitive sensors which can sense the finger up to 5cm above the screen. We use machine learning to develop data-driven models to infer position, pose and sensor readings, based on training data from: 1. data generated by robots, 2. data from electrostatic simulators 3. human-generated data. Machine learned emulation is used to accelerate the electrostatic simulation performance by a factor of millions. We combine a Conditional Variational Autoencoder with domain expertise/models experimentally collected data. We compare forward and inverse model approaches to direct inference of finger pose. The combination gives the most accurate reported results on inferring 3D position and pose with a capacitive sensor on a mobile device.
    Simple Video Generation using Neural ODEs. (arXiv:2109.03292v1 [cs.CV])
    (0 min) Despite having been studied to a great extent, the task of conditional generation of sequences of frames, or videos, remains extremely challenging. It is a common belief that a key step towards solving this task resides in modelling accurately both spatial and temporal information in video signals. A promising direction to do so has been to learn latent variable models that predict the future in latent space and project back to pixels, as suggested in recent literature. Following this line of work and building on top of a family of models introduced in prior work, Neural ODE, we investigate an approach that models time-continuous dynamics over a continuous latent space with a differential equation with respect to time. The intuition behind this approach is that these trajectories in latent space could then be extrapolated to generate video frames beyond the time steps for which the model is trained. We show that our approach yields promising results in the task of future frame prediction on the Moving MNIST dataset with 1 and 2 digits.
    How do I update my model? On the resilience of Predictive Process Monitoring models to change. (arXiv:2109.03501v1 [cs.LG])
    (2 min) Existing well investigated Predictive Process Monitoring techniques typically construct a predictive model based on past process executions, and then use it to predict the future of new ongoing cases, without the possibility of updating it with new cases when they complete their execution. This can make Predictive Process Monitoring too rigid to deal with the variability of processes working in real environments that continuously evolve and/or exhibit new variant behaviours over time. As a solution to this problem, we evaluate the use of three different strategies that allow the periodic rediscovery or incremental construction of the predictive model so as to exploit new available data. The evaluation focuses on the performance of the new learned predictive models, in terms of accuracy and time, against the original one, and uses a number of real and synthetic datasets with and without explicit Concept Drift. The results provide an evidence of the potential of incremental learning algorithms for predicting process monitoring in real environments.
    Uncertainty Quantification and Experimental Design for large-scale linear Inverse Problems under Gaussian Process Priors. (arXiv:2109.03457v1 [stat.ML])
    (2 min) We consider the use of Gaussian process (GP) priors for solving inverse problems in a Bayesian framework. As is well known, the computational complexity of GPs scales cubically in the number of datapoints. We here show that in the context of inverse problems involving integral operators, one faces additional difficulties that hinder inversion on large grids. Furthermore, in that context, covariance matrices can become too large to be stored. By leveraging results about sequential disintegrations of Gaussian measures, we are able to introduce an implicit representation of posterior covariance matrices that reduces the memory footprint by only storing low rank intermediate matrices, while allowing individual elements to be accessed on-the-fly without needing to build full posterior covariance matrices. Moreover, it allows for fast sequential inclusion of new observations. These features are crucial when considering sequential experimental design tasks. We demonstrate our approach by computing sequential data collection plans for excursion set recovery for a gravimetric inverse problem, where the goal is to provide fine resolution estimates of high density regions inside the Stromboli volcano, Italy. Sequential data collection plans are computed by extending the weighted integrated variance reduction (wIVR) criterion to inverse problems. Our results show that this criterion is able to significantly reduce the uncertainty on the excursion volume, reaching close to minimal levels of residual uncertainty. Overall, our techniques allow the advantages of probabilistic models to be brought to bear on large-scale inverse problems arising in the natural sciences.
    AgreementLearning: An End-to-End Framework for Learning with Multiple Annotators without Groundtruth. (arXiv:2109.03596v1 [cs.LG])
    (2 min) The annotation of domain experts is important for some medical applications where the objective groundtruth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single groundtruth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel agreement learning framework to tackle the challenge of learning from multiple annotators without objective groundtruth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between the annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between the annotators. The proposed method can be easily plugged to existing backbones developed with majority-voted groundtruth or multiple annotations. Thereon, experiments on two medical datasets demonstrate improved agreement levels with annotators.
    Deep Learning for Multi-View Ultrasonic Image Fusion. (arXiv:2109.03616v1 [eess.IV])
    (2 min) Ultrasonic imaging is being used to obtain information about the acoustic properties of a medium by emitting waves into it and recording their interaction using ultrasonic transducer arrays. The Delay-And-Sum (DAS) algorithm forms images using the main path on which reflected signals travel back to the transducers. In some applications, different insonification paths can be considered, for instance by placing the transducers at different locations or if strong reflectors inside the medium are known a-priori. These different modes give rise to multiple DAS images reflecting different geometric information about the scatterers and the challenge is to either fuse them into one image or to directly extract higher-level information regarding the materials of the medium, e.g., a segmentation map. Traditional image fusion techniques typically use ad-hoc combinations of pre-defined image transforms, pooling operations and thresholding. In this work, we propose a deep neural network (DNN) architecture that directly maps all available data to a segmentation map while explicitly incorporating the DAS image formation for the different insonification paths as network layers. This enables information flow between data pre-processing and image post-processing DNNs, trained end-to-end. We compare our proposed method to a traditional image fusion technique using simulated data experiments, mimicking a non-destructive testing application with four image modes, i.e., two transducer locations and two internal reflection boundaries. Using our approach, it is possible to obtain much more accurate segmentation of defects.
    Predicting Process Name from Network Data. (arXiv:2109.03328v1 [cs.CR])
    (2 min) The ability to identify applications based on the network data they generate could be a valuable tool for cyber defense. We report on a machine learning technique capable of using netflow-like features to predict the application that generated the traffic. In our experiments, we used ground-truth labels obtained from host-based sensors deployed in a large enterprise environment; we applied random forests and multilayer perceptrons to the tasks of browser vs. non-browser identification, browser fingerprinting, and process name prediction. For each of these tasks, we demonstrate how machine learning models can achieve high classification accuracy using only netflow-like features as the basis for classification.
    Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi. (arXiv:2109.03552v1 [cs.CL])
    (2 min) The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.
    Signal-domain representation of symbolic music for learning embedding spaces. (arXiv:2109.03454v1 [cs.LG])
    (2 min) A key aspect of machine learning models lies in their ability to learn efficient intermediate features. However, the input representation plays a crucial role in this process, and polyphonic musical scores remain a particularly complex type of information. In this paper, we introduce a novel representation of symbolic music data, which transforms a polyphonic score into a continuous signal. We evaluate the ability to learn meaningful features from this representation from a musical point of view. Hence, we introduce an evaluation method relying on principled generation of synthetic data. Finally, to test our proposed representation we conduct an extensive benchmark against recent polyphonic symbolic representations. We show that our signal-like representation leads to better reconstruction and disentangled features. This improvement is reflected in the metric properties and in the generation ability of the space learned from our signal-like representation according to music theory properties.
    R2-D2: A Modular Baseline for Open-Domain Question Answering. (arXiv:2109.03502v1 [cs.CL])
    (2 min) This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from all system's components. We demonstrate its strength across three open-domain QA datasets: NaturalQuestions, TriviaQA and EfficientQA, surpassing state-of-the-art on the first two. Our analysis demonstrates that: (i) combining extractive and generative reader yields absolute improvements up to 5 exact match and it is at least twice as effective as the posterior averaging ensemble of the same models with different parameters, (ii) the extractive reader with fewer parameters can match the performance of the generative reader on extractive QA datasets.
    Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning. (arXiv:2109.03445v1 [stat.ML])
    (2 min) The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a solution to an equation of the form $\mathbf{f}(\boldsymbol{\theta}) = \mathbf{0}$ where $\mathbf{f} : \mathbb{R}^d \rightarrow \mathbb{R}^d$, when only noisy measurements of $\mathbf{f}(\cdot)$ are available. In the literature to date, one can make a distinction between "synchronous" updating, whereby the entire vector of the current guess $\boldsymbol{\theta}_t$ is updated at each time, and "asynchronous" updating, whereby ony one component of $\boldsymbol{\theta}_t$ is updated. In convex and nonconvex optimization, there is also the notion of "batch" updating, whereby some but not all components of $\boldsymbol{\theta}_t$ are updated at each time $t$. In addition, there is also a distinction between using a "local" clock versus a "global" clock. In the literature to date, convergence proofs when a local clock is used make the assumption that the measurement noise is an i.i.d\ sequence, an assumption that does not hold in Reinforcement Learning (RL). In this note, we provide a general theory of convergence for batch asymchronous stochastic approximation (BASA), that works whether the updates use a local clock or a global clock, for the case where the measurement noises form a martingale difference sequence. This is the most general result to date and encompasses all others.
    Estimating Expected Calibration Errors. (arXiv:2109.03480v1 [cs.LG])
    (2 min) Uncertainty in probabilistic classifiers predictions is a key concern when models are used to support human decision making, in broader probabilistic pipelines or when sensitive automatic decisions have to be taken. Studies have shown that most models are not intrinsically well calibrated, meaning that their decision scores are not consistent with posterior probabilities. Hence being able to calibrate these models, or enforce calibration while learning them, has regained interest in recent literature. In this context, properly assessing calibration is paramount to quantify new contributions tackling calibration. However, there is room for improvement for commonly used metrics and evaluation of calibration could benefit from deeper analyses. Thus this paper focuses on the empirical evaluation of calibration metrics in the context of classification. More specifically it evaluates different estimators of the Expected Calibration Error ($ECE$), amongst which legacy estimators and some novel ones, proposed in this paper. We build an empirical procedure to quantify the quality of these $ECE$ estimators, and use it to decide which estimator should be used in practice for different settings.
    A Clustering-aided Ensemble Method for Predicting Ridesourcing Demand in Chicago. (arXiv:2109.03433v1 [cs.LG])
    (2 min) Accurately forecasting ridesourcing demand is important for effective transportation planning and policy-making. With the rise of Artificial Intelligence (AI), researchers have started to utilize machine learning models to forecast travel demand, which, in many cases, can produce higher prediction accuracy than statistical models. However, most existing machine-learning studies used a global model to predict the demand and ignored the influence of spatial heterogeneity (i.e., the spatial variations in the impacts of explanatory variables). Spatial heterogeneity can drive the parameter estimations varying over space; failing to consider the spatial variations may limit the model's prediction performance. To account for spatial heterogeneity, this study proposes a Clustering-aided Ensemble Method (CEM) to forecast the zone-to-zone (census-tract-to-census-tract) travel demand for ridesourcing services. Specifically, we develop a clustering framework to split the origin-destination pairs into different clusters and ensemble the cluster-specific machine learning models for prediction. We implement and test the proposed methodology by using the ridesourcing-trip data in Chicago. The results show that, with a more transparent and flexible model structure, the CEM significantly improves the prediction accuracy than the benchmark models (i.e., global machine-learning and statistical models directly trained on all observations). This study offers transportation researchers and practitioners a new methodology of travel demand forecasting, especially for new travel modes like ridesourcing and micromobility.
    Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud. (arXiv:2109.03285v1 [cs.LG])
    (2 min) Understanding the predictions made by machine learning (ML) models and their potential biases remains a challenging and labor-intensive task that depends on the application, the dataset, and the specific model. We present Amazon SageMaker Clarify, an explainability feature for Amazon SageMaker that launched in December 2020, providing insights into data and ML models by identifying biases and explaining predictions. It is deeply integrated into Amazon SageMaker, a fully managed service that enables data scientists and developers to build, train, and deploy ML models at any scale. Clarify supports bias detection and feature importance computation across the ML lifecycle, during data preparation, model evaluation, and post-deployment monitoring. We outline the desiderata derived from customer input, the modular architecture, and the methodology for bias and explanation computations. Further, we describe the technical challenges encountered and the tradeoffs we had to make. For illustration, we discuss two customer use cases. We present our deployment results including qualitative customer feedback and a quantitative evaluation. Finally, we summarize lessons learned, and discuss best practices for the successful adoption of fairness and explanation tools in practice.
    Sample and Communication-Efficient Decentralized Actor-Critic Algorithms with Finite-Time Analysis. (arXiv:2109.03699v1 [cs.LG])
    (2 min) Actor-critic (AC) algorithms have been widely adopted in decentralized multi-agent systems to learn the optimal joint control policy. However, existing decentralized AC algorithms either do not preserve the privacy of agents or are not sample and communication-efficient. In this work, we develop two decentralized AC and natural AC (NAC) algorithms that are private, and sample and communication-efficient. In both algorithms, agents share noisy information to preserve privacy and adopt mini-batch updates to improve sample and communication efficiency. Particularly for decentralized NAC, we develop a decentralized Markovian SGD algorithm with an adaptive mini-batch size to efficiently compute the natural policy gradient. Under Markovian sampling and linear function approximation, we prove the proposed decentralized AC and NAC algorithms achieve the state-of-the-art sample complexities $\mathcal{O}\big(\epsilon^{-2}\ln(\epsilon^{-1})\big)$ and $\mathcal{O}\big(\epsilon^{-3}\ln(\epsilon^{-1})\big)$, respectively, and the same small communication complexity $\mathcal{O}\big(\epsilon^{-1}\ln(\epsilon^{-1})\big)$. Numerical experiments demonstrate that the proposed algorithms achieve lower sample and communication complexities than the existing decentralized AC algorithm.
    A Deep Reinforcement Learning Approach for Constrained Online Logistics Route Assignment. (arXiv:2109.03467v1 [cs.LG])
    (2 min) As online shopping prevails and e-commerce platforms emerge, there is a tremendous number of parcels being transported every day. Thus, it is crucial for the logistics industry on how to assign a candidate logistics route for each shipping parcel properly as it leaves a significant impact on the total logistics cost optimization and business constraints satisfaction such as transit hub capacity and delivery proportion of delivery providers. This online route-assignment problem can be viewed as a constrained online decision-making problem. Notably, the large amount (beyond ${10^5}$) of daily parcels, the variability and non-Markovian characteristics of parcel information impose difficulties on attaining (near-) optimal solution without violating constraints excessively. In this paper, we develop a model-free DRL approach named PPO-RA, in which Proximal Policy Optimization (PPO) is improved with dedicated techniques to address the challenges for route assignment (RA). The actor and critic networks use attention mechanism and parameter sharing to accommodate each incoming parcel with varying numbers and identities of candidate routes, without modeling non-Markovian parcel arriving dynamics since we make assumption of i.i.d. parcel arrival. We use recorded delivery parcel data to evaluate the performance of PPO-RA by comparing it with widely-used baselines via simulation. The results show the capability of the proposed approach to achieve considerable cost savings while satisfying most constraints.
    Preprocessing and Modeling of Radial Fan Data for Health State Prediction. (arXiv:2109.03468v1 [cs.LG])
    (2 min) Monitoring critical components of systems is a crucial step towards failure safety. Affordable sensors are available and the industry is in the process of introducing and extending monitoring solutions to improve product quality. Often, no expertise of how much data is required for a certain task (e.g. monitoring) exists. Especially in vital machinery, a trend to exaggerated sensors may be noticed, both in quality and in quantity. This often results in an excessive generation of data, which should be transferred, processed and stored nonetheless. In a previous case study, several sensors have been mounted on a healthy radial fan, which was later artificially damaged. The gathered data was used for modeling (and therefore monitoring) a healthy state. The models were evaluated on a dataset created by using a faulty impeller. This paper focuses on the reduction of this data through downsampling and binning. Different models are created with linear regression and random forest regression and the resulting difference in quality is discussed.
    U-FNO -- an enhanced Fourier neural operator based-deep learning model for multiphase flow. (arXiv:2109.03697v1 [physics.geo-ph])
    (2 min) Numerical simulation of multiphase flow in porous media is essential for many geoscience applications. However, due to the multi-physics, non-linear, and multi-scale problem nature, these simulations are very expensive at desirable grid resolutions, and the computational cost often impedes rigorous engineering decision-making. Machine learning methods provide faster alternatives to traditional simulators by training neural network models with numerical simulation data mappings. Traditional convolutional neural network (CNN)-based models are accurate yet data-intensive and are prone to overfitting. Here we present a new architecture, U-FNO, an enhanced Fourier neural operator for solving the multiphase flow problem. The U-FNO is designed based on the Fourier neural operator (FNO) that learns an integral kernel in the Fourier space. Through a systematic comparison among a CNN benchmark and three types of FNO variations on a CO2-water multiphase problem in the context of CO2 geological storage, we show that the U-FNO architecture has the advantages of both traditional CNN and original FNO, providing significantly more accurate and efficient performance than previous architectures. The trained U-FNO provides gas saturation and pressure buildup predictions with a 10,000 times speedup compared to traditional numerical simulators while maintaining similar accuracy.
    Higher Order Kernel Mean Embeddings to Capture Filtrations of Stochastic Processes. (arXiv:2109.03582v1 [stat.ML])
    (2 min) Stochastic processes are random variables with values in some space of paths. However, reducing a stochastic process to a path-valued random variable ignores its filtration, i.e. the flow of information carried by the process through time. By conditioning the process on its filtration, we introduce a family of higher order kernel mean embeddings (KMEs) that generalizes the notion of KME and captures additional information related to the filtration. We derive empirical estimators for the associated higher order maximum mean discrepancies (MMDs) and prove consistency. We then construct a filtration-sensitive kernel two-sample test able to pick up information that gets missed by the standard MMD test. In addition, leveraging our higher order MMDs we construct a family of universal kernels on stochastic processes that allows to solve real-world calibration and optimal stopping problems in quantitative finance (such as the pricing of American options) via classical kernel-based regression methods. Finally, adapting existing tests for conditional independence to the case of stochastic processes, we design a causal-discovery algorithm to recover the causal graph of structural dependencies among interacting bodies solely from observations of their multidimensional trajectories.
    Computing on Functions Using Randomized Vector Representations. (arXiv:2109.03429v1 [cs.LG])
    (2 min) Vector space models for symbolic processing that encode symbols by random vectors have been proposed in cognitive science and connectionist communities under the names Vector Symbolic Architecture (VSA), and, synonymously, Hyperdimensional (HD) computing. In this paper, we generalize VSAs to function spaces by mapping continuous-valued data into a vector space such that the inner product between the representations of any two data points represents a similarity kernel. By analogy to VSA, we call this new function encoding and computing framework Vector Function Architecture (VFA). In VFAs, vectors can represent individual data points as well as elements of a function space (a reproducing kernel Hilbert space). The algebraic vector operations, inherited from VSA, correspond to well-defined operations in function space. Furthermore, we study a previously proposed method for encoding continuous data, fractional power encoding (FPE), which uses exponentiation of a random base vector to produce randomized representations of data points and fulfills the kernel properties for inducing a VFA. We show that the distribution from which elements of the base vector are sampled determines the shape of the FPE kernel, which in turn induces a VFA for computing with band-limited functions. In particular, VFAs provide an algebraic framework for implementing large-scale kernel machines with random features, extending Rahimi and Recht, 2007. Finally, we demonstrate several applications of VFA models to problems in image recognition, density estimation and nonlinear regression. Our analyses and results suggest that VFAs constitute a powerful new framework for representing and manipulating functions in distributed neural systems, with myriad applications in artificial intelligence.
    Single Plane-Wave Imaging using Physics-Based Deep Learning. (arXiv:2109.03661v1 [eess.IV])
    (2 min) In plane-wave imaging, multiple unfocused ultrasound waves are transmitted into a medium of interest from different angles and an image is formed from the recorded reflections. The number of plane waves used leads to a trade-off between frame-rate and image quality, with single-plane-wave (SPW) imaging being the fastest possible modality with the worst image quality. Recently, deep learning methods have been proposed to improve ultrasound imaging. One approach is to use image-to-image networks that work on the formed image and another is to directly learn a mapping from data to an image. Both approaches utilize purely data-driven models and require deep, expressive network architectures, combined with large numbers of training samples to obtain good results. Here, we propose a data-to-image architecture that incorporates a wave-physics-based image formation algorithm in-between deep convolutional neural networks. To achieve this, we implement the Fourier (FK) migration method as network layers and train the whole network end-to-end. We compare our proposed data-to-image network with an image-to-image network in simulated data experiments, mimicking a medical ultrasound application. Experiments show that it is possible to obtain high-quality SPW images, almost similar to an image formed using 75 plane waves over an angular range of $\pm$16$^\circ$. This illustrates the great potential of combining deep neural networks with physics-based image formation algorithms for SPW imaging.
    Shuffled Patch-Wise Supervision for Presentation Attack Detection. (arXiv:2109.03484v1 [cs.CV])
    (2 min) Face anti-spoofing is essential to prevent false facial verification by using a photo, video, mask, or a different substitute for an authorized person's face. Most of the state-of-the-art presentation attack detection (PAD) systems suffer from overfitting, where they achieve near-perfect scores on a single dataset but fail on a different dataset with more realistic data. This problem drives researchers to develop models that perform well under real-world conditions. This is an especially challenging problem for frame-based presentation attack detection systems that use convolutional neural networks (CNN). To this end, we propose a new PAD approach, which combines pixel-wise binary supervision with patch-based CNN. We believe that training a CNN with face patches allows the model to distinguish spoofs without learning background or dataset-specific traces. We tested the proposed method both on the standard benchmark datasets -- Replay-Mobile, OULU-NPU -- and on a real-world dataset. The proposed approach shows its superiority on challenging experimental setups. Namely, it achieves higher performance on OULU-NPU protocol 3, 4 and on inter-dataset real-world experiments.
    Tactile Image-to-Image Disentanglement of Contact Geometry from Motion-Induced Shear. (arXiv:2109.03615v1 [cs.RO])
    (2 min) Robotic touch, particularly when using soft optical tactile sensors, suffers from distortion caused by motion-dependent shear. The manner in which the sensor contacts a stimulus is entangled with the tactile information about the geometry of the stimulus. In this work, we propose a supervised convolutional deep neural network model that learns to disentangle, in the latent space, the components of sensor deformations caused by contact geometry from those due to sliding-induced shear. The approach is validated by reconstructing unsheared tactile images from sheared images and showing they match unsheared tactile images collected with no sliding motion. In addition, the unsheared tactile images give a faithful reconstruction of the contact geometry that is not possible from the sheared data, and robust estimation of the contact pose that can be used for servo control sliding around various 2D shapes. Finally, the contact geometry reconstruction in conjunction with servo control sliding were used for faithful full object reconstruction of various 2D shapes. The methods have broad applicability to deep learning models for robots with a shear-sensitive sense of touch.
    Power to the Relational Inductive Bias: Graph Neural Networks in Electrical Power Grids. (arXiv:2109.03604v1 [cs.LG])
    (2 min) The application of graph neural networks (GNNs) to the domain of electrical power grids has high potential impact on smart grid monitoring. Even though there is a natural correspondence of power flow to message-passing in GNNs, their performance on power grids is not well-understood. We argue that there is a gap between GNN research driven by benchmarks which contain graphs that differ from power grids in several important aspects. Additionally, inductive learning of GNNs across multiple power grid topologies has not been explored with real-world data. We address this gap by means of (i) defining power grid graph datasets in inductive settings, (ii) an exploratory analysis of graph properties, and (iii) an empirical study of the concrete learning task of state estimation on real-world power grids. Our results show that GNNs are more robust to noise with up to 400% lower error compared to baselines. Furthermore, due to the unique properties of electrical grids, we do not observe the well known over-smoothing phenomenon of GNNs and find the best performing models to be exceptionally deep with up to 13 layers. This is in stark contrast to existing benchmark datasets where the consensus is that 2 to 3 layer GNNs perform best. Our results demonstrate that a key challenge in this domain is to effectively handle long-range dependence.
    FaBiAN: A Fetal Brain magnetic resonance Acquisition Numerical phantom. (arXiv:2109.03624v1 [physics.med-ph])
    (3 min) Accurate characterization of in utero human brain maturation is critical as it involves complex and interconnected structural and functional processes that may influence health later in life. Magnetic resonance imaging is a powerful tool to investigate equivocal neurological patterns during fetal development. However, the number of acquisitions of satisfactory quality available in this cohort of sensitive subjects remains scarce, thus hindering the validation of advanced image processing techniques. Numerical phantoms can mitigate these limitations by providing a controlled environment with a known ground truth. In this work, we present FaBiAN, an open-source Fetal Brain magnetic resonance Acquisition Numerical phantom that simulates clinical T2-weighted fast spin echo sequences of the fetal brain. This unique tool is based on a general, flexible and realistic setup that includes stochastic fetal movements, thus providing images of the fetal brain throughout maturation comparable to clinical acquisitions. We demonstrate its value to evaluate the robustness and optimize the accuracy of an algorithm for super-resolution fetal brain magnetic resonance imaging from simulated motion-corrupted 2D low-resolution series as compared to a synthetic high-resolution reference volume. We also show that the images generated can complement clinical datasets to support data-intensive deep learning methods for fetal brain tissue segmentation.
    RepNAS: Searching for Efficient Re-parameterizing Blocks. (arXiv:2109.03508v1 [cs.LG])
    (2 min) In the past years, significant improvements in the field of neural architecture search(NAS) have been made. However, it is still challenging to search for efficient networks due to the gap between the searched constraint and real inference time exists. To search for a high-performance network with low inference time, several previous works set a computational complexity constraint for the search algorithm. However, many factors affect the speed of inference(e.g., FLOPs, MACs). The correlation between a single indicator and the latency is not strong. Currently, some re-parameterization(Rep) techniques are proposed to convert multi-branch to single-path architecture which is inference-friendly. Nevertheless, multi-branch architectures are still human-defined and inefficient. In this work, we propose a new search space that is suitable for structural re-parameterization techniques. RepNAS, a one-stage NAS approach, is present to efficiently search the optimal diverse branch block(ODBB) for each layer under the branch number constraint. Our experimental results show the searched ODBB can easily surpass the manual diverse branch block(DBB) with efficient training. Code and models will be available sooner.
    Understanding and Preparing Data of Industrial Processes for Machine Learning Applications. (arXiv:2109.03469v1 [cs.LG])
    (2 min) Industrial applications of machine learning face unique challenges due to the nature of raw industry data. Preprocessing and preparing raw industrial data for machine learning applications is a demanding task that often takes more time and work than the actual modeling process itself and poses additional challenges. This paper addresses one of those challenges, specifically, the challenge of missing values due to sensor unavailability at different production units of nonlinear production lines. In cases where only a small proportion of the data is missing, those missing values can often be imputed. In cases of large proportions of missing data, imputing is often not feasible, and removing observations containing missing values is often the only option. This paper presents a technique, that allows to utilize all of the available data without the need of removing large amounts of observations where data is only partially available. We do not only discuss the principal idea of the presented method, but also show different possible implementations that can be applied depending on the data at hand. Finally, we demonstrate the application of the presented method with data from a steel production plant.

2021-09-08

  • cs.CL updates on arXiv.org

    Intuitive Contrasting Map for Antonym Embeddings. (arXiv:2004.12835v2 [cs.CL] UPDATED)
    (2 min) This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the data and can produce new embeddings that explicitly highlight specific semantic attributes of the word. The new embeddings produced by the map are shown to improve the performance on downstream tasks.
    Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles. (arXiv:2109.03158v1 [cs.CL])
    (2 min) An individual's variation in writing style is often a function of both social and personal attributes. While structured social variation has been extensively studied, e.g., gender based variation, far less is known about how to characterize individual styles due to their idiosyncratic nature. We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features. The neural model achieves strong performance at authorship identification on short texts and through an analogy-based probing task, showing that the learned representations exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, we quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, we provide a description of idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often distinctive yet consistent.
    Aspect-Controllable Opinion Summarization. (arXiv:2109.03171v1 [cs.CL])
    (2 min) Recent work on opinion summarization produces general summaries based on a set of input reviews and the popularity of opinions expressed in them. In this paper, we propose an approach that allows the generation of customized summaries based on aspect queries (e.g., describing the location and room of a hotel). Using a review corpus, we create a synthetic training dataset of (review, summary) pairs enriched with aspect controllers which are induced by a multi-instance learning model that predicts the aspects of a document at different levels of granularity. We fine-tune a pretrained model using our synthetic dataset and generate aspect-specific summaries by modifying the aspect controllers. Experiments on two benchmarks show that our model outperforms the previous state of the art and generates personalized summaries by controlling the number of aspects discussed in them.
    FHAC at GermEval 2021: Identifying German toxic, engaging, and fact-claiming comments with ensemble learning. (arXiv:2109.03094v1 [cs.CL])
    (2 min) The availability of language representations learned by large pretrained neural network models (such as BERT and ELECTRA) has led to improvements in many downstream Natural Language Processing tasks in recent years. Pretrained models usually differ in pretraining objectives, architectures, and datasets they are trained on which can affect downstream performance. In this contribution, we fine-tuned German BERT and German ELECTRA models to identify toxic (subtask 1), engaging (subtask 2), and fact-claiming comments (subtask 3) in Facebook data provided by the GermEval 2021 competition. We created ensembles of these models and investigated whether and how classification performance depends on the number of ensemble members and their composition. On out-of-sample data, our best ensemble achieved a macro-F1 score of 0.73 (for all subtasks), and F1 scores of 0.72, 0.70, and 0.76 for subtasks 1, 2, and 3, respectively.
    Revisiting Context Choices for Context-aware Machine Translation. (arXiv:2109.02995v1 [cs.CL])
    (0 min) One of the most popular methods for context-aware machine translation (MT) is to use separate encoders for the source sentence and context as multiple sources for one target sentence. Recent work has cast doubt on whether these models actually learn useful signals from the context or are improvements in automatic evaluation metrics just a side-effect. We show that multi-source transformer models improve MT over standard transformer-base models even with empty lines provided as context, but the translation quality improves significantly (1.51 - 2.65 BLEU) when a sufficient amount of correct context is provided. We also show that even though randomly shuffling in-domain context can also improve over baselines, the correct context further improves translation quality and random out-of-domain context further degrades it.
    Knowledge Graph Augmented Political Perspective Detection in News Media. (arXiv:2108.03861v2 [cs.CL] UPDATED)
    (2 min) Identifying political perspective in news media has become an important task due to the rapid growth of political commentary and the increasingly polarized ideologies. Previous approaches only focus on leveraging the semantic information and leaves out the rich social and political context that helps individuals understand political stances. In this paper, we propose a perspective detection method that incorporates external knowledge of real-world politics. Specifically, we construct a contemporary political knowledge graph with 1,071 entities and 10,703 triples. We then build a heterogeneous information network for each news document that jointly models article semantics and external knowledge in knowledge graphs. Finally, we apply gated relational graph convolutional networks and conduct political perspective detection as graph-level classification. Extensive experiments show that our method achieves the best performance and outperforms state-of-the-art methods by 5.49%. Numerous ablation studies further bear out the necessity of external knowledge and the effectiveness of our graph-based approach.
    Data Driven Content Creation using Statistical and Natural Language Processing Techniques for Financial Domain. (arXiv:2109.02935v1 [cs.CL])
    (2 min) Over the years customers' expectation of getting information instantaneously has given rise to the increased usage of channels like virtual assistants. Typically, customers try to get their questions answered by low-touch channels like search and virtual assistant first, before getting in touch with a live chat agent or the phone representative. Higher usage of these low-touch systems is a win-win for both customers and the organization since it enables organizations to attain a low cost of service while customers get served without delay. In this paper, we propose a two-part framework where the first part describes methods to combine the information from different interaction channels like call, search, and chat. We do this by summarizing (using a stacked Bi-LSTM network) the high-touch interaction channel data such as call and chat into short searchquery like customer intents and then creating an organically grown intent taxonomy from interaction data (using Hierarchical Agglomerative Clustering). The second part of the framework focuses on extracting customer questions by analyzing interaction data sources. It calculates similarity scores using TF-IDF and BERT(Devlin et al., 2019). It also maps these identified questions to the output of the first part of the framework using syntactic and semantic similarity.
    FinQA: A Dataset of Numerical Reasoning over Financial Data. (arXiv:2109.00122v2 [cs.CL] UPDATED)
    (2 min) The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset -- the first of its kind -- should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available\url{https://github.com/czyssrs/FinQA}.
    Encoding Heterogeneous Social and Political Context for Entity Stance Prediction. (arXiv:2108.03881v2 [cs.CL] UPDATED)
    (2 min) Political stance detection has become an important task due to the increasingly polarized political ideologies. Most existing works focus on identifying perspectives in news articles or social media posts, while social entities, such as individuals and organizations, produce these texts and actually take stances. In this paper, we propose the novel task of entity stance prediction, which aims to predict entities' stances given their social and political context. Specifically, we retrieve facts from Wikipedia about social entities regarding contemporary U.S. politics. We then annotate social entities' stances towards political ideologies with the help of domain experts. After defining the task of entity stance prediction, we propose a graph-based solution, which constructs a heterogeneous information network from collected facts and adopts gated relational graph convolutional networks for representation learning. Our model is then trained with a combination of supervised, self-supervised and unsupervised loss functions, which are motivated by multiple social and political phenomenons. We conduct extensive experiments to compare our method with existing text and graph analysis baselines. Our model achieves highest stance detection accuracy and yields inspiring insights regarding social entity stances. We further conduct ablation study and parameter analysis to study the mechanism and effectiveness of our proposed approach.
    DuTrust: A Sentiment Analysis Dataset for Trustworthiness Evaluation. (arXiv:2108.13140v2 [cs.CL] UPDATED)
    (2 min) While deep learning models have greatly improved the performance of most artificial intelligence tasks, they are often criticized to be untrustworthy due to the black-box problem. Consequently, many works have been proposed to study the trustworthiness of deep learning. However, as most open datasets are designed for evaluating the accuracy of model outputs, there is still a lack of appropriate datasets for evaluating the inner workings of neural networks. The lack of datasets obviously hinders the development of trustworthiness research. Therefore, in order to systematically evaluate the factors for building trustworthy systems, we propose a novel and well-annotated sentiment analysis dataset to evaluate robustness and interpretability. To evaluate these factors, our dataset contains diverse annotations about the challenging distribution of instances, manual adversarial instances and sentiment explanations. Several evaluation metrics are further proposed for interpretability and robustness. Based on the dataset and metrics, we conduct comprehensive comparisons for the trustworthiness of three typical models, and also study the relations between accuracy, robustness and interpretability. We release this trustworthiness evaluation dataset at \url{https://github/xyz} and hope our work can facilitate the progress on building more trustworthy systems for real-world applications.
    MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. (arXiv:2109.00904v2 [cs.CL] UPDATED)
    (2 min) We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate fine-tuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.
    NumGPT: Improving Numeracy Ability of Generative Pre-trained Models. (arXiv:2109.03137v1 [cs.CL])
    (2 min) Existing generative pre-trained language models (e.g., GPT) focus on modeling the language structure and semantics of general texts. However, those models do not consider the numerical properties of numbers and cannot perform robustly on numerical reasoning tasks (e.g., math word problems and measurement estimation). In this paper, we propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts. Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number. A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT. We conduct extensive experiments on four different datasets to evaluate the numeracy ability of NumGPT. The experiment results show that NumGPT outperforms baseline models (e.g., GPT and GPT with DICE) on a range of numerical reasoning tasks such as measurement estimation, number comparison, math word problems, and magnitude classification. Ablation studies are also conducted to evaluate the impact of pre-training and model hyperparameters on the performance.
    Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers. (arXiv:2101.00234v2 [cs.CL] UPDATED)
    (2 min) Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and with a high parameter budget. In light of this, we explore parameter-sharing methods in Transformers with a specific focus on generative models. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer. Our model combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
    Infusing Future Information into Monotonic Attention Through Language Models. (arXiv:2109.03121v1 [cs.CL])
    (2 min) Simultaneous neural machine translation(SNMT) models start emitting the target sequence before they have processed the source sequence. The recent adaptive policies for SNMT use monotonic attention to perform read/write decisions based on the partial source and target sequences. The lack of sufficient information might cause the monotonic attention to take poor read/write decisions, which in turn negatively affects the performance of the SNMT model. On the other hand, human translators make better read/write decisions since they can anticipate the immediate future words using linguistic information and domain knowledge.Motivated by human translators, in this work, we propose a framework to aid monotonic attention with an external language model to improve its decisions.We conduct experiments on the MuST-C English-German and English-French speech-to-text translation tasks to show the effectiveness of the proposed framework.The proposed SNMT method improves the quality-latency trade-off over the state-of-the-art monotonic multihead attention.
    Eliminating Sentiment Bias for Aspect-Level Sentiment Classification with Unsupervised Opinion Extraction. (arXiv:2109.02403v2 [cs.CL] UPDATED)
    (2 min) Aspect-level sentiment classification (ALSC) aims at identifying the sentiment polarity of a specified aspect in a sentence. ALSC is a practical setting in aspect-based sentiment analysis due to no opinion term labeling needed, but it fails to interpret why a sentiment polarity is derived for the aspect. To address this problem, recent works fine-tune pre-trained Transformer encoders for ALSC to extract an aspect-centric dependency tree that can locate the opinion words. However, the induced opinion words only provide an intuitive cue far below human-level interpretability. Besides, the pre-trained encoder tends to internalize an aspect's intrinsic sentiment, causing sentiment bias and thus affecting model performance. In this paper, we propose a span-based anti-bias aspect representation learning framework. It first eliminates the sentiment bias in the aspect embedding by adversarial learning against aspects' prior sentiment. Then, it aligns the distilled opinion candidates with the aspect by span-based dependency modeling to highlight the interpretable opinion terms. Our method achieves new state-of-the-art performance on five benchmarks, with the capability of unsupervised opinion extraction.
    ExCode-Mixed: Explainable Approaches towards Sentiment Analysis on Code-Mixed Data using BERT models. (arXiv:2109.03200v1 [cs.AI])
    (2 min) The increasing use of social media sites in countries like India has given rise to large volumes of code-mixed data. Sentiment analysis of this data can provide integral insights into people's perspectives and opinions. Developing robust explainability techniques which explain why models make their predictions becomes essential. In this paper, we propose an adequate methodology to integrate explainable approaches into code-mixed sentiment analysis.
    Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression. (arXiv:2109.03228v1 [cs.CL])
    (2 min) Recent studies on compression of pretrained language models (e.g., BERT) usually use preserved accuracy as the metric for evaluation. In this paper, we propose two new metrics, label loyalty and probability loyalty that measure how closely a compressed model (i.e., student) mimics the original model (i.e., teacher). We also explore the effect of compression with regard to robustness under adversarial attacks. We benchmark quantization, pruning, knowledge distillation and progressive module replacing with loyalty and robustness. By combining multiple compression techniques, we provide a practical strategy to achieve better accuracy, loyalty and robustness.
    How much pretraining data do language models need to learn syntax?. (arXiv:2109.03160v1 [cs.CL])
    (2 min) Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.
    PAUSE: Positive and Annealed Unlabeled Sentence Embedding. (arXiv:2109.03155v1 [cs.CL])
    (2 min) Sentence embedding refers to a set of effective and versatile techniques for converting raw text into numerical vector representations that can be used in a wide range of natural language processing (NLP) applications. The majority of these techniques are either supervised or unsupervised. Compared to the unsupervised methods, the supervised ones make less assumptions about optimization objectives and usually achieve better results. However, the training requires a large amount of labeled sentence pairs, which is not available in many industrial scenarios. To that end, we propose a generic and end-to-end approach -- PAUSE (Positive and Annealed Unlabeled Sentence Embedding), capable of learning high-quality sentence embeddings from a partially labeled dataset. We experimentally show that PAUSE achieves, and sometimes surpasses, state-of-the-art results using only a small fraction of labeled sentence pairs on various benchmark tasks. When applied to a real industrial use case where labeled samples are scarce, PAUSE encourages us to extend our dataset without the liability of extensive manual annotation work.
    Toward Improving Coherence and Diversity of Slogan Generation. (arXiv:2102.05924v2 [cs.CL] UPDATED)
    (2 min) Previous work in slogan generation focused on utilising slogan skeletons mined from existing slogans. While some generated slogans can be catchy, they are often not coherent with the company's focus or style across their marketing communications because the skeletons are mined from other companies' slogans. We propose a sequence-to-sequence (seq2seq) transformer model to generate slogans from a brief company description. A naive seq2seq model fine-tuned for slogan generation is prone to introducing false information. We use company name delexicalisation and entity masking to alleviate this problem and improve the generated slogans' quality and truthfulness. Furthermore, we apply conditional training based on the first words' POS tag to generate syntactically diverse slogans. Our best model achieved a ROUGE-1/-2/-L F1 score of 35.58/18.47/33.32. Besides, automatic and human evaluations indicate that our method generates significantly more factual, diverse and catchy slogans than strong LSTM and transformer seq2seq baselines.
    WordBias: An Interactive Visual Tool for Discovering Intersectional Biases Encoded in Word Embeddings. (arXiv:2103.03598v2 [cs.CL] UPDATED)
    (2 min) Intersectional bias is a bias caused by an overlap of multiple social factors like gender, sexuality, race, disability, religion, etc. A recent study has shown that word embedding models can be laden with biases against intersectional groups like African American females, etc. The first step towards tackling such intersectional biases is to identify them. However, discovering biases against different intersectional groups remains a challenging task. In this work, we present WordBias, an interactive visual tool designed to explore biases against intersectional groups encoded in static word embeddings. Given a pretrained static word embedding, WordBias computes the association of each word along different groups based on race, age, etc. and then visualizes them using a novel interactive interface. Using a case study, we demonstrate how WordBias can help uncover biases against intersectional groups like Black Muslim Males, Poor Females, etc. encoded in word embedding. In addition, we also evaluate our tool using qualitative feedback from expert interviews. The source code for this tool can be publicly accessed for reproducibility at github.com/bhavyaghai/WordBias.
    Sentence-Permuted Paragraph Generation. (arXiv:2104.07228v2 [cs.CL] UPDATED)
    (2 min) Generating paragraphs of diverse contents is important in many applications. Existing generation models produce similar contents from homogenized contexts due to the fixed left-to-right sentence order. Our idea is permuting the sentence orders to improve the content diversity of multi-sentence paragraph. We propose a novel framework PermGen whose objective is to maximize the expected log-likelihood of output paragraph distributions with respect to all possible sentence orders. PermGen uses hierarchical positional embedding and designs new procedures for training, decoding, and candidate ranking in the sentence-permuted generation. Experiments on three paragraph generation benchmarks demonstrate PermGen generates more diverse outputs with a higher quality than existing models.
    Data-QuestEval: A Referenceless Metric for Data-to-Text Semantic Evaluation. (arXiv:2104.07555v3 [cs.CL] UPDATED)
    (2 min) QuestEval is a reference-less metric used in text-to-text tasks, that compares the generated summaries directly to the source text, by automatically asking and answering questions. Its adaptation to Data-to-Text tasks is not straightforward, as it requires multimodal Question Generation and Answering systems on the considered tasks, which are seldom available. To this purpose, we propose a method to build synthetic multimodal corpora enabling to train multimodal components for a data-QuestEval metric. The resulting metric is reference-less and multimodal; it obtains state-of-the-art correlations with human judgment on the WebNLG and WikiBio benchmarks. We make data-QuestEval's code and models available for reproducibility purpose, as part of the QuestEval project.
    Injecting Entity Types into Entity-Guided Text Generation. (arXiv:2009.13401v3 [cs.CL] UPDATED)
    (2 min) Recent successes in deep generative modeling have led to significant advances in natural language generation (NLG). Incorporating entities into neural generation models has demonstrated great improvements by assisting to infer the summary topic and to generate coherent content. To enhance the role of entity in NLG, in this paper, we aim to model the entity type in the decoding phase to generate contextual words accurately. We develop a novel NLG model to produce a target sequence based on a given list of entities. Our model has a multi-step decoder that injects the entity types into the process of entity mention generation. Experiments on two public news datasets demonstrate type injection performs better than existing type embedding concatenation baselines.
    Joint model for intent and entity recognition. (arXiv:2109.03221v1 [cs.CL])
    (2 min) The semantic understanding of natural dialogues composes of several parts. Some of them, like intent classification and entity detection, have a crucial role in deciding the next steps in handling user input. Handling each task as an individual problem can be wasting of training resources, and also each problem can benefit from each other. This paper tackles these problems as one. Our new model, which combine intent and entity recognition into one system, is achieving better metrics in both tasks with lower training requirements than solving each task separately. We also optimize the model based on the inputs.
    Unsupervised Conversation Disentanglement through Co-Training. (arXiv:2109.03199v1 [cs.CL])
    (2 min) Conversation disentanglement aims to separate intermingled messages into detached sessions, which is a fundamental task in understanding multi-party conversations. Existing work on conversation disentanglement relies heavily upon human-annotated datasets, which are expensive to obtain in practice. In this work, we explore to train a conversation disentanglement model without referencing any human annotations. Our method is built upon a deep co-training algorithm, which consists of two neural networks: a message-pair classifier and a session classifier. The former is responsible for retrieving local relations between two messages while the latter categorizes a message to a session by capturing context-aware information. Both networks are initialized respectively with pseudo data built from an unannotated corpus. During the deep co-training process, we use the session classifier as a reinforcement learning component to learn a session assigning policy by maximizing the local rewards given by the message-pair classifier. For the message-pair classifier, we enrich its training data by retrieving message pairs with high confidence from the disentangled sessions predicted by the session classifier. Experimental results on the large Movie Dialogue Dataset demonstrate that our proposed approach achieves competitive performance compared to the previous supervised methods. Further experiments show that the predicted disentangled conversations can promote the performance on the downstream task of multi-party response selection.
    GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation. (arXiv:2109.03079v1 [cs.CL])
    (2 min) Practical dialogue systems require robust methods of detecting out-of-scope (OOS) utterances to avoid conversational breakdowns and related failure modes. Directly training a model with labeled OOS examples yields reasonable performance, but obtaining such data is a resource-intensive process. To tackle this limited-data problem, previous methods focus on better modeling the distribution of in-scope (INS) examples. We introduce GOLD as an orthogonal technique that augments existing data to train better OOS detectors operating in low-data regimes. GOLD generates pseudo-labeled candidates using samples from an auxiliary dataset and keeps only the most beneficial candidates for training through a novel filtering mechanism. In experiments across three target benchmarks, the top GOLD model outperforms all existing methods on all key metrics, achieving relative gains of 52.4%, 48.9% and 50.3% against median baseline performance. We also analyze the unique properties of OOS data to identify key factors for optimally applying our proposed method.
    Sequential Attention Module for Natural Language Processing. (arXiv:2109.03009v1 [cs.AI])
    (2 min) Recently, large pre-trained neural language models have attained remarkable performance on many downstream natural language processing (NLP) applications via fine-tuning. In this paper, we target at how to further improve the token representations on the language models. We, therefore, propose a simple yet effective plug-and-play module, Sequential Attention Module (SAM), on the token embeddings learned from a pre-trained language model. Our proposed SAM consists of two main attention modules deployed sequentially: Feature-wise Attention Module (FAM) and Token-wise Attention Module (TAM). More specifically, FAM can effectively identify the importance of features at each dimension and promote the effect via dot-product on the original token embeddings for downstream NLP applications. Meanwhile, TAM can further re-weight the features at the token-wise level. Moreover, we propose an adaptive filter on FAM to prevent noise impact and increase information absorption. Finally, we conduct extensive experiments to demonstrate the advantages and properties of our proposed SAM. We first show how SAM plays a primary role in the champion solution of two subtasks of SemEval'21 Task 7. After that, we apply SAM on sentiment analysis and three popular NLP tasks and demonstrate that SAM consistently outperforms the state-of-the-art baselines.
    When differential privacy meets NLP: The devil is in the detail. (arXiv:2109.03175v1 [cs.CL])
    (2 min) Differential privacy provides a formal approach to privacy of individuals. Applications of differential privacy in various scenarios, such as protecting users' original utterances, must satisfy certain mathematical properties. Our contribution is a formal analysis of ADePT, a differentially private auto-encoder for text rewriting (Krishna et al, 2021). ADePT achieves promising results on downstream tasks while providing tight privacy guarantees. Our proof reveals that ADePT is not differentially private, thus rendering the experimental results unsubstantiated. We also quantify the impact of the error in its private mechanism, showing that the true sensitivity is higher by at least factor 6 in an optimistic case of a very small encoder's dimension and that the amount of utterances that are not privatized could easily reach 100% of the entire dataset. Our intention is neither to criticize the authors, nor the peer-reviewing process, but rather point out that if differential privacy applications in NLP rely on formal guarantees, these should be outlined in full and put under detailed scrutiny.
    FH-SWF SG at GermEval 2021: Using Transformer-Based Language Models to Identify Toxic, Engaging, & Fact-Claiming Comments. (arXiv:2109.02966v1 [cs.CL])
    (2 min) In this paper we describe the methods we used for our submissions to the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments. For all three subtasks we fine-tuned freely available transformer-based models from the Huggingface model hub. We evaluated the performance of various pre-trained models after fine-tuning on 80% of the training data with different hyperparameters and submitted predictions of the two best performing resulting models. We found that this approach worked best for subtask 3, for which we achieved an F1-score of 0.736.
    Rare Words Degenerate All Words. (arXiv:2109.03127v1 [cs.CL])
    (2 min) Despite advances in neural network language model, the representation degeneration problem of embeddings is still challenging. Recent studies have found that the learned output embeddings are degenerated into a narrow-cone distribution which makes the similarity between each embeddings positive. They analyzed the cause of the degeneration problem has been demonstrated as common to most embeddings. However, we found that the degeneration problem is especially originated from the training of embeddings of rare words. In this study, we analyze the intrinsic mechanism of the degeneration of rare word embeddings with respect of their gradient about the negative log-likelihood loss function. Furthermore, we theoretically and empirically demonstrate that the degeneration of rare word embeddings causes the degeneration of non-rare word embeddings, and that the overall degeneration problem can be alleviated by preventing the degeneration of rare word embeddings. Based on our analyses, we propose a novel method, Adaptive Gradient Partial Scaling(AGPS), to address the degeneration problem. Experimental results demonstrate the effectiveness of the proposed method qualitatively and quantitatively.
    POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling. (arXiv:2109.03039v1 [cs.IR])
    (2 min) Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.
    Generate & Rank: A Multi-task Framework for Math Word Problems. (arXiv:2109.03034v1 [cs.CL])
    (2 min) Math word problem (MWP) is a challenging and critical task in natural language processing. Many recent studies formalize MWP as a generation task and have adopted sequence-to-sequence models to transform problem descriptions to mathematical expressions. However, mathematical expressions are prone to minor mistakes while the generation objective does not explicitly handle such mistakes. To address this limitation, we devise a new ranking task for MWP and propose Generate & Rank, a multi-task framework based on a generative pre-trained language model. By joint training with generation and ranking, the model learns from its own mistakes and is able to distinguish between correct and incorrect expressions. Meanwhile, we perform tree-based disturbance specially designed for MWP and an online update to boost the ranker. We demonstrate the effectiveness of our proposed method on the benchmark and the results show that our method consistently outperforms baselines in all datasets. Particularly, in the classical Math23k, our method is 7% (78.4% $\rightarrow$ 85.4%) higher than the state-of-the-art.
    Query-Variant Advertisement Text Generation with Association Knowledge. (arXiv:2004.06438v2 [cs.CL] UPDATED)
    (2 min) Advertising is an important revenue source for many companies. However, it is expensive to manually create advertisements that meet the needs of various queries for massive items. In this paper, we propose the query-variant advertisement text generation task that aims to generate candidate advertisements for different queries with various needs given the item keywords. In this task, for many different queries there is only one general purposed advertisement with no predefined query-advertisement pair, which would discourage traditional End-to-End models from generating query-variant advertisements for different queries with different needs. To deal with the problem, we propose a query-variant advertisement text generation model that takes keywords and associated external knowledge as input during training and adds different queries during inference. Adding external knowledge helps the model adapted to the information besides the item keywords during training, which makes the transition between training and inference more smoothing when the query is added during inference. Both automatic and human evaluation show that our model can generate more attractive and query-focused advertisements than the strong baselines.
    Patient Outcome and Zero-shot Diagnosis Prediction with Hypernetwork-guided Multitask Learning. (arXiv:2109.03062v1 [cs.CL])
    (2 min) Multitask deep learning has been applied to patient outcome prediction from text, taking clinical notes as input and training deep neural networks with a joint loss function of multiple tasks. However, the joint training scheme of multitask learning suffers from inter-task interference, and diagnosis prediction among the multiple tasks has the generalizability issue due to rare diseases or unseen diagnoses. To solve these challenges, we propose a hypernetwork-based approach that generates task-conditioned parameters and coefficients of multitask prediction heads to learn task-specific prediction and balance the multitask learning. We also incorporate semantic task information to improves the generalizability of our task-conditioned multitask model. Experiments on early and discharge notes extracted from the real-world MIMIC database show our method can achieve better performance on multitask patient outcome prediction than strong baselines in most cases. Besides, our method can effectively handle the scenario with limited information and improve zero-shot prediction on unseen diagnosis categories.
    Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT. (arXiv:2109.02938v1 [cs.CL])
    (2 min) This paper presents an automatic method to evaluate the naturalness of natural language generation in dialogue systems. While this task was previously rendered through expensive and time-consuming human labor, we present this novel task of automatic naturalness evaluation of generated language. By fine-tuning the BERT model, our proposed naturalness evaluation method shows robust results and outperforms the baselines: support vector machines, bi-directional LSTMs, and BLEURT. In addition, the training speed and evaluation performance of naturalness model are improved by transfer learning from quality and informativeness linguistic knowledge.
    Fully Synthetic Data Improves Neural Machine Translation withKnowledge Distillation. (arXiv:2012.15455v2 [cs.CL] UPDATED)
    (2 min) This paper explores augmenting monolingual data for knowledge distillation in neural machine translation. Source language monolingual text can be incorporated as a forward translation. Interestingly, we find the best way to incorporate target language monolingual text is to translate it to the source language and round-trip translate it back to the target language, resulting in a fully synthetic corpus. We find that combining monolingual data from both source and target languages yields better performance than a corpus twice as large only in one language. Moreover, experiments reveal that the improvement depends upon the provenance of the test set. If the test set was originally in the source language (with the target side written by translators), then forward translating source monolingual data matters. If the test set was originally in the target language (with the source written by translators), then incorporating target monolingual data matters.
    Exploiting Reasoning Chains for Multi-hop Science Question Answering. (arXiv:2109.02905v1 [cs.CL])
    (2 min) We propose a novel Chain Guided Retriever-reader ({\tt CGR}) framework to model the reasoning chain for multi-hop Science Question Answering. Our framework is capable of performing explainable reasoning without the need of any corpus-specific annotations, such as the ground-truth reasoning chain, or human-annotated entity mentions. Specifically, we first generate reasoning chains from a semantic graph constructed by Abstract Meaning Representation of retrieved evidence facts. A \textit{Chain-aware loss}, concerning both local and global chain information, is also designed to enable the generated chains to serve as distant supervision signals for training the retriever, where reinforcement learning is also adopted to maximize the utility of the reasoning chains. Our framework allows the retriever to capture step-by-step clues of the entire reasoning process, which is not only shown to be effective on two challenging multi-hop Science QA tasks, namely OpenBookQA and ARC-Challenge, but also favors explainability.
    IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages. (arXiv:2109.02903v1 [cs.CL])
    (2 min) In this paper we present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. Different from existing pre-trained models, IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT for 12 language pairs and extreme summarization for 7 languages using multilingual fine-tuning show that IndicBART is competitive with or better than mBART50 despite containing significantly fewer parameters. Our analyses focus on identifying the impact of script unification (to Devanagari), corpora size as well as multilingualism on the final performance. The IndicBART model is available under the MIT license at https://indicnlp.ai4bharat.org/indic-bart .
    Countering Online Hate Speech: An NLP Perspective. (arXiv:2109.02941v1 [cs.CL])
    (2 min) Online hate speech has caught everyone's attention from the news related to the COVID-19 pandemic, US elections, and worldwide protests. Online toxicity - an umbrella term for online hateful behavior, manifests itself in forms such as online hate speech. Hate speech is a deliberate attack directed towards an individual or a group motivated by the targeted entity's identity or opinions. The rising mass communication through social media further exacerbates the harmful consequences of online hate speech. While there has been significant research on hate-speech identification using Natural Language Processing (NLP), the work on utilizing NLP for prevention and intervention of online hate speech lacks relatively. This paper presents a holistic conceptual framework on hate-speech NLP countering methods along with a thorough survey on the current progress of NLP for countering online hate speech. It classifies the countering techniques based on their time of action, and identifies potential future research areas on this topic.
    Empathetic Dialogue Generation with Pre-trained RoBERTa-GPT2 and External Knowledge. (arXiv:2109.03004v1 [cs.CL])
    (2 min) One challenge for dialogue agents is to recognize feelings of the conversation partner and respond accordingly. In this work, RoBERTa-GPT2 is proposed for empathetic dialogue generation, where the pre-trained auto-encoding RoBERTa is utilised as encoder and the pre-trained auto-regressive GPT-2 as decoder. With the combination of the pre-trained RoBERTa and GPT-2, our model realizes a new state-of-the-art emotion accuracy. To enable the empathetic ability of RoBERTa-GPT2 model, we propose a commonsense knowledge and emotional concepts extractor, in which the commonsensible and emotional concepts of dialogue context are extracted for the GPT-2 decoder. The experiment results demonstrate that the empathetic dialogue generation benefits from both pre-trained encoder-decoder architecture and external knowledge.
    Don't Go Far Off: An Empirical Study on Neural Poetry Translation. (arXiv:2109.02972v1 [cs.CL])
    (2 min) Despite constant improvements in machine translation quality, automatic poetry translation remains a challenging problem due to the lack of open-sourced parallel poetic corpora, and to the intrinsic complexities involved in preserving the semantics, style, and figurative nature of poetry. We present an empirical investigation for poetry translation along several dimensions: 1) size and style of training data (poetic vs. non-poetic), including a zero-shot setup; 2) bilingual vs. multilingual learning; and 3) language-family-specific models vs. mixed-multilingual models. To accomplish this, we contribute a parallel dataset of poetry translations for several language pairs. Our results show that multilingual fine-tuning on poetic text significantly outperforms multilingual fine-tuning on non-poetic text that is 35X larger in size, both in terms of automatic metrics (BLEU, BERTScore) and human evaluation metrics such as faithfulness (meaning and poetic style). Moreover, multilingual fine-tuning on poetic data outperforms \emph{bilingual} fine-tuning on poetic data.
    Learning grounded word meaning representations on similarity graphs. (arXiv:2109.03084v1 [cs.CL])
    (2 min) This paper introduces a novel approach to learn visually grounded meaning representations of words as low-dimensional node embeddings on an underlying graph hierarchy. The lower level of the hierarchy models modality-specific word representations through dedicated but communicating graphs, while the higher level puts these representations together on a single graph to learn a representation jointly from both modalities. The topology of each graph models similarity relations among words, and is estimated jointly with the graph embedding. The assumption underlying this model is that words sharing similar meaning correspond to communities in an underlying similarity graph in a low-dimensional space. We named this model Hierarchical Multi-Modal Similarity Graph Embedding (HM-SGE). Experimental results validate the ability of HM-SGE to simulate human similarity judgements and concept categorization, outperforming the state of the art.
    Integrating Regular Expressions with Neural Networks via DFA. (arXiv:2109.02882v1 [cs.CL])
    (2 min) Human-designed rules are widely used to build industry applications. However, it is infeasible to maintain thousands of such hand-crafted rules. So it is very important to integrate the rule knowledge into neural networks to build a hybrid model that achieves better performance. Specifically, the human-designed rules are formulated as Regular Expressions (REs), from which the equivalent Minimal Deterministic Finite Automatons (MDFAs) are constructed. We propose to use the MDFA as an intermediate model to capture the matched RE patterns as rule-based features for each input sentence and introduce these additional features into neural networks. We evaluate the proposed method on the ATIS intent classification task. The experiment results show that the proposed method achieves the best performance compared to neural networks and four other methods that combine REs and neural networks when the training dataset is relatively small.
    Paraphrase Generation as Unsupervised Machine Translation. (arXiv:2109.02950v1 [cs.CL])
    (2 min) In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences expressing the same meaning in a large-scale unlabeled monolingual corpus. The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters. Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final Seq2Seq model to generate paraphrases, which can be directly used for test in the unsupervised setup, or be finetuned on labeled datasets in the supervised setup. The proposed method offers merits over machine-translation-based paraphrase generation methods, as it avoids reliance on bilingual sentence pairs. It also allows human intervene with the model so that more diverse paraphrases can be generated using different filtering criteria. Extensive experiments on existing paraphrase dataset for both the supervised and unsupervised setups demonstrate the effectiveness the proposed paradigm.
    Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica. (arXiv:2109.02738v1 [cs.CL])
    (2 min) People convey their intention and attitude through linguistic styles of the text that they write. In this study, we investigate lexicon usages across styles throughout two lenses: human perception and machine word importance, since words differ in the strength of the stylistic cues that they provide. To collect labels of human perception, we curate a new dataset, Hummingbird, on top of benchmarking style datasets. We have crowd workers highlight the representative words in the text that makes them think the text has the following styles: politeness, sentiment, offensiveness, and five emotion types. We then compare these human word labels with word importance derived from a popular fine-tuned style classifier like BERT. Our results show that the BERT often finds content words not relevant to the target style as important words used in style prediction, but humans do not perceive the same way even though for some styles (e.g., positive sentiment and joy) human- and machine-identified words share significant overlap for some styles.
    Datasets: A Community Library for Natural Language Processing. (arXiv:2109.02846v1 [cs.CL])
    (2 min) The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.
    Mixed Attention Transformer for LeveragingWord-Level Knowledge to Neural Cross-Lingual Information Retrieval. (arXiv:2109.02789v1 [cs.IR])
    (2 min) Pretrained contextualized representations offer great success for many downstream tasks, including document ranking. The multilingual versions of such pretrained representations provide a possibility of jointly learning many languages with the same model. Although it is expected to gain big with such joint training, in the case of cross lingual information retrieval (CLIR), the models under a multilingual setting are not achieving the same level of performance as those under a monolingual setting. We hypothesize that the performance drop is due to the translation gap between query and documents. In the monolingual retrieval task, because of the same lexical inputs, it is easier for model to identify the query terms that occurred in documents. However, in the multilingual pretrained models that the words in different languages are projected into the same hyperspace, the model tends to translate query terms into related terms, i.e., terms that appear in a similar context, in addition to or sometimes rather than synonyms in the target language. This property is creating difficulties for the model to connect terms that cooccur in both query and document. To address this issue, we propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table. We design a sandwich like architecture to embed MAT into the recent transformer based deep neural models. By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence. Experimental results demonstrate the effectiveness of the external knowledge and the significant improvement of MAT embedded neural reranking model on CLIR task.
    Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models. (arXiv:2109.02837v1 [cs.CL])
    (2 min) Commonsense reasoning benchmarks have been largely solved by fine-tuning language models. The downside is that fine-tuning may cause models to overfit to task-specific data and thereby forget their knowledge gained during pre-training. Recent works only propose lightweight model updates as models may already possess useful knowledge from past experience, but a challenge remains in understanding what parts and to what extent models should be refined for a given task. In this paper, we investigate what models learn from commonsense reasoning datasets. We measure the impact of three different adaptation methods on the generalization and accuracy of models. Our experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
    A Scalable AI Approach for Clinical Trial Cohort Optimization. (arXiv:2109.02808v1 [cs.CL])
    (2 min) FDA has been promoting enrollment practices that could enhance the diversity of clinical trial populations, through broadening eligibility criteria. However, how to broaden eligibility remains a significant challenge. We propose an AI approach to Cohort Optimization (AICO) through transformer-based natural language processing of the eligibility criteria and evaluation of the criteria using real-world data. The method can extract common eligibility criteria variables from a large set of relevant trials and measure the generalizability of trial designs to real-world patients. It overcomes the scalability limits of existing manual methods and enables rapid simulation of eligibility criteria design for a disease of interest. A case study on breast cancer trial design demonstrates the utility of the method in improving trial generalizability.
    Puzzle Solving without Search or Human Knowledge: An Unnatural Language Approach. (arXiv:2109.02797v1 [cs.LG])
    (2 min) The application of Generative Pre-trained Transformer (GPT-2) to learn text-archived game notation provides a model environment for exploring sparse reward gameplay. The transformer architecture proves amenable to training on solved text archives describing mazes, Rubik's Cube, and Sudoku solvers. The method benefits from fine-tuning the transformer architecture to visualize plausible strategies derived outside any guidance from human heuristics or domain expertise. The large search space ($>10^{19}$) for the games provides a puzzle environment in which the solution has few intermediate rewards and a final move that solves the challenge.
    WhyAct: Identifying Action Reasons in Lifestyle Vlogs. (arXiv:2109.02747v1 [cs.CV])
    (2 min) We aim to automatically identify human action reasons in online videos. We focus on the widespread genre of lifestyle vlogs, in which people perform actions while verbally describing them. We introduce and make publicly available the {\sc WhyAct} dataset, consisting of 1,077 visual actions manually annotated with their reasons. We describe a multimodal model that leverages visual and textual information to automatically infer the reasons corresponding to an action presented in the video.
    Detecting Inspiring Content on Social Media. (arXiv:2109.02734v1 [cs.CL])
    (2 min) Inspiration moves a person to see new possibilities and transforms the way they perceive their own potential. Inspiration has received little attention in psychology, and has not been researched before in the NLP community. To the best of our knowledge, this work is the first to study inspiration through machine learning methods. We aim to automatically detect inspiring content from social media data. To this end, we analyze social media posts to tease out what makes a post inspiring and what topics are inspiring. We release a dataset of 5,800 inspiring and 5,800 non-inspiring English-language public post unique ids collected from a dump of Reddit public posts made available by a third party and use linguistic heuristics to automatically detect which social media English-language posts are inspiring.
    End-to-end Neural Information Status Classification. (arXiv:2109.02753v1 [cs.CL])
    (2 min) Most previous studies on information status (IS) classification and bridging anaphora recognition assume that the gold mention or syntactic tree information is given (Hou et al., 2013; Roesiger et al., 2018; Hou, 2020; Yu and Poesio, 2020). In this paper, we propose an end-to-end neural approach for information status classification. Our approach consists of a mention extraction component and an information status assignment component. During the inference time, our system takes a raw text as the input and generates mentions together with their information status. On the ISNotes corpus (Markert et al., 2012), we show that our information status assignment component achieves new state-of-the-art results on fine-grained IS classification based on gold mentions. Furthermore, our system performs significantly better than other baselines for both mention extraction and fine-grained IS classification in the end-to-end setting. Finally, we apply our system on BASHI (Roesiger, 2018) and SciCorp (Roesiger, 2016) to recognize referential bridging anaphora. We find that our end-to-end system trained on ISNotes achieves competitive results on bridging anaphora recognition compared to the previous state-of-the-art system that relies on syntactic information and is trained on the in-domain datasets (Yu and Poesio, 2020).
    SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification by Utilising the Notion of "Subjectivity" and "Identity Terms". (arXiv:2109.02691v1 [cs.CL])
    (2 min) Toxic comment classification models are often found biased toward identity terms which are terms characterizing a specific group of people such as "Muslim" and "black". Such bias is commonly reflected in false-positive predictions, i.e. non-toxic comments with identity terms. In this work, we propose a novel approach to tackle such bias in toxic comment classification, leveraging the notion of subjectivity level of a comment and the presence of identity terms. We hypothesize that when a comment is made about a group of people that is characterized by an identity term, the likelihood of that comment being toxic is associated with the subjectivity level of the comment, i.e. the extent to which the comment conveys personal feelings and opinions. Building upon the BERT model, we propose a new structure that is able to leverage these features, and thoroughly evaluate our model on 4 datasets of varying sizes and representing different social media platforms. The results show that our model can consistently outperform BERT and a SOTA model devised to address identity term bias in a different way, with a maximum improvement in F1 of 2.43% and 1.91% respectively.
    Text-to-Table: A New Way of Information Extraction. (arXiv:2109.02707v1 [cs.CL])
    (2 min) We study a new problem setting of information extraction (IE), referred to as text-to-table, which can be viewed as an inverse problem of the well-studied table-to-text. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained language model to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data will be made publicly available.
  • cs.CV updates on arXiv.org

    Intelligent Motion Planning for a Cost-effective Object Follower Mobile Robotic System with Obstacle Avoidance. (arXiv:2109.02700v1 [cs.RO])
    (0 min) There are few industries which use manually controlled robots for carrying material and this cannot be used all the time in all the places. So, it is very tranquil to have robots which can follow a specific human by following the unique coloured object held by that person. So, we propose a robotic system which uses robot vision and deep learning to get the required linear and angular velocities which are {\nu} and {\omega}, respectively. Which in turn makes the robot to avoid obstacles when following the unique coloured object held by the human. The novel methodology that we are proposing is accurate in detecting the position of the unique coloured object in any kind of lighting and tells us the horizontal pixel value where the robot is present and also tells if the object is close to or far from the robot. Moreover, the artificial neural networks that we have used in this problem gave us a meagre error in linear and angular velocity prediction and the PI controller which was used to control the linear and angular velocities, which in turn controls the position of the robot gave us impressive results and this methodology outperforms all other methodologies.
    On the Out-of-distribution Generalization of Probabilistic Image Modelling. (arXiv:2109.02639v1 [cs.CV])
    (0 min) Out-of-distribution (OOD) detection and lossless compression constitute two problems that can be solved by the training of probabilistic models on a first dataset with subsequent likelihood evaluation on a second dataset, where data distributions differ. By defining the generalization of probabilistic models in terms of likelihood we show that, in the case of image models, the OOD generalization ability is dominated by local features. This motivates our proposal of a Local Autoregressive model that exclusively models local image features towards improving OOD performance. We apply the proposed model to OOD detection tasks and achieve state-of-the-art unsupervised OOD detection performance without the introduction of additional data. Additionally, we employ our model to build a new lossless image compressor: NeLLoC (Neural Local Lossless Compressor) and report state-of-the-art compression rates and model size.
    Learning to Infer Shape Programs Using Self Training. (arXiv:2011.13045v2 [cs.CV] UPDATED)
    (0 min) Inferring programs which generate 2D and 3D shapes is important for reverse engineering, editing, and more. Training such inference models is challenging due to the lack of paired (shape, program) data in most domains. A popular approach is to pre-train a model on synthetic data and then fine-tune on real shapes using slow, unstable reinforcement learning. In this paper, we argue that self-training is a viable alternative for fine-tuning such models. Self-training is a semi-supervised learning paradigm where a model assigns pseudo-labels to unlabeled data, and then retrains with (data, pseudo-label) pairs as the new ground truth. We show that for constructive solid geometry and assembly-based modeling, self-training outperforms state-of-the-art reinforcement learning approaches. Additionally, shape program inference has a unique property that circumvents a potential downside of self-training (incorrect pseudo-label assignment): inferred programs are executable. For a given shape from our distribution of interest $\mathbf{x}^*$ and its predicted program $\mathbf{z}$, one can execute $\mathbf{z}$ to obtain a shape $\mathbf{x}$ and train on $(\mathbf{z}, \mathbf{x})$ pairs, rather than $(\mathbf{z}, \mathbf{x}^*)$ pairs. We term this procedure latent execution self training (LEST). We demonstrate that self training infers shape programs with higher shape reconstruction accuracy and converges significantly faster than reinforcement learning approaches, and in some domains, LEST can further improve this performance.
    Grassmannian Graph-attentional Landmark Selection for Domain Adaptation. (arXiv:2109.02990v1 [cs.CV])
    (0 min) Domain adaptation aims to leverage information from the source domain to improve the classification performance in the target domain. It mainly utilizes two schemes: sample reweighting and feature matching. While the first scheme allocates different weights to individual samples, the second scheme matches the feature of two domains using global structural statistics. The two schemes are complementary with each other, which are expected to jointly work for robust domain adaptation. Several methods combine the two schemes, but the underlying relationship of samples is insufficiently analyzed due to the neglect of the hierarchy of samples and the geometric properties between samples. To better combine the advantages of the two schemes, we propose a Grassmannian graph-attentional landmark selection (GGLS) framework for domain adaptation. GGLS presents a landmark selection scheme using attention-induced neighbors of the graphical structure of samples and performs distribution adaptation and knowledge adaptation over Grassmann manifold. the former treats the landmarks of each sample differently, and the latter avoids feature distortion and achieves better geometric properties. Experimental results on different real-world cross-domain visual recognition tasks demonstrate that GGLS provides better classification accuracies compared with state-of-the-art domain adaptation methods.
    MultiScene: A Large-scale Dataset and Benchmark for Multi-scene Recognition in Single Aerial Images. (arXiv:2104.02846v3 [cs.CV] UPDATED)
    (2 min) Aerial scene recognition is a fundamental research problem in interpreting high-resolution aerial imagery. Over the past few years, most studies focus on classifying an image into one scene category, while in real-world scenarios, it is more often that a single image contains multiple scenes. Therefore, in this paper, we investigate a more practical yet underexplored task -- multi-scene recognition in single images. To this end, we create a large-scale dataset, called MultiScene, composed of 100,000 unconstrained high-resolution aerial images. Considering that manually labeling such images is extremely arduous, we resort to low-cost annotations from crowdsourcing platforms, e.g., OpenStreetMap (OSM). However, OSM data might suffer from incompleteness and incorrectness, which introduce noise into image labels. To address this issue, we visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean. With it, we can develop and evaluate deep networks for multi-scene recognition using clean data. Moreover, we provide crowdsourced annotations of all images for the purpose of studying network learning with noisy labels. We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images and learning from noisy labels for this task, respectively. To facilitate progress, we make our dataset and trained models available on https://gitlab.lrz.de/ai4eo/reasoning/multiscene.
    Towards Robust Object Detection: Bayesian RetinaNet for Homoscedastic Aleatoric Uncertainty Modeling. (arXiv:2108.00784v2 [cs.CV] UPDATED)
    (2 min) According to recent studies, commonly used computer vision datasets contain about 4% of label errors. For example, the COCO dataset is known for its high level of noise in data labels, which limits its use for training robust neural deep architectures in a real-world scenario. To model such a noise, in this paper we have proposed the homoscedastic aleatoric uncertainty estimation, and present a series of novel loss functions to address the problem of image object detection at scale. Specifically, the proposed functions are based on Bayesian inference and we have incorporated them into the common community-adopted object detection deep learning architecture RetinaNet. We have also shown that modeling of homoscedastic aleatoric uncertainty using our novel functions allows to increase the model interpretability and to improve the object detection performance being evaluated on the COCO dataset.
    DPR-CAE: Capsule Autoencoder with Dynamic Part Representation for Image Parsing. (arXiv:2104.14735v2 [cs.CV] UPDATED)
    (2 min) Parsing an image into a hierarchy of objects, parts, and relations is important and also challenging in many computer vision tasks. This paper proposes a simple and effective capsule autoencoder to address this issue, called DPR-CAE. In our approach, the encoder parses the input into a set of part capsules, including pose, intensity, and dynamic vector. The decoder introduces a novel dynamic part representation (DPR) by combining the dynamic vector and a shared template bank. These part representations are then regulated by corresponding capsules to composite the final output in an interpretable way. Besides, an extra translation-invariant module is proposed to avoid directly learning the uncertain scene-part relationship in our DPR-CAE, which makes the resulting method achieves a promising performance gain on $rm$-MNIST and $rm$-Fashion-MNIST. % to model the scene-object relationship DPR-CAE can be easily combined with the existing stacked capsule autoencoder and experimental results show it significantly improves performance in terms of unsupervised object classification. Our code is available in the Appendix.
    Benchmarking Robustness of Deep Learning Classifiers Using Two-Factor Perturbation. (arXiv:2103.03102v3 [cs.LG] UPDATED)
    (2 min) The accuracy of DL classifiers is unstable in that it often changes significantly when retested on adversarial images, imperfect images, or perturbed images. This paper adds to the small but fundamental body of work on benchmarking the robustness of DL classifiers on defective images. Unlike existed single-factor digital perturbation work, we provide state-of-the-art two-factor perturbation that provides two natural perturbations on images applied in different sequences. The two-factor perturbation includes (1) two digital perturbations (Salt & pepper noise and Gaussian noise) applied in both sequences. (2) one digital perturbation (salt & pepper noise) and a geometric perturbation (rotation) applied in different sequences. To measure robust DL classifiers, previous scientists provided 15 types of single-factor corruption. We created 69 benchmarking image sets, including a clean set, sets with single factor perturbations, and sets with two-factor perturbation conditions. To be best of our knowledge, this is the first report that two-factor perturbed images improves both robustness and accuracy of DL classifiers. Previous research evaluating deep learning (DL) classifiers has often used top-1/top-5 accuracy, so researchers have usually offered tables, line diagrams, and bar charts to display accuracy of DL classifiers. But these existed approaches cannot quantitively evaluate robustness of DL classifiers. We innovate a new two-dimensional, statistical visualization tool, including mean accuracy and coefficient of variation (CV), to benchmark the robustness of DL classifiers. All source codes and related image sets are shared on websites (this http URL or https://github.com/daiweiworking/RobustDeepLearningUsingPerturbations ) to support future academic research and industry projects.
    Learning Fast Sample Re-weighting Without Reward Data. (arXiv:2109.03216v1 [cs.LG])
    (2 min) Training sample re-weighting is an effective approach for tackling data biases such as imbalanced and corrupted labels. Recent methods develop learning-based algorithms to learn sample re-weighting strategies jointly with model training based on the frameworks of reinforcement learning and meta learning. However, depending on additional unbiased reward data is limiting their general applicability. Furthermore, existing learning-based sample re-weighting methods require nested optimizations of models and weighting parameters, which requires expensive second-order computation. This paper addresses these two problems and presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data. The method is based on two key ideas: learning from history to build proxy reward data and feature sharing to reduce the optimization cost. Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise robustness and long-tailed recognition, and does so while achieving significantly improved training efficiency. The source code is publicly available at https://github.com/google-research/google-research/tree/master/ieg.
    Multi-Modal RGB-D Scene Recognition Across Domains. (arXiv:2103.14672v2 [cs.CV] UPDATED)
    (2 min) Scene recognition is one of the basic problems in computer vision research with extensive applications in robotics. When available, depth images provide helpful geometric cues that complement the RGB texture information and help to identify discriminative scene image features. Depth sensing technology developed fast in the last years and a great variety of 3D cameras have been introduced, each with different acquisition properties. However, those properties are often neglected when targeting big data collections, so multi-modal images are gathered disregarding their original nature. In this work, we put under the spotlight the existence of a possibly severe domain shift issue within multi-modality scene recognition datasets. As a consequence, a scene classification model trained on one camera may not generalize on data from a different camera, only providing a low recognition performance. Starting from the well-known SUN RGB-D dataset, we designed an experimental testbed to study this problem and we use it to benchmark the performance of existing methods. Finally, we introduce a novel adaptive scene recognition approach that leverages self-supervised translation between modalities. Indeed, learning to go from RGB to depth and vice-versa is an unsupervised procedure that can be trained jointly on data of multiple cameras and may help to bridge the gap among the extracted feature distributions. Our experimental results confirm the effectiveness of the proposed approach.
    Agent-Environment Network for Temporal Action Proposal Generation. (arXiv:2107.08323v2 [cs.CV] UPDATED)
    (2 min) Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
    Go Wider Instead of Deeper. (arXiv:2107.11817v3 [cs.LG] UPDATED)
    (2 min) More transformer blocks with residual connections have recently achieved impressive results on various tasks. To achieve better performance with fewer trainable parameters, recent methods are proposed to go shallower by parameter sharing or model compressing along with the depth. However, weak modeling capacity limits their performance. Contrastively, going wider by inducing more trainable matrixes and parameters would produce a huge model requiring advanced parallelism to train and inference. In this paper, we propose a parameter-efficient framework, going wider instead of deeper. Specially, following existing works, we adapt parameter sharing to compress along depth. But, such deployment would limit the performance. To maximize modeling capacity, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). Across transformer blocks, instead of sharing normalization layers, we propose to use individual layernorms to transform various semantic representations in a more parameter-efficient way. To evaluate our plug-and-run framework, we design WideNet and conduct comprehensive experiments on popular computer vision and natural language processing benchmarks. On ImageNet-1K, our best model outperforms Vision Transformer (ViT) by $1.5\%$ with $0.72 \times$ trainable parameters. Using $0.46 \times$ and $0.13 \times$ parameters, our WideNet can still surpass ViT and ViT-MoE by $0.8\%$ and $2.1\%$, respectively. On four natural language processing datasets, WideNet outperforms ALBERT by $1.8\%$ on average and surpass BERT using factorized embedding parameterization by $0.8\%$ with fewer parameters.
    DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection. (arXiv:2012.04886v2 [cs.CV] UPDATED)
    (2 min) As moving objects always draw more attention of human eyes, the temporal motive information is always exploited complementarily with spatial information to detect salient objects in videos. Although efficient tools such as optical flow have been proposed to extract temporal motive information, it often encounters difficulties when used for saliency detection due to the movement of camera or the partial movement of salient objects. In this paper, we investigate the complimentary roles of spatial and temporal information and propose a novel dynamic spatiotemporal network (DS-Net) for more effective fusion of spatiotemporal information. We construct a symmetric two-bypass network to explicitly extract spatial and temporal features. A dynamic weight generator (DWG) is designed to automatically learn the reliability of corresponding saliency branch. And a top-down cross attentive aggregation (CAA) procedure is designed so as to facilitate dynamic complementary aggregation of spatiotemporal features. Finally, the features are modified by spatial attention with the guidance of coarse saliency map and then go through decoder part for final saliency map. Experimental results on five benchmarks VOS, DAVIS, FBMS, SegTrack-v2, and ViSal demonstrate that the proposed method achieves superior performance than state-of-the-art algorithms. The source code is available at https://github.com/TJUMMG/DS-Net.
    Real-time Human Action Recognition Using Locally Aggregated Kinematic-Guided Skeletonlet and Supervised Hashing-by-Analysis Model. (arXiv:2105.11312v2 [cs.CV] UPDATED)
    (2 min) 3D action recognition is referred to as the classification of action sequences which consist of 3D skeleton joints. While many research work are devoted to 3D action recognition, it mainly suffers from three problems: highly complicated articulation, a great amount of noise, and a low implementation efficiency. To tackle all these problems, we propose a real-time 3D action recognition framework by integrating the locally aggregated kinematic-guided skeletonlet (LAKS) with a supervised hashing-by-analysis (SHA) model. We first define the skeletonlet as a few combinations of joint offsets grouped in terms of kinematic principle, and then represent an action sequence using LAKS, which consists of a denoising phase and a locally aggregating phase. The denoising phase detects the noisy action data and adjust it by replacing all the features within it with the features of the corresponding previous frame, while the locally aggregating phase sums the difference between an offset feature of the skeletonlet and its cluster center together over all the offset features of the sequence. Finally, the SHA model which combines sparse representation with a hashing model, aiming at promoting the recognition accuracy while maintaining a high efficiency. Experimental results on MSRAction3D, UTKinectAction3D and Florence3DAction datasets demonstrate that the proposed method outperforms state-of-the-art methods in both recognition accuracy and implementation efficiency.
    Improving Transferability of Domain Adaptation Networks Through Domain Alignment Layers. (arXiv:2109.02693v1 [cs.CV])
    (2 min) Deep learning (DL) has been the primary approach used in various computer vision tasks due to its relevant results achieved on many tasks. However, on real-world scenarios with partially or no labeled data, DL methods are also prone to the well-known domain shift problem. Multi-source unsupervised domain adaptation (MSDA) aims at learning a predictor for an unlabeled domain by assigning weak knowledge from a bag of source models. However, most works conduct domain adaptation leveraging only the extracted features and reducing their domain shift from the perspective of loss function designs. In this paper, we argue that it is not sufficient to handle domain shift only based on domain-level features, but it is also essential to align such information on the feature space. Unlike previous works, we focus on the network design and propose to embed Multi-Source version of DomaIn Alignment Layers (MS-DIAL) at different levels of the predictor. These layers are designed to match the feature distributions between different domains and can be easily applied to various MSDA methods. To show the robustness of our approach, we conducted an extensive experimental evaluation considering two challenging scenarios: digit recognition and object classification. The experimental results indicated that our approach can improve state-of-the-art MSDA methods, yielding relative gains of up to +30.64% on their classification accuracies.
    Rethinking Crowdsourcing Annotation: Partial Annotation with Salient Labels for Multi-Label Image Classification. (arXiv:2109.02688v1 [cs.CV])
    (2 min) Annotated images are required for both supervised model training and evaluation in image classification. Manually annotating images is arduous and expensive, especially for multi-labeled images. A recent trend for conducting such laboursome annotation tasks is through crowdsourcing, where images are annotated by volunteers or paid workers online (e.g., workers of Amazon Mechanical Turk) from scratch. However, the quality of crowdsourcing image annotations cannot be guaranteed, and incompleteness and incorrectness are two major concerns for crowdsourcing annotations. To address such concerns, we have a rethinking of crowdsourcing annotations: Our simple hypothesis is that if the annotators only partially annotate multi-label images with salient labels they are confident in, there will be fewer annotation errors and annotators will spend less time on uncertain labels. As a pleasant surprise, with the same annotation budget, we show a multi-label image classifier supervised by images with salient annotations can outperform models supervised by fully annotated images. Our method contributions are 2-fold: An active learning way is proposed to acquire salient labels for multi-label images; and a novel Adaptive Temperature Associated Model (ATAM) specifically using partial annotations is proposed for multi-label image classification. We conduct experiments on practical crowdsourcing data, the Open Street Map (OSM) dataset and benchmark dataset COCO 2014. When compared with state-of-the-art classification methods trained on fully annotated images, the proposed ATAM can achieve higher accuracy. The proposed idea is promising for crowdsourcing data annotation. Our code will be publicly available.
    Pseudo-mask Matters in Weakly-supervised Semantic Segmentation. (arXiv:2108.12995v2 [cs.CV] UPDATED)
    (2 min) Most weakly supervised semantic segmentation (WSSS) methods follow the pipeline that generates pseudo-masks initially and trains the segmentation model with the pseudo-masks in fully supervised manner after. However, we find some matters related to the pseudo-masks, including high quality pseudo-masks generation from class activation maps (CAMs), and training with noisy pseudo-mask supervision. For these matters, we propose the following designs to push the performance to new state-of-art: (i) Coefficient of Variation Smoothing to smooth the CAMs adaptively; (ii) Proportional Pseudo-mask Generation to project the expanded CAMs to pseudo-mask based on a new metric indicating the importance of each class on each location, instead of the scores trained from binary classifiers. (iii) Pretended Under-Fitting strategy to suppress the influence of noise in pseudo-mask; (iv) Cyclic Pseudo-mask to boost the pseudo-masks during training of fully supervised semantic segmentation (FSSS). Experiments based on our methods achieve new state-of-art results on two changeling weakly supervised semantic segmentation datasets, pushing the mIoU to 70.0% and 40.2% on PAS-CAL VOC 2012 and MS COCO 2014 respectively. Codes including segmentation framework are released at https://github.com/Eli-YiLi/PMM
    TrouSPI-Net: Spatio-temporal attention on parallel atrous convolutions and U-GRUs for skeletal pedestrian crossing prediction. (arXiv:2109.00953v2 [cs.CV] UPDATED)
    (2 min) Understanding the behaviors and intentions of pedestrians is still one of the main challenges for vehicle autonomy, as accurate predictions of their intentions can guarantee their safety and driving comfort of vehicles. In this paper, we address pedestrian crossing prediction in urban traffic environments by linking the dynamics of a pedestrian's skeleton to a binary crossing intention. We introduce TrouSPI-Net: a context-free, lightweight, multi-branch predictor. TrouSPI-Net extracts spatio-temporal features for different time resolutions by encoding pseudo-images sequences of skeletal joints' positions and processes them with parallel attention modules and atrous convolutions. The proposed approach is then enhanced by processing features such as relative distances of skeletal joints, bounding box positions, or ego-vehicle speed with U-GRUs. Using the newly proposed evaluation procedures for two large public naturalistic data sets for studying pedestrian behavior in traffic: JAAD and PIE, we evaluate TrouSPI-Net and analyze its performance. Experimental results show that TrouSPI-Net achieved 0.76 F1 score on JAAD and 0.80 F1 score on PIE, therefore outperforming current state-of-the-art while being lightweight and context-free.
    Animatable Neural Radiance Fields from Monocular RGB Videos. (arXiv:2106.13629v2 [cs.CV] UPDATED)
    (2 min) We present animatable neural radiance fields (animatable NeRF) for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical space for the detailed human template, which enables natural shape deformation from the observation space to the canonical space under the explicit control of the pose parameters. To compensate for inaccurate pose estimation, we introduce the pose refinement strategy that updates the initial pose during the learning process, which not only helps to learn more accurate human reconstruction but also accelerates the convergence. In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from novel views, and 3) animation of the human with novel poses.
    Graph Representation Learning for Road Type Classification. (arXiv:2107.07791v3 [cs.LG] UPDATED)
    (2 min) We present a novel learning-based approach to graph representations of road networks employing state-of-the-art graph convolutional neural networks. Our approach is applied to realistic road networks of 17 cities from Open Street Map. While edge features are crucial to generate descriptive graph representations of road networks, graph convolutional networks usually rely on node features only. We show that the highly representative edge features can still be integrated into such networks by applying a line graph transformation. We also propose a method for neighborhood sampling based on a topological neighborhood composed of both local and global neighbors. We compare the performance of learning representations using different types of neighborhood aggregation functions in transductive and inductive tasks and in supervised and unsupervised learning. Furthermore, we propose a novel aggregation approach, Graph Attention Isomorphism Network, GAIN. Our results show that GAIN outperforms state-of-the-art methods on the road type classification problem.
    Fair Comparison: Quantifying Variance in Resultsfor Fine-grained Visual Categorization. (arXiv:2109.03156v1 [cs.CV])
    (2 min) For the task of image classification, researchers work arduously to develop the next state-of-the-art (SOTA) model, each bench-marking their own performance against that of their predecessors and of their peers. Unfortunately, the metric used most frequently to describe a model's performance, average categorization accuracy, is often used in isolation. As the number of classes increases, such as in fine-grained visual categorization (FGVC), the amount of information conveyed by average accuracy alone dwindles. While its most glaring weakness is its failure to describe the model's performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, on the same dataset, to another (both averaged across all categories and at the per-class level). We first demonstrate the magnitude of these variations across models and across class distributions based on attributes of the data, comparing results on different visual domains and different per-class image distributions, including long-tailed distributions and few-shot subsets. We then analyze the impact various FGVC methods have on overall and per-class variance. From this analysis, we both highlight the importance of reporting and comparing methods based on information beyond overall accuracy, as well as point out techniques that mitigate variance in FGVC results.
    Material Measurement Units for a Circular Economy: Foundations through a Survey. (arXiv:2103.01997v2 [cs.CV] UPDATED)
    (2 min) Long-term availability of minerals and industrial materials is a necessary condition for sustainable development as they are the constituents of any manufacturing product. To enhance the efficiency of material management, we define a computer-vision-enabled material measurement system and provide a survey of works relevant to its development with particular emphasis on the foundations. A network of such systems for wide-area material stock monitoring is also covered. Finally, challenges and future research directions are discussed. As the first article bridging industrial ecology and advanced computer vision, this survey is intended to support both research communities towards more sustainable manufacturing.
    LSD-StructureNet: Modeling Levels of Structural Detail in 3D Part Hierarchies. (arXiv:2108.13459v2 [cs.CV] UPDATED)
    (2 min) Generative models for 3D shapes represented by hierarchies of parts can generate realistic and diverse sets of outputs. However, existing models suffer from the key practical limitation of modelling shapes holistically and thus cannot perform conditional sampling, i.e. they are not able to generate variants on individual parts of generated shapes without modifying the rest of the shape. This is limiting for applications such as 3D CAD design that involve adjusting created shapes at multiple levels of detail. To address this, we introduce LSD-StructureNet, an augmentation to the StructureNet architecture that enables re-generation of parts situated at arbitrary positions in the hierarchies of its outputs. We achieve this by learning individual, probabilistic conditional decoders for each hierarchy depth. We evaluate LSD-StructureNet on the PartNet dataset, the largest dataset of 3D shapes represented by hierarchies of parts. Our results show that contrarily to existing methods, LSD-StructureNet can perform conditional sampling without impacting inference speed or the realism and diversity of its outputs.
    PLAM: a Posit Logarithm-Approximate Multiplier. (arXiv:2102.09262v2 [cs.LG] UPDATED)
    (2 min) The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in Neural Network related tasks and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the complexity of posit multipliers, the most power-hungry units within Deep Neural Network architectures. When comparing with state-of-the-art posit multipliers, experiments show that the proposed technique reduces the area, power, and delay of hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively, without accuracy degradation.
    Experimental Quantum Generative Adversarial Networks for Image Generation. (arXiv:2010.06201v3 [quant-ph] UPDATED)
    (2 min) Quantum machine learning is expected to be one of the first practical applications of near-term quantum devices. Pioneer theoretical works suggest that quantum generative adversarial networks (GANs) may exhibit a potential exponential advantage over classical GANs, thus attracting widespread attention. However, it remains elusive whether quantum GANs implemented on near-term quantum devices can actually solve real-world learning tasks. Here, we devise a flexible quantum GAN scheme to narrow this knowledge gap, which could accomplish image generation with arbitrarily high-dimensional features, and could also take advantage of quantum superposition to train multiple examples in parallel. For the first time, we experimentally achieve the learning and generation of real-world hand-written digit images on a superconducting quantum processor. Moreover, we utilize a gray-scale bar dataset to exhibit the competitive performance between quantum GANs and the classical GANs based on multilayer perceptron and convolutional neural network architectures, respectively, benchmarked by the Fr\'echet Distance score. Our work provides guidance for developing advanced quantum generative models on near-term quantum devices and opens up an avenue for exploring quantum advantages in various GAN-related learning tasks.
    RCNN-SliceNet: A Slice and Cluster Approach for Nuclei Centroid Detection in Three-Dimensional Fluorescence Microscopy Images. (arXiv:2106.15753v2 [eess.IV] UPDATED)
    (2 min) Robust and accurate nuclei centroid detection is important for the understanding of biological structures in fluorescence microscopy images. Existing automated nuclei localization methods face three main challenges: (1) Most of object detection methods work only on 2D images and are difficult to extend to 3D volumes; (2) Segmentation-based models can be used on 3D volumes but it is computational expensive for large microscopy volumes and they have difficulty distinguishing different instances of objects; (3) Hand annotated ground truth is limited for 3D microscopy volumes. To address these issues, we present a scalable approach for nuclei centroid detection of 3D microscopy volumes. We describe the RCNN-SliceNet to detect 2D nuclei centroids for each slice of the volume from different directions and 3D agglomerative hierarchical clustering (AHC) is used to estimate the 3D centroids of nuclei in a volume. The model was trained with the synthetic microscopy data generated using Spatially Constrained Cycle-Consistent Adversarial Networks (SpCycleGAN) and tested on different types of real 3D microscopy data. Extensive experimental results demonstrate that our proposed method can accurately count and detect the nuclei centroids in a 3D microscopy volume.
    Training Deep Networks from Zero to Hero: avoiding pitfalls and going beyond. (arXiv:2109.02752v1 [cs.LG])
    (2 min) Training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning, can result in poor generalization or inconclusive results when it comes to small datasets or specific applications. This tutorial covers the basic steps as well as more recent options to improve models, in particular, but not restricted to, supervised learning. It can be particularly useful in datasets that are not as well-prepared as those in challenges, and also under scarce annotation and/or small data. We describe basic procedures: as data preparation, optimization and transfer learning, but also recent architectural choices such as use of transformer modules, alternative convolutional layers, activation functions, wide and deep networks, as well as training procedures including as curriculum, contrastive and self-supervised learning.
    Efficient Vision Transformers via Fine-Grained Manifold Distillation. (arXiv:2107.01378v3 [cs.CV] UPDATED)
    (2 min) This paper studies the model compression problem of vision transformers. Benefit from the self-attention module, transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources including memory usage and the inference complexity. Compared with the existing knowledge distillation approaches, we propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches. We then explore an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and random-selected manifolds in teacher and student models. Experimental results conducted on several benchmarks demonstrate the superiority of the proposed algorithm for distilling portable transformer models with higher performance. For example, our approach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model, which outperforms other ViT distillation methods.
    Self-supervised Tumor Segmentation through Layer Decomposition. (arXiv:2109.03230v1 [cs.CV])
    (2 min) In this paper, we propose a self-supervised approach for tumor segmentation. Specifically, we advocate a zero-shot setting, where models from self-supervised learning should be directly applicable for the downstream task, without using any manual annotations whatsoever. We make the following contributions. First, with careful examination on existing self-supervised learning approaches, we reveal the surprising result that, given suitable data augmentation, models trained from scratch in fact achieve comparable performance to those pre-trained with self-supervised learning. Second, inspired by the fact that tumors tend to be characterized independently to the contexts, we propose a scalable pipeline for generating synthetic tumor data, and train a self-supervised model that minimises the generalisation gap with the downstream task. Third, we conduct extensive ablation studies on different downstream datasets, BraTS2018 for brain tumor segmentation and LiTS2017 for liver tumor segmentation. While evaluating the model transferability for tumor segmentation under a low-annotation regime, including an extreme case of zero-shot segmentation, the proposed approach demonstrates state-of-the-art performance, substantially outperforming all existing self-supervised approaches, and opening up the usage of self-supervised learning in practical scenarios.
    Smart Traffic Monitoring System using Computer Vision and Edge Computing. (arXiv:2109.03141v1 [cs.CV])
    (2 min) Traffic management systems capture tremendous video data and leverage advances in video processing to detect and monitor traffic incidents. The collected data are traditionally forwarded to the traffic management center (TMC) for in-depth analysis and may thus exacerbate the network paths to the TMC. To alleviate such bottlenecks, we propose to utilize edge computing by equipping edge nodes that are close to cameras with computing resources (e.g. cloudlets). A cloudlet, with limited computing resources as compared to TMC, provides limited video processing capabilities. In this paper, we focus on two common traffic monitoring tasks, congestion detection, and speed detection, and propose a two-tier edge computing based model that takes into account of both the limited computing capability in cloudlets and the unstable network condition to the TMC. Our solution utilizes two algorithms for each task, one implemented at the edge and the other one at the TMC, which are designed with the consideration of different computing resources. While the TMC provides strong computation power, the video quality it receives depends on the underlying network conditions. On the other hand, the edge processes very high-quality video but with limited computing resources. Our model captures this trade-off. We evaluate the performance of the proposed two-tier model as well as the traffic monitoring algorithms via test-bed experiments under different weather as well as network conditions and show that our proposed hybrid edge-cloud solution outperforms both the cloud-only and edge-only solutions.
    Meta-Semi: A Meta-learning Approach for Semi-supervised Learning. (arXiv:2007.02394v3 [cs.LG] UPDATED)
    (2 min) Deep learning based semi-supervised learning (SSL) algorithms have led to promising results in recent years. However, they tend to introduce multiple tunable hyper-parameters, making them less practical in real SSL scenarios where the labeled data is scarce for extensive hyper-parameter search. In this paper, we propose a novel meta-learning based SSL algorithm (Meta-Semi) that requires tuning only one additional hyper-parameter, compared with a standard supervised deep learning algorithm, to achieve competitive performance under various conditions of SSL. We start by defining a meta optimization problem that minimizes the loss on labeled data through dynamically reweighting the loss on unlabeled samples, which are associated with soft pseudo labels during training. As the meta problem is computationally intensive to solve directly, we propose an efficient algorithm to dynamically obtain the approximate solutions. We show theoretically that Meta-Semi converges to the stationary point of the loss function on labeled data under mild conditions. Empirically, Meta-Semi outperforms state-of-the-art SSL algorithms significantly on the challenging semi-supervised CIFAR-100 and STL-10 tasks, and achieves competitive performance on CIFAR-10 and SVHN.
    How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild. (arXiv:2106.03932v2 [cs.CV] UPDATED)
    (2 min) Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker. Each stage of this pipeline plays an important role for the final performance of the created architecture. Based on a series of controlled experiments, this work presents several practical guidelines for audio-visual active speaker detection. Correspondingly, we present a new architecture called ASDNet, which achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 93.5% outperforming the second best with a large margin of 4.7%. Our code and pretrained models are publicly available.
    CovarianceNet: Conditional Generative Model for Correct Covariance Prediction in Human Motion Prediction. (arXiv:2109.02965v1 [cs.CV])
    (2 min) The correct characterization of uncertainty when predicting human motion is equally important as the accuracy of this prediction. We present a new method to correctly predict the uncertainty associated with the predicted distribution of future trajectories. Our approach, CovariaceNet, is based on a Conditional Generative Model with Gaussian latent variables in order to predict the parameters of a bi-variate Gaussian distribution. The combination of CovarianceNet with a motion prediction model results in a hybrid approach that outputs a uni-modal distribution. We will show how some state of the art methods in motion prediction become overconfident when predicting uncertainty, according to our proposed metric and validated in the ETH data-set \cite{pellegrini2009you}. CovarianceNet correctly predicts uncertainty, which makes our method suitable for applications that use predicted distributions, e.g., planning or decision making.
    ConvNets for Counting: Object Detection of Transient Phenomena in Steelpan Drums. (arXiv:2102.00632v2 [cs.CV] UPDATED)
    (2 min) We train an object detector built from convolutional neural networks to count interference fringes in elliptical antinode regions in frames of high-speed video recordings of transient oscillations in Caribbean steelpan drums illuminated by electronic speckle pattern interferometry (ESPI). The annotations provided by our model aim to contribute to the understanding of time-dependent behavior in such drums by tracking the development of sympathetic vibration modes. The system is trained on a dataset of crowdsourced human-annotated images obtained from the Zooniverse Steelpan Vibrations Project. Due to the small number of human-annotated images and the ambiguity of the annotation task, we also evaluate the model on a large corpus of synthetic images whose properties have been matched to the real images by style transfer using a Generative Adversarial Network. Applying the model to thousands of unlabeled video frames, we measure oscillations consistent with audio recordings of these drum strikes. One unanticipated result is that sympathetic oscillations of higher-octave notes significantly precede the rise in sound intensity of the corresponding second harmonic tones; the mechanism responsible for this remains unidentified. This paper primarily concerns the development of the predictive model; further exploration of the steelpan images and deeper physical insights await its further application.
    FDA: Feature Decomposition and Aggregation for Robust Airway Segmentation. (arXiv:2109.02920v1 [eess.IV])
    (2 min) 3D Convolutional Neural Networks (CNNs) have been widely adopted for airway segmentation. The performance of 3D CNNs is greatly influenced by the dataset while the public airway datasets are mainly clean CT scans with coarse annotation, thus difficult to be generalized to noisy CT scans (e.g. COVID-19 CT scans). In this work, we proposed a new dual-stream network to address the variability between the clean domain and noisy domain, which utilizes the clean CT scans and a small amount of labeled noisy CT scans for airway segmentation. We designed two different encoders to extract the transferable clean features and the unique noisy features separately, followed by two independent decoders. Further on, the transferable features are refined by the channel-wise feature recalibration and Signed Distance Map (SDM) regression. The feature recalibration module emphasizes critical features and the SDM pays more attention to the bronchi, which is beneficial to extracting the transferable topological features robust to the coarse labels. Extensive experimental results demonstrated the obvious improvement brought by our proposed method. Compared to other state-of-the-art transfer learning methods, our method accurately segmented more bronchi in the noisy CT scans.
    Analysis of MRI Biomarkers for Brain Cancer Survival Prediction. (arXiv:2109.02785v1 [q-bio.QM])
    (2 min) Prediction of Overall Survival (OS) of brain cancer patients from multi-modal MRI is a challenging field of research. Most of the existing literature on survival prediction is based on Radiomic features, which does not consider either non-biological factors or the functional neurological status of the patient(s). Besides, the selection of an appropriate cut-off for survival and the presence of censored data create further problems. Application of deep learning models for OS prediction is also limited due to the lack of large annotated publicly available datasets. In this scenario we analyse the potential of two novel neuroimaging feature families, extracted from brain parcellation atlases and spatial habitats, along with classical radiomic and geometric features; to study their combined predictive power for analysing overall survival. A cross validation strategy with grid search is proposed to simultaneously select and evaluate the most predictive feature subset based on its predictive power. A Cox Proportional Hazard (CoxPH) model is employed for univariate feature selection, followed by the prediction of patient-specific survival functions by three multivariate parsimonious models viz. Coxnet, Random survival forests (RSF) and Survival SVM (SSVM). The brain cancer MRI data used for this research was taken from two open-access collections TCGA-GBM and TCGA-LGG available from The Cancer Imaging Archive (TCIA). Corresponding survival data for each patient was downloaded from The Cancer Genome Atlas (TCGA). A high cross validation $C-index$ score of $0.82\pm.10$ was achieved using RSF with the best $24$ selected features. Age was found to be the most important biological predictor. There were $9$, $6$, $6$ and $2$ features selected from the parcellation, habitat, radiomic and region-based feature groups respectively.
    Deep Collaborative Multi-Modal Learning for Unsupervised Kinship Estimation. (arXiv:2109.02804v1 [cs.CV])
    (2 min) Kinship verification is a long-standing research challenge in computer vision. The visual differences presented to the face have a significant effect on the recognition capabilities of the kinship systems. We argue that aggregating multiple visual knowledge can better describe the characteristics of the subject for precise kinship identification. Typically, the age-invariant features can represent more natural facial details. Such age-related transformations are essential for face recognition due to the biological effects of aging. However, the existing methods mainly focus on employing the single-view image features for kinship identification, while more meaningful visual properties such as race and age are directly ignored in the feature learning step. To this end, we propose a novel deep collaborative multi-modal learning (DCML) to integrate the underlying information presented in facial properties in an adaptive manner to strengthen the facial details for effective unsupervised kinship verification. Specifically, we construct a well-designed adaptive feature fusion mechanism, which can jointly leverage the complementary properties from different visual perspectives to produce composite features and draw greater attention to the most informative components of spatial feature maps. Particularly, an adaptive weighting strategy is developed based on a novel attention mechanism, which can enhance the dependencies between different properties by decreasing the information redundancy in channels in a self-adaptive manner. To validate the effectiveness of the proposed method, extensive experimental evaluations conducted on four widely-used datasets show that our DCML method is always superior to some state-of-the-art kinship verification methods.
    Efficient ADMM-based Algorithms for Convolutional Sparse Coding. (arXiv:2109.02969v1 [cs.LG])
    (2 min) Convolutional sparse coding improves on the standard sparse approximation by incorporating a global shift-invariant model. The most efficient convolutional sparse coding methods are based on the alternating direction method of multipliers and the convolution theorem. The only major difference between these methods is how they approach a convolutional least-squares fitting subproblem. This letter presents a solution to this subproblem, which improves the efficiency of the state-of-the-art algorithms. We also use the same approach for developing an efficient convolutional dictionary learning method. Furthermore, we propose a novel algorithm for convolutional sparse coding with a constraint on the approximation error.
    Rethinking Common Assumptions to Mitigate Racial Bias in Face Recognition Datasets. (arXiv:2109.03229v1 [cs.CV])
    (2 min) Many existing works have made great strides towards reducing racial bias in face recognition. However, most of these methods attempt to rectify bias that manifests in models during training instead of directly addressing a major source of the bias, the dataset itself. Exceptions to this are BUPT-Balancedface/RFW and Fairface, but these works assume that primarily training on a single race or not racially balancing the dataset are inherently disadvantageous. We demonstrate that these assumptions are not necessarily valid. In our experiments, training on only African faces induced less bias than training on a balanced distribution of faces and distributions skewed to include more African faces produced more equitable models. We additionally notice that adding more images of existing identities to a dataset in place of adding new identities can lead to accuracy boosts across racial categories. Our code is available at https://github.com/j-alex-hanson/rethinking-race-face-datasets
    Backdoor Attacks Against Deep Learning Systems in the Physical World. (arXiv:2006.14580v4 [cs.CV] UPDATED)
    (3 min) Backdoor attacks embed hidden malicious behaviors into deep learning models, which only activate and cause misclassifications on model inputs containing a specific trigger. Existing works on backdoor attacks and defenses, however, mostly focus on digital attacks that use digitally generated patterns as triggers. A critical question remains unanswered: can backdoor attacks succeed using physical objects as triggers, thus making them a credible threat against deep learning systems in the real world? We conduct a detailed empirical study to explore this question for facial recognition, a critical deep learning task. Using seven physical objects as triggers, we collect a custom dataset of 3205 images of ten volunteers and use it to study the feasibility of physical backdoor attacks under a variety of real-world conditions. Our study reveals two key findings. First, physical backdoor attacks can be highly successful if they are carefully configured to overcome the constraints imposed by physical objects. In particular, the placement of successful triggers is largely constrained by the target model's dependence on key facial features. Second, four of today's state-of-the-art defenses against (digital) backdoors are ineffective against physical backdoors, because the use of physical objects breaks core assumptions used to construct these defenses. Our study confirms that (physical) backdoor attacks are not a hypothetical phenomenon but rather pose a serious real-world threat to critical classification tasks. We need new and more robust defenses against backdoors in the physical world.
    Generative-Adversarial-Networks-based Ghost Recognition. (arXiv:2103.13858v2 [cs.CV] UPDATED)
    (2 min) Nowadays, target recognition technique plays an important role in many fields. However, the current target image information based methods suffer from the influence of image quality and the time cost of image reconstruction. In this paper, we propose a novel imaging-free target recognition method combining ghost imaging (GI) and generative adversarial networks (GAN). Based on the mechanism of GI, a set of random speckles sequence is employed to illuminate target, and a bucket detector without resolution is utilized to receive echo signal. The bucket signal sequence formed after continuous detections is constructed into a bucket signal array, which is regarded as the sample of GAN. Then, conditional GAN is used to map bucket signal array and target category. In practical application, the speckles sequence in training step is employed to illuminate target, and the bucket signal array is input GAN for recognition. The proposed method can improve the problems caused by conventional recognition methods that based on target image information, and provide a certain turbulence-free ability. Extensive experiments show that the proposed method achieves promising performance.
    PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System. (arXiv:2109.03144v1 [cs.CV])
    (2 min) Optical Character Recognition (OCR) systems have been widely used in various of application scenarios. Designing an OCR system is still a challenging task. In previous work, we proposed a practical ultra lightweight OCR system (PP-OCR) to balance the accuracy against the efficiency. In order to improve the accuracy of PP-OCR and keep high efficiency, in this paper, we propose a more robust OCR system, i.e. PP-OCRv2. We introduce bag of tricks to train a better text detector and a better text recognizer, which include Collaborative Mutual Learning (CML), CopyPaste, Lightweight CPUNetwork (LCNet), Unified-Deep Mutual Learning (U-DML) and Enhanced CTCLoss. Experiments on real data show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost. It is also comparable to the server models of the PP-OCR which uses ResNet series as backbones. All of the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.
    Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention. (arXiv:2109.02955v1 [cs.CV])
    (2 min) Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a method for the newly proposed task that effectively utilizes multi-modal data of video and motion sensors, or inertial measurement units (IMUs). While conventional video captioning tasks have difficulty in dealing with detailed descriptions of human activities due to the limited view of a fixed camera, egocentric vision has greater potential to be used for generating the finer-grained descriptions of human activities on the basis of a much closer view. In addition, we utilize wearable-sensor data as auxiliary information to mitigate the inherent problems in egocentric vision: motion blur, self-occlusion, and out-of-camera-range activities. We propose a method for effectively utilizing the sensor data in combination with the video data on the basis of an attention mechanism that dynamically determines the modality that requires more attention, taking the contextual information into account. We compared the proposed sensor-fusion method with strong baselines on the MMAC Captions dataset and found that using sensor data as supplementary information to the egocentric-video data was beneficial, and that our proposed method outperformed the strong baselines, demonstrating the effectiveness of the proposed method.
    Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors. (arXiv:2109.02993v1 [cs.CV])
    (2 min) Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need a dataset that contains video and respective audio deepfakes. We were able to find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized fake audios as well. We used this multimodal deepfake dataset and performed detailed baseline experiments using state-of-the-art unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude through detailed experimentation that unimodals, addressing only a single modality, video or audio, do not perform well compared to ensemble-based methods. Whereas purely multimodal-based baselines provide the worst performance.
    Semi-Relaxed Quantization with DropBits: Training Low-Bit Neural Networks via Bit-wise Regularization. (arXiv:1911.12990v3 [cs.CV] UPDATED)
    (2 min) Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] successfully employ the popular Gumbel-Softmax that allows this transformation with efficient gradient-based optimization. However, RQ with this Gumbel-Softmax relaxation still suffers from bias-variance trade-off depending on the temperature parameter of Gumbel-Softmax. To resolve the issue, we propose a novel method, Semi-Relaxed Quantization (SRQ) that uses multi-class straight-through estimator to effectively reduce the bias and variance, along with a new regularization technique, DropBits that replaces dropout regularization to randomly drop the bits instead of neurons to further reduce the bias of the multi-class straight-through estimator in SRQ. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer using DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support the quantized lottery ticket hypothesis: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.
    Fine-grained Hand Gesture Recognition in Multi-viewpoint Hand Hygiene. (arXiv:2109.02917v1 [cs.CV])
    (2 min) This paper contributes a new high-quality dataset for hand gesture recognition in hand hygiene systems, named "MFH". Generally, current datasets are not focused on: (i) fine-grained actions; and (ii) data mismatch between different viewpoints, which are available under realistic settings. To address the aforementioned issues, the MFH dataset is proposed to contain a total of 731147 samples obtained by different camera views in 6 non-overlapping locations. Additionally, each sample belongs to one of seven steps introduced by the World Health Organization (WHO). As a minor contribution, inspired by advances in fine-grained image recognition and distribution adaptation, this paper recommends using the self-supervised learning method to handle these preceding problems. The extensive experiments on the benchmarking MFH dataset show that the introduced method yields competitive performance in both the Accuracy and the Macro F1-score. The code and the MFH dataset are available at https://github.com/willogy-team/hand-gesture-recognition-smc2021.
    Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution. (arXiv:2109.03075v1 [cs.CV])
    (2 min) Knowledge distillation (KD) is an effective framework that aims to transfer meaningful information from a large teacher to a smaller student. Generally, KD often involves how to define and transfer knowledge. Previous KD methods often focus on mining various forms of knowledge, for example, feature maps and refined information. However, the knowledge is derived from the primary supervised task and thus is highly task-specific. Motivated by the recent success of self-supervised representation learning, we propose an auxiliary self-supervision augmented task to guide networks to learn more meaningful features. Therefore, we can derive soft self-supervision augmented distributions as richer dark knowledge from this task for KD. Unlike previous knowledge, this distribution encodes joint knowledge from supervised and self-supervised feature learning. Beyond knowledge exploration, another crucial aspect is how to learn and distill our proposed knowledge effectively. To fully take advantage of hierarchical feature maps, we propose to append several auxiliary branches at various hidden layers. Each auxiliary branch is guided to learn self-supervision augmented task and distill this distribution from teacher to student. Thus we call our KD method as Hierarchical Self-Supervision Augmented Knowledge Distillation (HSSAKD). Experiments on standard image classification show that both offline and online HSSAKD achieves state-of-the-art performance in the field of KD. Further transfer experiments on object detection further verify that HSSAKD can guide the network to learn better features, which can be attributed to learn and distill an auxiliary self-supervision augmented task effectively.
    Progressive Self-Guided Loss for Salient Object Detection. (arXiv:2101.02412v2 [cs.CV] UPDATED)
    (2 min) We present a simple yet effective progressive self-guided loss function to facilitate deep learning-based salient object detection (SOD) in images. The saliency maps produced by the most relevant works still suffer from incomplete predictions due to the internal complexity of salient objects. Our proposed progressive self-guided loss simulates a morphological closing operation on the model predictions for progressively creating auxiliary training supervisions to step-wisely guide the training process. We demonstrate that this new loss function can guide the SOD model to highlight more complete salient objects step-by-step and meanwhile help to uncover the spatial dependencies of the salient object pixels in a region growing manner. Moreover, a new feature aggregation module is proposed to capture multi-scale features and aggregate them adaptively by a branch-wise attention mechanism. Benefiting from this module, our SOD framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively. Experimental results on several benchmark datasets show that our loss function not only advances the performance of existing SOD models without architecture modification but also helps our proposed framework to achieve state-of-the-art performance.
    Self-Attentive 3D Human Pose and Shape Estimation from Videos. (arXiv:2103.14182v2 [cs.CV] UPDATED)
    (2 min) We consider the task of estimating 3D human pose and shape from videos. While existing frame-based approaches have made significant progress, these methods are independently applied to each image, thereby often leading to inconsistent predictions. In this work, we present a video-based learning algorithm for 3D human pose and shape estimation. The key insights of our method are two-fold. First, to address the inconsistent temporal prediction issue, we exploit temporal information in videos and propose a self-attention module that jointly considers short-range and long-range dependencies across frames, resulting in temporally coherent estimations. Second, we model human motion with a forecasting module that allows the transition between adjacent frames to be smooth. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. Extensive experimental results show that our algorithm performs favorably against the state-of-the-art methods.
    End to end hyperspectral imaging system with coded compression imaging process. (arXiv:2109.02643v1 [eess.IV])
    (2 min) Hyperspectral images (HSIs) can provide rich spatial and spectral information with extensive application prospects. Recently, several methods using convolutional neural networks (CNNs) to reconstruct HSIs have been developed. However, most deep learning methods fit a brute-force mapping relationship between the compressive and standard HSIs. Thus, the learned mapping would be invalid when the observation data deviate from the training data. To recover the three-dimensional HSIs from two-dimensional compressive images, we present dual-camera equipment with a physics-informed self-supervising CNN method based on a coded aperture snapshot spectral imaging system. Our method effectively exploits the spatial-spectral relativization from the coded spectral information and forms a self-supervising system based on the camera quantum effect model. The experimental results show that our method can be adapted to a wide imaging environment with good performance. In addition, compared with most of the network-based methods, our system does not require a dedicated dataset for pre-training. Therefore, it has greater scenario adaptability and better generalization ability. Meanwhile, our system can be constantly fine-tuned and self-improved in real-life scenarios.
    Fishr: Invariant Gradient Variances for Out-of-distribution Generalization. (arXiv:2109.02934v1 [cs.LG])
    (2 min) Learning robust models that generalize well under changes in the data distribution is critical for real-world applications. To this end, there has been a growing surge of interest to learn simultaneously from multiple training domains - while enforcing different types of invariance across those domains. Yet, all existing approaches fail to show systematic benefits under fair evaluation protocols. In this paper, we propose a new learning scheme to enforce domain invariance in the space of the gradients of the loss function: specifically, we introduce a regularization term that matches the domain-level variances of gradients across training domains. Critically, our strategy, named Fishr, exhibits close relations with the Fisher Information and the Hessian of the loss. We show that forcing domain-level gradient covariances to be similar during the learning procedure eventually aligns the domain-level loss landscapes locally around the final weights. Extensive experiments demonstrate the effectiveness of Fishr for out-of-distribution generalization. In particular, Fishr improves the state of the art on the DomainBed benchmark and performs significantly better than Empirical Risk Minimization. The code is released at https://github.com/alexrame/fishr.
    A New Basis for Sparse Principal Component Analysis. (arXiv:2007.00596v2 [stat.ML] UPDATED)
    (2 min) Previous versions of sparse principal component analysis (PCA) have presumed that the eigen-basis (a $p \times k$ matrix) is approximately sparse. We propose a method that presumes the $p \times k$ matrix becomes approximately sparse after a $k \times k$ rotation. The simplest version of the algorithm initializes with the leading $k$ principal components. Then, the principal components are rotated with an $k \times k$ orthogonal rotation to make them approximately sparse. Finally, soft-thresholding is applied to the rotated principal components. This approach differs from prior approaches because it uses an orthogonal rotation to approximate a sparse basis. One consequence is that a sparse component need not to be a leading eigenvector, but rather a mixture of them. In this way, we propose a new (rotated) basis for sparse PCA. In addition, our approach avoids "deflation" and multiple tuning parameters required for that. Our sparse PCA framework is versatile; for example, it extends naturally to a two-way analysis of a data matrix for simultaneous dimensionality reduction of rows and columns. We provide evidence showing that for the same level of sparsity, the proposed sparse PCA method is more stable and can explain more variance compared to alternative methods. Through three applications -- sparse coding of images, analysis of transcriptome sequencing data, and large-scale clustering of social networks, we demonstrate the modern usefulness of sparse PCA in exploring multivariate data.
    Learning from Self-Discrepancy via Multiple Co-teaching for Cross-Domain Person Re-Identification. (arXiv:2104.02265v5 [cs.CV] UPDATED)
    (2 min) Employing clustering strategy to assign unlabeled target images with pseudo labels has become a trend for person re-identification (re-ID) algorithms in domain adaptation. A potential limitation of these clustering-based methods is that they always tend to introduce noisy labels, which will undoubtedly hamper the performance of our re-ID system. To handle this limitation, an intuitive solution is to utilize collaborative training to purify the pseudo label quality. However, there exists a challenge that the complementarity of two networks, which inevitably share a high similarity, becomes weakened gradually as training process goes on; worse still, these approaches typically ignore to consider the self-discrepancy of intra-class relations. To address this issue, in this paper, we propose a multiple co-teaching framework for domain adaptive person re-ID, opening up a promising direction about self-discrepancy problem under unsupervised condition. On top of that, a mean-teaching mechanism is leveraged to enlarge the difference and discover more complementary features. Comprehensive experiments conducted on several large-scale datasets show that our method achieves competitive performance compared with the state-of-the-arts.
    Support Vector Machine for Handwritten Character Recognition. (arXiv:2109.03081v1 [cs.CV])
    (2 min) Handwriting recognition has been one of the most fascinating and challenging research areas in field of image processing and pattern recognition. It contributes enormously to the improvement of automation process. In this paper, a system for recognition of unconstrained handwritten Malayalam characters is proposed. A database of 10,000 character samples of 44 basic Malayalam characters is used in this work. A discriminate feature set of 64 local and 4 global features are used to train and test SVM classifier and achieved 92.24% accuracy
    Unpaired Adversarial Learning for Single Image Deraining with Rain-Space Contrastive Constraints. (arXiv:2109.02973v1 [cs.CV])
    (2 min) Deep learning-based single image deraining (SID) with unpaired information is of immense importance, as relying on paired synthetic data often limits their generality and scalability in real-world applications. However, we noticed that direct employ of unpaired adversarial learning and cycle-consistency constraints in the SID task is insufficient to learn the underlying relationship from rainy input to clean outputs, since the domain knowledge between rainy and rain-free images is asymmetrical. To address such limitation, we develop an effective unpaired SID method which explores mutual properties of the unpaired exemplars by a contrastive learning manner in a GAN framework, named as CDR-GAN. The proposed method mainly consists of two cooperative branches: Bidirectional Translation Branch (BTB) and Contrastive Guidance Branch (CGB). Specifically, BTB takes full advantage of the circulatory architecture of adversarial consistency to exploit latent feature distributions and guide transfer ability between two domains by equipping it with bidirectional mapping. Simultaneously, CGB implicitly constrains the embeddings of different exemplars in rain space by encouraging the similar feature distributions closer while pushing the dissimilar further away, in order to better help rain removal and image restoration. During training, we explore several loss functions to further constrain the proposed CDR-GAN. Extensive experiments show that our method performs favorably against existing unpaired deraining approaches on both synthetic and real-world datasets, even outperforms several fully-supervised or semi-supervised models.
    nnFormer: Interleaved Transformer for Volumetric Segmentation. (arXiv:2109.03201v1 [cs.CV])
    (2 min) Transformers, the default model of choices in natural language processing, have drawn scant attention from the medical imaging community. Given the ability to exploit long-term dependencies, transformers are promising to help atypical convolutional neural networks (convnets) to overcome its inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations without investigating how to optimally combine self-attention (i.e., the core of transformers) with convolution. To address this issue, in this paper, we introduce nnFormer (i.e., Not-aNother transFormer), a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. In practice, nnFormer learns volumetric representations from 3D local volumes. Compared to the naive voxel-level self-attention implementation, such volume-based operations help to reduce the computational complexity by approximate 98% and 99.5% on Synapse and ACDC datasets, respectively. In comparison to prior-art network configurations, nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC. For instance, nnFormer outperforms Swin-UNet by over 7 percents on Synapse. Even when compared to nnUNet, currently the best performing fully-convolutional medical segmentation network, nnFormer still provides slightly better performance on Synapse and ACDC.
    Learning to Combine the Modalities of Language and Video for Temporal Moment Localization. (arXiv:2109.02925v1 [cs.CV])
    (2 min) Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal interaction network (TACI) that generates two 2D proposal maps obtained globally from the integrated contextual features, which are generated by using CM-LSTM, and locally from boundary score sequences and then combines them into a final 2D map in an end-to-end manner. On the TML benchmark dataset, ActivityNet-Captions, the TACI outperform state-of-the-art TML methods with R@1 of 45.50% and 27.23% for IoU@0.5 and IoU@0.7, respectively. In addition, we show that the revised state-of-the-arts methods by replacing the original LSTM with our CM-LSTM achieve performance gains.
    Brand Label Albedo Extraction of eCommerce Products using Generative Adversarial Network. (arXiv:2109.02929v1 [cs.CV])
    (2 min) In this paper we present our solution to extract albedo of branded labels for e-commerce products. To this end, we generate a large-scale photo-realistic synthetic data set for albedo extraction followed by training a generative model to translate images with diverse lighting conditions to albedo. We performed an extensive evaluation to test the generalisation of our method to in-the-wild images. From the experimental results, we observe that our solution generalises well compared to the existing method both in the unseen rendered images as well as in the wild image.
    Perceptual Video Compression with Recurrent Conditional GAN. (arXiv:2109.03082v1 [eess.IV])
    (2 min) This paper proposes a Perceptual Learned Video Compression (PLVC) approach with recurrent conditional generative adversarial network. In our approach, the recurrent auto-encoder-based generator learns to fully explore the temporal correlation for compressing video. More importantly, we propose a recurrent conditional discriminator, which judges raw and compressed video conditioned on both spatial and temporal information, including the latent representation, temporal motion and hidden states in recurrent cells. This way, in the adversarial training, it pushes the generated video to be not only spatially photo-realistic but also temporally consistent with groundtruth and coherent among video frames. Therefore, the proposed PLVC model learns to compress video towards good perceptual quality at low bit-rate. The experimental results show that our PLVC approach outperforms the previous traditional and learned approaches on several perceptual quality metrics. The user study further validates the outstanding perceptual performance of PLVC in comparison with the latest learned video compression approaches and the official HEVC test model (HM 16.20). The codes will be released at https://github.com/RenYang-home/PLVC.
    Automatic Generation of Dense Non-rigid Optical Flow. (arXiv:1812.01946v5 [cs.CV] UPDATED)
    (2 min) There hardly exists any large-scale datasets with dense optical flow of non-rigid motion from real-world imagery as of today. The reason lies mainly in the required setup to derive ground truth optical flows: a series of images with known camera poses along its trajectory, and an accurate 3D model from a textured scene. Human annotation is not only too tedious for large databases, it can simply hardly contribute to accurate optical flow. To circumvent the need for manual annotation, we propose a framework to automatically generate optical flow from real-world videos. The method extracts and matches objects from video frames to compute initial constraints, and applies a deformation over the objects of interest to obtain dense optical flow fields. We propose several ways to augment the optical flow variations. Extensive experimental results show that training on our automatically generated optical flow outperforms methods that are trained on rigid synthetic data using FlowNet-S, LiteFlowNet, PWC-Net, and RAFT. Datasets and implementation of our optical flow generation framework are released at https://github.com/lhoangan/arap_flow
    Kinship Verification Based on Cross-Generation Feature Interaction Learning. (arXiv:2109.02809v1 [cs.CV])
    (2 min) Kinship verification from facial images has been recognized as an emerging yet challenging technique in many potential computer vision applications. In this paper, we propose a novel cross-generation feature interaction learning (CFIL) framework for robust kinship verification. Particularly, an effective collaborative weighting strategy is constructed to explore the characteristics of cross-generation relations by corporately extracting features of both parents and children image pairs. Specifically, we take parents and children as a whole to extract the expressive local and non-local features. Different from the traditional works measuring similarity by distance, we interpolate the similarity calculations as the interior auxiliary weights into the deep CNN architecture to learn the whole and natural features. These similarity weights not only involve corresponding single points but also excavate the multiple relationships cross points, where local and non-local features are calculated by using these two kinds of distance measurements. Importantly, instead of separately conducting similarity computation and feature extraction, we integrate similarity learning and feature extraction into one unified learning process. The integrated representations deduced from local and non-local features can comprehensively express the informative semantics embedded in images and preserve abundant correlation knowledge from image pairs. Extensive experiments demonstrate the efficiency and superiority of the proposed model compared to some state-of-the-art kinship verification methods.
    Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos. (arXiv:2109.03223v1 [cs.CV])
    (2 min) Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented as combinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize individually; in this task, it requires not only performing recognition simultaneously for all three triplet components, but also correctly establishing the data association between them. To achieve this task, we introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels. We first introduce a new form of spatial attention to capture individual action triplet components in a scene; called the Class Activation Guided Attention Mechanism (CAGAM). This technique focuses on the recognition of verbs and targets using activations resulting from instruments. To solve the association problem, our RDV model adds a new form of semantic attention inspired by Transformer networks. Using multiple heads of cross and self attentions, RDV is able to effectively capture relationships between instruments, verbs, and targets. We also introduce CholecT50 - a dataset of 50 endoscopic videos in which every frame has been annotated with labels from 100 triplet classes. Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
    Few-shot Learning via Dependency Maximization and Instance Discriminant Analysis. (arXiv:2109.02820v1 [cs.CV])
    (2 min) We study the few-shot learning (FSL) problem, where a model learns to recognize new objects with extremely few labeled training data per category. Most of previous FSL approaches resort to the meta-learning paradigm, where the model accumulates inductive bias through learning many training tasks so as to solve a new unseen few-shot task. In contrast, we propose a simple approach to exploit unlabeled data accompanying the few-shot task for improving few-shot performance. Firstly, we propose a Dependency Maximization method based on the Hilbert-Schmidt norm of the cross-covariance operator, which maximizes the statistical dependency between the embedded feature of those unlabeled data and their label predictions, together with the supervised loss over the support set. We then use the obtained model to infer the pseudo-labels for those unlabeled data. Furthermore, we propose anInstance Discriminant Analysis to evaluate the credibility of each pseudo-labeled example and select the most faithful ones into an augmented support set to retrain the model as in the first step. We iterate the above process until the pseudo-labels for the unlabeled data becomes stable. Following the standard transductive and semi-supervised FSL setting, our experiments show that the proposed method out-performs previous state-of-the-art methods on four widely used benchmarks, including mini-ImageNet, tiered-ImageNet, CUB, and CIFARFS.
    Statistical analysis of locally parameterized shapes. (arXiv:2109.03027v1 [stat.ME])
    (2 min) The alignment of shapes has been a crucial step in statistical shape analysis, for example, in calculating mean shape, detecting locational differences between two shape populations, and classification. Procrustes alignment is the most commonly used method and state of the art. In this work, we uncover that alignment might seriously affect the statistical analysis. For example, alignment can induce false shape differences and lead to misleading results and interpretations. We propose a novel hierarchical shape parameterization based on local coordinate systems. The local parameterized shapes are translation and rotation invariant. Thus, the inherent alignment problems from the commonly used global coordinate system for shape representation can be avoided using this parameterization. The new parameterization is also superior for shape deformation and simulation. The method's power is demonstrated on the hypothesis testing of simulated data as well as the left hippocampi of patients with Parkinson's disease and controls.
    Journalistic Guidelines Aware News Image Captioning. (arXiv:2109.02865v1 [cs.CV])
    (2 min) The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow journalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associated with. In this work, we propose a new approach to this task, motivated by caption guidelines that journalists follow. Our approach, Journalistic Guidelines Aware News Image Captioning (JoGANIC), leverages the structure of captions to improve the generation quality and guide our representation design. Experimental results, including detailed ablation studies, on two large-scale publicly available datasets show that JoGANIC substantially outperforms state-of-the-art methods both on caption generation and named entity related metrics.
    FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting. (arXiv:2109.02974v1 [cs.CV])
    (2 min) Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used in video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges in detail due to the hard patch splitting. Here we aim to tackle this problem by proposing FuseFormer, a Transformer model designed for video inpainting via fine-grained feature fusion based on novel Soft Split and Soft Composition operations. The soft split divides feature map into many patches with given overlapping interval. On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up. These two modules are first used in tokenization before Transformer layers and de-tokenization after Transformer layers, for effective mapping between tokens and features. Therefore, sub-patch level information interaction is enabled for more effective feature propagation between neighboring patches, resulting in synthesizing vivid content for hole regions in videos. Moreover, in FuseFormer, we elaborately insert the soft composition and soft split into the feed-forward network, enabling the 1D linear layers to have the capability of modelling 2D structure. And, the sub-patch level feature fusion ability is further enhanced. In both quantitative and qualitative evaluations, our proposed FuseFormer surpasses state-of-the-art methods. We also conduct detailed analysis to examine its superiority.
    Improving Dietary Assessment Via Integrated Hierarchy Food Classification. (arXiv:2109.02736v1 [cs.CV])
    (2 min) Image-based dietary assessment refers to the process of determining what someone eats and how much energy and nutrients are consumed from visual data. Food classification is the first and most crucial step. Existing methods focus on improving accuracy measured by the rate of correct classification based on visual information alone, which is very challenging due to the high complexity and inter-class similarity of foods. Further, accuracy in food classification is conceptual as description of a food can always be improved. In this work, we introduce a new food classification framework to improve the quality of predictions by integrating the information from multiple domains while maintaining the classification accuracy. We apply a multi-task network based on a hierarchical structure that uses both visual and nutrition domain specific information to cluster similar foods. Our method is validated on the modified VIPER-FoodNet (VFN) food image dataset by including associated energy and nutrient information. We achieve comparable classification accuracy with existing methods that use visual information only, but with less error in terms of energy and nutrient values for the wrong predictions.
    Improving Phenotype Prediction using Long-Range Spatio-Temporal Dynamics of Functional Connectivity. (arXiv:2109.03115v1 [q-bio.NC])
    (2 min) The study of functional brain connectivity (FC) is important for understanding the underlying mechanisms of many psychiatric disorders. Many recent analyses adopt graph convolutional networks, to study non-linear interactions between functionally-correlated states. However, although patterns of brain activation are known to be hierarchically organised in both space and time, many methods have failed to extract powerful spatio-temporal features. To overcome those challenges, and improve understanding of long-range functional dynamics, we translate an approach, from the domain of skeleton-based action recognition, designed to model interactions across space and time. We evaluate this approach using the Human Connectome Project (HCP) dataset on sex classification and fluid intelligence prediction. To account for subject topographic variability of functional organisation, we modelled functional connectomes using multi-resolution dual-regressed (subject-specific) ICA nodes. Results show a prediction accuracy of 94.4% for sex classification (an increase of 6.2% compared to other methods), and an improvement of correlation with fluid intelligence of 0.325 vs 0.144, relative to a baseline model that encodes space and time separately. Results suggest that explicit encoding of spatio-temporal dynamics of brain functional activity may improve the precision with which behavioural and cognitive phenotypes may be predicted in the future.
    ICCAD Special Session Paper: Quantum-Classical Hybrid Machine Learning for Image Classification. (arXiv:2109.02862v1 [cs.CV])
    (2 min) Image classification is a major application domain for conventional deep learning (DL). Quantum machine learning (QML) has the potential to revolutionize image classification. In any typical DL-based image classification, we use convolutional neural network (CNN) to extract features from the image and multi-layer perceptron network (MLP) to create the actual decision boundaries. On one hand, QML models can be useful in both of these tasks. Convolution with parameterized quantum circuits (Quanvolution) can extract rich features from the images. On the other hand, quantum neural network (QNN) models can create complex decision boundaries. Therefore, Quanvolution and QNN can be used to create an end-to-end QML model for image classification. Alternatively, we can extract image features separately using classical dimension reduction techniques such as, Principal Components Analysis (PCA) or Convolutional Autoencoder (CAE) and use the extracted features to train a QNN. We review two proposals on quantum-classical hybrid ML models for image classification namely, Quanvolutional Neural Network and dimension reduction using a classical algorithm followed by QNN. Particularly, we make a case for trainable filters in Quanvolution and CAE-based feature extraction for image datasets (instead of dimension reduction using linear transformations such as, PCA). We discuss various design choices, potential opportunities, and drawbacks of these models. We also release a Python-based framework to create and explore these hybrid models with a variety of design choices.
    Deep SIMBAD: Active Landmark-based Self-localization Using Ranking -based Scene Descriptor. (arXiv:2109.02786v1 [cs.RO])
    (2 min) Landmark-based robot self-localization has recently garnered interest as a highly-compressive domain-invariant approach for performing visual place recognition (VPR) across domains (e.g., time of day, weather, and season). However, landmark-based self-localization can be an ill-posed problem for a passive observer (e.g., manual robot control), as many viewpoints may not provide an effective landmark view. In this study, we consider an active self-localization task by an active observer and present a novel reinforcement learning (RL)-based next-best-view (NBV) planner. Our contributions are as follows. (1) SIMBAD-based VPR: We formulate the problem of landmark-based compact scene description as SIMBAD (similarity-based pattern recognition) and further present its deep learning extension. (2) VPR-to-NBV knowledge transfer: We address the challenge of RL under uncertainty (i.e., active self-localization) by transferring the state recognition ability of VPR to the NBV. (3) NNQL-based NBV: We regard the available VPR as the experience database by adapting nearest-neighbor approximation of Q-learning (NNQL). The result shows an extremely compact data structure that compresses both the VPR and NBV into a single incremental inverted index. Experiments using the public NCLT dataset validated the effectiveness of the proposed approach.
    Zero-Shot Open Set Detection by Extending CLIP. (arXiv:2109.02748v1 [cs.CV])
    (2 min) In a regular open set detection problem, samples of known classes (also called closed set classes) are used to train a special classifier. In testing, the classifier can (1) classify the test samples of known classes to their respective classes and (2) also detect samples that do not belong to any of the known classes (we say they belong to some unknown or open set classes). This paper studies the problem of zero-shot open-set detection, which still performs the same two tasks in testing but has no training except using the given known class names. This paper proposes a novel and yet simple method (called ZO-CLIP) to solve the problem. ZO-CLIP builds on top of the recent advances in zero-shot classification through multi-modal representation learning. It first extends the pre-trained multi-modal model CLIP by training a text-based image description generator on top of CLIP. In testing, it uses the extended model to generate some candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot open set detection. Experimental results on 5 benchmark datasets for open set detection confirm that ZO-CLIP outperforms the baselines by a large margin.
    DeepFakes: Detecting Forged and Synthetic Media Content Using Machine Learning. (arXiv:2109.02874v1 [cs.CV])
    (2 min) The rapid advancement in deep learning makes the differentiation of authentic and manipulated facial images and video clips unprecedentedly harder. The underlying technology of manipulating facial appearances through deep generative approaches, enunciated as DeepFake that have emerged recently by promoting a vast number of malicious face manipulation applications. Subsequently, the need of other sort of techniques that can assess the integrity of digital visual content is indisputable to reduce the impact of the creations of DeepFake. A large body of research that are performed on DeepFake creation and detection create a scope of pushing each other beyond the current status. This study presents challenges, research trends, and directions related to DeepFake creation and detection techniques by reviewing the notable research in the DeepFake domain to facilitate the development of more robust approaches that could deal with the more advance DeepFake in the future.
    Crash Report Data Analysis for Creating Scenario-Wise, Spatio-Temporal Attention Guidance to Support Computer Vision-based Perception of Fatal Crash Risks. (arXiv:2109.02710v1 [stat.AP])
    (3 min) Reducing traffic fatalities and serious injuries is a top priority of the US Department of Transportation. The computer vision (CV)-based crash anticipation in the near-crash phase is receiving growing attention. The ability to perceive fatal crash risks earlier is also critical because it will improve the reliability of crash anticipation. Yet, annotated image data for training a reliable AI model for the early visual perception of crash risks are not abundant. The Fatality Analysis Reporting System contains big data of fatal crashes. It is a reliable data source for learning the relationship between driving scene characteristics and fatal crashes to compensate for the limitation of CV. Therefore, this paper develops a data analytics model, named scenario-wise, Spatio-temporal attention guidance, from fatal crash report data, which can estimate the relevance of detected objects to fatal crashes from their environment and context information. First, the paper identifies five sparse variables that allow for decomposing the 5-year fatal crash dataset to develop scenario-wise attention guidance. Then, exploratory analysis of location- and time-related variables of the crash report data suggests reducing fatal crashes to spatially defined groups. The group's temporal pattern is an indicator of the similarity of fatal crashes in the group. Hierarchical clustering and K-means clustering merge the spatially defined groups into six clusters according to the similarity of their temporal patterns. After that, association rule mining discovers the statistical relationship between the temporal information of driving scenes with crash features, for each cluster. The paper shows how the developed attention guidance supports the design and implementation of a preliminary CV model that can identify objects of a possibility to involve in fatal crashes from their environment and context information.
    Graph Attention Layer Evolves Semantic Segmentation for Road Pothole Detection: A Benchmark and Algorithms. (arXiv:2109.02711v1 [cs.CV])
    (2 min) Existing road pothole detection approaches can be classified as computer vision-based or machine learning-based. The former approaches typically employ 2-D image analysis/understanding or 3-D point cloud modeling and segmentation algorithms to detect road potholes from vision sensor data. The latter approaches generally address road pothole detection using convolutional neural networks (CNNs) in an end-to-end manner. However, road potholes are not necessarily ubiquitous and it is challenging to prepare a large well-annotated dataset for CNN training. In this regard, while computer vision-based methods were the mainstream research trend in the past decade, machine learning-based methods were merely discussed. Recently, we published the first stereo vision-based road pothole detection dataset and a novel disparity transformation algorithm, whereby the damaged and undamaged road areas can be highly distinguished. However, there are no benchmarks currently available for state-of-the-art (SoTA) CNNs trained using either disparity images or transformed disparity images. Therefore, in this paper, we first discuss the SoTA CNNs designed for semantic segmentation and evaluate their performance for road pothole detection with extensive experiments. Additionally, inspired by graph neural network (GNN), we propose a novel CNN layer, referred to as graph attention layer (GAL), which can be easily deployed in any existing CNN to optimize image feature representations for semantic segmentation. Our experiments compare GAL-DeepLabv3+, our best-performing implementation, with nine SoTA CNNs on three modalities of training data: RGB images, disparity images, and transformed disparity images. The experimental results suggest that our proposed GAL-DeepLabv3+ achieves the best overall pothole detection accuracy on all training data modalities.
    Automatic Landmarks Correspondence Detection in Medical Images with an Application to Deformable Image Registration. (arXiv:2109.02722v1 [cs.CV])
    (2 min) Deformable Image Registration (DIR) can benefit from additional guidance using corresponding landmarks in the images. However, the benefits thereof are largely understudied, especially due to the lack of automatic detection methods for corresponding landmarks in three-dimensional (3D) medical images. In this work, we present a Deep Convolutional Neural Network (DCNN), called DCNN-Match, that learns to predict landmark correspondences in 3D images in a self-supervised manner. We explored five variants of DCNN-Match that use different loss functions and tested DCNN-Match separately as well as in combination with the open-source registration software Elastix to assess its impact on a common DIR approach. We employed lower-abdominal Computed Tomography (CT) scans from cervical cancer patients: 121 pelvic CT scan pairs containing simulated elastic transformations and 11 pairs demonstrating clinical deformations. Our results show significant improvement in DIR performance when landmark correspondences predicted by DCNN-Match were used in case of simulated as well as clinical deformations. We also observed that the spatial distribution of the automatically identified landmarks and the associated matching errors affect the extent of improvement in DIR. Finally, DCNN-Match was found to generalize well to Magnetic Resonance Imaging (MRI) scans without requiring retraining, indicating easy applicability to other datasets.
    Robustness and Generalization via Generative Adversarial Training. (arXiv:2109.02765v1 [cs.CV])
    (2 min) While deep neural networks have achieved remarkable success in various computer vision tasks, they often fail to generalize to new domains and subtle variations of input images. Several defenses have been proposed to improve the robustness against these variations. However, current defenses can only withstand the specific attack used in training, and the models often remain vulnerable to other input variations. Moreover, these methods often degrade performance of the model on clean images and do not generalize to out-of-domain samples. In this paper we present Generative Adversarial Training, an approach to simultaneously improve the model's generalization to the test set and out-of-domain samples as well as its robustness to unseen adversarial attacks. Instead of altering a low-level pre-defined aspect of images, we generate a spectrum of low-level, mid-level and high-level changes using generative models with a disentangled latent space. Adversarial training with these examples enable the model to withstand a wide range of attacks by observing a variety of input alterations during training. We show that our approach not only improves performance of the model on clean images and out-of-domain samples but also makes it robust against unforeseen attacks and outperforms prior work. We validate effectiveness of our method by demonstrating results on various tasks such as classification, segmentation and object detection.
    CIM: Class-Irrelevant Mapping for Few-Shot Classification. (arXiv:2109.02840v1 [cs.CV])
    (2 min) Few-shot classification (FSC) is one of the most concerned hot issues in recent years. The general setting consists of two phases: (1) Pre-train a feature extraction model (FEM) with base data (has large amounts of labeled samples). (2) Use the FEM to extract the features of novel data (with few labeled samples and totally different categories from base data), then classify them with the to-be-designed classifier. The adaptability of pre-trained FEM to novel data determines the accuracy of novel features, thereby affecting the final classification performances. To this end, how to appraise the pre-trained FEM is the most crucial focus in the FSC community. It sounds like traditional Class Activate Mapping (CAM) based methods can achieve this by overlaying weighted feature maps. However, due to the particularity of FSC (e.g., there is no backpropagation when using the pre-trained FEM to extract novel features), we cannot activate the feature map with the novel classes. To address this challenge, we propose a simple, flexible method, dubbed as Class-Irrelevant Mapping (CIM). Specifically, first, we introduce dictionary learning theory and view the channels of the feature map as the bases in a dictionary. Then we utilize the feature map to fit the feature vector of an image to achieve the corresponding channel weights. Finally, we overlap the weighted feature map for visualization to appraise the ability of pre-trained FEM on novel data. For fair use of CIM in evaluating different models, we propose a new measurement index, called Feature Localization Accuracy (FLA). In experiments, we first compare our CIM with CAM in regular tasks and achieve outstanding performances. Next, we use our CIM to appraise several classical FSC frameworks without considering the classification results and discuss them.
    STRIVE: Scene Text Replacement In Videos. (arXiv:2109.02762v1 [cs.CV])
    (2 min) We propose replacing scene text in videos using deep style transfer and learned photometric transformations.Building on recent progress on still image text replacement,we present extensions that alter text while preserving the appearance and motion characteristics of the original video.Compared to the problem of still image text replacement,our method addresses additional challenges introduced by video, namely effects induced by changing lighting, motion blur, diverse variations in camera-object pose over time,and preservation of temporal consistency. We parse the problem into three steps. First, the text in all frames is normalized to a frontal pose using a spatio-temporal trans-former network. Second, the text is replaced in a single reference frame using a state-of-art still-image text replacement method. Finally, the new text is transferred from the reference to remaining frames using a novel learned image transformation network that captures lighting and blur effects in a temporally consistent manner. Results on synthetic and challenging real videos show realistic text trans-fer, competitive quantitative and qualitative performance,and superior inference speed relative to alternatives. We introduce new synthetic and real-world datasets with paired text objects. To the best of our knowledge this is the first attempt at deep video text replacement.
    Single-Camera 3D Head Fitting for Mixed Reality Clinical Applications. (arXiv:2109.02740v1 [cs.CV])
    (2 min) We address the problem of estimating the shape of a person's head, defined as the geometry of the complete head surface, from a video taken with a single moving camera, and determining the alignment of the fitted 3D head for all video frames, irrespective of the person's pose. 3D head reconstructions commonly tend to focus on perfecting the face reconstruction, leaving the scalp to a statistical approximation. Our goal is to reconstruct the head model of each person to enable future mixed reality applications. To do this, we recover a dense 3D reconstruction and camera information via structure-from-motion and multi-view stereo. These are then used in a new two-stage fitting process to recover the 3D head shape by iteratively fitting a 3D morphable model of the head with the dense reconstruction in canonical space and fitting it to each person's head, using both traditional facial landmarks and scalp features extracted from the head's segmentation mask. Our approach recovers consistent geometry for varying head shapes, from videos taken by different people, with different smartphones, and in a variety of environments from living rooms to outdoor spaces.
    Pano3D: A Holistic Benchmark and a Solid Baseline for $360^o$ Depth Estimation. (arXiv:2109.02749v1 [cs.CV])
    (2 min) Pano3D is a new benchmark for depth estimation from spherical panoramas. It aims to assess performance across all depth estimation traits, the primary direct depth estimation performance targeting precision and accuracy, and also the secondary traits, boundary preservation, and smoothness. Moreover, Pano3D moves beyond typical intra-dataset evaluation to inter-dataset performance assessment. By disentangling the capacity to generalize to unseen data into different test splits, Pano3D represents a holistic benchmark for $360^o$ depth estimation. We use it as a basis for an extended analysis seeking to offer insights into classical choices for depth estimation. This results in a solid baseline for panoramic depth that follow-up works can build upon to steer future progress.
    Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images. (arXiv:2109.02716v1 [cs.CV])
    (2 min) Crop and weed monitoring is an important challenge for agriculture and food production nowadays. Thanks to recent advances in data acquisition and computation technologies, agriculture is evolving to a more smart and precision farming to meet with the high yield and high quality crop production. Classification and recognition in Unmanned Aerial Vehicles (UAV) images are important phases for crop monitoring. Advances in deep learning models relying on Convolutional Neural Network (CNN) have achieved high performances in image classification in the agricultural domain. Despite the success of this architecture, CNN still faces many challenges such as high computation cost, the need of large labelled datasets, ... Natural language processing's transformer architecture can be an alternative approach to deal with CNN's limitations. Making use of the self-attention paradigm, Vision Transformer (ViT) models can achieve competitive or better results without applying any convolution operations. In this paper, we adopt the self-attention mechanism via the ViT models for plant classification of weeds and crops: red beet, off-type beet (green leaves), parsley and spinach. Our experiments show that with small set of labelled training data, ViT models perform better compared to state-of-the-art CNN-based models EfficientNet and ResNet, with a top accuracy of 99.8\% achieved by the ViT model.
    GCsT: Graph Convolutional Skeleton Transformer for Action Recognition. (arXiv:2109.02860v1 [cs.CV])
    (2 min) Graph convolutional networks (GCNs) achieve promising performance for skeleton-based action recognition. However, in most GCN-based methods, the spatial-temporal graph convolution is strictly restricted by the graph topology while only captures the short-term temporal context, thus lacking the flexibility of feature extraction. In this work, we present a novel architecture, named Graph Convolutional skeleton Transformer (GCsT), which addresses limitations in GCNs by introducing Transformer. Our GCsT employs all the benefits of Transformer (i.e. dynamical attention and global context) while keeps the advantages of GCNs (i.e. hierarchy and local topology structure). In GCsT, the spatial-temporal GCN forces the capture of local dependencies while Transformer dynamically extracts global spatial-temporal relationships. Furthermore, the proposed GCsT shows stronger expressive capability by adding additional information present in skeleton sequences. Incorporating the Transformer allows that information to be introduced into the model almost effortlessly. We validate the proposed GCsT by conducting extensive experiments, which achieves the state-of-the-art performance on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA datasets.
    Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds. (arXiv:2109.02763v1 [cs.SD])
    (2 min) Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360-degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision teacher methods and a sound student method -- the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial -- training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released.
    WhyAct: Identifying Action Reasons in Lifestyle Vlogs. (arXiv:2109.02747v1 [cs.CV])
    (2 min) We aim to automatically identify human action reasons in online videos. We focus on the widespread genre of lifestyle vlogs, in which people perform actions while verbally describing them. We introduce and make publicly available the {\sc WhyAct} dataset, consisting of 1,077 visual actions manually annotated with their reasons. We describe a multimodal model that leverages visual and textual information to automatically infer the reasons corresponding to an action presented in the video.
  • cs.IR updates on arXiv.org

    Intuitive Contrasting Map for Antonym Embeddings. (arXiv:2004.12835v2 [cs.CL] UPDATED)
    (2 min) This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the data and can produce new embeddings that explicitly highlight specific semantic attributes of the word. The new embeddings produced by the map are shown to improve the performance on downstream tasks.
    Recommendation Fairness: From Static to Dynamic. (arXiv:2109.03150v1 [cs.IR])
    (2 min) Driven by the need to capture users' evolving interests and optimize their long-term experiences, more and more recommender systems have started to model recommendation as a Markov decision process and employ reinforcement learning to address the problem. Shouldn't research on the fairness of recommender systems follow the same trend from static evaluation and one-shot intervention to dynamic monitoring and non-stop control? In this paper, we portray the recent developments in recommender systems first and then discuss how fairness could be baked into the reinforcement learning techniques for recommendation. Moreover, we argue that in order to make further progress in recommendation fairness, we may want to consider multi-agent (game-theoretic) optimization, multi-objective (Pareto) optimization, and simulation-based optimization, in the general framework of stochastic games.
    PEEK: A Large Dataset of Learner Engagement with Educational Videos. (arXiv:2109.03154v1 [cs.IR])
    (2 min) Educational recommenders have received much less attention in comparison to e-commerce and entertainment-related recommenders, even though efficient intelligent tutors have great potential to improve learning gains. One of the main challenges in advancing this research direction is the scarcity of large, publicly available datasets. In this work, we release a large, novel dataset of learners engaging with educational videos in-the-wild. The dataset, named Personalised Educational Engagement with Knowledge Topics PEEK, is the first publicly available dataset of this nature. The video lectures have been associated with Wikipedia concepts related to the material of the lecture, thus providing a humanly intuitive taxonomy. We believe that granular learner engagement signals in unison with rich content representations will pave the way to building powerful personalization algorithms that will revolutionise educational and informational recommendation systems. Towards this goal, we 1) construct a novel dataset from a popular video lecture repository, 2) identify a set of benchmark algorithms to model engagement, and 3) run extensive experimentation on the PEEK dataset to demonstrate its value. Our experiments with the dataset show promise in building powerful informational recommender systems. The dataset and the support code is available publicly.
    Solving Fashion Recommendation -- The Farfetch Challenge. (arXiv:2108.01314v2 [cs.LG] UPDATED)
    (2 min) Recommendation engines are integral to the modern e-commerce experience, both for the seller and the end user. Accurate recommendations lead to higher revenue and better user experience. In this paper, we are presenting our solution to ECML PKDD Farfetch Fashion Recommendation Challenge. The goal of this challenge is to maximize the chances of a click when the users are presented with set of fashion items. We have approached this problem as a binary classification problem. Our winning solution utilizes Catboost as the classifier and Bayesian Optimization for hyper parameter tuning. Our baseline model achieved MRR of 0.5153 on the validation set. Bayesian optimization of hyper parameters improved the MRR to 0.5240 on the validation set. Our final submission on the test set achieved a MRR of 0.5257.
    Mixed Attention Transformer for LeveragingWord-Level Knowledge to Neural Cross-Lingual Information Retrieval. (arXiv:2109.02789v1 [cs.IR])
    (2 min) Pretrained contextualized representations offer great success for many downstream tasks, including document ranking. The multilingual versions of such pretrained representations provide a possibility of jointly learning many languages with the same model. Although it is expected to gain big with such joint training, in the case of cross lingual information retrieval (CLIR), the models under a multilingual setting are not achieving the same level of performance as those under a monolingual setting. We hypothesize that the performance drop is due to the translation gap between query and documents. In the monolingual retrieval task, because of the same lexical inputs, it is easier for model to identify the query terms that occurred in documents. However, in the multilingual pretrained models that the words in different languages are projected into the same hyperspace, the model tends to translate query terms into related terms, i.e., terms that appear in a similar context, in addition to or sometimes rather than synonyms in the target language. This property is creating difficulties for the model to connect terms that cooccur in both query and document. To address this issue, we propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table. We design a sandwich like architecture to embed MAT into the recent transformer based deep neural models. By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence. Experimental results demonstrate the effectiveness of the external knowledge and the significant improvement of MAT embedded neural reranking model on CLIR task.
    Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts. (arXiv:2104.12822v3 [cs.IR] UPDATED)
    (2 min) In this paper, we describe a method to tackle data sparsity and create recommendations in domains with limited knowledge about user preferences. We expand the variational autoencoder collaborative filtering from a single-domain to a multi-domain setting. The intuition is that user-item interactions in a source domain can augment the recommendation quality in a target domain. The intuition can be taken to its extreme, where, in a cross-domain setup, the user history in a source domain is enough to generate high-quality recommendations in a target one. We thus create a Product-of-Experts (POE) architecture for recommendations that jointly models user-item interactions across multiple domains. The method is resilient to missing data for one or more of the domains, which is a situation often found in real life. We present results on two widely-used datasets - Amazon and Yelp, which support the claim that holistic user preference knowledge leads to better recommendations. Surprisingly, we find that in some cases, a POE recommender that does not access the target domain user representation can surpass a strong VAE recommender baseline trained on the target domain.
    Refining BERT Embeddings for Document Hashing via Mutual Information Maximization. (arXiv:2109.02867v1 [cs.IR])
    (2 min) Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOW), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOW or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed based on the mutual information (MI) maximization principle. Specifically, the method first constructs appropriate global and local codes from the documents and then seeks to maximize their mutual information. Experimental results on three benchmark datasets demonstrate that the proposed method is able to generate hash codes that outperform existing ones learned from BOW features by a substantial margin.
    DeepFakes: Detecting Forged and Synthetic Media Content Using Machine Learning. (arXiv:2109.02874v1 [cs.CV])
    (2 min) The rapid advancement in deep learning makes the differentiation of authentic and manipulated facial images and video clips unprecedentedly harder. The underlying technology of manipulating facial appearances through deep generative approaches, enunciated as DeepFake that have emerged recently by promoting a vast number of malicious face manipulation applications. Subsequently, the need of other sort of techniques that can assess the integrity of digital visual content is indisputable to reduce the impact of the creations of DeepFake. A large body of research that are performed on DeepFake creation and detection create a scope of pushing each other beyond the current status. This study presents challenges, research trends, and directions related to DeepFake creation and detection techniques by reviewing the notable research in the DeepFake domain to facilitate the development of more robust approaches that could deal with the more advance DeepFake in the future.
    Sparse Distributed Memory using Spiking Neural Networks on Nengo. (arXiv:2109.03111v1 [cs.NE])
    (2 min) We present a Spiking Neural Network (SNN) based Sparse Distributed Memory (SDM) implemented on the Nengo framework. We have based our work on previous work by Furber et al, 2004, implementing SDM using N-of-M codes. As an integral part of the SDM design, we have implemented Correlation Matrix Memory (CMM) using SNN on Nengo. Our SNN implementation uses Leaky Integrate and Fire (LIF) spiking neuron models on Nengo. Our objective is to understand how well SNN-based SDMs perform in comparison to conventional SDMs. Towards this, we have simulated both conventional and SNN-based SDM and CMM on Nengo. We observe that SNN-based models perform similarly as the conventional ones. In order to evaluate the performance of different SNNs, we repeated the experiment using Adaptive-LIF, Spiking Rectified Linear Unit, and Izhikevich models and obtained similar results. We conclude that it is indeed feasible to develop some types of associative memories using spiking neurons whose memory capacity and other features are similar to the performance without SNNs. Finally we have implemented an application where MNIST images, encoded with N-of-M codes, are associated with their labels and stored in the SNN-based SDM.
    POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling. (arXiv:2109.03039v1 [cs.IR])
    (2 min) Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.
    Hyper Meta-Path Contrastive Learning for Multi-Behavior Recommendation. (arXiv:2109.02859v1 [cs.IR])
    (2 min) User purchasing prediction with multi-behavior information remains a challenging problem for current recommendation systems. Various methods have been proposed to address it via leveraging the advantages of graph neural networks (GNNs) or multi-task learning. However, most existing works do not take the complex dependencies among different behaviors of users into consideration. They utilize simple and fixed schemes, like neighborhood information aggregation or mathematical calculation of vectors, to fuse the embeddings of different user behaviors to obtain a unified embedding to represent a user's behavioral patterns which will be used in downstream recommendation tasks. To tackle the challenge, in this paper, we first propose the concept of hyper meta-path to construct hyper meta-paths or hyper meta-graphs to explicitly illustrate the dependencies among different behaviors of a user. How to obtain a unified embedding for a user from hyper meta-paths and avoid the previously mentioned limitations simultaneously is critical. Thanks to the recent success of graph contrastive learning, we leverage it to learn embeddings of user behavior patterns adaptively instead of assigning a fixed scheme to understand the dependencies among different behaviors. A new graph contrastive learning based framework is proposed by coupling with hyper meta-paths, namely HMG-CR, which consistently and significantly outperforms all baselines in extensive comparison experiments.
  • cs.LG updates on arXiv.org

    Normalizing field flows: Solving forward and inverse stochastic differential equations using physics-informed flow models. (arXiv:2108.12956v2 [cs.LG] UPDATED)
    (2 min) We introduce in this work the normalizing field flows (NFF) for learning random fields from scattered measurements. More precisely, we construct a bijective transformation (a normalizing flow characterizing by neural networks) between a Gaussian random field with the Karhunen-Lo\`eve (KL) expansion structure and the target stochastic field, where the KL expansion coefficients and the invertible networks are trained by maximizing the sum of the log-likelihood on scattered measurements. This NFF model can be used to solve data-driven forward, inverse, and mixed forward/inverse stochastic partial differential equations in a unified framework. We demonstrate the capability of the proposed NFF model for learning Non Gaussian processes and different types of stochastic partial differential equations.
    Point-Cloud Deep Learning of Porous Media for Permeability Prediction. (arXiv:2107.14038v2 [eess.IV] UPDATED)
    (2 min) We propose a novel deep learning framework for predicting permeability of porous media from their digital images. Unlike convolutional neural networks, instead of feeding the whole image volume as inputs to the network, we model the boundary between solid matrix and pore spaces as point clouds and feed them as inputs to a neural network based on the PointNet architecture. This approach overcomes the challenge of memory restriction of graphics processing units and its consequences on the choice of batch size, and convergence. Compared to convolutional neural networks, the proposed deep learning methodology provides freedom to select larger batch sizes, due to reducing significantly the size of network inputs. Specifically, we use the classification branch of PointNet and adjust it for a regression task. As a test case, two and three dimensional synthetic digital rock images are considered. We investigate the effect of different components of our neural network on its performance. We compare our deep learning strategy with a convolutional neural network from various perspectives, specifically for maximum possible batch size. We inspect the generalizability of our network by predicting the permeability of real-world rock samples as well as synthetic digital rocks that are statistically different from the samples used during training. The network predicts the permeability of digital rocks a few thousand times faster than a Lattice Boltzmann solver with a high level of prediction accuracy.
    Towards Robust Object Detection: Bayesian RetinaNet for Homoscedastic Aleatoric Uncertainty Modeling. (arXiv:2108.00784v2 [cs.CV] UPDATED)
    (2 min) According to recent studies, commonly used computer vision datasets contain about 4% of label errors. For example, the COCO dataset is known for its high level of noise in data labels, which limits its use for training robust neural deep architectures in a real-world scenario. To model such a noise, in this paper we have proposed the homoscedastic aleatoric uncertainty estimation, and present a series of novel loss functions to address the problem of image object detection at scale. Specifically, the proposed functions are based on Bayesian inference and we have incorporated them into the common community-adopted object detection deep learning architecture RetinaNet. We have also shown that modeling of homoscedastic aleatoric uncertainty using our novel functions allows to increase the model interpretability and to improve the object detection performance being evaluated on the COCO dataset.
    Convergence Analysis of Nonconvex Distributed Stochastic Zeroth-order Coordinate Method. (arXiv:2103.12954v2 [math.OC] UPDATED)
    (2 min) This paper investigates the stochastic distributed nonconvex optimization problem of minimizing a global cost function formed by the summation of $n$ local cost functions. We solve such a problem by involving zeroth-order (ZO) information exchange. In this paper, we propose a ZO distributed primal-dual coordinate method (ZODIAC) to solve the stochastic optimization problem. Agents approximate their own local stochastic ZO oracle along with coordinates with an adaptive smoothing parameter. We show that the proposed algorithm achieves the convergence rate of $\mathcal{O}(\sqrt{p}/\sqrt{T})$ for general nonconvex cost functions. We demonstrate the efficiency of proposed algorithms through a numerical example in comparison with the existing state-of-the-art centralized and distributed ZO algorithms.
    PLAM: a Posit Logarithm-Approximate Multiplier. (arXiv:2102.09262v2 [cs.LG] UPDATED)
    (2 min) The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in Neural Network related tasks and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the complexity of posit multipliers, the most power-hungry units within Deep Neural Network architectures. When comparing with state-of-the-art posit multipliers, experiments show that the proposed technique reduces the area, power, and delay of hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively, without accuracy degradation.
    How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild. (arXiv:2106.03932v2 [cs.CV] UPDATED)
    (2 min) Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker. Each stage of this pipeline plays an important role for the final performance of the created architecture. Based on a series of controlled experiments, this work presents several practical guidelines for audio-visual active speaker detection. Correspondingly, we present a new architecture called ASDNet, which achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 93.5% outperforming the second best with a large margin of 4.7%. Our code and pretrained models are publicly available.
    Semiparametric Bayesian Networks. (arXiv:2109.03008v1 [cs.LG])
    (2 min) We introduce semiparametric Bayesian networks that combine parametric and nonparametric conditional probability distributions. Their aim is to incorporate the advantages of both components: the bounded complexity of parametric models and the flexibility of nonparametric ones. We demonstrate that semiparametric Bayesian networks generalize two well-known types of Bayesian networks: Gaussian Bayesian networks and kernel density estimation Bayesian networks. For this purpose, we consider two different conditional probability distributions required in a semiparametric Bayesian network. In addition, we present modifications of two well-known algorithms (greedy hill-climbing and PC) to learn the structure of a semiparametric Bayesian network from data. To realize this, we employ a score function based on cross-validation. In addition, using a validation dataset, we apply an early-stopping criterion to avoid overfitting. To evaluate the applicability of the proposed algorithm, we conduct an exhaustive experiment on synthetic data sampled by mixing linear and nonlinear functions, multivariate normal data sampled from Gaussian Bayesian networks, real data from the UCI repository, and bearings degradation data. As a result of this experiment, we conclude that the proposed algorithm accurately learns the combination of parametric and nonparametric components, while achieving a performance comparable with those provided by state-of-the-art methods.
    Interactive slice visualization for exploring machine learning models. (arXiv:2101.06986v2 [stat.ML] UPDATED)
    (2 min) Machine learning models fit complex algorithms to arbitrarily large datasets. These algorithms are well-known to be high on performance and low on interpretability. We use interactive visualization of slices of predictor space to address the interpretability deficit; in effect opening up the black-box of machine learning algorithms, for the purpose of interrogating, explaining, validating and comparing model fits. Slices are specified directly through interaction, or using various touring algorithms designed to visit high-occupancy sections or regions where the model fits have interesting properties. The methods presented here are implemented in the R package \pkg{condvis2}.
    What you need to know to train recurrent neural networks to make Flip Flops memories and more. (arXiv:2010.07858v2 [cs.LG] UPDATED)
    (2 min) Training neural networks to perform different tasks is relevant across various disciplines that go beyond Machine Learning. In particular, Recurrent Neural Networks (RNN) are of great interest to different scientific communities, for example, Computational Neuroscience research and Dynamical Systems among others. Open-source frameworks dedicated to Machine Learning such as Tensorflow and Keras has produced significant changes in the development of technologies that we currently use. One relevant problem that can be approached is how to build the models for the study of dynamical systems, and how to extract the relevant information to be able to answer the scientific questions of interest. The purpose of the present work is to contribute to this aim by using a temporal processing task, in this case, a 3-bit Flip Flop memory, to show the modeling procedure in every step: from equations to the software code using Tensorflow and Keras. The obtained networks are analyzed to describe the dynamics and to show different visualization and analysis tools. The code developed in this work is provided to be used as a base for model other systems.
    A New Basis for Sparse Principal Component Analysis. (arXiv:2007.00596v2 [stat.ML] UPDATED)
    (2 min) Previous versions of sparse principal component analysis (PCA) have presumed that the eigen-basis (a $p \times k$ matrix) is approximately sparse. We propose a method that presumes the $p \times k$ matrix becomes approximately sparse after a $k \times k$ rotation. The simplest version of the algorithm initializes with the leading $k$ principal components. Then, the principal components are rotated with an $k \times k$ orthogonal rotation to make them approximately sparse. Finally, soft-thresholding is applied to the rotated principal components. This approach differs from prior approaches because it uses an orthogonal rotation to approximate a sparse basis. One consequence is that a sparse component need not to be a leading eigenvector, but rather a mixture of them. In this way, we propose a new (rotated) basis for sparse PCA. In addition, our approach avoids "deflation" and multiple tuning parameters required for that. Our sparse PCA framework is versatile; for example, it extends naturally to a two-way analysis of a data matrix for simultaneous dimensionality reduction of rows and columns. We provide evidence showing that for the same level of sparsity, the proposed sparse PCA method is more stable and can explain more variance compared to alternative methods. Through three applications -- sparse coding of images, analysis of transcriptome sequencing data, and large-scale clustering of social networks, we demonstrate the modern usefulness of sparse PCA in exploring multivariate data.
    Learning Practically Feasible Policies for Online 3D Bin Packing. (arXiv:2108.13680v2 [cs.RO] UPDATED)
    (2 min) We tackle the Online 3D Bin Packing Problem, a challenging yet practically useful variant of the classical Bin Packing Problem. In this problem, the items are delivered to the agent without informing the full sequence information. Agent must directly pack these items into the target bin stably without changing their arrival order, and no further adjustment is permitted. Online 3D-BPP can be naturally formulated as Markov Decision Process (MDP). We adopt deep reinforcement learning, in particular, the on-policy actor-critic framework, to solve this MDP with constrained action space. To learn a practically feasible packing policy, we propose three critical designs. First, we propose an online analysis of packing stability based on a novel stacking tree. It attains a high analysis accuracy while reducing the computational complexity from $O(N^2)$ to $O(N \log N)$, making it especially suited for RL training. Second, we propose a decoupled packing policy learning for different dimensions of placement which enables high-resolution spatial discretization and hence high packing precision. Third, we introduce a reward function that dictates the robot to place items in a far-to-near order and therefore simplifies the collision avoidance in movement planning of the robotic arm. Furthermore, we provide a comprehensive discussion on several key implemental issues. The extensive evaluation demonstrates that our learned policy outperforms the state-of-the-art methods significantly and is practically usable for real-world applications.
    Fishr: Invariant Gradient Variances for Out-of-distribution Generalization. (arXiv:2109.02934v1 [cs.LG])
    (2 min) Learning robust models that generalize well under changes in the data distribution is critical for real-world applications. To this end, there has been a growing surge of interest to learn simultaneously from multiple training domains - while enforcing different types of invariance across those domains. Yet, all existing approaches fail to show systematic benefits under fair evaluation protocols. In this paper, we propose a new learning scheme to enforce domain invariance in the space of the gradients of the loss function: specifically, we introduce a regularization term that matches the domain-level variances of gradients across training domains. Critically, our strategy, named Fishr, exhibits close relations with the Fisher Information and the Hessian of the loss. We show that forcing domain-level gradient covariances to be similar during the learning procedure eventually aligns the domain-level loss landscapes locally around the final weights. Extensive experiments demonstrate the effectiveness of Fishr for out-of-distribution generalization. In particular, Fishr improves the state of the art on the DomainBed benchmark and performs significantly better than Empirical Risk Minimization. The code is released at https://github.com/alexrame/fishr.
    Quick Learning Mechanism with Cross-Domain Adaptation for Intelligent Fault Diagnosis. (arXiv:2103.08889v2 [stat.ML] UPDATED)
    (2 min) The fault diagnostic model trained for a laboratory case machine fails to perform well on the industrial machines running under variable operating conditions. For every new operating condition of such machines, a new diagnostic model has to be trained which is a time-consuming and uneconomical process. Therefore, we propose a quick learning mechanism that can transform the existing diagnostic model into a new model suitable for industrial machines operating in different conditions. The proposed method uses the Net2Net transformation followed by a fine-tuning to cancel/minimize the maximum mean discrepancy between the new data and the previous one. The fine-tuning of the model requires a very less amount of labelled target samples and very few iterations of training. Therefore, the proposed method is capable of learning the new target data pattern quickly. The effectiveness of the proposed fault diagnosis method has been demonstrated on the Case Western Reserve University dataset, Intelligent Maintenance Systems bearing dataset, and Paderborn university dataset under the wide variations of the operating conditions. It has been validated that the diagnostic model trained on artificially damaged fault datasets can be used to quickly train another model for a real damage dataset.
    Deep Learning for Exotic Option Valuation. (arXiv:2103.12551v2 [q-fin.CP] UPDATED)
    (2 min) A common approach to valuing exotic options involves choosing a model and then determining its parameters to fit the volatility surface as closely as possible. We refer to this as the model calibration approach (MCA). A disadvantage of MCA is that some information in the volatility surface is lost during the calibration process and the prices of exotic options will not in general be consistent with those of plain vanilla options. We consider an alternative approach where the structure of the user's preferred model is preserved but points on the volatility are features input to a neural network. We refer to this as the volatility feature approach (VFA) model. We conduct experiments showing that VFA can be expected to outperform MCA for the volatility surfaces encountered in practice. Once the upfront computational time has been invested in developing the neural network, the valuation of exotic options using VFA is very fast.
    Meta-Semi: A Meta-learning Approach for Semi-supervised Learning. (arXiv:2007.02394v3 [cs.LG] UPDATED)
    (2 min) Deep learning based semi-supervised learning (SSL) algorithms have led to promising results in recent years. However, they tend to introduce multiple tunable hyper-parameters, making them less practical in real SSL scenarios where the labeled data is scarce for extensive hyper-parameter search. In this paper, we propose a novel meta-learning based SSL algorithm (Meta-Semi) that requires tuning only one additional hyper-parameter, compared with a standard supervised deep learning algorithm, to achieve competitive performance under various conditions of SSL. We start by defining a meta optimization problem that minimizes the loss on labeled data through dynamically reweighting the loss on unlabeled samples, which are associated with soft pseudo labels during training. As the meta problem is computationally intensive to solve directly, we propose an efficient algorithm to dynamically obtain the approximate solutions. We show theoretically that Meta-Semi converges to the stationary point of the loss function on labeled data under mild conditions. Empirically, Meta-Semi outperforms state-of-the-art SSL algorithms significantly on the challenging semi-supervised CIFAR-100 and STL-10 tasks, and achieves competitive performance on CIFAR-10 and SVHN.
    Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers. (arXiv:2101.00234v2 [cs.CL] UPDATED)
    (2 min) Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and with a high parameter budget. In light of this, we explore parameter-sharing methods in Transformers with a specific focus on generative models. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer. Our model combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
    Refined approachability algorithms and application to regret minimization with global costs. (arXiv:2009.03831v4 [cs.LG] UPDATED)
    (2 min) Blackwell's approachability is a framework where two players, the Decision Maker and the Environment, play a repeated game with vector-valued payoffs. The goal of the Decision Maker is to make the average payoff converge to a given set called the target. When this is indeed possible, simple algorithms which guarantee the convergence are known. This abstract tool was successfully used for the construction of optimal strategies in various repeated games, but also found several applications in online learning. By extending an approach proposed by (Abernethy et al., 2011), we construct and analyze a class of Follow the Regularized Leader algorithms (FTRL) for Blackwell's approachability which are able to minimize not only the Euclidean distance to the target set (as it is often the case in the context of Blackwell's approachability) but a wide range of distance-like quantities. This flexibility enables us to apply these algorithms to closely minimize the quantity of interest in various online learning problems. In particular, for regret minimization with $\ell_p$ global costs, we obtain the first bounds with explicit dependence in $p$ and the dimension $d$.
    Deep Personalized Glucose Level Forecasting Using Attention-based Recurrent Neural Networks. (arXiv:2106.00884v2 [cs.LG] UPDATED)
    (2 min) In this paper, we study the problem of blood glucose forecasting and provide a deep personalized solution. Predicting blood glucose level in people with diabetes has significant value because health complications of abnormal glucose level are serious, sometimes even leading to death. Therefore, having a model that can accurately and quickly warn patients of potential problems is essential. To develop a better deep model for blood glucose forecasting, we analyze the data and detect important patterns. These observations helped us to propose a method that has several key advantages over existing methods: 1- it learns a personalized model for each patient as well as a global model; 2- it uses an attention mechanism and extracted time features to better learn long-term dependencies in the data; 3- it introduces a new, robust training procedure for time series data. We empirically show the efficacy of our model on a real dataset.
    Learning Fast Sample Re-weighting Without Reward Data. (arXiv:2109.03216v1 [cs.LG])
    (2 min) Training sample re-weighting is an effective approach for tackling data biases such as imbalanced and corrupted labels. Recent methods develop learning-based algorithms to learn sample re-weighting strategies jointly with model training based on the frameworks of reinforcement learning and meta learning. However, depending on additional unbiased reward data is limiting their general applicability. Furthermore, existing learning-based sample re-weighting methods require nested optimizations of models and weighting parameters, which requires expensive second-order computation. This paper addresses these two problems and presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data. The method is based on two key ideas: learning from history to build proxy reward data and feature sharing to reduce the optimization cost. Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise robustness and long-tailed recognition, and does so while achieving significantly improved training efficiency. The source code is publicly available at https://github.com/google-research/google-research/tree/master/ieg.
    Fast approximations of the Jeffreys divergence between univariate Gaussian mixture models via exponential polynomial densities. (arXiv:2107.05901v3 [cs.IT] UPDATED)
    (2 min) The Jeffreys divergence is a renown symmetrization of the oriented Kullback-Leibler divergence broadly used in information sciences. Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with pros and cons have been proposed in the literature to either estimate, approximate, or lower and upper bound this divergence. In this paper, we propose a simple yet fast heuristic to approximate the Jeffreys divergence between two univariate Gaussian mixtures with arbitrary number of components. Our heuristic relies on converting the mixtures into pairs of dually parameterized probability densities belonging to an exponential family. In particular, we consider the versatile polynomial exponential family densities, and design a divergence to measure in closed-form the goodness of fit between a Gaussian mixture and its polynomial exponential density approximation. This goodness-of-fit divergence is a generalization of the Hyv\"arinen divergence used to estimate models with computationally intractable normalizers. It allows us to perform model selection by choosing the orders of the polynomial exponential densities used to approximate the mixtures. We demonstrate experimentally that our heuristic to approximate the Jeffreys divergence improves by several orders of magnitude the computational time of stochastic Monte Carlo estimations while approximating reasonably well the Jeffreys divergence, specially when the mixtures have a very small number of modes. Besides, our mixture-to-exponential family conversion techniques may prove useful in other settings.
    Novelty detection using ensembles with regularized disagreement. (arXiv:2012.05825v2 [cs.LG] UPDATED)
    (2 min) Despite their excellent performance on in-distribution (ID) data, machine learning-based prediction systems often predict out-of-distribution (OOD) samples incorrectly while indicating high confidence. Instead, they should flag samples that are not similar to the training data, for example, when new classes emerge over time. Even though current OOD detection algorithms can successfully distinguish completely different data sets, they fail to reliably identify samples from novel classes. We develop a new ensemble-based procedure that promotes model diversity and exploits regularization to limit disagreement to only OOD samples, using a batch containing an unknown mixture of ID and OOD data. We show that our procedure significantly outperforms state-of-the-art methods, including those that have access, during training, to data that is known to be OOD. We run extensive comparisons of our approach on a variety of novel-class detection scenarios, on standard image data sets such as SVHN/CIFAR-10/CIFAR-100, as well as on new disease detection on medical image data sets.
    Error Controlled Actor-Critic. (arXiv:2109.02517v2 [cs.LG] UPDATED)
    (2 min) On error of value function inevitably causes an overestimation phenomenon and has a negative impact on the convergence of the algorithms. To mitigate the negative effects of the approximation error, we propose Error Controlled Actor-critic which ensures confining the approximation error in value function. We present an analysis of how the approximation error can hinder the optimization process of actor-critic methods.Then, we derive an upper boundary of the approximation error of Q function approximator and find that the error can be lowered by restricting on the KL-divergence between every two consecutive policies when training the policy. The results of experiments on a range of continuous control tasks demonstrate that the proposed actor-critic algorithm apparently reduces the approximation error and significantly outperforms other model-free RL algorithms.
    Diff-ResNets for Few-shot Learning -- an ODE Perspective. (arXiv:2105.03155v2 [cs.LG] UPDATED)
    (2 min) Interpreting deep neural networks from the ordinary differential equations (ODEs) perspective has inspired many efficient and robust network architectures. However, existing ODE based approaches ignore the relationship among data points, which is a critical component in many problems including few-shot learning and semi-supervised learning. In this paper, inspired by the diffusive ODEs, we propose a novel diffusion residual network (Diff-ResNet) to strengthen the interactions among data points. Under the structured data assumption, it is proved that the diffusion mechanism can decrease the distance-diameter ratio that improves the separability of inter-class points and reduces the distance among local intra-class points. This property can be easily adopted by the residual networks for constructing the separable hyperplanes. The synthetic binary classification experiments demonstrate the effectiveness of the proposed diffusion mechanism. Moreover, extensive experiments of few-shot image classification and semi-supervised graph node classification in various datasets validate the advantages of the proposed Diff-ResNet over existing few-shot learning methods.
    CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation. (arXiv:2106.10796v2 [cs.LG] UPDATED)
    (2 min) Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to the decrease of convergence accuracy.
    GANSER: A Self-supervised Data Augmentation Framework for EEG-based Emotion Recognition. (arXiv:2109.03124v1 [cs.LG])
    (2 min) The data scarcity problem in Electroencephalography (EEG) based affective computing results into difficulty in building an effective model with high accuracy and stability using machine learning algorithms especially deep learning models. Data augmentation has recently achieved considerable performance improvement for deep learning models: increased accuracy, stability, and reduced over-fitting. In this paper, we propose a novel data augmentation framework, namely Generative Adversarial Network-based Self-supervised Data Augmentation (GANSER). As the first to combine adversarial training with self-supervised learning for EEG-based emotion recognition, the proposed framework can generate high-quality and high-diversity simulated EEG samples. In particular, we utilize adversarial training to learn an EEG generator and force the generated EEG signals to approximate the distribution of real samples, ensuring the quality of augmented samples. A transformation function is employed to mask parts of EEG signals and force the generator to synthesize potential EEG signals based on the remaining parts, to produce a wide variety of samples. The masking possibility during transformation is introduced as prior knowledge to guide to extract distinguishable features for simulated EEG signals and generalize the classifier to the augmented sample space. Finally, extensive experiments demonstrate our proposed method can help emotion recognition for performance gain and achieve state-of-the-art results.
    Experimental Quantum Generative Adversarial Networks for Image Generation. (arXiv:2010.06201v3 [quant-ph] UPDATED)
    (0 min) Quantum machine learning is expected to be one of the first practical applications of near-term quantum devices. Pioneer theoretical works suggest that quantum generative adversarial networks (GANs) may exhibit a potential exponential advantage over classical GANs, thus attracting widespread attention. However, it remains elusive whether quantum GANs implemented on near-term quantum devices can actually solve real-world learning tasks. Here, we devise a flexible quantum GAN scheme to narrow this knowledge gap, which could accomplish image generation with arbitrarily high-dimensional features, and could also take advantage of quantum superposition to train multiple examples in parallel. For the first time, we experimentally achieve the learning and generation of real-world hand-written digit images on a superconducting quantum processor. Moreover, we utilize a gray-scale bar dataset to exhibit the competitive performance between quantum GANs and the classical GANs based on multilayer perceptron and convolutional neural network architectures, respectively, benchmarked by the Fr\'echet Distance score. Our work provides guidance for developing advanced quantum generative models on near-term quantum devices and opens up an avenue for exploring quantum advantages in various GAN-related learning tasks.
    OdoNet: Untethered Speed Aiding for Vehicle Navigation Without Hardware Wheeled Odometer. (arXiv:2109.03091v1 [cs.RO])
    (2 min) Odometer has been proven to significantly improve the accuracy of the Global Navigation Satellite System / Inertial Navigation System (GNSS/INS) integrated vehicle navigation in GNSS-challenged environments. However, the odometer is inaccessible in many applications, especially for aftermarket devices. To apply forward speed aiding without hardware wheeled odometer, we propose OdoNet, an untethered one-dimensional Convolution Neural Network (CNN)-based pseudo-odometer model learning from a single Inertial Measurement Unit (IMU), which can act as an alternative to the wheeled odometer. Dedicated experiments have been conducted to verify the feasibility and robustness of the OdoNet. The results indicate that the IMU individuality, the vehicle loads, and the road conditions have little impact on the robustness and precision of the OdoNet, while the IMU biases and the mounting angles may notably ruin the OdoNet. Thus, a data-cleaning procedure is added to effectively mitigate the impacts of the IMU biases and the mounting angles. Compared to the process using only non-holonomic constraint (NHC), after employing the pseudo-odometer, the positioning error is reduced by around 68%, while the percentage is around 74% for the hardware wheeled odometer. In conclusion, the proposed OdoNet can be employed as an untethered pseudo-odometer for vehicle navigation, which can efficiently improve the accuracy and reliability of the positioning in GNSS-denied environments.
    SimJEB: Simulated Jet Engine Bracket Dataset. (arXiv:2105.03534v2 [cs.CE] UPDATED)
    (0 min) This paper introduces the Simulated Jet Engine Bracket Dataset (SimJEB): a new, public collection of crowdsourced mechanical brackets and accompanying structural simulations. SimJEB is applicable to a wide range of geometry processing tasks; the complexity of the shapes in SimJEB offer a challenge to automated geometry cleaning and meshing, while categorical labels and structural simulations facilitate classification and regression (i.e. engineering surrogate modeling). In contrast to existing shape collections, SimJEB's models are all designed for the same engineering function and thus have consistent structural loads and support conditions. On the other hand, SimJEB models are more complex, diverse, and realistic than the synthetically generated datasets commonly used in parametric surrogate model evaluation. The designs in SimJEB were derived from submissions to the GrabCAD Jet Engine Bracket Challenge: an open engineering design competition with over 700 hand-designed CAD entries from 320 designers representing 56 countries. Each model has been cleaned, categorized, meshed, and simulated with finite element analysis according to the original competition specifications. The result is a collection of 381 diverse, high-quality and application-focused designs for advancing geometric deep learning, engineering surrogate modeling, automated cleaning and related geometry processing tasks.
    Auxiliary Diagnosing Coronary Stenosis Using Machine Learning. (arXiv:2007.10316v4 [q-bio.TO] UPDATED)
    (2 min) How to accurately classify and diagnose whether an individual has Coronary Stenosis (CS) without invasive physical examination? This problem has not been solved satisfactorily. To this end, the four machine learning (ML) algorithms, i.e., Boosted Tree (BT), Decision Tree (DT), Logistic Regression (LR) and Random Forest (RF) are employed in this paper. First, eleven features including basic information of an individual, symptoms and results of routine physical examination are selected, as well as one label is specified, indicating whether an individual suffers from different severity of coronary artery stenosis or not. On the basis of it, a sample set is constructed. Second, each of these four ML algorithms learns from the sample set to obtain the corresponding optimal classified results, respectively. The experimental results show that: RF performs better than other three algorithms, and the former algorithm classifies whether an individual has CS with an accuracy of 95.7% (=90/94).
    ConvNets for Counting: Object Detection of Transient Phenomena in Steelpan Drums. (arXiv:2102.00632v2 [cs.CV] UPDATED)
    (0 min) We train an object detector built from convolutional neural networks to count interference fringes in elliptical antinode regions in frames of high-speed video recordings of transient oscillations in Caribbean steelpan drums illuminated by electronic speckle pattern interferometry (ESPI). The annotations provided by our model aim to contribute to the understanding of time-dependent behavior in such drums by tracking the development of sympathetic vibration modes. The system is trained on a dataset of crowdsourced human-annotated images obtained from the Zooniverse Steelpan Vibrations Project. Due to the small number of human-annotated images and the ambiguity of the annotation task, we also evaluate the model on a large corpus of synthetic images whose properties have been matched to the real images by style transfer using a Generative Adversarial Network. Applying the model to thousands of unlabeled video frames, we measure oscillations consistent with audio recordings of these drum strikes. One unanticipated result is that sympathetic oscillations of higher-octave notes significantly precede the rise in sound intensity of the corresponding second harmonic tones; the mechanism responsible for this remains unidentified. This paper primarily concerns the development of the predictive model; further exploration of the steelpan images and deeper physical insights await its further application.
    Improving Phenotype Prediction using Long-Range Spatio-Temporal Dynamics of Functional Connectivity. (arXiv:2109.03115v1 [q-bio.NC])
    (0 min) The study of functional brain connectivity (FC) is important for understanding the underlying mechanisms of many psychiatric disorders. Many recent analyses adopt graph convolutional networks, to study non-linear interactions between functionally-correlated states. However, although patterns of brain activation are known to be hierarchically organised in both space and time, many methods have failed to extract powerful spatio-temporal features. To overcome those challenges, and improve understanding of long-range functional dynamics, we translate an approach, from the domain of skeleton-based action recognition, designed to model interactions across space and time. We evaluate this approach using the Human Connectome Project (HCP) dataset on sex classification and fluid intelligence prediction. To account for subject topographic variability of functional organisation, we modelled functional connectomes using multi-resolution dual-regressed (subject-specific) ICA nodes. Results show a prediction accuracy of 94.4% for sex classification (an increase of 6.2% compared to other methods), and an improvement of correlation with fluid intelligence of 0.325 vs 0.144, relative to a baseline model that encodes space and time separately. Results suggest that explicit encoding of spatio-temporal dynamics of brain functional activity may improve the precision with which behavioural and cognitive phenotypes may be predicted in the future.
    Semi-Relaxed Quantization with DropBits: Training Low-Bit Neural Networks via Bit-wise Regularization. (arXiv:1911.12990v3 [cs.CV] UPDATED)
    (0 min) Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] successfully employ the popular Gumbel-Softmax that allows this transformation with efficient gradient-based optimization. However, RQ with this Gumbel-Softmax relaxation still suffers from bias-variance trade-off depending on the temperature parameter of Gumbel-Softmax. To resolve the issue, we propose a novel method, Semi-Relaxed Quantization (SRQ) that uses multi-class straight-through estimator to effectively reduce the bias and variance, along with a new regularization technique, DropBits that replaces dropout regularization to randomly drop the bits instead of neurons to further reduce the bias of the multi-class straight-through estimator in SRQ. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer using DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support the quantized lottery ticket hypothesis: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.
    Explanations for Occluded Images. (arXiv:2103.03622v2 [cs.LG] UPDATED)
    (0 min) Existing algorithms for explaining the output of image classifiers perform poorly on inputs where the object of interest is partially occluded. We present a novel, black-box algorithm for computing explanations that uses a principled approach based on causal theory. We have implemented the method in the DEEPCOVER tool. We obtain explanations that are much more accurate than those generated by the existing explanation tools on images with occlusions and observe a level of performance comparable to the state of the art when explaining images without occlusions.
    Backdoor Attacks Against Deep Learning Systems in the Physical World. (arXiv:2006.14580v4 [cs.CV] UPDATED)
    (0 min) Backdoor attacks embed hidden malicious behaviors into deep learning models, which only activate and cause misclassifications on model inputs containing a specific trigger. Existing works on backdoor attacks and defenses, however, mostly focus on digital attacks that use digitally generated patterns as triggers. A critical question remains unanswered: can backdoor attacks succeed using physical objects as triggers, thus making them a credible threat against deep learning systems in the real world? We conduct a detailed empirical study to explore this question for facial recognition, a critical deep learning task. Using seven physical objects as triggers, we collect a custom dataset of 3205 images of ten volunteers and use it to study the feasibility of physical backdoor attacks under a variety of real-world conditions. Our study reveals two key findings. First, physical backdoor attacks can be highly successful if they are carefully configured to overcome the constraints imposed by physical objects. In particular, the placement of successful triggers is largely constrained by the target model's dependence on key facial features. Second, four of today's state-of-the-art defenses against (digital) backdoors are ineffective against physical backdoors, because the use of physical objects breaks core assumptions used to construct these defenses. Our study confirms that (physical) backdoor attacks are not a hypothetical phenomenon but rather pose a serious real-world threat to critical classification tasks. We need new and more robust defenses against backdoors in the physical world.
    Countering Online Hate Speech: An NLP Perspective. (arXiv:2109.02941v1 [cs.CL])
    (0 min) Online hate speech has caught everyone's attention from the news related to the COVID-19 pandemic, US elections, and worldwide protests. Online toxicity - an umbrella term for online hateful behavior, manifests itself in forms such as online hate speech. Hate speech is a deliberate attack directed towards an individual or a group motivated by the targeted entity's identity or opinions. The rising mass communication through social media further exacerbates the harmful consequences of online hate speech. While there has been significant research on hate-speech identification using Natural Language Processing (NLP), the work on utilizing NLP for prevention and intervention of online hate speech lacks relatively. This paper presents a holistic conceptual framework on hate-speech NLP countering methods along with a thorough survey on the current progress of NLP for countering online hate speech. It classifies the countering techniques based on their time of action, and identifies potential future research areas on this topic.
    Distance preserving model order reduction of graph-Laplacians and cluster analysis. (arXiv:1809.03048v3 [cs.LG] UPDATED)
    (0 min) Graph-Laplacians and their spectral embeddings play an important role in multiple areas of machine learning. This paper is focused on graph-Laplacian dimension reduction for the spectral clustering of data as a primary application. Spectral embedding provides a low-dimensional parametrization of the data manifold which makes the subsequent task (e.g., clustering) much easier. However, despite reducing the dimensionality of data, the overall computational cost may still be prohibitive for large data sets due to two factors. First, computing the partial eigendecomposition of the graph-Laplacian typically requires a large Krylov subspace. Second, after the spectral embedding is complete, one still has to operate with the same number of data points. For example, clustering of the embedded data is typically performed with various relaxations of k-means which computational cost scales poorly with respect to the size of data set. In this work, we switch the focus from the entire data set to a subset of graph vertices (target subset). We develop two novel algorithms for such low-dimensional representation of the original graph that preserves important global distances between the nodes of the target subset. In particular, it allows to ensure that target subset clustering is consistent with the spectral clustering of the full data set if one would perform such. That is achieved by a properly parametrized reduced-order model (ROM) of the graph-Laplacian that approximates accurately the diffusion transfer function of the original graph for inputs and outputs restricted to the target subset. Working with a small target subset reduces greatly the required dimension of Krylov subspace and allows to exploit the conventional algorithms (like approximations of k-means) in the regimes when they are most robust and efficient.
    Utilizing a digital swarm intelligence platform to improve consensus among radiologists and exploring its applications. (arXiv:2107.07341v2 [cs.HC] UPDATED)
    (0 min) Radiologists today play a key role in making diagnostic decisions and labeling images for training A.I. algorithms. Low inter-reader reliability (IRR) can be seen between experts when interpreting challenging cases. While teams-based decisions are known to outperform individual decisions, inter-personal biases often creep up in group interactions which limit non-dominant participants from expressing true opinions. To overcome the dual problems of low consensus and inter-personal bias, we explored a solution modeled on biological swarms of bees. Two separate cohorts; three radiologists and five radiology residents collaborated on a digital swarm platform in real time and in a blinded fashion, grading meniscal lesions on knee MR exams. These consensus votes were benchmarked against clinical (arthroscopy) and radiological (senior-most radiologist) observations. The IRR of the consensus votes was compared to the IRR of the majority and most confident votes of the two cohorts.The radiologist cohort saw an improvement of 23% in IRR of swarm votes over majority vote. Similar improvement of 23% in IRR in 3-resident swarm votes over majority vote, was observed. The 5-resident swarm had an even higher improvement of 32% in IRR over majority vote. Swarm consensus votes also improved specificity by up to 50%. The swarm consensus votes outperformed individual and majority vote decisions in both the radiologists and resident cohorts. The 5-resident swarm had higher IRR than 3-resident swarm indicating positive effect of increased swarm size. The attending and resident swarms also outperformed predictions from a state-of-the-art A.I. algorithm. Utilizing a digital swarm platform improved agreement and allows participants to express judgement free intent, resulting in superior clinical performance and robust A.I. training labels.
    Analysis of MRI Biomarkers for Brain Cancer Survival Prediction. (arXiv:2109.02785v1 [q-bio.QM])
    (0 min) Prediction of Overall Survival (OS) of brain cancer patients from multi-modal MRI is a challenging field of research. Most of the existing literature on survival prediction is based on Radiomic features, which does not consider either non-biological factors or the functional neurological status of the patient(s). Besides, the selection of an appropriate cut-off for survival and the presence of censored data create further problems. Application of deep learning models for OS prediction is also limited due to the lack of large annotated publicly available datasets. In this scenario we analyse the potential of two novel neuroimaging feature families, extracted from brain parcellation atlases and spatial habitats, along with classical radiomic and geometric features; to study their combined predictive power for analysing overall survival. A cross validation strategy with grid search is proposed to simultaneously select and evaluate the most predictive feature subset based on its predictive power. A Cox Proportional Hazard (CoxPH) model is employed for univariate feature selection, followed by the prediction of patient-specific survival functions by three multivariate parsimonious models viz. Coxnet, Random survival forests (RSF) and Survival SVM (SSVM). The brain cancer MRI data used for this research was taken from two open-access collections TCGA-GBM and TCGA-LGG available from The Cancer Imaging Archive (TCIA). Corresponding survival data for each patient was downloaded from The Cancer Genome Atlas (TCGA). A high cross validation $C-index$ score of $0.82\pm.10$ was achieved using RSF with the best $24$ selected features. Age was found to be the most important biological predictor. There were $9$, $6$, $6$ and $2$ features selected from the parcellation, habitat, radiomic and region-based feature groups respectively.
    Prescriptive Process Monitoring Under Resource Constraints: A Causal Inference Approach. (arXiv:2109.02894v1 [cs.LG])
    (0 min) Prescriptive process monitoring is a family of techniques to optimize the performance of a business process by triggering interventions at runtime. Existing prescriptive process monitoring techniques assume that the number of interventions that may be triggered is unbounded. In practice, though, specific interventions consume resources with finite capacity. For example, in a loan origination process, an intervention may consist of preparing an alternative loan offer to increase the applicant's chances of taking a loan. This intervention requires a certain amount of time from a credit officer, and thus, it is not possible to trigger this intervention in all cases. This paper proposes a prescriptive process monitoring technique that triggers interventions to optimize a cost function under fixed resource constraints. The proposed technique relies on predictive modeling to identify cases that are likely to lead to a negative outcome, in combination with causal inference to estimate the effect of an intervention on the outcome of the case. These outputs are then used to allocate resources to interventions to maximize a cost function. A preliminary empirical evaluation suggests that the proposed approach produces a higher net gain than a purely predictive (non-causal) baseline.
    Fruit-CoV: An Efficient Vision-based Framework for Speedy Detection and Diagnosis of SARS-CoV-2 Infections Through Recorded Cough Sounds. (arXiv:2109.03219v1 [cs.SD])
    (0 min) SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing service of SARS-CoV-2 at home. In this study, we introduce Fruit-CoV, a two-stage vision framework, which is capable of detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, we convert sounds into Log-Mel Spectrograms and use the EfficientNet-V2 network to extract its visual features in the first stage. In the second stage, we use 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN to aggregate feature representations of the Log-Mel Spectrograms. Finally, we use the combined features to train a binary classifier. In this study, we use a dataset provided by the AICovidVN 115M Challenge, which includes a total of 7371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results show that our proposed model achieves an AUC score of 92.8% and ranks the 1st place on the leaderboard of the AICovidVN Challenge. More importantly, our proposed framework can be integrated into a call center or a VoIP system to speed up detecting SARS-CoV-2 infections through online/recorded cough sounds.
    Regularized Learning in Banach Spaces. (arXiv:2109.03159v1 [cs.LG])
    (0 min) This article presents a different way to study the theory of regularized learning for generalized data including representer theorems and convergence theorems. The generalized data are composed of linear functionals and real scalars to represent the discrete information of the local models. By the extension of the classical machine learning, the empirical risks are computed by the generalized data and the loss functions. According to the techniques of regularization, the global solutions are approximated by minimizing the regularized empirical risks over the Banach spaces. The Banach spaces are adaptively chosen to endow the generalized input data with compactness such that the existence and convergence of the approximate solutions are guaranteed by the weak* topology.
    NumGPT: Improving Numeracy Ability of Generative Pre-trained Models. (arXiv:2109.03137v1 [cs.CL])
    (0 min) Existing generative pre-trained language models (e.g., GPT) focus on modeling the language structure and semantics of general texts. However, those models do not consider the numerical properties of numbers and cannot perform robustly on numerical reasoning tasks (e.g., math word problems and measurement estimation). In this paper, we propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts. Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number. A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT. We conduct extensive experiments on four different datasets to evaluate the numeracy ability of NumGPT. The experiment results show that NumGPT outperforms baseline models (e.g., GPT and GPT with DICE) on a range of numerical reasoning tasks such as measurement estimation, number comparison, math word problems, and magnitude classification. Ablation studies are also conducted to evaluate the impact of pre-training and model hyperparameters on the performance.
    Robust Density Estimation under Besov IPM Losses. (arXiv:2004.08597v2 [math.ST] UPDATED)
    (0 min) We study minimax convergence rates of nonparametric density estimation in the Huber contamination model, in which a proportion of the data comes from an unknown outlier distribution. We provide the first results for this problem under a large family of losses, called Besov integral probability metrics (IPMs), that includes $\mathcal{L}^p$, Wasserstein, Kolmogorov-Smirnov, and other common distances between probability distributions. Specifically, under a range of smoothness assumptions on the population and outlier distributions, we show that a re-scaled thresholding wavelet series estimator achieves minimax optimal convergence rates under a wide variety of losses. Finally, based on connections that have recently been shown between nonparametric density estimation under IPM losses and generative adversarial networks (GANs), we show that certain GAN architectures also achieve these minimax rates.
    Multi-Center Federated Learning. (arXiv:2005.01026v2 [cs.LG] UPDATED)
    (0 min) Federated learning has received great attention for its capability to train a large-scale model in a decentralized manner without needing to access user data directly. It helps protect the users' private data from centralized collecting. Unlike distributed machine learning, federated learning aims to tackle non-IID data from heterogeneous sources in various real-world applications, such as those on smartphones. Existing federated learning approaches usually adopt a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. However, due to the diverse nature of user behaviors, assigning users' gradients to different global models (i.e., centers) can better capture the heterogeneity of data distributions across users. Our paper proposes a novel multi-center aggregation mechanism for federated learning, which learns multiple global models from the non-IID user data and simultaneously derives the optimal matching between users and centers. We formulate the problem as a joint optimization that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Our experimental results on benchmark datasets show that our method outperforms several popular federated learning methods.
    Deep Convolutional Neural Networks Predict Elasticity Tensors and their Bounds in Homogenization. (arXiv:2109.03020v1 [cond-mat.mtrl-sci])
    (0 min) In the present work, 3D convolutional neural networks (CNNs) are trained to link random heterogeneous, two-phase materials of arbitrary phase fractions to their elastic macroscale stiffness thus replacing explicit homogenization simulations. In order to reduce the uncertainty of the true stiffness of the synthetic composites due to unknown boundary conditions (BCs), the CNNs predict beyond the stiffness for periodic BC the upper bound through kinematically uniform BC, and the lower bound through stress uniform BC. This work describes the workflow of the homogenization-CNN, from microstructure generation over the CNN design, the operations of convolution, nonlinear activation and pooling as well as training and validation along with backpropagation up to performance measurements in tests. Therein the CNNs demonstrate the predictive accuracy not only for the standard test set but also for samples of the real, two-phase microstructure of a diamond-based coating. The CNN that covers all three boundary types is virtually as accurate as the separate treatment in three different nets. The CNNs of this contribution provide through stiffness bounds an indicator of the proper RVE size for individual snapshot samples. Moreover, they enable statistical analyses for the effective elastic stiffness on ensembles of synthetical microstructures without costly simulations.
    ICCAD Special Session Paper: Quantum-Classical Hybrid Machine Learning for Image Classification. (arXiv:2109.02862v1 [cs.CV])
    (0 min) Image classification is a major application domain for conventional deep learning (DL). Quantum machine learning (QML) has the potential to revolutionize image classification. In any typical DL-based image classification, we use convolutional neural network (CNN) to extract features from the image and multi-layer perceptron network (MLP) to create the actual decision boundaries. On one hand, QML models can be useful in both of these tasks. Convolution with parameterized quantum circuits (Quanvolution) can extract rich features from the images. On the other hand, quantum neural network (QNN) models can create complex decision boundaries. Therefore, Quanvolution and QNN can be used to create an end-to-end QML model for image classification. Alternatively, we can extract image features separately using classical dimension reduction techniques such as, Principal Components Analysis (PCA) or Convolutional Autoencoder (CAE) and use the extracted features to train a QNN. We review two proposals on quantum-classical hybrid ML models for image classification namely, Quanvolutional Neural Network and dimension reduction using a classical algorithm followed by QNN. Particularly, we make a case for trainable filters in Quanvolution and CAE-based feature extraction for image datasets (instead of dimension reduction using linear transformations such as, PCA). We discuss various design choices, potential opportunities, and drawbacks of these models. We also release a Python-based framework to create and explore these hybrid models with a variety of design choices.
    BioNetExplorer: Architecture-Space Exploration of Bio-Signal Processing Deep Neural Networks for Wearables. (arXiv:2109.02909v1 [eess.SP])
    (0 min) In this work, we propose the BioNetExplorer framework to systematically generate and explore multiple DNN architectures for bio-signal processing in wearables. Our framework adapts key neural architecture parameters to search for an embedded DNN with a low hardware overhead, which can be deployed in wearable edge devices to analyse the bio-signal data and to extract the relevant information, such as arrhythmia and seizure. Our framework also enables hardware-aware DNN architecture search using genetic algorithms by imposing user requirements and hardware constraints (storage, FLOPs, etc.) during the exploration stage, thereby limiting the number of networks explored. Moreover, BioNetExplorer can also be used to search for DNNs based on the user-required output classes; for instance, a user might require a specific output class due to genetic predisposition or a pre-existing heart condition. The use of genetic algorithms reduces the exploration time, on average, by 9x, compared to exhaustive exploration. We are successful in identifying Pareto-optimal designs, which can reduce the storage overhead of the DNN by ~30MB for a quality loss of less than 0.5%. To enable low-cost embedded DNNs, BioNetExplorer also employs different model compression techniques to further reduce the storage overhead of the network by up to 53x for a quality loss of <0.2%.
    Understanding Model Drift in a Large Cellular Network. (arXiv:2109.03011v1 [cs.NI])
    (2 min) Operational networks are increasingly using machine learning models for a variety of tasks, including detecting anomalies, inferring application performance, and forecasting demand. Accurate models are important, yet accuracy can degrade over time due to concept drift, whereby either the characteristics of the data change over time (data drift) or the relationship between the features and the target predictor change over time (model drift). Drift is important to detect because changes in properties of the underlying data or relationships to the target prediction can require model retraining, which can be time-consuming and expensive. Concept drift occurs in operational networks for a variety of reasons, ranging from software upgrades to seasonality to changes in user behavior. Yet, despite the prevalence of drift in networks, its extent and effects on prediction accuracy have not been extensively studied. This paper presents an initial exploration into concept drift in a large cellular network in the United States for a major metropolitan area in the context of demand forecasting. We find that concept drift arises largely due to data drift, and it appears across different key performance indicators (KPIs), models, training set sizes, and time intervals. We identify the sources of concept drift for the particular problem of forecasting downlink volume. Weekly and seasonal patterns introduce both high and low-frequency model drift, while disasters and upgrades result in sudden drift due to exogenous shocks. Regions with high population density, lower traffic volumes, and higher speeds also tend to correlate with more concept drift. The features that contribute most significantly to concept drift are User Equipment (UE) downlink packets, UE uplink packets, and Real-time Transport Protocol (RTP) total received packets.
    Benchmarking Robustness of Deep Learning Classifiers Using Two-Factor Perturbation. (arXiv:2103.03102v3 [cs.LG] UPDATED)
    (2 min) The accuracy of DL classifiers is unstable in that it often changes significantly when retested on adversarial images, imperfect images, or perturbed images. This paper adds to the small but fundamental body of work on benchmarking the robustness of DL classifiers on defective images. Unlike existed single-factor digital perturbation work, we provide state-of-the-art two-factor perturbation that provides two natural perturbations on images applied in different sequences. The two-factor perturbation includes (1) two digital perturbations (Salt & pepper noise and Gaussian noise) applied in both sequences. (2) one digital perturbation (salt & pepper noise) and a geometric perturbation (rotation) applied in different sequences. To measure robust DL classifiers, previous scientists provided 15 types of single-factor corruption. We created 69 benchmarking image sets, including a clean set, sets with single factor perturbations, and sets with two-factor perturbation conditions. To be best of our knowledge, this is the first report that two-factor perturbed images improves both robustness and accuracy of DL classifiers. Previous research evaluating deep learning (DL) classifiers has often used top-1/top-5 accuracy, so researchers have usually offered tables, line diagrams, and bar charts to display accuracy of DL classifiers. But these existed approaches cannot quantitively evaluate robustness of DL classifiers. We innovate a new two-dimensional, statistical visualization tool, including mean accuracy and coefficient of variation (CV), to benchmark the robustness of DL classifiers. All source codes and related image sets are shared on websites (this http URL or https://github.com/daiweiworking/RobustDeepLearningUsingPerturbations ) to support future academic research and industry projects.
    Certified Robustness to Programmable Transformations in LSTMs. (arXiv:2102.07818v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks for natural language processing are fragile in the face of adversarial examples -- small input perturbations, like synonym substitution or word duplication, which cause a neural network to change its prediction. We present an approach to certifying the robustness of LSTMs (and extensions of LSTMs) and training models that can be efficiently certified. Our approach can certify robustness to intractably large perturbation spaces defined programmatically in a language of string transformations. Our evaluation shows that (1) our approach can train models that are more robust to combinations of string transformations than those produced using existing techniques; (2) our approach can show high certification accuracy of the resulting models.
    Adversarial Parameter Defense by Multi-Step Risk Minimization. (arXiv:2109.02889v1 [cs.LG])
    (0 min) Previous studies demonstrate DNNs' vulnerability to adversarial examples and adversarial training can establish a defense to adversarial examples. In addition, recent studies show that deep neural networks also exhibit vulnerability to parameter corruptions. The vulnerability of model parameters is of crucial value to the study of model robustness and generalization. In this work, we introduce the concept of parameter corruption and propose to leverage the loss change indicators for measuring the flatness of the loss basin and the parameter robustness of neural network parameters. On such basis, we analyze parameter corruptions and propose the multi-step adversarial corruption algorithm. To enhance neural networks, we propose the adversarial parameter defense algorithm that minimizes the average risk of multiple adversarial parameter corruptions. Experimental results show that the proposed algorithm can improve both the parameter robustness and accuracy of neural networks.
    Scale-invariant representation of machine learning. (arXiv:2109.02914v1 [cs.LG])
    (0 min) The success of machine learning stems from its structured data representation. Similar data have close representation as compressed codes for classification or emerged labels for clustering. We observe that the frequency of the internal representation follows power laws in both supervised and unsupervised learning. The scale-invariant distribution implies that machine learning largely compresses frequent typical data, and at the same time, differentiates many atypical data as outliers. In this study, we derive how the power laws can naturally arise in machine learning. In terms of information theory, the scale-invariant representation corresponds to a maximally uncertain data grouping among possible representations that guarantee pre-specified learning accuracy.
    Revisiting Recursive Least Squares for Training Deep Neural Networks. (arXiv:2109.03220v1 [cs.LG])
    (2 min) Recursive least squares (RLS) algorithms were once widely used for training small-scale neural networks, due to their fast convergence. However, previous RLS algorithms are unsuitable for training deep neural networks (DNNs), since they have high computational complexity and too many preconditions. In this paper, to overcome these drawbacks, we propose three novel RLS optimization algorithms for training feedforward neural networks, convolutional neural networks and recurrent neural networks (including long short-term memory networks), by using the error backpropagation and our average-approximation RLS method, together with the equivalent gradients of the linear least squares loss function with respect to the linear outputs of hidden layers. Compared with previous RLS optimization algorithms, our algorithms are simple and elegant. They can be viewed as an improved stochastic gradient descent (SGD) algorithm, which uses the inverse autocorrelation matrix of each layer as the adaptive learning rate. Their time and space complexities are only several times those of SGD. They only require the loss function to be the mean squared error and the activation function of the output layer to be invertible. In fact, our algorithms can be also used in combination with other first-order optimization algorithms without requiring these two preconditions. In addition, we present two improved methods for our algorithms. Finally, we demonstrate their effectiveness compared to the Adam algorithm on MNIST, CIFAR-10 and IMDB datasets, and investigate the influences of their hyperparameters experimentally.
    Injecting Entity Types into Entity-Guided Text Generation. (arXiv:2009.13401v3 [cs.CL] UPDATED)
    (2 min) Recent successes in deep generative modeling have led to significant advances in natural language generation (NLG). Incorporating entities into neural generation models has demonstrated great improvements by assisting to infer the summary topic and to generate coherent content. To enhance the role of entity in NLG, in this paper, we aim to model the entity type in the decoding phase to generate contextual words accurately. We develop a novel NLG model to produce a target sequence based on a given list of entities. Our model has a multi-step decoder that injects the entity types into the process of entity mention generation. Experiments on two public news datasets demonstrate type injection performs better than existing type embedding concatenation baselines.
    ExCode-Mixed: Explainable Approaches towards Sentiment Analysis on Code-Mixed Data using BERT models. (arXiv:2109.03200v1 [cs.AI])
    (2 min) The increasing use of social media sites in countries like India has given rise to large volumes of code-mixed data. Sentiment analysis of this data can provide integral insights into people's perspectives and opinions. Developing robust explainability techniques which explain why models make their predictions becomes essential. In this paper, we propose an adequate methodology to integrate explainable approaches into code-mixed sentiment analysis.
    Safe-Critical Modular Deep Reinforcement Learning with Temporal Logic through Gaussian Processes and Control Barrier Functions. (arXiv:2109.02791v1 [cs.RO])
    (2 min) Reinforcement learning (RL) is a promising approach and has limited success towards real-world applications, because ensuring safe exploration or facilitating adequate exploitation is a challenges for controlling robotic systems with unknown models and measurement uncertainties. Such a learning problem becomes even more intractable for complex tasks over continuous space (state-space and action-space). In this paper, we propose a learning-based control framework consisting of several aspects: (1) linear temporal logic (LTL) is leveraged to facilitate complex tasks over an infinite horizons which can be translated to a novel automaton structure; (2) we propose an innovative reward scheme for RL-agent with the formal guarantee such that global optimal policies maximize the probability of satisfying the LTL specifications; (3) based on a reward shaping technique, we develop a modular policy-gradient architecture utilizing the benefits of automaton structures to decompose overall tasks and facilitate the performance of learned controllers; (4) by incorporating Gaussian Processes (GPs) to estimate the uncertain dynamic systems, we synthesize a model-based safeguard using Exponential Control Barrier Functions (ECBFs) to address problems with high-order relative degrees. In addition, we utilize the properties of LTL automatons and ECBFs to construct a guiding process to further improve the efficiency of exploration. Finally, we demonstrate the effectiveness of the framework via several robotic environments. And we show such an ECBF-based modular deep RL algorithm achieves near-perfect success rates and guard safety with a high probability confidence during training.
    Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks. (arXiv:2109.03040v1 [cs.LG])
    (2 min) Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. However, existing software solutions are not efficient. Therefore, many hardware accelerators have been proposed optimizing performance, power and resource utilization of the implementation. Amongst existing solutions, Field Programmable Gate Array (FPGA) based architecture provides better cost-energy-performance trade-offs as well as scalability and minimizing development time. In this paper, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. Our architecture consists of parallel Multiply and Accumulate (MAC) units with caching techniques and interconnection networks to exploit maximum data parallelism. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations. As a result, our architecture achieved significant reduction in resource utilization with competitive accuracy. Furthermore, we developed an assembly-type microinstructions to access the co-processing fabric to manage layer-wise parallelism, thereby making re-use of limited resources. Finally, we have tested our architecture up to 9x9 kernel size on Xilinx Virtex 7 FPGA, achieving a throughput of up to 226.2 GOp/S for 3x3 kernel size.
    Iterative Pseudo-Labeling with Deep Feature Annotation and Confidence-Based Sampling. (arXiv:2109.02717v1 [cs.LG])
    (2 min) Training deep neural networks is challenging when large and annotated datasets are unavailable. Extensive manual annotation of data samples is time-consuming, expensive, and error-prone, notably when it needs to be done by experts. To address this issue, increased attention has been devoted to techniques that propagate uncertain labels (also called pseudo labels) to large amounts of unsupervised samples and use them for training the model. However, these techniques still need hundreds of supervised samples per class in the training set and a validation set with extra supervised samples to tune the model. We improve a recent iterative pseudo-labeling technique, Deep Feature Annotation (DeepFA), by selecting the most confident unsupervised samples to iteratively train a deep neural network. Our confidence-based sampling strategy relies on only dozens of annotated training samples per class with no validation set, considerably reducing user effort in data annotation. We first ascertain the best configuration for the baseline -- a self-trained deep neural network -- and then evaluate our confidence DeepFA for different confidence thresholds. Experiments on six datasets show that DeepFA already outperforms the self-trained baseline, but confidence DeepFA can considerably outperform the original DeepFA and the baseline.
    Large-Scale System Identification Using a Randomized SVD. (arXiv:2109.02703v1 [math.OC])
    (2 min) Learning a dynamical system from input/output data is a fundamental task in the control design pipeline. In the partially observed setting there are two components to identification: parameter estimation to learn the Markov parameters, and system realization to obtain a state space model. In both sub-problems it is implicitly assumed that standard numerical algorithms such as the singular value decomposition (SVD) can be easily and reliably computed. When trying to fit a high-dimensional model to data, for example in the cyber-physical system setting, even computing an SVD is intractable. In this work we show that an approximate matrix factorization obtained using randomized methods can replace the standard SVD in the realization algorithm while maintaining the non-asymptotic (in data-set size) performance and robustness guarantees of classical methods. Numerical examples illustrate that for large system models, this is the only method capable of producing a model.
    Pano3D: A Holistic Benchmark and a Solid Baseline for $360^o$ Depth Estimation. (arXiv:2109.02749v1 [cs.CV])
    (2 min) Pano3D is a new benchmark for depth estimation from spherical panoramas. It aims to assess performance across all depth estimation traits, the primary direct depth estimation performance targeting precision and accuracy, and also the secondary traits, boundary preservation, and smoothness. Moreover, Pano3D moves beyond typical intra-dataset evaluation to inter-dataset performance assessment. By disentangling the capacity to generalize to unseen data into different test splits, Pano3D represents a holistic benchmark for $360^o$ depth estimation. We use it as a basis for an extended analysis seeking to offer insights into classical choices for depth estimation. This results in a solid baseline for panoramic depth that follow-up works can build upon to steer future progress.
    Graph Attention Layer Evolves Semantic Segmentation for Road Pothole Detection: A Benchmark and Algorithms. (arXiv:2109.02711v1 [cs.CV])
    (2 min) Existing road pothole detection approaches can be classified as computer vision-based or machine learning-based. The former approaches typically employ 2-D image analysis/understanding or 3-D point cloud modeling and segmentation algorithms to detect road potholes from vision sensor data. The latter approaches generally address road pothole detection using convolutional neural networks (CNNs) in an end-to-end manner. However, road potholes are not necessarily ubiquitous and it is challenging to prepare a large well-annotated dataset for CNN training. In this regard, while computer vision-based methods were the mainstream research trend in the past decade, machine learning-based methods were merely discussed. Recently, we published the first stereo vision-based road pothole detection dataset and a novel disparity transformation algorithm, whereby the damaged and undamaged road areas can be highly distinguished. However, there are no benchmarks currently available for state-of-the-art (SoTA) CNNs trained using either disparity images or transformed disparity images. Therefore, in this paper, we first discuss the SoTA CNNs designed for semantic segmentation and evaluate their performance for road pothole detection with extensive experiments. Additionally, inspired by graph neural network (GNN), we propose a novel CNN layer, referred to as graph attention layer (GAL), which can be easily deployed in any existing CNN to optimize image feature representations for semantic segmentation. Our experiments compare GAL-DeepLabv3+, our best-performing implementation, with nine SoTA CNNs on three modalities of training data: RGB images, disparity images, and transformed disparity images. The experimental results suggest that our proposed GAL-DeepLabv3+ achieves the best overall pothole detection accuracy on all training data modalities.
    Solving Fashion Recommendation -- The Farfetch Challenge. (arXiv:2108.01314v2 [cs.LG] UPDATED)
    (2 min) Recommendation engines are integral to the modern e-commerce experience, both for the seller and the end user. Accurate recommendations lead to higher revenue and better user experience. In this paper, we are presenting our solution to ECML PKDD Farfetch Fashion Recommendation Challenge. The goal of this challenge is to maximize the chances of a click when the users are presented with set of fashion items. We have approached this problem as a binary classification problem. Our winning solution utilizes Catboost as the classifier and Bayesian Optimization for hyper parameter tuning. Our baseline model achieved MRR of 0.5153 on the validation set. Bayesian optimization of hyper parameters improved the MRR to 0.5240 on the validation set. Our final submission on the test set achieved a MRR of 0.5257.
    Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts. (arXiv:2104.12822v3 [cs.IR] UPDATED)
    (2 min) In this paper, we describe a method to tackle data sparsity and create recommendations in domains with limited knowledge about user preferences. We expand the variational autoencoder collaborative filtering from a single-domain to a multi-domain setting. The intuition is that user-item interactions in a source domain can augment the recommendation quality in a target domain. The intuition can be taken to its extreme, where, in a cross-domain setup, the user history in a source domain is enough to generate high-quality recommendations in a target one. We thus create a Product-of-Experts (POE) architecture for recommendations that jointly models user-item interactions across multiple domains. The method is resilient to missing data for one or more of the domains, which is a situation often found in real life. We present results on two widely-used datasets - Amazon and Yelp, which support the claim that holistic user preference knowledge leads to better recommendations. Surprisingly, we find that in some cases, a POE recommender that does not access the target domain user representation can surpass a strong VAE recommender baseline trained on the target domain.
    Using Satellite Imagery and Machine Learning to Estimate the Livelihood Impact of Electricity Access. (arXiv:2109.02890v1 [econ.GN])
    (2 min) In many regions of the world, sparse data on key economic outcomes inhibits the development, targeting, and evaluation of public policy. We demonstrate how advancements in satellite imagery and machine learning can help ameliorate these data and inference challenges. In the context of an expansion of the electrical grid across Uganda, we show how a combination of satellite imagery and computer vision can be used to develop local-level livelihood measurements appropriate for inferring the causal impact of electricity access on livelihoods. We then show how ML-based inference techniques deliver more reliable estimates of the causal impact of electrification than traditional alternatives when applied to these data. We estimate that grid access improves village-level asset wealth in rural Uganda by 0.17 standard deviations, more than doubling the growth rate over our study period relative to untreated areas. Our results provide country-scale evidence on the impact of a key infrastructure investment, and provide a low-cost, generalizable approach to future policy evaluation in data sparse environments.
    Least-Squares ReLU Neural Network (LSNN) Method For Scalar Nonlinear Hyperbolic Conservation Law. (arXiv:2105.11627v2 [math.NA] UPDATED)
    (2 min) We introduced the least-squares ReLU neural network (LSNN) method for solving the linear advection-reaction problem with discontinuous solution and showed that the method outperforms mesh-based numerical methods in terms of the number of degrees of freedom. This paper studies the LSNN method for scalar nonlinear hyperbolic conservation law. The method is a discretization of an equivalent least-squares (LS) formulation in the set of neural network functions with the ReLU activation function. Evaluation of the LS functional is done by using numerical integration and conservative finite volume scheme. Numerical results of some test problems show that the method is capable of approximating the discontinuous interface of the underlying problem automatically through the free breaking lines of the ReLU neural network. Moreover, the method does not exhibit the common Gibbs phenomena along the discontinuous interface.
    On the Out-of-distribution Generalization of Probabilistic Image Modelling. (arXiv:2109.02639v1 [cs.CV])
    (2 min) Out-of-distribution (OOD) detection and lossless compression constitute two problems that can be solved by the training of probabilistic models on a first dataset with subsequent likelihood evaluation on a second dataset, where data distributions differ. By defining the generalization of probabilistic models in terms of likelihood we show that, in the case of image models, the OOD generalization ability is dominated by local features. This motivates our proposal of a Local Autoregressive model that exclusively models local image features towards improving OOD performance. We apply the proposed model to OOD detection tasks and achieve state-of-the-art unsupervised OOD detection performance without the introduction of additional data. Additionally, we employ our model to build a new lossless image compressor: NeLLoC (Neural Local Lossless Compressor) and report state-of-the-art compression rates and model size.
    Backpropagation and fuzzy algorithm Modelling to Resolve Blood Supply Chain Issues in the Covid-19 Pandemic. (arXiv:2109.02645v1 [cs.LG])
    (2 min) Bloodstock shortages and its uncertain demand has become a major problem for all countries worldwide. Therefore, this study aims to provide solution to the issues of blood distribution during the Covid-19 Pandemic at Bengkulu, Indonesia. The Backpropagation algorithm was used to improve the possibility of discovering available and potential donors. Furthermore, the distances, age, and length of donation were measured to obtain the right person to donate blood when it needed. The Backpropagation uses three input layers to classify eligible donors, namely age, body, weight, and bias. In addition, the system through its query automatically counts the variables via the Fuzzy Tahani and simultaneously access the vast database.
    Recommendation Fairness: From Static to Dynamic. (arXiv:2109.03150v1 [cs.IR])
    (2 min) Driven by the need to capture users' evolving interests and optimize their long-term experiences, more and more recommender systems have started to model recommendation as a Markov decision process and employ reinforcement learning to address the problem. Shouldn't research on the fairness of recommender systems follow the same trend from static evaluation and one-shot intervention to dynamic monitoring and non-stop control? In this paper, we portray the recent developments in recommender systems first and then discuss how fairness could be baked into the reinforcement learning techniques for recommendation. Moreover, we argue that in order to make further progress in recommendation fairness, we may want to consider multi-agent (game-theoretic) optimization, multi-objective (Pareto) optimization, and simulation-based optimization, in the general framework of stochastic games.
    Self-adaptive deep neural network: Numerical approximation to functions and PDEs. (arXiv:2109.02839v1 [math.NA])
    (2 min) Designing an optimal deep neural network for a given task is important and challenging in many machine learning applications. To address this issue, we introduce a self-adaptive algorithm: the adaptive network enhancement (ANE) method, written as loops of the form train, estimate and enhance. Starting with a small two-layer neural network (NN), the step train is to solve the optimization problem at the current NN; the step estimate is to compute a posteriori estimator/indicators using the solution at the current NN; the step enhance is to add new neurons to the current NN. Novel network enhancement strategies based on the computed estimator/indicators are developed in this paper to determine how many new neurons and when a new layer should be added to the current NN. The ANE method provides a natural process for obtaining a good initialization in training the current NN; in addition, we introduce an advanced procedure on how to initialize newly added neurons for a better approximation. We demonstrate that the ANE method can automatically design a nearly minimal NN for learning functions exhibiting sharp transitional layers as well as discontinuous solutions of hyperbolic partial differential equations.
    Optimizing Quantum Variational Circuits with Deep Reinforcement Learning. (arXiv:2109.03188v1 [cs.LG])
    (2 min) Quantum Machine Learning (QML) is considered to be one of the most promising applications of near term quantum devices. However, the optimization of quantum machine learning models presents numerous challenges arising from the imperfections of hardware and the fundamental obstacles in navigating an exponentially scaling Hilbert space. In this work, we evaluate the potential of contemporary methods in deep reinforcement learning to augment gradient based optimization routines in quantum variational circuits. We find that reinforcement learning augmented optimizers consistently outperform gradient descent in noisy environments. All code and pretrained weights are available to replicate the results or deploy the models at https://github.com/lockwo/rl_qvc_opt.
    Robust Predictable Control. (arXiv:2109.03214v1 [cs.LG])
    (2 min) Many of the challenges facing today's reinforcement learning (RL) algorithms, such as robustness, generalization, transfer, and computational efficiency are closely related to compression. Prior work has convincingly argued why minimizing information is useful in the supervised learning setting, but standard RL algorithms lack an explicit mechanism for compression. The RL setting is unique because (1) its sequential nature allows an agent to use past information to avoid looking at future observations and (2) the agent can optimize its behavior to prefer states where decision making requires few bits. We take advantage of these properties to propose a method (RPC) for learning simple policies. This method brings together ideas from information bottlenecks, model-based RL, and bits-back coding into a simple and theoretically-justified algorithm. Our method jointly optimizes a latent-space model and policy to be self-consistent, such that the policy avoids states where the model is inaccurate. We demonstrate that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck. We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
    Dual-constrained Deep Semi-Supervised Coupled Factorization Network with Enriched Prior. (arXiv:2009.03714v2 [cs.LG] UPDATED)
    (2 min) Nonnegative matrix factorization is usually powerful for learning the "shallow" parts-based representation, but it clearly fails to discover deep hierarchical information within both the basis and representation spaces. In this paper, we technically propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net, for learning the hierarchical coupled representations. To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network. Specifically, DS2CF-Net designs a deep coupled factorization architecture using multi-layers of linear transformations, which coupled updates the bases and new representations in each layer. To improve the discriminating ability of learned deep representations and deep coefficients, our network clearly considers enriching the supervised prior by the joint deep coefficients-regularized label prediction, and incorporates enriched prior information as additional label and structure constraints. The label constraint can enable the samples of the same label to have the same coordinate in the new feature space, while the structure constraint forces the coefficient matrices in each layer to be block-diagonal so that the enhanced prior using the self-expressive label propagation are more accurate. Our network also integrates the adaptive dual-graph learning to retain the local manifold structures of both the data manifold and feature manifold by minimizing the reconstruction errors in each layer. Extensive experiments on several real databases demonstrate that our DS2CF-Net can obtain state-of-the-art performance for representation learning and clustering.
    Instance-dependent Label-noise Learning under a Structural Causal Model. (arXiv:2109.02986v1 [stat.ML])
    (2 min) Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets.
    PEEK: A Large Dataset of Learner Engagement with Educational Videos. (arXiv:2109.03154v1 [cs.IR])
    (2 min) Educational recommenders have received much less attention in comparison to e-commerce and entertainment-related recommenders, even though efficient intelligent tutors have great potential to improve learning gains. One of the main challenges in advancing this research direction is the scarcity of large, publicly available datasets. In this work, we release a large, novel dataset of learners engaging with educational videos in-the-wild. The dataset, named Personalised Educational Engagement with Knowledge Topics PEEK, is the first publicly available dataset of this nature. The video lectures have been associated with Wikipedia concepts related to the material of the lecture, thus providing a humanly intuitive taxonomy. We believe that granular learner engagement signals in unison with rich content representations will pave the way to building powerful personalization algorithms that will revolutionise educational and informational recommendation systems. Towards this goal, we 1) construct a novel dataset from a popular video lecture repository, 2) identify a set of benchmark algorithms to model engagement, and 3) run extensive experimentation on the PEEK dataset to demonstrate its value. Our experiments with the dataset show promise in building powerful informational recommender systems. The dataset and the support code is available publicly.
    Efficient ADMM-based Algorithms for Convolutional Sparse Coding. (arXiv:2109.02969v1 [cs.LG])
    (2 min) Convolutional sparse coding improves on the standard sparse approximation by incorporating a global shift-invariant model. The most efficient convolutional sparse coding methods are based on the alternating direction method of multipliers and the convolution theorem. The only major difference between these methods is how they approach a convolutional least-squares fitting subproblem. This letter presents a solution to this subproblem, which improves the efficiency of the state-of-the-art algorithms. We also use the same approach for developing an efficient convolutional dictionary learning method. Furthermore, we propose a novel algorithm for convolutional sparse coding with a constraint on the approximation error.
    PAUSE: Positive and Annealed Unlabeled Sentence Embedding. (arXiv:2109.03155v1 [cs.CL])
    (2 min) Sentence embedding refers to a set of effective and versatile techniques for converting raw text into numerical vector representations that can be used in a wide range of natural language processing (NLP) applications. The majority of these techniques are either supervised or unsupervised. Compared to the unsupervised methods, the supervised ones make less assumptions about optimization objectives and usually achieve better results. However, the training requires a large amount of labeled sentence pairs, which is not available in many industrial scenarios. To that end, we propose a generic and end-to-end approach -- PAUSE (Positive and Annealed Unlabeled Sentence Embedding), capable of learning high-quality sentence embeddings from a partially labeled dataset. We experimentally show that PAUSE achieves, and sometimes surpasses, state-of-the-art results using only a small fraction of labeled sentence pairs on various benchmark tasks. When applied to a real industrial use case where labeled samples are scarce, PAUSE encourages us to extend our dataset without the liability of extensive manual annotation work.
    COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization. (arXiv:2109.03207v1 [cs.LG])
    (2 min) First-order methods for stochastic optimization have undeniable relevance, in part due to their pivotal role in machine learning. Variance reduction for these algorithms has become an important research topic. In contrast to common approaches, which rarely leverage global models of the objective function, we exploit convexity and L-smoothness to improve the noisy estimates outputted by the stochastic gradient oracle. Our method, named COCO denoiser, is the joint maximum likelihood estimator of multiple function gradients from their noisy observations, subject to co-coercivity constraints between them. The resulting estimate is the solution of a convex Quadratically Constrained Quadratic Problem. Although this problem is expensive to solve by interior point methods, we exploit its structure to apply an accelerated first-order algorithm, the Fast Dual Proximal Gradient method. Besides analytically characterizing the proposed estimator, we show empirically that increasing the number and proximity of the queried points leads to better gradient estimates. We also apply COCO in stochastic settings by plugging it in existing algorithms, such as SGD, Adam or STRSAGA, outperforming their vanilla versions, even in scenarios where our modelling assumptions are mismatched.
    Learning to Bid in Contextual First Price Auctions. (arXiv:2109.03173v1 [cs.LG])
    (2 min) In this paper, we investigate the problem about how to bid in repeated contextual first price auctions. We consider a single bidder (learner) who repeatedly bids in the first price auctions: at each time $t$, the learner observes a context $x_t\in \mathbb{R}^d$ and decides the bid based on historical information and $x_t$. We assume a structured linear model of the maximum bid of all the others $m_t = \alpha_0\cdot x_t + z_t$, where $\alpha_0\in \mathbb{R}^d$ is unknown to the learner and $z_t$ is randomly sampled from a noise distribution $\mathcal{F}$ with log-concave density function $f$. We consider both \emph{binary feedback} (the learner can only observe whether she wins or not) and \emph{full information feedback} (the learner can observe $m_t$) at the end of each time $t$. For binary feedback, when the noise distribution $\mathcal{F}$ is known, we propose a bidding algorithm, by using maximum likelihood estimation (MLE) method to achieve at most $\widetilde{O}(\sqrt{\log(d) T})$ regret. Moreover, we generalize this algorithm to the setting with binary feedback and the noise distribution is unknown but belongs to a parametrized family of distributions. For the full information feedback with \emph{unknown} noise distribution, we provide an algorithm that achieves regret at most $\widetilde{O}(\sqrt{dT})$. Our approach combines an estimator for log-concave density functions and then MLE method to learn the noise distribution $\mathcal{F}$ and linear weight $\alpha_0$ simultaneously. We also provide a lower bound result such that any bidding policy in a broad class must achieve regret at least $\Omega(\sqrt{T})$, even when the learner receives the full information feedback and $\mathcal{F}$ is known.
    Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders. (arXiv:2005.02921v2 [stat.ME] UPDATED)
    (3 min) Random effect models are popular statistical models for detecting and correcting spurious sample correlations due to hidden confounders in genome-wide gene expression data. In applications where some confounding factors are known, estimating simultaneously the contribution of known and latent variance components in random effect models is a challenge that has so far relied on numerical gradient-based optimizers to maximize the likelihood function. This is unsatisfactory because the resulting solution is poorly characterized and the efficiency of the method may be suboptimal. Here we prove analytically that maximum-likelihood latent variables can always be chosen orthogonal to the known confounding factors, in other words, that maximum-likelihood latent variables explain sample covariances not already explained by known factors. Based on this result we propose a restricted maximum-likelihood method which estimates the latent variables by maximizing the likelihood on the restricted subspace orthogonal to the known confounding factors, and show that this reduces to probabilistic PCA on that subspace. The method then estimates the variance-covariance parameters by maximizing the remaining terms in the likelihood function given the latent variables, using a newly derived analytic solution for this problem. Compared to gradient-based optimizers, our method attains greater or equal likelihood values, can be computed using standard matrix operations, results in latent factors that don't overlap with any known factors, and has a runtime reduced by several orders of magnitude. Hence the restricted maximum-likelihood method facilitates the application of random effect modelling strategies for learning latent variance components to much larger gene expression datasets than possible with current methods.
    On the Convergence of Decentralized Adaptive Gradient Methods. (arXiv:2109.03194v1 [cs.LG])
    (2 min) Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.
    Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression. (arXiv:2109.03228v1 [cs.CL])
    (2 min) Recent studies on compression of pretrained language models (e.g., BERT) usually use preserved accuracy as the metric for evaluation. In this paper, we propose two new metrics, label loyalty and probability loyalty that measure how closely a compressed model (i.e., student) mimics the original model (i.e., teacher). We also explore the effect of compression with regard to robustness under adversarial attacks. We benchmark quantization, pruning, knowledge distillation and progressive module replacing with loyalty and robustness. By combining multiple compression techniques, we provide a practical strategy to achieve better accuracy, loyalty and robustness.
    Intuitive Contrasting Map for Antonym Embeddings. (arXiv:2004.12835v2 [cs.CL] UPDATED)
    (2 min) This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the data and can produce new embeddings that explicitly highlight specific semantic attributes of the word. The new embeddings produced by the map are shown to improve the performance on downstream tasks.
    Learning TSP Requires Rethinking Generalization. (arXiv:2006.07054v3 [cs.LG] UPDATED)
    (2 min) End-to-end training of neural network solvers for combinatorial optimization problems such as the Travelling Salesman Problem is intractable and inefficient beyond a few hundreds of nodes. While state-of-the-art Machine Learning approaches perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances of practical scales. Towards leveraging transfer learning to solve large-scale TSPs, this paper identifies inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols.
    BERT based classification system for detecting rumours on Twitter. (arXiv:2109.02975v1 [cs.LG])
    (2 min) The role of social media in opinion formation has far-reaching implications in all spheres of society. Though social media provide platforms for expressing news and views, it is hard to control the quality of posts due to the sheer volumes of posts on platforms like Twitter and Facebook. Misinformation and rumours have lasting effects on society, as they tend to influence people's opinions and also may motivate people to act irrationally. It is therefore very important to detect and remove rumours from these platforms. The only way to prevent the spread of rumours is through automatic detection and classification of social media posts. Our focus in this paper is the Twitter social medium, as it is relatively easy to collect data from Twitter. The majority of previous studies used supervised learning approaches to classify rumours on Twitter. These approaches rely on feature extraction to obtain both content and context features from the text of tweets to distinguish rumours and non-rumours. Manually extracting features however is time-consuming considering the volume of tweets. We propose a novel approach to deal with this problem by utilising sentence embedding using BERT to identify rumours on Twitter, rather than the usual feature extraction techniques. We use sentence embedding using BERT to represent each tweet's sentences into a vector according to the contextual meaning of the tweet. We classify those vectors into rumours or non-rumours by using various supervised learning techniques. Our BERT based models improved the accuracy by approximately 10% as compared to previous methods.
    Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model. (arXiv:2109.02773v1 [cs.SD])
    (2 min) An emerging trend in audio processing is capturing low-level speech representations from raw waveforms. These representations have shown promising results on a variety of tasks, such as speech recognition and speech separation. Compared to handcrafted features, learning speech features via backpropagation provides the model greater flexibility in how it represents data for different tasks theoretically. However, results from empirical study shows that, in some tasks, such as voice spoof detection, handcrafted features are more competitive than learned features. Instead of evaluating handcrafted features and raw waveforms independently, this paper proposes an Auxiliary Rawnet model to complement handcrafted features with features learned from raw waveforms. A key benefit of the approach is that it can improve accuracy at a relatively low computational cost. The proposed Auxiliary Rawnet model is tested using the ASVspoof 2019 dataset and the results from this dataset indicate that a light-weight waveform encoder can potentially boost the performance of handcrafted-features-based encoders in exchange for a small amount of additional computational work.
    Brand Label Albedo Extraction of eCommerce Products using Generative Adversarial Network. (arXiv:2109.02929v1 [cs.CV])
    (2 min) In this paper we present our solution to extract albedo of branded labels for e-commerce products. To this end, we generate a large-scale photo-realistic synthetic data set for albedo extraction followed by training a generative model to translate images with diverse lighting conditions to albedo. We performed an extensive evaluation to test the generalisation of our method to in-the-wild images. From the experimental results, we observe that our solution generalises well compared to the existing method both in the unseen rendered images as well as in the wild image.
    A Scalable AI Approach for Clinical Trial Cohort Optimization. (arXiv:2109.02808v1 [cs.CL])
    (2 min) FDA has been promoting enrollment practices that could enhance the diversity of clinical trial populations, through broadening eligibility criteria. However, how to broaden eligibility remains a significant challenge. We propose an AI approach to Cohort Optimization (AICO) through transformer-based natural language processing of the eligibility criteria and evaluation of the criteria using real-world data. The method can extract common eligibility criteria variables from a large set of relevant trials and measure the generalizability of trial designs to real-world patients. It overcomes the scalability limits of existing manual methods and enables rapid simulation of eligibility criteria design for a disease of interest. A case study on breast cancer trial design demonstrates the utility of the method in improving trial generalizability.
    SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification by Utilising the Notion of "Subjectivity" and "Identity Terms". (arXiv:2109.02691v1 [cs.CL])
    (2 min) Toxic comment classification models are often found biased toward identity terms which are terms characterizing a specific group of people such as "Muslim" and "black". Such bias is commonly reflected in false-positive predictions, i.e. non-toxic comments with identity terms. In this work, we propose a novel approach to tackle such bias in toxic comment classification, leveraging the notion of subjectivity level of a comment and the presence of identity terms. We hypothesize that when a comment is made about a group of people that is characterized by an identity term, the likelihood of that comment being toxic is associated with the subjectivity level of the comment, i.e. the extent to which the comment conveys personal feelings and opinions. Building upon the BERT model, we propose a new structure that is able to leverage these features, and thoroughly evaluate our model on 4 datasets of varying sizes and representing different social media platforms. The results show that our model can consistently outperform BERT and a SOTA model devised to address identity term bias in a different way, with a maximum improvement in F1 of 2.43% and 1.91% respectively.
    gen2Out: Detecting and Ranking Generalized Anomalies. (arXiv:2109.02704v1 [cs.LG])
    (2 min) In a cloud of m-dimensional data points, how would we spot, as well as rank, both single-point- as well as group- anomalies? We are the first to generalize anomaly detection in two dimensions: The first dimension is that we handle both point-anomalies, as well as group-anomalies, under a unified view -- we shall refer to them as generalized anomalies. The second dimension is that gen2Out not only detects, but also ranks, anomalies in suspiciousness order. Detection, and ranking, of anomalies has numerous applications: For example, in EEG recordings of an epileptic patient, an anomaly may indicate a seizure; in computer network traffic data, it may signify a power failure, or a DoS/DDoS attack. We start by setting some reasonable axioms; surprisingly, none of the earlier methods pass all the axioms. Our main contribution is the gen2Out algorithm, that has the following desirable properties: (a) Principled and Sound anomaly scoring that obeys the axioms for detectors, (b) Doubly-general in that it detects, as well as ranks generalized anomaly -- both point- and group-anomalies, (c) Scalable, it is fast and scalable, linear on input size. (d) Effective, experiments on real-world epileptic recordings (200GB) demonstrate effectiveness of gen2Out as confirmed by clinicians. Experiments on 27 real-world benchmark datasets show that gen2Out detects ground truth groups, matches or outperforms point-anomaly baseline algorithms on accuracy, with no competition for group-anomalies and requires about 2 minutes for 1 million data points on a stock machine.
    OKSP: A Novel Deep Learning Automatic Event Detection Pipeline for Seismic Monitoringin Costa Rica. (arXiv:2109.02723v1 [physics.geo-ph])
    (2 min) Small magnitude earthquakes are the most abundant but the most difficult to locate robustly and well due to their low amplitudes and high frequencies usually obscured by heterogeneous noise sources. They highlight crucial information about the stress state and the spatio-temporal behavior of fault systems during the earthquake cycle, therefore, its full characterization is then crucial for improving earthquake hazard assessment. Modern DL algorithms along with the increasing computational power are exploiting the continuously growing seismological databases, allowing scientists to improve the completeness for earthquake catalogs, systematically detecting smaller magnitude earthquakes and reducing the errors introduced mainly by human intervention. In this work, we introduce OKSP, a novel automatic earthquake detection pipeline for seismic monitoring in Costa Rica. Using Kabre supercomputer from the Costa Rica High Technology Center, we applied OKSP to the day before and the first 5 days following the Puerto Armuelles, M6.5, earthquake that occurred on 26 June, 2019, along the Costa Rica-Panama border and found 1100 more earthquakes previously unidentified by the Volcanological and Seismological Observatory of Costa Rica. From these events, a total of 23 earthquakes with magnitudes below 1.0 occurred a day to hours prior to the mainshock, shedding light about the rupture initiation and earthquake interaction leading to the occurrence of this productive seismic sequence. Our observations show that for the study period, the model was 100% exhaustive and 82% precise, resulting in an F1 score of 0.90. This effort represents the very first attempt for automatically detecting earthquakes in Costa Rica using deep learning methods and demonstrates that, in the near future, earthquake monitoring routines will be carried out entirely by AI algorithms.
    Puzzle Solving without Search or Human Knowledge: An Unnatural Language Approach. (arXiv:2109.02797v1 [cs.LG])
    (2 min) The application of Generative Pre-trained Transformer (GPT-2) to learn text-archived game notation provides a model environment for exploring sparse reward gameplay. The transformer architecture proves amenable to training on solved text archives describing mazes, Rubik's Cube, and Sudoku solvers. The method benefits from fine-tuning the transformer architecture to visualize plausible strategies derived outside any guidance from human heuristics or domain expertise. The large search space ($>10^{19}$) for the games provides a puzzle environment in which the solution has few intermediate rewards and a final move that solves the challenge.
    FastAudio: A Learnable Audio Front-End for Spoof Speech Detection. (arXiv:2109.02774v1 [cs.SD])
    (2 min) Voice assistants, such as smart speakers, have exploded in popularity. It is currently estimated that the smart speaker adoption rate has exceeded 35% in the US adult population. Manufacturers have integrated speaker identification technology, which attempts to determine the identity of the person speaking, to provide personalized services to different members of the same family. Speaker identification can also play an important role in controlling how the smart speaker is used. For example, it is not critical to correctly identify the user when playing music. However, when reading the user's email out loud, it is critical to correctly verify the speaker that making the request is the authorized user. Speaker verification systems, which authenticate the speaker identity, are therefore needed as a gatekeeper to protect against various spoofing attacks that aim to impersonate the enrolled user. This paper compares popular learnable front-ends which learn the representations of audio by joint training with downstream tasks (End-to-End). We categorize the front-ends by defining two generic architectures and then analyze the filtering stages of both types in terms of learning constraints. We propose replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks. The proposed FastAudio front-end is then tested with two popular back-ends to measure the performance on the LA track of the ASVspoof 2019 dataset. The FastAudio front-end achieves a relative improvement of 27% when compared with fixed front-ends, outperforming all other learnable front-ends on this task.
    ArGoT: A Glossary of Terms extracted from the arXiv. (arXiv:2109.02801v1 [cs.DL])
    (2 min) We introduce ArGoT, a data set of mathematical terms extracted from the articles hosted on the arXiv website. A term is any mathematical concept defined in an article. Using labels in the article's source code and examples from other popular math websites, we mine all the terms in the arXiv data and compile a comprehensive vocabulary of mathematical terms. Each term can be then organized in a dependency graph by using the term's definitions and the arXiv's metadata. Using both hyperbolic and standard word embeddings, we demonstrate how this structure is reflected in the text's vector representation and how they capture relations of entailment in mathematical concepts. This data set is part of an ongoing effort to align natural mathematical text with existing Interactive Theorem Prover Libraries (ITPs) of formally verified statements.
    Predicting Mood Disorder Symptoms with Remotely Collected Videos Using an Interpretable Multimodal Dynamic Attention Fusion Network. (arXiv:2109.03029v1 [cs.LG])
    (2 min) We developed a novel, interpretable multimodal classification method to identify symptoms of mood disorders viz. depression, anxiety and anhedonia using audio, video and text collected from a smartphone application. We used CNN-based unimodal encoders to learn dynamic embeddings for each modality and then combined these through a transformer encoder. We applied these methods to a novel dataset - collected by a smartphone application - on 3002 participants across up to three recording sessions. Our method demonstrated better multimodal classification performance compared to existing methods that employed static embeddings. Lastly, we used SHapley Additive exPlanations (SHAP) to prioritize important features in our model that could serve as potential digital markers.
    Besov Function Approximation and Binary Classification on Low-Dimensional Manifolds Using Convolutional Residual Networks. (arXiv:2109.02832v1 [stat.ML])
    (2 min) Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.
    Optimizing model-agnostic Random Subspace ensembles. (arXiv:2109.03099v1 [cs.LG])
    (2 min) This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach alternates between (1) learning an ensemble of models using a parametric version of the Random Subspace approach, in which feature subsets are sampled according to Bernoulli distributions, and (2) identifying the parameters of the Bernoulli distributions that minimize the generalization error of the ensemble model. Parameter optimization is rendered tractable by using an importance sampling approach able to estimate the expected model output for any given parameter set, without the need to learn new models. While the degree of randomization is controlled by a hyper-parameter in standard Random Subspace, it has the advantage to be automatically tuned in our parametric version. Furthermore, model-agnostic feature importance scores can be easily derived from the trained ensemble model. We show the good performance of the proposed approach, both in terms of prediction and feature ranking, on simulated and real-world datasets. We also show that our approach can be successfully used for the reconstruction of gene regulatory networks.
    Bringing a Ruler Into the Black Box: Uncovering Feature Impact from Individual Conditional Expectation Plots. (arXiv:2109.02724v1 [cs.LG])
    (2 min) As machine learning systems become more ubiquitous, methods for understanding and interpreting these models become increasingly important. In particular, practitioners are often interested both in what features the model relies on and how the model relies on them--the feature's impact on model predictions. Prior work on feature impact including partial dependence plots (PDPs) and Individual Conditional Expectation (ICE) plots has focused on a visual interpretation of feature impact. We propose a natural extension to ICE plots with ICE feature impact, a model-agnostic, performance-agnostic feature impact metric drawn out from ICE plots that can be interpreted as a close analogy to linear regression coefficients. Additionally, we introduce an in-distribution variant of ICE feature impact to vary the influence of out-of-distribution points as well as heterogeneity and non-linearity measures to characterize feature impact. Lastly, we demonstrate ICE feature impact's utility in several tasks using real-world data.
    Trojan Signatures in DNN Weights. (arXiv:2109.02836v1 [cs.LG])
    (2 min) Deep neural networks have been shown to be vulnerable to backdoor, or trojan, attacks where an adversary has embedded a trigger in the network at training time such that the model correctly classifies all standard inputs, but generates a targeted, incorrect classification on any input which contains the trigger. In this paper, we present the first ultra light-weight and highly effective trojan detection method that does not require access to the training/test data, does not involve any expensive computations, and makes no assumptions on the nature of the trojan trigger. Our approach focuses on analysis of the weights of the final, linear layer of the network. We empirically demonstrate several characteristics of these weights that occur frequently in trojaned networks, but not in benign networks. In particular, we show that the distribution of the weights associated with the trojan target class is clearly distinguishable from the weights associated with other classes. Using this, we demonstrate the effectiveness of our proposed detection method against state-of-the-art attacks across a variety of architectures, datasets, and trigger types.
    Individual Mobility Prediction via Attentive Marked Temporal Point Processes. (arXiv:2109.02715v1 [cs.LG])
    (2 min) Individual mobility prediction is an essential task for transportation demand management and traffic system operation. There exist a large body of works on modeling location sequence and predicting the next location of users; however, little attention is paid to the prediction of the next trip, which is governed by the strong spatiotemporal dependencies between diverse attributes, including trip start time $t$, origin $o$, and destination $d$. To fill this gap, in this paper we propose a novel point process-based model -- Attentive Marked temporal point processes (AMTPP) -- to model human mobility and predict the whole trip $(t,o,d)$ in a joint manner. To encode the influence of history trips, AMTPP employs the self-attention mechanism with a carefully designed positional embedding to capture the daily/weekly periodicity and regularity in individual travel behavior. Given the unique peaked nature of inter-event time in human behavior, we use an asymmetric log-Laplace mixture distribution to precisely model the distribution of trip start time $t$. Furthermore, an origin-destination (OD) matrix learning block is developed to model the relationship between every origin and destination pair. Experimental results on two large metro trip datasets demonstrate the superior performance of AMTPP.
    Refinement of Hottopixx and its Postprocessing. (arXiv:2109.02863v1 [cs.LG])
    (2 min) Hottopixx, proposed by Bittorf et al. at NIPS 2012, is an algorithm for solving nonnegative matrix factorization (NMF) problems under the separability assumption. Separable NMFs have important applications, such as topic extraction from documents and unmixing of hyperspectral images. In such applications, the robustness of the algorithm to noise is the key to the success. Hottopixx has been shown to be robust to noise, and its robustness can be further enhanced through postprocessing. However, there is a drawback. Hottopixx and its postprocessing require us to estimate the noise level involved in the matrix we want to factorize before running, since they use it as part of the input data. The noise-level estimation is not an easy task. In this paper, we overcome this drawback. We present a refinement of Hottopixx and its postprocessing that runs without prior knowledge of the noise level. We show that the refinement has almost the same robustness to noise as the original algorithm.
    Training Deep Networks from Zero to Hero: avoiding pitfalls and going beyond. (arXiv:2109.02752v1 [cs.LG])
    (2 min) Training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning, can result in poor generalization or inconclusive results when it comes to small datasets or specific applications. This tutorial covers the basic steps as well as more recent options to improve models, in particular, but not restricted to, supervised learning. It can be particularly useful in datasets that are not as well-prepared as those in challenges, and also under scarce annotation and/or small data. We describe basic procedures: as data preparation, optimization and transfer learning, but also recent architectural choices such as use of transformer modules, alternative convolutional layers, activation functions, wide and deep networks, as well as training procedures including as curriculum, contrastive and self-supervised learning.
    Sequential Diagnosis Prediction with Transformer and Ontological Representation. (arXiv:2109.03069v1 [cs.LG])
    (2 min) Sequential diagnosis prediction on the Electronic Health Record (EHR) has been proven crucial for predictive analytics in the medical domain. EHR data, sequential records of a patient's interactions with healthcare systems, has numerous inherent characteristics of temporality, irregularity and data insufficiency. Some recent works train healthcare predictive models by making use of sequential information in EHR data, but they are vulnerable to irregular, temporal EHR data with the states of admission/discharge from hospital, and insufficient data. To mitigate this, we propose an end-to-end robust transformer-based model called SETOR, which exploits neural ordinary differential equation to handle both irregular intervals between a patient's visits with admitted timestamps and length of stay in each visit, to alleviate the limitation of insufficient data by integrating medical ontology, and to capture the dependencies between the patient's visits by employing multi-layer transformer blocks. Experiments conducted on two real-world healthcare datasets show that, our sequential diagnoses prediction model SETOR not only achieves better predictive results than previous state-of-the-art approaches, irrespective of sufficient or insufficient training data, but also derives more interpretable embeddings of medical codes. The experimental codes are available at the GitHub repository (https://github.com/Xueping/SETOR).
    Position-based Hash Embeddings For Scaling Graph Neural Networks. (arXiv:2109.00101v2 [cs.LG] UPDATED)
    (2 min) Graph Neural Networks (GNNs) bring the power of deep representation learning to graph and relational data and achieve state-of-the-art performance in many applications. GNNs compute node representations by taking into account the topology of the node's ego-network and the features of the ego-network's nodes. When the nodes do not have high-quality features, GNNs learn an embedding layer to compute node embeddings and use them as input features. However, the size of the embedding layer is linear to the product of the number of nodes in the graph and the dimensionality of the embedding and does not scale to big data and graphs with hundreds of millions of nodes. To reduce the memory associated with this embedding layer, hashing-based approaches, commonly used in applications like NLP and recommender systems, can potentially be used. However, a direct application of these ideas fails to exploit the fact that in many real-world graphs, nodes that are topologically close will tend to be related to each other (homophily) and as such their representations will be similar. In this work, we present approaches that take advantage of the nodes' position in the graph to dramatically reduce the memory required, with minimal if any degradation in the quality of the resulting GNN model. Our approaches decompose a node's embedding into two components: a position-specific component and a node-specific component. The position-specific component models homophily and the node-specific component models the node-to-node variation. Extensive experiments using different datasets and GNN models show that our methods are able to reduce the memory requirements by 88% to 97% while achieving, in nearly all cases, better classification accuracy than other competing approaches, including the full embeddings.
    Go Wider Instead of Deeper. (arXiv:2107.11817v3 [cs.LG] UPDATED)
    (2 min) More transformer blocks with residual connections have recently achieved impressive results on various tasks. To achieve better performance with fewer trainable parameters, recent methods are proposed to go shallower by parameter sharing or model compressing along with the depth. However, weak modeling capacity limits their performance. Contrastively, going wider by inducing more trainable matrixes and parameters would produce a huge model requiring advanced parallelism to train and inference. In this paper, we propose a parameter-efficient framework, going wider instead of deeper. Specially, following existing works, we adapt parameter sharing to compress along depth. But, such deployment would limit the performance. To maximize modeling capacity, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). Across transformer blocks, instead of sharing normalization layers, we propose to use individual layernorms to transform various semantic representations in a more parameter-efficient way. To evaluate our plug-and-run framework, we design WideNet and conduct comprehensive experiments on popular computer vision and natural language processing benchmarks. On ImageNet-1K, our best model outperforms Vision Transformer (ViT) by $1.5\%$ with $0.72 \times$ trainable parameters. Using $0.46 \times$ and $0.13 \times$ parameters, our WideNet can still surpass ViT and ViT-MoE by $0.8\%$ and $2.1\%$, respectively. On four natural language processing datasets, WideNet outperforms ALBERT by $1.8\%$ on average and surpass BERT using factorized embedding parameterization by $0.8\%$ with fewer parameters.
    Hyper Meta-Path Contrastive Learning for Multi-Behavior Recommendation. (arXiv:2109.02859v1 [cs.IR])
    (2 min) User purchasing prediction with multi-behavior information remains a challenging problem for current recommendation systems. Various methods have been proposed to address it via leveraging the advantages of graph neural networks (GNNs) or multi-task learning. However, most existing works do not take the complex dependencies among different behaviors of users into consideration. They utilize simple and fixed schemes, like neighborhood information aggregation or mathematical calculation of vectors, to fuse the embeddings of different user behaviors to obtain a unified embedding to represent a user's behavioral patterns which will be used in downstream recommendation tasks. To tackle the challenge, in this paper, we first propose the concept of hyper meta-path to construct hyper meta-paths or hyper meta-graphs to explicitly illustrate the dependencies among different behaviors of a user. How to obtain a unified embedding for a user from hyper meta-paths and avoid the previously mentioned limitations simultaneously is critical. Thanks to the recent success of graph contrastive learning, we leverage it to learn embeddings of user behavior patterns adaptively instead of assigning a fixed scheme to understand the dependencies among different behaviors. A new graph contrastive learning based framework is proposed by coupling with hyper meta-paths, namely HMG-CR, which consistently and significantly outperforms all baselines in extensive comparison experiments.
    Learning to Infer Shape Programs Using Self Training. (arXiv:2011.13045v2 [cs.CV] UPDATED)
    (2 min) Inferring programs which generate 2D and 3D shapes is important for reverse engineering, editing, and more. Training such inference models is challenging due to the lack of paired (shape, program) data in most domains. A popular approach is to pre-train a model on synthetic data and then fine-tune on real shapes using slow, unstable reinforcement learning. In this paper, we argue that self-training is a viable alternative for fine-tuning such models. Self-training is a semi-supervised learning paradigm where a model assigns pseudo-labels to unlabeled data, and then retrains with (data, pseudo-label) pairs as the new ground truth. We show that for constructive solid geometry and assembly-based modeling, self-training outperforms state-of-the-art reinforcement learning approaches. Additionally, shape program inference has a unique property that circumvents a potential downside of self-training (incorrect pseudo-label assignment): inferred programs are executable. For a given shape from our distribution of interest $\mathbf{x}^*$ and its predicted program $\mathbf{z}$, one can execute $\mathbf{z}$ to obtain a shape $\mathbf{x}$ and train on $(\mathbf{z}, \mathbf{x})$ pairs, rather than $(\mathbf{z}, \mathbf{x}^*)$ pairs. We term this procedure latent execution self training (LEST). We demonstrate that self training infers shape programs with higher shape reconstruction accuracy and converges significantly faster than reinforcement learning approaches, and in some domains, LEST can further improve this performance.
    Graph Representation Learning for Road Type Classification. (arXiv:2107.07791v3 [cs.LG] UPDATED)
    (2 min) We present a novel learning-based approach to graph representations of road networks employing state-of-the-art graph convolutional neural networks. Our approach is applied to realistic road networks of 17 cities from Open Street Map. While edge features are crucial to generate descriptive graph representations of road networks, graph convolutional networks usually rely on node features only. We show that the highly representative edge features can still be integrated into such networks by applying a line graph transformation. We also propose a method for neighborhood sampling based on a topological neighborhood composed of both local and global neighbors. We compare the performance of learning representations using different types of neighborhood aggregation functions in transductive and inductive tasks and in supervised and unsupervised learning. Furthermore, we propose a novel aggregation approach, Graph Attention Isomorphism Network, GAIN. Our results show that GAIN outperforms state-of-the-art methods on the road type classification problem.
    Early ICU Mortality Prediction and Survival Analysis for Respiratory Failure. (arXiv:2109.03048v1 [cs.LG])
    (2 min) Respiratory failure is the one of major causes of death in critical care unit. During the outbreak of COVID-19, critical care units experienced an extreme shortage of mechanical ventilation because of respiratory failure related syndromes. To help this, the early mortality risk prediction in patients who suffer respiratory failure can provide timely support for clinical treatment and resource management. In the study, we propose a dynamic modeling approach for early mortality risk prediction of the respiratory failure patients based on the first 24 hours ICU physiological data. Our proposed model is validated on the eICU collaborate database. We achieved a high AUROC performance (80-83%) and significantly improved AUCPR 4% on Day 5 since ICU admission, compared to the state-of-art prediction models. In addition, we illustrated that the survival curve includes the time-varying information for the early ICU admission survival analysis.
    HMSG: Heterogeneous Graph Neural Network based on Metapath Subgraph Learning. (arXiv:2109.02868v1 [cs.AI])
    (2 min) Many real-world data can be represented as heterogeneous graphs with different types of nodes and connections. Heterogeneous graph neural network model aims to embed nodes or subgraphs into low-dimensional vector space for various downstream tasks such as node classification, link prediction, etc. Although several models were proposed recently, they either only aggregate information from the same type of neighbors, or just indiscriminately treat homogeneous and heterogeneous neighbors in the same way. Based on these observations, we propose a new heterogeneous graph neural network model named HMSG to comprehensively capture structural, semantic and attribute information from both homogeneous and heterogeneous neighbors. Specifically, we first decompose the heterogeneous graph into multiple metapath-based homogeneous and heterogeneous subgraphs, and each subgraph associates specific semantic and structural information. Then message aggregation methods are applied to each subgraph independently, so that information can be learned in a more targeted and efficient manner. Through a type-specific attribute transformation, node attributes can also be transferred among different types of nodes. Finally, we fuse information from subgraphs together to get the complete representation. Extensive experiments on several datasets for node classification, node clustering and link prediction tasks show that HMSG achieves the best performance in all evaluation metrics than state-of-the-art baselines.
    Motion Artifact Reduction In Photoplethysmography For Reliable Signal Selection. (arXiv:2109.02755v1 [eess.SP])
    (2 min) Photoplethysmography (PPG) is a non-invasive and economical technique to extract vital signs of the human body. Although it has been widely used in consumer and research grade wrist devices to track a user's physiology, the PPG signal is very sensitive to motion which can corrupt the signal's quality. Existing Motion Artifact (MA) reduction techniques have been developed and evaluated using either synthetic noisy signals or signals collected during high-intensity activities - both of which are difficult to generalize for real-life scenarios. Therefore, it is valuable to collect realistic PPG signals while performing Activities of Daily Living (ADL) to develop practical signal denoising and analysis methods. In this work, we propose an automatic pseudo clean PPG generation process for reliable PPG signal selection. For each noisy PPG segment, the corresponding pseudo clean PPG reduces the MAs and contains rich temporal details depicting cardiac features. Our experimental results show that 71% of the pseudo clean PPG collected from ADL can be considered as high quality segment where the derived MAE of heart rate and respiration rate are 1.46 BPM and 3.93 BrPM, respectively. Therefore, our proposed method can determine the reliability of the raw noisy PPG by considering quality of the corresponding pseudo clean PPG signal.
    Zero-Shot Open Set Detection by Extending CLIP. (arXiv:2109.02748v1 [cs.CV])
    (2 min) In a regular open set detection problem, samples of known classes (also called closed set classes) are used to train a special classifier. In testing, the classifier can (1) classify the test samples of known classes to their respective classes and (2) also detect samples that do not belong to any of the known classes (we say they belong to some unknown or open set classes). This paper studies the problem of zero-shot open-set detection, which still performs the same two tasks in testing but has no training except using the given known class names. This paper proposes a novel and yet simple method (called ZO-CLIP) to solve the problem. ZO-CLIP builds on top of the recent advances in zero-shot classification through multi-modal representation learning. It first extends the pre-trained multi-modal model CLIP by training a text-based image description generator on top of CLIP. In testing, it uses the extended model to generate some candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot open set detection. Experimental results on 5 benchmark datasets for open set detection confirm that ZO-CLIP outperforms the baselines by a large margin.
    Machine Learning: Challenges, Limitations, and Compatibility for Audio Restoration Processes. (arXiv:2109.02692v1 [cs.SD])
    (2 min) In this paper machine learning networks are explored for their use in restoring degraded and compressed speech audio. The project intent is to build a new trained model from voice data to learn features of compression artifacting distortion introduced by data loss from lossy compression and resolution loss with an existing algorithm presented in SEGAN: Speech Enhancement Generative Adversarial Network. The resulting generator from the model was then to be used to restore degraded speech audio. This paper details an examination of the subsequent compatibility and operational issues presented by working with deprecated code, which obstructed the trained model from successfully being developed. This paper further serves as an examination of the challenges, limitations, and compatibility in the current state of machine learning.
    Few-shot Learning in Emotion Recognition of Spontaneous Speech Using a Siamese Neural Network with Adaptive Sample Pair Formation. (arXiv:2109.02915v1 [cs.LG])
    (2 min) Speech-based machine learning (ML) has been heralded as a promising solution for tracking prosodic and spectrotemporal patterns in real-life that are indicative of emotional changes, providing a valuable window into one's cognitive and mental state. Yet, the scarcity of labelled data in ambulatory studies prevents the reliable training of ML models, which usually rely on "data-hungry" distribution-based learning. Leveraging the abundance of labelled speech data from acted emotions, this paper proposes a few-shot learning approach for automatically recognizing emotion in spontaneous speech from a small number of labelled samples. Few-shot learning is implemented via a metric learning approach through a siamese neural network, which models the relative distance between samples rather than relying on learning absolute patterns of the corresponding distributions of each emotion. Results indicate the feasibility of the proposed metric learning in recognizing emotions from spontaneous speech in four datasets, even with a small amount of labelled samples. They further demonstrate superior performance of the proposed metric learning compared to commonly used adaptation methods, including network fine-tuning and adversarial learning. Findings from this work provide a foundation for the ambulatory tracking of human emotion in spontaneous speech contributing to the real-life assessment of mental health degradation.
    Few-shot Learning via Dependency Maximization and Instance Discriminant Analysis. (arXiv:2109.02820v1 [cs.CV])
    (2 min) We study the few-shot learning (FSL) problem, where a model learns to recognize new objects with extremely few labeled training data per category. Most of previous FSL approaches resort to the meta-learning paradigm, where the model accumulates inductive bias through learning many training tasks so as to solve a new unseen few-shot task. In contrast, we propose a simple approach to exploit unlabeled data accompanying the few-shot task for improving few-shot performance. Firstly, we propose a Dependency Maximization method based on the Hilbert-Schmidt norm of the cross-covariance operator, which maximizes the statistical dependency between the embedded feature of those unlabeled data and their label predictions, together with the supervised loss over the support set. We then use the obtained model to infer the pseudo-labels for those unlabeled data. Furthermore, we propose anInstance Discriminant Analysis to evaluate the credibility of each pseudo-labeled example and select the most faithful ones into an augmented support set to retrain the model as in the first step. We iterate the above process until the pseudo-labels for the unlabeled data becomes stable. Following the standard transductive and semi-supervised FSL setting, our experiments show that the proposed method out-performs previous state-of-the-art methods on four widely used benchmarks, including mini-ImageNet, tiered-ImageNet, CUB, and CIFARFS.
    Robustness and Generalization via Generative Adversarial Training. (arXiv:2109.02765v1 [cs.CV])
    (2 min) While deep neural networks have achieved remarkable success in various computer vision tasks, they often fail to generalize to new domains and subtle variations of input images. Several defenses have been proposed to improve the robustness against these variations. However, current defenses can only withstand the specific attack used in training, and the models often remain vulnerable to other input variations. Moreover, these methods often degrade performance of the model on clean images and do not generalize to out-of-domain samples. In this paper we present Generative Adversarial Training, an approach to simultaneously improve the model's generalization to the test set and out-of-domain samples as well as its robustness to unseen adversarial attacks. Instead of altering a low-level pre-defined aspect of images, we generate a spectrum of low-level, mid-level and high-level changes using generative models with a disentangled latent space. Adversarial training with these examples enable the model to withstand a wide range of attacks by observing a variety of input alterations during training. We show that our approach not only improves performance of the model on clean images and out-of-domain samples but also makes it robust against unforeseen attacks and outperforms prior work. We validate effectiveness of our method by demonstrating results on various tasks such as classification, segmentation and object detection.

2021-09-07

  • cs.CL updates on arXiv.org

    Quantum Natural Language Processing on Near-Term Quantum Computers. (arXiv:2005.04147v2 [cs.CL] UPDATED)
    (0 min) In this work, we describe a full-stack pipeline for natural language processing on near-term quantum computers, aka QNLP. The language-modelling framework we employ is that of compositional distributional semantics (DisCoCat), which extends and complements the compositional structure of pregroup grammars. Within this model, the grammatical reduction of a sentence is interpreted as a diagram, encoding a specific interaction of words according to the grammar. It is this interaction which, together with a specific choice of word embedding, realises the meaning (or "semantics") of a sentence. Building on the formal quantum-like nature of such interactions, we present a method for mapping DisCoCat diagrams to quantum circuits. Our methodology is compatible both with NISQ devices and with established Quantum Machine Learning techniques, paving the way to near-term applications of quantum technology to natural language processing.
    Relation Extraction from Tables using Artificially Generated Metadata. (arXiv:2108.10750v3 [cs.CL] UPDATED)
    (0 min) Relation Extraction (RE) from tables is the task of identifying relations between pairs of columns of a table. Generally, RE models for this task require labelled tables for training. These labelled tables can also be generated artificially from a Knowledge Graph (KG), which makes the cost to acquire them much lower in comparison to manual annotations. However, unlike real tables, these synthetic tables lack associated metadata, such as, column-headers, captions, etc; this is because synthetic tables are created out of KGs that do not store such metadata. Meanwhile, previous works have shown that metadata is important for accurate RE from tables. To address this issue, we propose methods to artificially create some of this metadata for synthetic tables. Afterward, we experiment with a BERT-based model, in line with recently published works, that takes as input a combination of proposed artificial metadata and table content. Our empirical results show that this leads to an improvement of 9\%-45\% in F1 score, in absolute terms, over 2 tabular datasets.
    TravelBERT: Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation. (arXiv:2109.01048v2 [cs.CL] UPDATED)
    (0 min) Existing technologies expand BERT from different perspectives, e.g. designing different pre-training tasks, different semantic granularities and different model architectures. Few models consider expanding BERT from different text formats. In this paper, we propose a heterogeneous knowledge language model (HKLM), a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text and well-structured text. To capture the corresponding relations among these multi-format knowledge, our approach uses masked language model objective to learn word knowledge, uses triple classification objective and title matching objective to learn entity knowledge and topic knowledge respectively. To obtain the aforementioned multi-format text, we construct a corpus in the tourism domain and conduct experiments on 5 tourism NLP datasets. The results show that our approach outperforms the pre-training of plain text using only 1/4 of the data. The code, datasets, corpus and knowledge graph will be released.
    Transformer Models for Text Coherence Assessment. (arXiv:2109.02176v1 [cs.CL])
    (0 min) Coherence is an important aspect of text quality and is crucial for ensuring its readability. It is essential desirable for outputs from text generation systems like summarization, question answering, machine translation, question generation, table-to-text, etc. An automated coherence scoring model is also helpful in essay scoring or providing writing feedback. A large body of previous work has leveraged entity-based methods, syntactic patterns, discourse relations, and more recently traditional deep learning architectures for text coherence assessment. Previous work suffers from drawbacks like the inability to handle long-range dependencies, out-of-vocabulary words, or model sequence information. We hypothesize that coherence assessment is a cognitively complex task that requires deeper models and can benefit from other related tasks. Accordingly, in this paper, we propose four different Transformer-based architectures for the task: vanilla Transformer, hierarchical Transformer, multi-task learning-based model, and a model with fact-based input representation. Our experiments with popular benchmark datasets across multiple domains on four different coherence assessment tasks demonstrate that our models achieve state-of-the-art results outperforming existing models by a good margin.
    FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models. (arXiv:2109.01951v1 [cs.CL])
    (0 min) The task of learning from only a few examples (called a few-shot setting) is of key importance and relevance to a real-world setting. For question answering (QA), the current state-of-the-art pre-trained models typically need fine-tuning on tens of thousands of examples to obtain good results. Their performance degrades significantly in a few-shot setting (< 100 examples). To address this, we propose a simple fine-tuning framework that leverages pre-trained text-to-text models and is directly aligned with their pre-training framework. Specifically, we construct the input as a concatenation of the question, a mask token representing the answer span and a context. Given this input, the model is fine-tuned using the same objective as that of its pre-training objective. Through experimental studies on various few-shot configurations, we show that this formulation leads to significant gains on multiple QA benchmarks (an absolute gain of 34.2 F1 points on average when there are only 16 training examples). The gains extend further when used with larger models (Eg:- 72.3 F1 on SQuAD using BART-large with only 32 examples) and translate well to a multilingual setting . On the multilingual TydiQA benchmark, our model outperforms the XLM-Roberta-large by an absolute margin of upto 40 F1 points and an average of 33 F1 points in a few-shot setting (<= 64 training examples). We conduct detailed ablation studies to analyze factors contributing to these gains.
    What's in your Head? Emergent Behaviour in Multi-Task Transformer Models. (arXiv:2104.06129v2 [cs.CL] UPDATED)
    (0 min) The primary paradigm for multi-task training in natural language processing is to represent the input with a shared pre-trained language model, and add a small, thin network (head) per task. Given an input, a target head is the head that is selected for outputting the final prediction. In this work, we examine the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for. We find that non-target heads exhibit emergent behaviour, which may either explain the target task, or generalize beyond their original task. For example, in a numerical reasoning task, a span extraction head extracts from the input the arguments to a computation that results in a number generated by a target generative head. In addition, a summarization head that is trained with a target question answering head, outputs query-based summaries when given a question and a context from which the answer is to be extracted. This emergent behaviour suggests that multi-task training leads to non-trivial extrapolation of skills, which can be harnessed for interpretability and generalization.
    Supervised Contrastive Learning for Multimodal Unreliable News Detection in COVID-19 Pandemic. (arXiv:2109.01850v1 [cs.CL])
    (0 min) As the digital news industry becomes the main channel of information dissemination, the adverse impact of fake news is explosively magnified. The credibility of a news report should not be considered in isolation. Rather, previously published news articles on the similar event could be used to assess the credibility of a news report. Inspired by this, we propose a BERT-based multimodal unreliable news detection framework, which captures both textual and visual information from unreliable articles utilising the contrastive learning strategy. The contrastive learner interacts with the unreliable news classifier to push similar credible news (or similar unreliable news) closer while moving news articles with similar content but opposite credibility labels away from each other in the multimodal embedding space. Experimental results on a COVID-19 related dataset, ReCOVery, show that our model outperforms a number of competitive baseline in unreliable news detection.
    Fastformer: Additive Attention Can Be All You Need. (arXiv:2108.09084v6 [cs.CL] UPDATED)
    (0 min) Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.
    End-to-End Self-Debiasing Framework for Robust NLU Training. (arXiv:2109.02071v1 [cs.CL])
    (0 min) Existing Natural Language Understanding (NLU) models have been shown to incorporate dataset biases leading to strong performance on in-distribution (ID) test sets but poor performance on out-of-distribution (OOD) ones. We introduce a simple yet effective debiasing framework whereby the shallow representations of the main model are used to derive a bias model and both models are trained simultaneously. We demonstrate on three well studied NLU tasks that despite its simplicity, our method leads to competitive OOD results. It significantly outperforms other debiasing approaches on two tasks, while still delivering high in-distribution performance.
    Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser. (arXiv:2109.02297v1 [cs.CL])
    (0 min) Considering the importance of building a good Visual Dialog (VD) Questioner, many researchers study the topic under a Q-Bot-A-Bot image-guessing game setting, where the Questioner needs to raise a series of questions to collect information of an undisclosed image. Despite progress has been made in Supervised Learning (SL) and Reinforcement Learning (RL), issues still exist. Firstly, previous methods do not provide explicit and effective guidance for Questioner to generate visually related and informative questions. Secondly, the effect of RL is hampered by an incompetent component, i.e., the Guesser, who makes image predictions based on the generated dialogs and assigns rewards accordingly. To enhance VD Questioner: 1) we propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs; 2) we propose an Augmented Guesser (AugG) that is strong and is optimized for the VD setting especially. Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-theart performance on both image-guessing task and question diversity. Human study further proves that our model generates more visually related, informative and coherent questions.
    Putting a Spin on Language: A Quantum Interpretation of Unary Connectives for Linguistic Applications. (arXiv:2004.04128v3 [cs.CL] UPDATED)
    (0 min) Extended versions of the Lambek Calculus currently used in computational linguistics rely on unary modalities to allow for the controlled application of structural rules affecting word order and phrase structure. These controlled structural operations give rise to derivational ambiguities that are missed by the original Lambek Calculus or its pregroup simplification. Proposals for compositional interpretation of extended Lambek Calculus in the compact closed category of FVect and linear maps have been made, but in these proposals the syntax-semantics mapping ignores the control modalities, effectively restricting their role to the syntax. Our aim is to turn the modalities into first-class citizens of the vectorial interpretation. Building on the directional density matrix semantics, we extend the interpretation of the type system with an extra spin density matrix space. The interpretation of proofs then results in ambiguous derivations being tensored with orthogonal spin states. Our method introduces a way of simultaneously representing co-existing interpretations of ambiguous utterances, and provides a uniform framework for the integration of lexical and derivational ambiguity.
    Re-entry Prediction for Online Conversations via Self-Supervised Learning. (arXiv:2109.02020v1 [cs.CL])
    (0 min) In recent years, world business in online discussions and opinion sharing on social media is booming. Re-entry prediction task is thus proposed to help people keep track of the discussions which they wish to continue. Nevertheless, existing works only focus on exploiting chatting history and context information, and ignore the potential useful learning signals underlying conversation data, such as conversation thread patterns and repeated engagement of target users, which help better understand the behavior of target users in conversations. In this paper, we propose three interesting and well-founded auxiliary tasks, namely, Spread Pattern, Repeated Target user, and Turn Authorship, as the self-supervised signals for re-entry prediction. These auxiliary tasks are trained together with the main task in a multi-task manner. Experimental results on two datasets newly collected from Twitter and Reddit show that our method outperforms the previous state-of-the-arts with fewer parameters and faster convergence. Extensive experiments and analysis show the effectiveness of our proposed models and also point out some key ideas in designing self-supervised tasks.
    Nearest Neighbour Few-Shot Learning for Cross-lingual Classification. (arXiv:2109.02221v1 [cs.CL])
    (0 min) Even though large pre-trained multilingual models (e.g. mBERT, XLM-R) have led to significant performance gains on a wide range of cross-lingual NLP tasks, success on many downstream tasks still relies on the availability of sufficient annotated data. Traditional fine-tuning of pre-trained models using only a few target samples can cause over-fitting. This can be quite limiting as most languages in the world are under-resourced. In this work, we investigate cross-lingual adaptation using a simple nearest neighbor few-shot (<15 samples) inference technique for classification tasks. We experiment using a total of 16 distinct languages across two NLP tasks- XNLI and PAWS-X. Our approach consistently improves traditional fine-tuning using only a handful of labeled samples in target locales. We also demonstrate its generalization capability across tasks.
    Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training. (arXiv:2109.02284v1 [cs.CL])
    (0 min) Learning multilingual and multi-domain translation model is challenging as the heterogeneous and imbalanced data make the model converge inconsistently over different corpora in real world. One common practice is to adjust the share of each corpus in the training, so that the learning process is balanced and low-resource cases can benefit from the high resource ones. However, automatic balancing methods usually depend on the intra- and inter-dataset characteristics, which is usually agnostic or requires human priors. In this work, we propose an approach, MultiUAT, that dynamically adjusts the training data usage based on the model's uncertainty on a small set of trusted clean data for multi-corpus machine translation. We experiments with two classes of uncertainty measures on multilingual (16 languages with 4 settings) and multi-domain settings (4 for in-domain and 2 for out-of-domain on English-German translation) and demonstrate our approach MultiUAT substantially outperforms its baselines, including both static and dynamic strategies. We analyze the cross-domain transfer and show the deficiency of static and similarity based methods.
    CSDS: A Fine-Grained Chinese Dataset for Customer Service Dialogue Summarization. (arXiv:2108.13139v2 [cs.CL] UPDATED)
    (0 min) Dialogue summarization has drawn much attention recently. Especially in the customer service domain, agents could use dialogue summaries to help boost their works by quickly knowing customer's issues and service progress. These applications require summaries to contain the perspective of a single speaker and have a clear topic flow structure, while neither are available in existing datasets. Therefore, in this paper, we introduce a novel Chinese dataset for Customer Service Dialogue Summarization (CSDS). CSDS improves the abstractive summaries in two aspects: (1) In addition to the overall summary for the whole dialogue, role-oriented summaries are also provided to acquire different speakers' viewpoints. (2) All the summaries sum up each topic separately, thus containing the topic-level structure of the dialogue. We define tasks in CSDS as generating the overall summary and different role-oriented summaries for a given dialogue. Next, we compare various summarization methods on CSDS, and experiment results show that existing methods are prone to generate redundant and incoherent summaries. Besides, the performance becomes much worse when analyzing the performance on role-oriented summaries and topic structures. We hope that this study could benchmark Chinese dialogue summarization and benefit further studies.
    Learning Hierarchical Structures with Differentiable Nondeterministic Stacks. (arXiv:2109.01982v1 [cs.CL])
    (0 min) Learning hierarchical structures in sequential data -- from simple algorithmic patterns to natural language -- in a reliable, generalizable way remains a challenging problem for neural language models. Past work has shown that recurrent neural networks (RNNs) struggle to generalize on held-out algorithmic or syntactic patterns without supervision or some inductive bias. To remedy this, many papers have explored augmenting RNNs with various differentiable stacks, by analogy with finite automata and pushdown automata. In this paper, we present a stack RNN model based on the recently proposed Nondeterministic Stack RNN (NS-RNN) that achieves lower cross-entropy than all previous stack RNNs on five context-free language modeling tasks (within 0.05 nats of the information-theoretic lower bound), including a task in which the NS-RNN previously failed to outperform a deterministic stack RNN baseline. Our model assigns arbitrary positive weights instead of probabilities to stack actions, and we provide an analysis of why this improves training. We also propose a restricted version of the NS-RNN that makes it practical to use for language modeling on natural language and present results on the Penn Treebank corpus.
    Uncovering the Limits of Text-based Emotion Detection. (arXiv:2109.01900v1 [cs.CL])
    (0 min) Identifying emotions from text is crucial for a variety of real world tasks. We consider the two largest now-available corpora for emotion classification: GoEmotions, with 58k messages labelled by readers, and Vent, with 33M writer-labelled messages. We design a benchmark and evaluate several feature spaces and learning algorithms, including two simple yet novel models on top of BERT that outperform previous strong baselines on GoEmotions. Through an experiment with human participants, we also analyze the differences between how writers express emotions and how readers perceive them. Our results suggest that emotions expressed by writers are harder to identify than emotions that readers perceive. We share a public web interface for researchers to explore our models.
    Imposing Relation Structure in Language-Model Embeddings Using Contrastive Learning. (arXiv:2109.00840v2 [cs.CL] UPDATED)
    (0 min) Though language model text embeddings have revolutionized NLP research, their ability to capture high-level semantic information, such as relations between entities in text, is limited. In this paper, we propose a novel contrastive learning framework that trains sentence embeddings to encode the relations in a graph structure. Given a sentence (unstructured text) and its graph, we use contrastive learning to impose relation-related structure on the token-level representations of the sentence obtained with a CharacterBERT (El Boukkouri et al.,2020) model. The resulting relation-aware sentence embeddings achieve state-of-the-art results on the relation extraction task using only a simple KNN classifier, thereby demonstrating the success of the proposed method. Additional visualization by a tSNE analysis shows the effectiveness of the learned representation space compared to baselines. Furthermore, we show that we can learn a different space for named entity recognition, again using a contrastive learning objective, and demonstrate how to successfully combine both representation spaces in an entity-relation task.
    SideControl: Controlled Open-domain Dialogue Generation via Additive Side Networks. (arXiv:2109.01958v1 [cs.CL])
    (0 min) Transformer-based pre-trained language models boost the performance of open-domain dialogue systems. Prior works leverage Transformer-based pre-trained language models to generate texts with desired attributes in two general approaches: (1) gradient-based methods: updating all latent representations of pre-trained models with gradients from attribute models; (2) weighted-decoding methods: re-ranking beam candidates from pre-trained models with attribute functions. However, gradient-based methods lead to high computation cost and can easily get overfitted on small training sets, while weighted-decoding methods are inherently constrained by the low-variance high-bias pre-trained model. In this work, we propose a novel approach to control the generation of Transformer-based pre-trained language models: the SideControl framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. We evaluate our proposed method on two benchmark open-domain dialogue datasets, and results show that the SideControl framework has better controllability, higher generation quality and better sample-efficiency than existing gradient-based and weighted-decoding baselines.
    LightTag: Text Annotation Platform. (arXiv:2109.02320v1 [cs.CL])
    (2 min) Text annotation tools assume that their user's goal is to create a labeled corpus. However, users view annotation as a necessary evil on the way to deliver business value through NLP. Thus an annotation tool should optimize for the throughput of the global NLP process, not only the productivity of individual annotators. LightTag is a text annotation tool designed and built on that principle. This paper shares our design rationale, data modeling choices, and user interface decisions then illustrates how those choices serve the full NLP lifecycle.
    Knowing False Negatives: An Adversarial Training Method for Distantly Supervised Relation Extraction. (arXiv:2109.02099v1 [cs.CL])
    (0 min) Distantly supervised relation extraction (RE) automatically aligns unstructured text with relation instances in a knowledge base (KB). Due to the incompleteness of current KBs, sentences implying certain relations may be annotated as N/A instances, which causes the so-called false negative (FN) problem. Current RE methods usually overlook this problem, inducing improper biases in both training and testing procedures. To address this issue, we propose a two-stage approach. First, it finds out possible FN samples by heuristically leveraging the memory mechanism of deep neural networks. Then, it aligns those unlabeled data with the training data into a unified feature space by adversarial training to assign pseudo labels and further utilize the information contained in them. Experiments on two wildly-used benchmark datasets demonstrate the effectiveness of our approach.
    Matching-oriented Product Quantization For Ad-hoc Retrieval. (arXiv:2104.07858v2 [cs.CL] UPDATED)
    (2 min) Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and quantization models can be jointly trained with supervised learning. However, there is a lack of appropriate formulation of the joint training objective; thus, the improvements over previous non-supervised baselines are limited in reality. In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated. With the minimization of MCL, we are able to maximize the matching probability of query and ground-truth key, which contributes to the optimal retrieval accuracy. Given that the exact computation of MCL is intractable due to the demand of vast contrastive samples, we further propose the Differentiable Cross-device Sampling (DCS), which significantly augments the contrastive samples for precise approximation of MCL. We conduct extensive experimental studies on four real-world datasets, whose results verify the effectiveness of MoPQ.
    The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models. (arXiv:2103.06678v2 [cs.CL] UPDATED)
    (2 min) In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.
    A Neural Network-Based Linguistic Similarity Measure for Entrainment in Conversations. (arXiv:2109.01924v1 [cs.CL])
    (0 min) Linguistic entrainment is a phenomenon where people tend to mimic each other in conversation. The core instrument to quantify entrainment is a linguistic similarity measure between conversational partners. Most of the current similarity measures are based on bag-of-words approaches that rely on linguistic markers, ignoring the overall language structure and dialogue context. To address this issue, we propose to use a neural network model to perform the similarity measure for entrainment. Our model is context-aware, and it further leverages a novel component to learn the shared high-level linguistic features across dialogues. We first investigate the effectiveness of our novel component. Then we use the model to perform similarity measure in a corpus-based entrainment analysis. We observe promising results for both evaluation tasks.
    An automated domain-independent text reading, interpreting and extracting approach for reviewing the scientific literature. (arXiv:2107.14638v3 [cs.CL] UPDATED)
    (0 min) It is presented here a machine learning-based (ML) natural language processing (NLP) approach capable to automatically recognize and extract categorical and numerical parameters from a corpus of articles. The approach (named a.RIX) operates with a concomitant/interchangeable use of ML models such as neuron networks (NNs), latent semantic analysis (LSA), naive-Bayes classifiers (NBC), and a pattern recognition model using regular expression (REGEX). A corpus of 7,873 scientific articles dealing with natural products (NPs) was used to demonstrate the efficiency of the a.RIX engine. The engine automatically extracts categorical and numerical parameters such as (i) the plant species from which active molecules are extracted, (ii) the microorganisms species for which active molecules can act against, and (iii) the values of minimum inhibitory concentration (MIC) against these microorganisms. The parameters are extracted without part-of-speech tagging (POS) and named entity recognition (NER) approaches (i.e. without the need of text annotation), and the models training is performed with unsupervised approaches. In this way, a.RIX can be essentially used on articles from any scientific field. Finally, it can potentially make obsolete the current article reviewing process in some areas, especially those in which machine learning models capture texts structure, text semantics, and latent knowledge.
    Emotion Dynamics in Movie Dialogues. (arXiv:2103.01345v5 [cs.CL] UPDATED)
    (0 min) Emotion dynamics is a framework for measuring how an individual's emotions change over time. It is a powerful tool for understanding how we behave and interact with the world. In this paper, we introduce a framework to track emotion dynamics through one's utterances. Specifically we introduce a number of utterance emotion dynamics (UED) metrics inspired by work in Psychology. We use this approach to trace emotional arcs of movie characters. We analyze thousands of such character arcs to test hypotheses that inform our broader understanding of stories. Notably, we show that there is a tendency for characters to use increasingly more negative words and become increasingly emotionally discordant with each other until about 90 percent of the narrative length. UED also has applications in behavior studies, social sciences, and public health.
    Handwritten Character Recognition of South Indian Scripts: A Review. (arXiv:1106.0107v1 [cs.CV] CROSS LISTED)
    (0 min) Handwritten character recognition is always a frontier area of research in the field of pattern recognition and image processing and there is a large demand for OCR on hand written documents. Even though, sufficient studies have performed in foreign scripts like Chinese, Japanese and Arabic characters, only a very few work can be traced for handwritten character recognition of Indian scripts especially for the South Indian scripts. This paper provides an overview of offline handwritten character recognition in South Indian Scripts, namely Malayalam, Tamil, Kannada and Telungu.
    Mitigating harm in language models with conditional-likelihood filtration. (arXiv:2108.07790v2 [cs.CL] UPDATED)
    (0 min) Language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. We present a methodology for programmatically identifying and removing harmful text from web-scale datasets. A pretrained language model is used to calculate the log-likelihood of researcher-written trigger phrases conditioned on a specific document, which is used to identify and filter documents from the dataset. We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text, with a marginal decrease in performance on standard language modeling benchmarks compared to unfiltered baselines. We provide a partial explanation for this performance gap by surfacing examples of hate speech and other undesirable content from standard language modeling benchmarks. Finally, we discuss the generalization of this method and how trigger phrases which reflect specific values can be used by researchers to build language models which are more closely aligned with their values.
    M^2-MedDialog: A Dataset and Benchmarks for Multi-domain Multi-service Medical Dialogues. (arXiv:2109.00430v2 [cs.CL] UPDATED)
    (0 min) Medical dialogue systems (MDSs) aim to assist doctors and patients with a range of professional medical services, i.e., diagnosis, consultation, and treatment. However, one-stop MDS is still unexplored because: (1) no dataset has so large-scale dialogues contains both multiple medical services and fine-grained medical labels (i.e., intents, slots, values); (2) no model has addressed a MDS based on multiple-service conversations in a unified framework. In this work, we first build a Multiple-domain Multiple-service medical dialogue (M^2-MedDialog)dataset, which contains 1,557 conversations between doctors and patients, covering 276 types of diseases, 2,468 medical entities, and 3 specialties of medical services. To the best of our knowledge, it is the only medical dialogue dataset that includes both multiple medical services and fine-grained medical labels. Then, we formulate a one-stop MDS as a sequence-to-sequence generation problem. We unify a MDS with causal language modeling and conditional causal language modeling, respectively. Specifically, we employ several pretrained models (i.e., BERT-WWM, BERT-MED, GPT2, and MT5) and their variants to get benchmarks on M^2-MedDialog dataset. We also propose pseudo labeling and natural perturbation methods to expand M2-MedDialog dataset and enhance the state-of-the-art pretrained models. We demonstrate the results achieved by the benchmarks so far through extensive experiments on M2-MedDialog. We release the dataset, the code, as well as the evaluation scripts to facilitate future research in this important research direction.
    When Retriever-Reader Meets Scenario-Based Multiple-Choice Questions. (arXiv:2108.13875v2 [cs.CL] UPDATED)
    (0 min) Scenario-based question answering (SQA) requires retrieving and reading paragraphs from a large corpus to answer a question which is contextualized by a long scenario description. Since a scenario contains both keyphrases for retrieval and much noise, retrieval for SQA is extremely difficult. Moreover, it can hardly be supervised due to the lack of relevance labels of paragraphs for SQA. To meet the challenge, in this paper we propose a joint retriever-reader model called JEEVES where the retriever is implicitly supervised only using QA labels via a novel word weighting mechanism. JEEVES significantly outperforms a variety of strong baselines on multiple-choice questions in three SQA datasets.
    Opinion Prediction with User Fingerprinting. (arXiv:2108.00270v2 [cs.CL] UPDATED)
    (0 min) Opinion prediction is an emerging research area with diverse real-world applications, such as market research and situational awareness. We identify two lines of approaches to the problem of opinion prediction. One uses topic-based sentiment analysis with time-series modeling, while the other uses static embedding of text. The latter approaches seek user-specific solutions by generating user fingerprints. Such approaches are useful in predicting user's reactions to unseen content. In this work, we propose a novel dynamic fingerprinting method that leverages contextual embedding of user's comments conditioned on relevant user's reading history. We integrate BERT variants with a recurrent neural network to generate predictions. The results show up to 13\% improvement in micro F1-score compared to previous approaches. Experimental results show novel insights that were previously unknown such as better predictions for an increase in dynamic history length, the impact of the nature of the article on performance, thereby laying the foundation for further research.
    Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese. (arXiv:2103.10069v5 [cs.CL] UPDATED)
    (0 min) The rise of social media has led to the increasing of comments on online forums. However, there still exists invalid comments which are not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,000 human-annotated comments. For these tasks, we propose a system for constructive and toxic speech detection with the state-of-the-art transfer learning model in Vietnamese NLP as PhoBERT. With this system, we obtain F1-scores of 78.59% and 59.40% for classifying constructive and toxic comments, respectively. Besides, we implement various baseline models as traditional Machine Learning and Deep Neural Network-Based models to evaluate the dataset. With the results, we can solve several tasks on the online discussions and develop the framework for identifying constructiveness and toxicity of Vietnamese social media comments automatically.
    Hocalarim: Mining Turkish Student Reviews. (arXiv:2109.02325v1 [cs.CL])
    (0 min) We introduce Hocalarim (MyProfessors), the largest student review dataset available for the Turkish language. It consists of over 5000 professor reviews left online by students, with different aspects of education rated on a scale of 1 to 5 stars. We investigate the properties of the dataset and present its statistics. We examine the impact of students' institution type on their ratings and the correlation of students' bias to give positive or negative feedback.
    QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction. (arXiv:2108.08468v3 [cs.CL] UPDATED)
    (0 min) We study the problem of query attribute value extraction, which aims to identify named entities from user queries as diverse surface form attribute values and afterward transform them into formally canonical forms. Such a problem consists of two phases: {named entity recognition (NER)} and {attribute value normalization (AVN)}. However, existing works only focus on the NER phase but neglect equally important AVN. To bridge this gap, this paper proposes a unified query attribute value extraction system in e-commerce search named QUEACO, which involves both two phases. Moreover, by leveraging large-scale weakly-labeled behavior data, we further improve the extraction performance with less supervision cost. Specifically, for the NER phase, QUEACO adopts a novel teacher-student network, where a teacher network that is trained on the strongly-labeled data generates pseudo-labels to refine the weakly-labeled data for training a student network. Meanwhile, the teacher network can be dynamically adapted by the feedback of the student's performance on strongly-labeled data to maximally denoise the noisy supervisions from the weak labels. For the AVN phase, we also leverage the weakly-labeled query-to-attribute behavior data to normalize surface form attribute values from queries into canonical forms from products. Extensive experiments on a real-world large-scale E-commerce dataset demonstrate the effectiveness of QUEACO.
    BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Residual Convolutional Neural Networks. (arXiv:2109.02237v1 [cs.CL])
    (0 min) Biomedical entity linking is the task of linking entity mentions in a biomedical document to referent entities in a knowledge base. Recently, many BERT-based models have been introduced for the task. While these models have achieved competitive results on many datasets, they are computationally expensive and contain about 110M parameters. Little is known about the factors contributing to their impressive performance and whether the over-parameterization is needed. In this work, we shed some light on the inner working mechanisms of these large BERT-based models. Through a set of probing experiments, we have found that the entity linking performance only changes slightly when the input word order is shuffled or when the attention scope is limited to a fixed window size. From these observations, we propose an efficient convolutional neural network with residual connections for biomedical entity linking. Because of the sparse connectivity and weight sharing properties, our model has a small number of parameters and is highly efficient. On five public datasets, our model achieves comparable or even better linking accuracy than the state-of-the-art BERT-based models while having about 60 times fewer parameters.
    STaCK: Sentence Ordering with Temporal Commonsense Knowledge. (arXiv:2109.02247v1 [cs.CL])
    (0 min) Sentence order prediction is the task of finding the correct order of sentences in a randomly ordered document. Correctly ordering the sentences requires an understanding of coherence with respect to the chronological sequence of events described in the text. Document-level contextual understanding and commonsense knowledge centered around these events are often essential in uncovering this coherence and predicting the exact chronological order. In this paper, we introduce STaCK -- a framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences. Our graph network accumulates temporal evidence using knowledge of `past' and `future' and formulates sentence ordering as a constrained edge classification problem. We report results on five different datasets, and empirically show that the proposed method is naturally suitable for order prediction. The implementation of this work is publicly available at: https://github.com/declare-lab/sentence-ordering.
    EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference. (arXiv:2011.14203v5 [cs.AR] UPDATED)
    (0 min) Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.
    Contextualized Embeddings based Convolutional Neural Networks for Duplicate Question Identification. (arXiv:2109.01560v2 [cs.CL] UPDATED)
    (0 min) Question Paraphrase Identification (QPI) is a critical task for large-scale Question-Answering forums. The purpose of QPI is to determine whether a given pair of questions are semantically identical or not. Previous approaches for this task have yielded promising results, but have often relied on complex recurrence mechanisms that are expensive and time-consuming in nature. In this paper, we propose a novel architecture combining a Bidirectional Transformer Encoder with Convolutional Neural Networks for the QPI task. We produce the predictions from the proposed architecture using two different inference setups: Siamese and Matched Aggregation. Experimental results demonstrate that our model achieves state-of-the-art performance on the Quora Question Pairs dataset. We empirically prove that the addition of convolution layers to the model architecture improves the results in both inference setups. We also investigate the impact of partial and complete fine-tuning and analyze the trade-off between computational power and accuracy in the process. Based on the obtained results, we conclude that the Matched-Aggregation setup consistently outperforms the Siamese setup. Our work provides insights into what architecture combinations and setups are likely to produce better results for the QPI task.
    Sent2Span: Span Detection for PICO Extraction in the Biomedical Text without Span Annotations. (arXiv:2109.02254v1 [cs.CL])
    (0 min) The rapid growth in published clinical trials makes it difficult to maintain up-to-date systematic reviews, which requires finding all relevant trials. This leads to policy and practice decisions based on out-of-date, incomplete, and biased subsets of available clinical evidence. Extracting and then normalising Population, Intervention, Comparator, and Outcome (PICO) information from clinical trial articles may be an effective way to automatically assign trials to systematic reviews and avoid searching and screening - the two most time-consuming systematic review processes. We propose and test a novel approach to PICO span detection. The major difference between our proposed method and previous approaches comes from detecting spans without needing annotated span data and using only crowdsourced sentence-level annotations. Experiments on two datasets show that PICO span detection results achieve much higher results for recall when compared to fully supervised methods with PICO sentence detection at least as good as human annotations. By removing the reliance on expert annotations for span detection, this work could be used in human-machine pipeline for turning low-quality crowdsourced, and sentence-level PICO annotations into structured information that can be used to quickly assign trials to relevant systematic reviews.
    Planning with Learned Entity Prompts for Abstractive Summarization. (arXiv:2104.07606v2 [cs.CL] UPDATED)
    (0 min) We introduce a simple but flexible mechanism to learn an intermediate plan to ground the generation of abstractive summaries. Specifically, we prepend (or prompt) target summaries with entity chains -- ordered sequences of entities mentioned in the summary. Transformer-based sequence-to-sequence models are then trained to generate the entity chain and then continue generating the summary conditioned on the entity chain and the input. We experimented with both pretraining and finetuning with this content planning objective. When evaluated on CNN/DailyMail, XSum, SAMSum and BillSum, we demonstrate empirically that the grounded generation with the planning objective improves entity specificity and planning in summaries for all datasets, and achieves state-of-the-art performance on XSum and SAMSum in terms of Rouge. Moreover, we demonstrate empirically that planning with entity chains provides a mechanism to control hallucinations in abstractive summaries. By prompting the decoder with a modified content plan that drops hallucinated entities, we outperform state-of-the-art approaches for faithfulness when evaluated automatically and by humans.
    A Partition Filter Network for Joint Entity and Relation Extraction. (arXiv:2108.12202v5 [cs.CL] UPDATED)
    (0 min) In joint entity and relation extraction, existing work either sequentially encode task-specific features, leading to an imbalance in inter-task feature interaction where features extracted later have no direct contact with those that come first. Or they encode entity features and relation features in a parallel manner, meaning that feature representation learning for each task is largely independent of each other except for input sharing. We propose a partition filter network to model two-way interaction between tasks properly, where feature encoding is decomposed into two steps: partition and filter. In our encoder, we leverage two gates: entity and relation gate, to segment neurons into two task partitions and one shared partition. The shared partition represents inter-task information valuable to both tasks and is evenly shared across two tasks to ensure proper two-way interaction. The task partitions represent intra-task information and are formed through concerted efforts of both gates, making sure that encoding of task-specific features is dependent upon each other. Experiment results on six public datasets show that our model performs significantly better than previous approaches. In addition, contrary to what previous work claims, our auxiliary experiments suggest that relation prediction is contributory to named entity prediction in a non-negligible way. The source code can be found at https://github.com/Coopercoppers/PFN.
    Cross-Task Generalization via Natural Language Crowdsourcing Instructions. (arXiv:2104.08773v2 [cs.CL] UPDATED)
    (0 min) Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. NLP models built with the conventional paradigm, however, often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that is equipped with the understanding of human-readable instructions that define the tasks, and can generalize to new tasks. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions and 193k task instances. The instructions are obtained from crowdsourcing instructions used to collect existing NLP datasets and mapped to a unified schema. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models can benefit from instructions to generalize across tasks. These models, however, are far behind supervised task-specific models, indicating significant room for more progress in this direction.
    Efficient Combinatorial Optimization for Word-level Adversarial Textual Attack. (arXiv:2109.02229v1 [cs.CL])
    (0 min) Over the past few years, various word-level textual attack approaches have been proposed to reveal the vulnerability of deep neural networks used in natural language processing. Typically, these approaches involve an important optimization step to determine which substitute to be used for each word in the original input. However, current research on this step is still rather limited, from the perspectives of both problem-understanding and problem-solving. In this paper, we address these issues by uncovering the theoretical properties of the problem and proposing an efficient local search algorithm (LS) to solve it. We establish the first provable approximation guarantee on solving the problem in general cases. Notably, for adversarial textual attack, it is even better than the previous bound which only holds in special case. Extensive experiments involving five NLP tasks, six datasets and eleven NLP models show that LS can largely reduce the number of queries usually by an order of magnitude to achieve high attack success rates. Further experiments show that the adversarial examples crafted by LS usually have higher quality, exhibit better transferability, and can bring more robustness improvement to victim models by adversarial training.
    Improving Numerical Reasoning Skills in the Modular Approach for Complex Question Answering on Text. (arXiv:2109.02289v1 [cs.CL])
    (0 min) Numerical reasoning skills are essential for complex question answering (CQA) over text. It requires opertaions including counting, comparison, addition and subtraction. A successful approach to CQA on text, Neural Module Networks (NMNs), follows the programmer-interpreter paradigm and leverages specialised modules to perform compositional reasoning. However, the NMNs framework does not consider the relationship between numbers and entities in both questions and paragraphs. We propose effective techniques to improve NMNs' numerical reasoning capabilities by making the interpreter question-aware and capturing the relationship between entities and numbers. On the same subset of the DROP dataset for CQA on text, experimental results show that our additions outperform the original NMNs by 3.0 points for the overall F1 score.
    Artificial Intelligence (AI) in Action: Addressing the COVID-19 Pandemic with Natural Language Processing (NLP). (arXiv:2010.16413v3 [cs.CL] UPDATED)
    (0 min) The COVID-19 pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP), the branch of artificial intelligence that interprets human language, can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
    Transformer Feed-Forward Layers Are Key-Value Memories. (arXiv:2012.14913v2 [cs.CL] UPDATED)
    (0 min) Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.
    Teaching Autoregressive Language Models Complex Tasks By Demonstration. (arXiv:2109.02102v1 [cs.CL])
    (0 min) This paper demonstrates that by fine-tuning an autoregressive language model (GPT-Neo) on appropriately structured step-by-step demonstrations, it is possible to teach it to execute a mathematical task that has previously proved difficult for Transformers - longhand modulo operations - with a relatively small number of examples. Specifically, we fine-tune GPT-Neo to solve the numbers__div_remainder task from the DeepMind Mathematics Dataset; Saxton et al. (arXiv:1904.01557) reported below 40% accuracy on this task with 2 million training examples. We show that after fine-tuning on 200 appropriately structured demonstrations of solving long division problems and reporting the remainders, the smallest available GPT-Neo model achieves over 80% accuracy. This is achieved by constructing an appropriate dataset for fine-tuning, with no changes to the learning algorithm. These results suggest that fine-tuning autoregressive language models on small sets of well-crafted demonstrations may be a useful paradigm for enabling individuals without training in machine learning to coax such models to perform some kinds of complex multi-step tasks.
    Impact and dynamics of hate and counter speech online. (arXiv:2009.08392v3 [cs.SI] UPDATED)
    (0 min) Citizen-generated counter speech is a promising way to fight hate speech and promote peaceful, non-polarized discourse. However, there is a lack of large-scale longitudinal studies of its effectiveness for reducing hate speech. To this end, we perform an exploratory analysis of the effectiveness of counter speech using several different macro- and micro-level measures to analyze 180,000 political conversations that took place on German Twitter over four years. We report on the dynamic interactions of hate and counter speech over time and provide insights into whether, as in `classic' bullying situations, organized efforts are more effective than independent individuals in steering online discourse. Taken together, our results build a multifaceted picture of the dynamics of hate and counter speech online. While we make no causal claims due to the complexity of discourse dynamics, our findings suggest that organized hate speech is associated with changes in public discourse and that counter speech -- especially when organized -- may help curb hateful rhetoric in online discourse.
    A Masked Segmental Language Model for Unsupervised Natural Language Segmentation. (arXiv:2104.07829v2 [cs.CL] UPDATED)
    (0 min) Segmentation remains an important preprocessing step both in languages where "words" or other important syntactic/semantic units (like morphemes) are not clearly delineated by white space, as well as when dealing with continuous speech data, where there is often no meaningful pause between words. Near-perfect supervised methods have been developed for use in resource-rich languages such as Chinese, but many of the world's languages are both morphologically complex, and have no large dataset of "gold" segmentations into meaningful units. To solve this problem, we propose a new type of Segmental Language Model (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021) for use in both unsupervised and lightly supervised segmentation tasks. We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture, harnessing the power of a bi-directional masked modeling context and attention. In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese (PKU Corpus) in segmentation quality, and performs similarly to the Recurrent model on English (PTB). We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.
    Data Augmentation for Cross-Domain Named Entity Recognition. (arXiv:2109.01758v1 [cs.CL])
    (2 min) Current work in named entity recognition (NER) shows that data augmentation techniques can produce more robust models. However, most existing techniques focus on augmenting in-domain data in low-resource scenarios where annotated data is quite limited. In contrast, we study cross-domain data augmentation for the NER task. We investigate the possibility of leveraging data from high-resource domains by projecting it into the low-resource domains. Specifically, we propose a novel neural architecture to transform the data representation from a high-resource to a low-resource domain by learning the patterns (e.g. style, noise, abbreviations, etc.) in the text that differentiate them and a shared feature space where both domains are aligned. We experiment with diverse datasets and show that transforming the data to the low-resource domain representation achieves significant improvements over only using data from high-resource domains.
    Self-Supervised Detection of Contextual Synonyms in a Multi-Class Setting: Phenotype Annotation Use Case. (arXiv:2109.01935v1 [cs.CL])
    (2 min) Contextualised word embeddings is a powerful tool to detect contextual synonyms. However, most of the current state-of-the-art (SOTA) deep learning concept extraction methods remain supervised and underexploit the potential of the context. In this paper, we propose a self-supervised pre-training approach which is able to detect contextual synonyms of concepts being training on the data created by shallow matching. We apply our methodology in the sparse multi-class setting (over 15,000 concepts) to extract phenotype information from electronic health records. We further investigate data augmentation techniques to address the problem of the class sparsity. Our approach achieves a new SOTA for the unsupervised phenotype concept annotation on clinical text on F1 and Recall outperforming the previous SOTA with a gain of up to 4.5 and 4.0 absolute points, respectively. After fine-tuning with as little as 20\% of the labelled data, we also outperform BioBERT and ClinicalBERT. The extrinsic evaluation on three ICU benchmarks also shows the benefit of using the phenotypes annotated by our model as features.
    Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach. (arXiv:2109.01862v1 [cs.CL])
    (2 min) In recent years, neural paraphrase generation based on Seq2Seq has achieved superior performance, however, the generated paraphrase still has the problem of lack of diversity. In this paper, we focus on improving the diversity between the generated paraphrase and the original sentence, i.e., making generated paraphrase different from the original sentence as much as possible. We propose BTmPG (Back-Translation guided multi-round Paraphrase Generation), which leverages multi-round paraphrase generation to improve diversity and employs back-translation to preserve semantic information. We evaluate BTmPG on two benchmark datasets. Both automatic and human evaluation show BTmPG can improve the diversity of paraphrase while preserving the semantics of the original sentence.
    Efficient Attention Branch Network with Combined Loss Function for Automatic Speaker Verification Spoof Detection. (arXiv:2109.02051v1 [cs.SD])
    (2 min) Many endeavors have sought to develop countermeasure techniques as enhancements on Automatic Speaker Verification (ASV) systems, in order to make them more robust against spoof attacks. As evidenced by the latest ASVspoof 2019 countermeasure challenge, models currently deployed for the task of ASV are, at their best, devoid of suitable degrees of generalization to unseen attacks. Upon further investigation of the proposed methods, it appears that a broader three-tiered view of the proposed systems. comprised of the classifier, feature extraction phase, and model loss function, may to some extent lessen the problem. Accordingly, the present study proposes the Efficient Attention Branch Network (EABN) modular architecture with a combined loss function to address the generalization problem...
    Frustratingly Simple Pretraining Alternatives to Masked Language Modeling. (arXiv:2109.01819v1 [cs.CL])
    (2 min) Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced by a [MASK] placeholder in a multi-class setting over the entire vocabulary. When pretraining, it is common to use alongside MLM other auxiliary objectives on the token or sequence level to improve downstream performance (e.g. next sentence prediction). However, no previous work so far has attempted in examining whether other simpler linguistically intuitive or not objectives can be used standalone as main pretraining objectives. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of MLM. Empirical results on GLUE and SQuAD show that our proposed methods achieve comparable or better performance to MLM using a BERT-BASE architecture. We further validate our methods using smaller models, showing that pretraining a model with 41% of the BERT-BASE's parameters, BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective.
    Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. (arXiv:2109.01934v1 [cs.CV])
    (2 min) Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e.\ implicitly learning the geometry of the scene. In this work, we evaluate the faithfulness of V\&L models to such geometric understanding, by formulating the prediction of pair-wise relative locations of objects as a classification as well as a regression task. Our findings suggest that state-of-the-art transformer-based V\&L models lack sufficient abilities to excel at this task. Motivated by this, we design two objectives as proxies for 3D spatial reasoning (SR) -- object centroid estimation, and relative position estimation, and train V\&L with weak supervision from off-the-shelf depth estimators. This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge (in fully supervised, few-shot, and O.O.D settings) as well as improvements in relative spatial reasoning. Code and data will be released \href{https://github.com/pratyay-banerjee/weak_sup_vqa}{here}.
    Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment. (arXiv:2109.01949v1 [cs.LG])
    (2 min) Self-supervised learning provides an opportunity to explore unlabeled chest X-rays and their associated free-text reports accumulated in clinical routine without manual supervision. This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports. The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching. Both are bidirectionally constrained on Cross-Entropy based and ranking-based Triplet Matching Losses. The region-word matching is calculated using the attention mechanism without direct supervision about their mapping. The pre-trained multi-modal representation learning paves the way for downstream tasks concerning image and/or text encoding. We demonstrate the representation learning quality by cross-modality retrievals and multi-label classifications on two datasets: OpenI-IU and MIMIC-CXR
    Error Detection in Large-Scale Natural Language Understanding Systems Using Transformer Models. (arXiv:2109.01754v1 [cs.CL])
    (2 min) Large-scale conversational assistants like Alexa, Siri, Cortana and Google Assistant process every utterance using multiple models for domain, intent and named entity recognition. Given the decoupled nature of model development and large traffic volumes, it is extremely difficult to identify utterances processed erroneously by such systems. We address this challenge to detect domain classification errors using offline Transformer models. We combine utterance encodings from a RoBERTa model with the Nbest hypothesis produced by the production system. We then fine-tune end-to-end in a multitask setting using a small dataset of humanannotated utterances with domain classification errors. We tested our approach for detecting misclassifications from one domain that accounts for <0.5% of the traffic in a large-scale conversational AI system. Our approach achieves an F1 score of 30% outperforming a bi- LSTM baseline by 16.9% and a standalone RoBERTa model by 4.8%. We improve this further by 2.2% to 32.2% by ensembling multiple models.
    Counterfactual Evaluation for Explainable AI. (arXiv:2109.01962v1 [cs.CL])
    (2 min) While recent years have witnessed the emergence of various explainable methods in machine learning, to what degree the explanations really represent the reasoning process behind the model prediction -- namely, the faithfulness of explanation -- is still an open problem. One commonly used way to measure faithfulness is \textit{erasure-based} criteria. Though conceptually simple, erasure-based criterion could inevitably introduce biases and artifacts. We propose a new methodology to evaluate the faithfulness of explanations from the \textit{counterfactual reasoning} perspective: the model should produce substantially different outputs for the original input and its corresponding counterfactual edited on a faithful feature. Specially, we introduce two algorithms to find the proper counterfactuals in both discrete and continuous scenarios and then use the acquired counterfactuals to measure faithfulness. Empirical results on several datasets show that compared with existing metrics, our proposed counterfactual evaluation method can achieve top correlation with the ground truth under diffe
    On the ability of monolingual models to learn language-agnostic representations. (arXiv:2109.01942v1 [cs.CL])
    (2 min) Pretrained multilingual models have become a de facto default approach for zero-shot cross-lingual transfer. Previous work has shown that these models are able to achieve cross-lingual representations when pretrained on two or more languages with shared parameters. In this work, we provide evidence that a model can achieve language-agnostic representations even when pretrained on a single language. That is, we find that monolingual models pretrained and finetuned on different languages achieve competitive performance compared to the ones that use the same target language. Surprisingly, the models show a similar performance on a same task regardless of the pretraining language. For example, models pretrained on distant languages such as German and Portuguese perform similarly on English tasks.
    Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark. (arXiv:2109.01839v1 [cs.CL])
    (2 min) As a kind of new expression elements, Internet memes are popular and extensively used in online chatting scenarios since they manage to make dialogues vivid, moving, and interesting. However, most current dialogue researches focus on text-only dialogue tasks. In this paper, we propose a new task named as \textbf{M}eme incorporated \textbf{O}pen-domain \textbf{D}ialogue (MOD). Compared to previous dialogue tasks, MOD is much more challenging since it requires the model to understand the multimodal elements as well as the emotions behind them. To facilitate the MOD research, we construct a large-scale open-domain multimodal dialogue dataset incorporating abundant Internet memes into utterances. The dataset consists of $\sim$45K Chinese conversations with $\sim$606K utterances. Each conversation contains about $13$ utterances with about $4$ Internet memes on average and each utterance equipped with an Internet meme is annotated with the corresponding emotion. In addition, we present a simple and effective method, which utilizes a unified generation network to solve the MOD task. Experimental results demonstrate that our method trained on the proposed corpus is able to achieve expressive communication including texts and memes. The corpus and models have been publicly available at https://github.com/lizekang/DSTC10-MOD.
    Data Efficient Masked Language Modeling for Vision and Language. (arXiv:2109.02040v1 [cs.CL])
    (2 min) Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource settings. Further, our pre-training approach substantially outperforms the baseline model on a prompt-based probing task designed to elicit image objects. These results and our analysis indicate that our method allows for better utilization of the training data.
    Representation Learning for Efficient and Effective Similarity Search and Recommendation. (arXiv:2109.01815v1 [cs.IR])
    (2 min) How data is represented and operationalized is critical for building computational solutions that are both effective and efficient. A common approach is to represent data objects as binary vectors, denoted \textit{hash codes}, which require little storage and enable efficient similarity search through direct indexing into a hash table or through similarity computations in an appropriate space. Due to the limited expressibility of hash codes, compared to real-valued representations, a core open challenge is how to generate hash codes that well capture semantic content or latent properties using a small number of bits, while ensuring that the hash codes are distributed in a way that does not reduce their search efficiency. State of the art methods use representation learning for generating such hash codes, focusing on neural autoencoder architectures where semantics are encoded into the hash codes by learning to reconstruct the original inputs of the hash codes. This thesis addresses the above challenge and makes a number of contributions to representation learning that (i) improve effectiveness of hash codes through more expressive representations and a more effective similarity measure than the current state of the art, namely the Hamming distance, and (ii) improve efficiency of hash codes by learning representations that are especially suited to the choice of search method. The contributions are empirically validated on several tasks related to similarity search and recommendation.
    ALLWAS: Active Learning on Language models in WASserstein space. (arXiv:2109.01691v1 [cs.CL])
    (2 min) Active learning has emerged as a standard paradigm in areas with scarcity of labeled training data, such as in the medical domain. Language models have emerged as the prevalent choice of several natural language tasks due to the performance boost offered by these models. However, in several domains, such as medicine, the scarcity of labeled training data is a common issue. Also, these models may not work well in cases where class imbalance is prevalent. Active learning may prove helpful in these cases to boost the performance with a limited label budget. To this end, we propose a novel method using sampling techniques based on submodular optimization and optimal transport for active learning in language models, dubbed ALLWAS. We construct a sampling strategy based on submodular optimization of the designed objective in the gradient domain. Furthermore, to enable learning from few samples, we propose a novel strategy for sampling from the Wasserstein barycenters. Our empirical evaluations on standard benchmark datasets for text classification show that our methods perform significantly better (>20% relative increase in some cases) than existing approaches for active learning on language models.
  • cs.CV updates on arXiv.org

    Semi-Supervised Raw-to-Raw Mapping. (arXiv:2106.13883v3 [cs.CV] UPDATED)
    (2 min) The raw-RGB colors of a camera sensor vary due to the spectral sensitivity differences across different sensor makes and models. This paper focuses on the task of mapping between different sensor raw-RGB color spaces. Prior work addressed this problem using a pairwise calibration to achieve accurate color mapping. Although being accurate, this approach is less practical as it requires: (1) capturing pair of images by both camera devices with a color calibration object placed in each new scene; (2) accurate image alignment or manual annotation of the color calibration object. This paper aims to tackle color mapping in the raw space through a more practical setup. Specifically, we present a semi-supervised raw-to-raw mapping method trained on a small set of paired images alongside an unpaired set of images captured by each camera device. Through extensive experiments, we show that our method achieves better results compared to other domain adaptation alternatives in addition to the single-calibration solution. We have generated a new dataset of raw images from two different smartphone cameras as part of this effort. Our dataset includes unpaired and paired sets for our semi-supervised training and evaluation.
    CAMS: Color-Aware Multi-Style Transfer. (arXiv:2106.13920v2 [cs.CV] UPDATED)
    (2 min) Image style transfer aims to manipulate the appearance of a source image, or "content" image, to share similar texture and colors of a target "style" image. Ideally, the style transfer manipulation should also preserve the semantic content of the source image. A commonly used approach to assist in transferring styles is based on Gram matrix optimization. One problem of Gram matrix-based optimization is that it does not consider the correlation between colors and their styles. Specifically, certain textures or structures should be associated with specific colors. This is particularly challenging when the target style image exhibits multiple style types. In this work, we propose a color-aware multi-style transfer method that generates aesthetically pleasing results while preserving the style-color correlation between style and generated images. We achieve this desired outcome by introducing a simple but efficient modification to classic Gram matrix-based style transfer optimization. A nice feature of our method is that it enables the users to manually select the color associations between the target style and content image for more transfer flexibility. We validated our method with several qualitative comparisons, including a user study conducted with 30 participants. In comparison with prior work, our method is simple, easy to implement, and achieves visually appealing results when targeting images that have multiple styles. Source code is available at https://github.com/mahmoudnafifi/color-aware-style-transfer.
    Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation. (arXiv:2107.03098v2 [cs.CV] UPDATED)
    (2 min) We propose a simple yet reliable bottom-up approach with a good trade-off between accuracy and efficiency for the problem of multi-person pose estimation. Given an image, we employ an Hourglass Network to infer all the keypoints from different persons indiscriminately as well as the guiding offsets connecting the adjacent keypoints belonging to the same persons. Then, we greedily group the candidate keypoints into multiple human poses (if any), utilizing the predicted guiding offsets. And we refer to this process as greedy offset-guided keypoint grouping (GOG). Moreover, we revisit the encoding-decoding method for the multi-person keypoint coordinates and reveal some important facts affecting accuracy. Experiments have demonstrated the obvious performance improvements brought by the introduced components. Our approach is comparable to the state of the art on the challenging COCO dataset under fair conditions. The source code and our pre-trained model are publicly available online.
    How Does Heterogeneous Label Noise Impact Generalization in Neural Nets?. (arXiv:2106.15475v2 [cs.CV] UPDATED)
    (2 min) Incorrectly labeled examples, or label noise, is common in real-world computer vision datasets. While the impact of label noise on learning in deep neural networks has been studied in prior work, these studies have exclusively focused on homogeneous label noise, i.e., the degree of label noise is the same across all categories. However, in the real-world, label noise is often heterogeneous, with some categories being affected to a greater extent than others. Here, we address this gap in the literature. We hypothesized that heterogeneous label noise would only affect the classes that had label noise unless there was transfer from those classes to the classes without label noise. To test this hypothesis, we designed a series of computer vision studies using MNIST, CIFAR-10, CIFAR-100, and MS-COCO where we imposed heterogeneous label noise during the training of multi-class, multi-task, and multi-label systems. Our results provide evidence in support of our hypothesis: label noise only affects the class affected by it unless there is transfer.
    Topological Semantic Mapping by Consolidation of Deep Visual Features. (arXiv:2106.12709v2 [cs.CV] UPDATED)
    (2 min) Many works in the recent literature introduce semantic mapping methods that use CNNs (Convolutional Neural Networks) to recognize semantic properties in images. The types of properties (eg.: room size, place category, and objects) and their classes (eg.: kitchen and bathroom, for place category) are usually predefined and restricted to a specific task. Thus, all the visual data acquired and processed during the construction of the maps are lost and only the recognized semantic properties remain on the maps. In contrast, this work introduces a topological semantic mapping method that uses deep visual features extracted by a CNN (GoogLeNet), from 2D images captured in multiple views of the environment as the robot operates, to create, through averages, consolidated representations of the visual features acquired in the regions covered by each topological node. These representations allow flexible recognition of semantic properties of the regions and use in other visual tasks. Experiments with a real-world indoor dataset showed that the method is able to consolidate the visual features of regions and use them to recognize objects and place categories as semantic properties, and to indicate the topological location of images, with very promising results.
    Transformer Networks for Data Augmentation of Human Physical Activity Recognition. (arXiv:2109.01081v2 [cs.LG] UPDATED)
    (2 min) Data augmentation is a widely used technique in classification to increase data used in training. It improves generalization and reduces amount of annotated human activity data needed for training which reduces labour and time needed with the dataset. Sensor time-series data, unlike images, cannot be augmented by computationally simple transformation algorithms. State of the art models like Recurrent Generative Adversarial Networks (RGAN) are used to generate realistic synthetic data. In this paper, transformer based generative adversarial networks which have global attention on data, are compared on PAMAP2 and Real World Human Activity Recognition data sets with RGAN. The newer approach provides improvements in time and savings in computational resources needed for data augmentation than previous approach.
    VMAF And Variants: Towards A Unified VQA. (arXiv:2103.07770v6 [eess.IV] UPDATED)
    (2 min) Video quality assessment (VQA) is now a fast-growing subject, maturing in the full reference (FR) case, yet challenging in the exploding no reference (NR) case. We investigate variants of the popular VMAF video quality assessment algorithm for the FR case, using both support vector regression and feedforward neural networks. We extend it to the NR case, using some different features but similar learning, to develop a partially unified framework for VQA. When fully trained, FR algorithms such as VMAF perform well on test datasets, with 90%+ match in PCC and SRCC; but for predicting performance in the wild, we train/test from scratch for each database. With an 80/20 train/test split, we still achieve 90%+ performance on average in both PCC and SRCC, with 8-9% gains over VMAF. Moreover, we even get decent performance (~75%) if we ignore the reference, treating FR as NR, partly justifying our attempts at unification. In the true NR case, we reduce complexity vs. leading recent algorithms VIDEVAL, RAPIQUE, yet achieve a stunning 90% in SRCC (~12% gain), while roughly matching in PCC (78% vs. 79.6%). At lower complexities, we can still achieve 87% in SRCC, 70% in PCC. In short, we find encouraging improvements in trainability in both FR and NR, while also constraining computational complexity against leading methods
    Image Segmentation, Compression and Reconstruction from Edge Distribution Estimation with Random Field and Random Cluster Theories. (arXiv:2104.10762v12 [eess.IV] UPDATED)
    (2 min) Random field and random cluster theory are used to describe certain mathematical results concerning the probability distribution of image pixel intensities characterized as generic $2D$ integer arrays. The size of the smallest bounded region within an image is estimated for segmenting an image, from which, the equilibrium distribution of intensities can be recovered. From the estimated bounded regions, properties of the sub-optimal and equilibrium distributions of intensities are derived, which leads to an image compression methodology whereby only slightly more than half of all pixels are required for a worst-case reconstruction of the original image. A custom deep belief network and heuristic allows for the unsupervised segmentation, detection and localization of objects in an image. An example illustrates the mathematical results.
    A Comprehensive Survey of Image-Based Food Recognition and Volume Estimation Methods for Dietary Assessment. (arXiv:2106.11776v3 [cs.CV] UPDATED)
    (2 min) Dietary studies showed that dietary-related problem such as obesity is associated with other chronic diseases like hypertension, irregular blood sugar levels, and increased risk of heart attacks. The primary cause of these problems is poor lifestyle choices and unhealthy dietary habits, which are manageable using interactive mHealth apps. However, traditional dietary monitoring systems using manual food logging suffer from imprecision, underreporting, time consumption, and low adherence. Recent dietary monitoring systems tackle these challenges by automatic assessment of dietary intake through machine learning methods. This survey discusses the most performing methodologies that have been developed so far for automatic food recognition and volume estimation. First, we will present the rationale of visual-based methods for food recognition. The core of the paper is the presentation, discussion and evaluation of these methods on popular food image databases. Following that, we discussed the mobile applications that are implementing these methods. The survey ends with a discussion of research gaps and open issues in this area.
    Neural Marching Cubes. (arXiv:2106.11272v3 [cs.GR] UPDATED)
    (3 min) We introduce Neural Marching Cubes (NMC), a data-driven approach for extracting a triangle mesh from a discretized implicit field. Classical MC is defined by coarse tessellation templates isolated to individual cubes. While more refined tessellations have been proposed, they all make heuristic assumptions, such as trilinearity, when determining the vertex positions and local mesh topologies in each cube. In principle, none of these approaches can reconstruct geometric features that reveal coherence or dependencies between nearby cubes (e.g., a sharp edge), as such information is unaccounted for, resulting in poor estimates of the true underlying implicit field. To tackle these challenges, we re-cast MC from a deep learning perspective, by designing tessellation templates more apt at preserving geometric features, and learning the vertex positions and mesh topologies from training meshes, to account for contextual information from nearby cubes. We develop a compact per-cube parameterization to represent the output triangle mesh, while being compatible with neural processing, so that a simple 3D convolutional network can be employed for the training. We show that all topological cases in each cube that are applicable to our design can be easily derived using our representation, and the resulting tessellations can also be obtained naturally and efficiently by following a few design guidelines. In addition, our network learns local features with limited receptive fields, hence it generalizes well to new shapes and new datasets. We evaluate our neural MC approach by quantitative and qualitative comparisons to all well-known MC variants. In particular, we demonstrate the ability of our network to recover sharp features such as edges and corners, a long-standing issue of MC and its variants. Our network also reconstructs local mesh topologies more accurately than previous approaches.
    Accurate and fast matrix factorization for low-rank learning. (arXiv:2104.10785v4 [stat.ML] UPDATED)
    (2 min) In this paper, we tackle two important problems in low-rank learning, which are partial singular value decomposition and numerical rank estimation of huge matrices. By using the concepts of Krylov subspaces such as Golub-Kahan bidiagonalization (GK-bidiagonalization) as well as Ritz vectors, we propose two methods for solving these problems in a fast and accurate way. Our experiments show the advantages of the proposed methods compared to the traditional and randomized singular value decomposition methods. The proposed methods are appropriate for applications involving huge matrices where the accuracy of the desired singular values and also all of their corresponding singular vectors are essential. As a real application, we evaluate the performance of our methods on the problem of Riemannian similarity learning between two various image datasets of MNIST and USPS.
    Multi-Agent Variational Occlusion Inference Using People as Sensors. (arXiv:2109.02173v1 [cs.RO])
    (0 min) Autonomous vehicles must reason about spatial occlusions in urban environments to ensure safety without being overly cautious. Prior work explored occlusion inference from observed social behaviors of road agents. Inferring occupancy from agent behaviors is an inherently multimodal problem; a driver may behave in the same manner for different occupancy patterns ahead of them (e.g., a driver may move at constant speed in traffic or on an open road). Past work, however, does not account for this multimodality, thus neglecting to model this source of aleatoric uncertainty in the relationship between driver behaviors and their environment. We propose an occlusion inference method that characterizes observed behaviors of human agents as sensor measurements, and fuses them with those from a standard sensor suite. To capture the aleatoric uncertainty, we train a conditional variational autoencoder with a discrete latent space to learn a multimodal mapping from observed driver trajectories to an occupancy grid representation of the view ahead of the driver. Our method handles multi-agent scenarios, combining measurements from multiple observed drivers using evidential theory to solve the sensor fusion problem. Our approach is validated on a real-world dataset, outperforming baselines and demonstrating real-time capable performance. Our code is available at https://github.com/sisl/MultiAgentVariationalOcclusionInference .
    To be Critical: Self-Calibrated Weakly Supervised Learning for Salient Object Detection. (arXiv:2109.01770v1 [cs.CV])
    (0 min) Weakly-supervised salient object detection (WSOD) aims to develop saliency models using image-level annotations. Despite of the success of previous works, explorations on an effective training strategy for the saliency network and accurate matches between image-level annotations and salient objects are still inadequate. In this work, 1) we propose a self-calibrated training strategy by explicitly establishing a mutual calibration loop between pseudo labels and network predictions, liberating the saliency network from error-prone propagation caused by pseudo labels. 2) we prove that even a much smaller dataset (merely 1.8% of ImageNet) with well-matched annotations can facilitate models to achieve better performance as well as generalizability. This sheds new light on the development of WSOD and encourages more contributions to the community. Comprehensive experiments demonstrate that our method outperforms all the existing WSOD methods by adopting the self-calibrated strategy only. Steady improvements are further achieved by training on the proposed dataset. Additionally, our method achieves 94.7% of the performance of fully-supervised methods on average. And what is more, the fully supervised models adopting our predicted results as "ground truths" achieve successful results (95.6% for BASNet and 97.3% for ITSD on F-measure), while costing only 0.32% of labeling time for pixel-level annotation.
    Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time. (arXiv:2009.10623v5 [cs.LG] UPDATED)
    (0 min) From CNNs to attention mechanisms, encoding inductive biases into neural networks has been a fruitful source of improvement in machine learning. Adding auxiliary losses to the main objective function is a general way of encoding biases that can help networks learn better representations. However, since auxiliary losses are minimized only on training data, they suffer from the same generalization gap as regular task losses. Moreover, by adding a term to the loss function, the model optimizes a different objective than the one we care about. In this work we address both problems: first, we take inspiration from \textit{transductive learning} and note that after receiving an input but before making a prediction, we can fine-tune our networks on any unsupervised loss. We call this process {\em tailoring}, because we customize the model to each input to ensure our prediction satisfies the inductive bias. Second, we formulate {\em meta-tailoring}, a nested optimization similar to that in meta-learning, and train our models to perform well on the task objective after adapting them using an unsupervised loss. The advantages of tailoring and meta-tailoring are discussed theoretically and demonstrated empirically on a diverse set of examples.
    Predicting isocitrate dehydrogenase mutationstatus in glioma using structural brain networksand graph neural networks. (arXiv:2109.01854v1 [eess.IV])
    (0 min) Glioma is a common malignant brain tumor that shows distinct survival among patients. The isocitrate dehydrogenase (IDH) gene mutation status provides critical diagnostic and prognostic value for glioma and is now accepted as the standard of care. A non-invasive prediction of IDH mutation based on the pre-treatment MRI has crucial clinical significance. Machine learning and deep learning models show reasonable performance in predicting IDH mutation status. However, most models neglect the systematic brain alterations caused by tumor invasion, where the infiltration along white matter tracts throughout the brain is identified as a hallmark of glioma. Structural brain network provides an effective tool to characterise brain organisation, which could be captured by the graph neural networks (GNN) for a more accurate prediction of IDH mutation status. Here we propose a method to predict the IDH mutation using GNN, based on the structural brain network of patients. Specifically, we firstly construct a network template of healthy subjects, which consists of atlases of edges (white matter tracts) and nodes (cortical and subcortical brain regions) to provide regions of interest (ROI). Next, we employ autoencoders to extract the latent multi-modal MRI features from the ROIs of the edge and node in patients. These features of edge and node of brain networks are used to train a GNN architecture in predicting IDH mutation status. The results show that the proposed method outperforms the baseline models using 3D-CNN and 3D-DenseNet. In addition, the model interpretation suggests its ability to identify the tracts infiltrated by tumor and corresponds to clinical prior knowledge. In conclusion, integrating brain networks with GNN offers a new avenue to study brain lesions using computational neuroscience and computer vision approaches.
    Efficient Privacy Preserving Edge Computing Framework for Image Classification. (arXiv:2005.04563v2 [cs.LG] UPDATED)
    (0 min) In order to extract knowledge from the large data collected by edge devices, traditional cloud based approach that requires data upload may not be feasible due to communication bandwidth limitation as well as privacy and security concerns of end users. To address these challenges, a novel privacy preserving edge computing framework is proposed in this paper for image classification. Specifically, autoencoder will be trained unsupervised at each edge device individually, then the obtained latent vectors will be transmitted to the edge server for the training of a classifier. This framework would reduce the communications overhead and protect the data of the end users. Comparing to federated learning, the training of the classifier in the proposed framework does not subject to the constraints of the edge devices, and the autoencoder can be trained independently at each edge device without any server involvement. Furthermore, the privacy of the end users' data is protected by transmitting latent vectors without additional cost of encryption. Experimental results provide insights on the image classification performance vs. various design parameters such as the data compression ratio of the autoencoder and the model complexity.
    Data Efficient Masked Language Modeling for Vision and Language. (arXiv:2109.02040v1 [cs.CL])
    (0 min) Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource settings. Further, our pre-training approach substantially outperforms the baseline model on a prompt-based probing task designed to elicit image objects. These results and our analysis indicate that our method allows for better utilization of the training data.
    Handwritten Character Recognition of South Indian Scripts: A Review. (arXiv:1106.0107v1 [cs.CV] CROSS LISTED)
    (0 min) Handwritten character recognition is always a frontier area of research in the field of pattern recognition and image processing and there is a large demand for OCR on hand written documents. Even though, sufficient studies have performed in foreign scripts like Chinese, Japanese and Arabic characters, only a very few work can be traced for handwritten character recognition of Indian scripts especially for the South Indian scripts. This paper provides an overview of offline handwritten character recognition in South Indian Scripts, namely Malayalam, Tamil, Kannada and Telungu.
    RiWNet: A moving object instance segmentation Network being Robust in adverse Weather conditions. (arXiv:2109.01820v1 [cs.CV])
    (0 min) Segmenting each moving object instance in a scene is essential for many applications. But like many other computer vision tasks, this task performs well in optimal weather, but then adverse weather tends to fail. To be robust in weather conditions, the usual way is to train network in data of given weather pattern or to fuse multiple sensors. We focus on a new possibility, that is, to improve its resilience to weather interference through the network's structural design. First, we propose a novel FPN structure called RiWFPN with a progressive top-down interaction and attention refinement module. RiWFPN can directly replace other FPN structures to improve the robustness of the network in non-optimal weather conditions. Then we extend SOLOV2 to capture temporal information in video to learn motion information, and propose a moving object instance segmentation network with RiWFPN called RiWNet. Finally, in order to verify the effect of moving instance segmentation in different weather disturbances, we propose a VKTTI-moving dataset which is a moving instance segmentation dataset based on the VKTTI dataset, taking into account different weather scenes such as rain, fog, sunset, morning as well as overcast. The experiment proves how RiWFPN improves the network's resilience to adverse weather effects compared to other FPN structures. We compare RiWNet to several other state-of-the-art methods in some challenging datasets, and RiWNet shows better performance especially under adverse weather conditions.
    FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. (arXiv:2108.05080v3 [cs.CV] UPDATED)
    (0 min) While significant advancements have been made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. A large amount of good quality datasets is typically required to capture the real-world scenarios to develop a competent deepfake detector. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. Hence, there is a crucial need for creating a good video as well as an audio deepfake dataset, which can be used to detect audio and video deepfake simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset (FakeAVCeleb) that contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the current most popular deepfake generation methods. We selected real YouTube videos of celebrities with four racial backgrounds (Caucasian, Black, East Asian, and South Asian) to develop a more realistic multimodal dataset that addresses racial bias and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.
    Rotation Equivariant Feature Image Pyramid Network for Object Detection in Optical Remote Sensing Imagery. (arXiv:2106.00880v3 [cs.CV] UPDATED)
    (0 min) Detection of objects is extremely important in various aerial vision-based applications. Over the last few years, the methods based on convolution neural networks have made substantial progress. However, because of the large variety of object scales, densities, and arbitrary orientations, the current detectors struggle with the extraction of semantically strong features for small-scale objects by a predefined convolution kernel. To address this problem, we propose the rotation equivariant feature image pyramid network (REFIPN), an image pyramid network based on rotation equivariance convolution. The proposed model adopts single-shot detector in parallel with a lightweight image pyramid module to extract representative features and generate regions of interest in an optimization approach. The proposed network extracts feature in a wide range of scales and orientations by using novel convolution filters. These features are used to generate vector fields and determine the weight and angle of the highest-scoring orientation for all spatial locations on an image. By this approach, the performance for small-sized object detection is enhanced without sacrificing the performance for large-sized object detection. The performance of the proposed model is validated on two commonly used aerial benchmarks and the results show our proposed model can achieve state-of-the-art performance with satisfactory efficiency.
    CTRL-C: Camera calibration TRansformer with Line-Classification. (arXiv:2109.02259v1 [cs.CV])
    (0 min) Single image camera calibration is the task of estimating the camera parameters from a single input image, such as the vanishing points, focal length, and horizon line. In this work, we propose Camera calibration TRansformer with Line-Classification (CTRL-C), an end-to-end neural network-based approach to single image camera calibration, which directly estimates the camera parameters from an image and a set of line segments. Our network adopts the transformer architecture to capture the global structure of an image with multi-modal inputs in an end-to-end manner. We also propose an auxiliary task of line classification to train the network to extract the global geometric information from lines effectively. Our experiments demonstrate that CTRL-C outperforms the previous state-of-the-art methods on the Google Street View and SUN360 benchmark datasets.
    Automatic Segmentation of the Optic Nerve Head Region in Optical Coherence Tomography: A Methodological Review. (arXiv:2109.02322v1 [eess.IV])
    (0 min) The optic nerve head represents the intraocular section of the optic nerve (ONH), which is prone to damage by intraocular pressure. The advent of optical coherence tomography (OCT) has enabled the evaluation of novel optic nerve head parameters, namely the depth and curvature of the lamina cribrosa (LC). Together with the Bruch's membrane opening minimum-rim-width, these seem to be promising optic nerve head parameters for diagnosis and monitoring of retinal diseases such as glaucoma. Nonetheless, these optical coherence tomography derived biomarkers are mostly extracted through manual segmentation, which is time-consuming and prone to bias, thus limiting their usability in clinical practice. The automatic segmentation of optic nerve head in OCT scans could further improve the current clinical management of glaucoma and other diseases. This review summarizes the current state-of-the-art in automatic segmentation of the ONH in OCT. PubMed and Scopus were used to perform a systematic review. Additional works from other databases (IEEE, Google Scholar and ARVO IOVS) were also included, resulting in a total of 27 reviewed studies. For each algorithm, the methods, the size and type of dataset used for validation, and the respective results were carefully analyzed. The results show that deep learning-based algorithms provide the highest accuracy, sensitivity and specificity for segmenting the different structures of the ONH including the LC. However, a lack of consensus regarding the definition of segmented regions, extracted parameters and validation approaches has been observed, highlighting the importance and need of standardized methodologies for ONH segmentation.
    Right Ventricular Segmentation from Short- and Long-Axis MRIs via Information Transition. (arXiv:2109.02171v1 [eess.IV])
    (0 min) Right ventricular (RV) segmentation from magnetic resonance imaging (MRI) is a crucial step for cardiac morphology and function analysis. However, automatic RV segmentation from MRI is still challenging, mainly due to the heterogeneous intensity, the complex variable shapes, and the unclear RV boundary. Moreover, current methods for the RV segmentation tend to suffer from performance degradation at the basal and apical slices of MRI. In this work, we propose an automatic RV segmentation framework, where the information from long-axis (LA) views is utilized to assist the segmentation of short-axis (SA) views via information transition. Specifically, we employed the transformed segmentation from LA views as a prior information, to extract the ROI from SA views for better segmentation. The information transition aims to remove the surrounding ambiguous regions in the SA views. %, such as the tricuspid valve regions. We tested our model on a public dataset with 360 multi-center, multi-vendor and multi-disease subjects that consist of both LA and SA MRIs. Our experimental results show that including LA views can be effective to improve the accuracy of the SA segmentation. Our model is publicly available at https://github.com/NanYoMy/MMs-2.
    What Do Compressed Deep Neural Networks Forget?. (arXiv:1911.05248v3 [cs.LG] UPDATED)
    (0 min) Deep neural network pruning and quantization techniques have demonstrated it is possible to achieve high levels of compression with surprisingly little degradation to test set accuracy. However, this measure of performance conceals significant differences in how different classes and images are impacted by model compression techniques. We find that models with radically different numbers of weights have comparable top-line performance metrics but diverge considerably in behavior on a narrow subset of the dataset. This small subset of data points, which we term Pruning Identified Exemplars (PIEs) are systematically more impacted by the introduction of sparsity. Compression disproportionately impacts model performance on the underrepresented long-tail of the data distribution. PIEs over-index on atypical or noisy images that are far more challenging for both humans and algorithms to classify. Our work provides intuition into the role of capacity in deep neural networks and the trade-offs incurred by compression. An understanding of this disparate impact is critical given the widespread deployment of compressed models in the wild.
    Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment. (arXiv:2109.01949v1 [cs.LG])
    (0 min) Self-supervised learning provides an opportunity to explore unlabeled chest X-rays and their associated free-text reports accumulated in clinical routine without manual supervision. This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports. The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching. Both are bidirectionally constrained on Cross-Entropy based and ranking-based Triplet Matching Losses. The region-word matching is calculated using the attention mechanism without direct supervision about their mapping. The pre-trained multi-modal representation learning paves the way for downstream tasks concerning image and/or text encoding. We demonstrate the representation learning quality by cross-modality retrievals and multi-label classifications on two datasets: OpenI-IU and MIMIC-CXR
    OCTAVA: an open-source toolbox for quantitative analysis of optical coherence tomography angiography images. (arXiv:2109.01835v1 [eess.IV])
    (0 min) Optical coherence tomography angiography (OCTA) performs non-invasive visualization and characterization of microvasculature in research and clinical applications mainly in ophthalmology and dermatology. A wide variety of instruments, imaging protocols, processing methods and metrics have been used to describe the microvasculature, such that comparing different study outcomes is currently not feasible. With the goal of contributing to standardization of OCTA data analysis, we report a user-friendly, open-source toolbox, OCTAVA (OCTA Vascular Analyzer), to automate the pre-processing, segmentation, and quantitative analysis of en face OCTA maximum intensity projection images in a standardized workflow. We present each analysis step, including optimization of filtering and choice of segmentation algorithm, and definition of metrics. We perform quantitative analysis of OCTA images from different commercial and non-commercial instruments and samples and show OCTAVA can accurately and reproducibly determine metrics for characterization of microvasculature. Wide adoption could enable studies and aggregation of data on a scale sufficient to develop reliable microvascular biomarkers for early detection, and to guide treatment, of microvascular disease.
    A New Semi-Automated Algorithm for Volumetric Segmentation of the Left Ventricle in Temporal 3D Echocardiography Sequences. (arXiv:2109.01132v2 [eess.IV] UPDATED)
    (0 min) Purpose: Echocardiography is commonly used as a non-invasive imaging tool in clinical practice for the assessment of cardiac function. However, delineation of the left ventricle is challenging due to the inherent properties of ultrasound imaging, such as the presence of speckle noise and the low signal-to-noise ratio. Methods: We propose a semi-automated segmentation algorithm for the delineation of the left ventricle in temporal 3D echocardiography sequences. The method requires minimal user interaction and relies on a diffeomorphic registration approach. Advantages of the method include no dependence on prior geometrical information, training data, or registration from an atlas. Results: The method was evaluated using three-dimensional ultrasound scan sequences from 18 patients from the Mazankowski Alberta Heart Institute, Edmonton, Canada, and compared to manual delineations provided by an expert cardiologist and four other registration algorithms. The segmentation approach yielded the following results over the cardiac cycle: a mean absolute difference of 1.01 (0.21) mm, a Hausdorff distance of 4.41 (1.43) mm, and a Dice overlap score of 0.93 (0.02). Conclusions: The method performed well compared to the four other registration algorithms.
    Out-of-Distribution Detection for Automotive Perception. (arXiv:2011.01413v2 [cs.CV] UPDATED)
    (0 min) Neural networks (NNs) are widely used for object classification in autonomous driving. However, NNs can fail on input data not well represented by the training dataset, known as out-of-distribution (OOD) data. A mechanism to detect OOD samples is important for safety-critical applications, such as automotive perception, to trigger a safe fallback mode. NNs often rely on softmax normalization for confidence estimation, which can lead to high confidences being assigned to OOD samples, thus hindering the detection of failures. This paper presents a method for determining whether inputs are OOD, which does not require OOD data during training and does not increase the computational cost of inference. The latter property is especially important in automotive applications with limited computational resources and real-time constraints. Our proposed approach outperforms state-of-the-art methods on real-world automotive datasets.
    Dual-Camera Super-Resolution with Aligned Attention Modules. (arXiv:2109.01349v2 [cs.CV] UPDATED)
    (0 min) We present a novel approach to reference-based super-resolution (RefSR) with the focus on dual-camera super-resolution (DCSR), which utilizes reference images for high-quality and high-fidelity results. Our proposed method generalizes the standard patch-based feature matching with spatial alignment operations. We further explore the dual-camera super-resolution that is one promising application of RefSR, and build a dataset that consists of 146 image pairs from the main and telephoto cameras in a smartphone. To bridge the domain gaps between real-world images and the training images, we propose a self-supervised domain adaptation strategy for real-world images. Extensive experiments on our dataset and a public benchmark demonstrate clear improvement achieved by our method over state of the art in both quantitative evaluation and visual comparisons.
    CX-ToM: Counterfactual Explanations with Theory-of-Mind for Enhancing Human Trust in Image Recognition Models. (arXiv:2109.01401v2 [cs.AI] UPDATED)
    (0 min) We propose CX-ToM, short for counterfactual explanations with theory-of mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, our CX-ToM framework generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling human's intention, machine's mind as inferred by the human as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c_pred, a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra, pointed ears of dog), referred to as explainable concepts, that need to be added to or deleted from I in order to alter the classification category of I by M to another specified class c_alt. We argue that, due to the iterative, conceptual and counterfactual nature of CX-ToM explanations, our framework is practical and more natural for both expert and non-expert users to understand the internal workings of complex deep learning models. Extensive quantitative and qualitative experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art explainable AI models.
    Confidence Adaptive Regularization for Deep Learning with Noisy Labels. (arXiv:2108.08212v2 [cs.LG] UPDATED)
    (0 min) Recent studies on the memorization effects of deep neural networks on noisy labels show that the networks first fit the correctly-labeled training samples before memorizing the mislabeled samples. Motivated by this early-learning phenomenon, we propose a novel method to prevent memorization of the mislabeled samples. Unlike the existing approaches which use the model output to identify or ignore the mislabeled samples, we introduce an indicator branch to the original model and enable the model to produce a confidence value for each sample. The confidence values are incorporated in our loss function which is learned to assign large confidence values to correctly-labeled samples and small confidence values to mislabeled samples. We also propose an auxiliary regularization term to further improve the robustness of the model. To improve the performance, we gradually correct the noisy labels with a well-designed target estimation strategy. We provide the theoretical analysis and conduct the experiments on synthetic and real-world datasets, demonstrating that our approach achieves comparable results to the state-of-the-art methods.
    Optimal Target Shape for LiDAR Pose Estimation. (arXiv:2109.01181v2 [cs.CV] UPDATED)
    (0 min) Targets are essential in problems such as object tracking in cluttered or textureless environments, camera (and multi-sensor) calibration tasks, and simultaneous localization and mapping (SLAM). Target shapes for these tasks typically are symmetric (square, rectangular, or circular) and work well for structured, dense sensor data such as pixel arrays (i.e., image). However, symmetric shapes lead to pose ambiguity when using sparse sensor data such as LiDAR point clouds and suffer from the quantization uncertainty of the LiDAR. This paper introduces the concept of optimizing target shape to remove pose ambiguity for LiDAR point clouds. A target is designed to induce large gradients at edge points under rotation and translation relative to the LiDAR to ameliorate the quantization uncertainty associated with point cloud sparseness. Moreover, given a target shape, we present a means that leverages the target's geometry to estimate the target's vertices while globally estimating the pose. Both the simulation and the experimental results (verified by a motion capture system) confirm that by using the optimal shape and the global solver, we achieve centimeter error in translation and a few degrees in rotation even when a partially illuminated target is placed 30 meters away. All the implementations and datasets are available at https://github.com/UMich-BipedLab/optimal_shape_global_pose_estimation.
    Multi-task fully convolutional network for tree species mapping in dense forests using small training hyperspectral data. (arXiv:2106.00799v2 [cs.CV] UPDATED)
    (0 min) This work proposes a multi-task fully convolutional architecture for tree species mapping in dense forests from sparse and scarce polygon-level annotations using hyperspectral UAV-borne data. Our model implements a partial loss function that enables dense tree semantic labeling outcomes from non-dense training samples, and a distance regression complementary task that enforces tree crown boundary constraints and substantially improves the model performance. Our multi-task architecture uses a shared backbone network that learns common representations for both tasks and two task-specific decoders, one for the semantic segmentation output and one for the distance map regression. We report that introducing the complementary task boosts the semantic segmentation performance compared to the single-task counterpart in up to 11% reaching an average user's accuracy of 88.63% and an average producer's accuracy of 88.59%, achieving state-of-art performance for tree species classification in tropical forests.
    Temporal-Coded Deep Spiking Neural Network with Easy Training and Robust Performance. (arXiv:1909.10837v4 [cs.CV] UPDATED)
    (0 min) Spiking neural network (SNN) is interesting both theoretically and practically because of its strong bio-inspiration nature and potentially outstanding energy efficiency. Unfortunately, its development has fallen far behind the conventional deep neural network (DNN), mainly because of difficult training and lack of widely accepted hardware experiment platforms. In this paper, we show that a deep temporal-coded SNN can be trained easily and directly over the benchmark datasets CIFAR10 and ImageNet, with testing accuracy within 1% of the DNN of equivalent size and architecture. Training becomes similar to DNN thanks to the closed-form solution to the spiking waveform dynamics. Considering that SNNs should be implemented in practical neuromorphic hardwares, we train the deep SNN with weights quantized to 8, 4, 2 bits and with weights perturbed by random noise to demonstrate its robustness in practical applications. In addition, we develop a phase-domain signal processing circuit schematic to implement our spiking neuron with 90% gain of energy efficiency over existing work. This paper demonstrates that the temporal-coded deep SNN is feasible for applications with high performance and high energy efficient.
    Learning to drive from a world on rails. (arXiv:2105.00636v2 [cs.RO] UPDATED)
    (0 min) We learn an interactive vision-based driving policy from pre-recorded driving logs via a model-based approach. A forward model of the world supervises a driving policy that predicts the outcome of any potential driving trajectory. To support learning from pre-recorded logs, we assume that the world is on rails, meaning neither the agent nor its actions influence the environment. This assumption greatly simplifies the learning problem, factorizing the dynamics into a nonreactive world model and a low-dimensional and compact forward model of the ego-vehicle. Our approach computes action-values for each training trajectory using a tabular dynamic-programming evaluation of the Bellman equations; these action-values in turn supervise the final vision-based driving policy. Despite the world-on-rails assumption, the final driving policy acts well in a dynamic and reactive world. At the time of writing, our method ranks first on the CARLA leaderboard, attaining a 25% higher driving score while using 40 times less data. Our method is also an order of magnitude more sample-efficient than state-of-the-art model-free reinforcement learning techniques on navigational tasks in the ProcGen benchmark.
    Efficient Action Recognition Using Confidence Distillation. (arXiv:2109.02137v1 [cs.CV])
    (0 min) Modern neural networks are powerful predictive models. However, when it comes to recognizing that they may be wrong about their predictions, they perform poorly. For example, for one of the most common activation functions, the ReLU and its variants, even a well-calibrated model can produce incorrect but high confidence predictions. In the related task of action recognition, most current classification methods are based on clip-level classifiers that densely sample a given video for non-overlapping, same-sized clips and aggregate the results using an aggregation function - typically averaging - to achieve video level predictions. While this approach has shown to be effective, it is sub-optimal in recognition accuracy and has a high computational overhead. To mitigate both these issues, we propose the confidence distillation framework to teach a representation of uncertainty of the teacher to the student sampler and divide the task of full video prediction between the student and the teacher models. We conduct extensive experiments on three action recognition datasets and demonstrate that our framework achieves significant improvements in action recognition accuracy (up to 20%) and computational efficiency (more than 40%).
    Instance-level Image Retrieval using Reranking Transformers. (arXiv:2103.12236v2 [cs.CV] UPDATED)
    (0 min) Instance-level image retrieval is the task of searching in a large database for images that match an object in a query image. To address this task, systems usually rely on a retrieval step that uses global image descriptors, and a subsequent step that performs domain-specific refinements or reranking by leveraging operations such as geometric verification based on local features. In this work, we propose Reranking Transformers (RRTs) as a general model to incorporate both local and global features to rerank the matching images in a supervised fashion and thus replace the relatively expensive process of geometric verification. RRTs are lightweight and can be easily parallelized so that reranking a set of top matching results can be performed in a single forward-pass. We perform extensive experiments on the Revisited Oxford and Paris datasets, and the Google Landmarks v2 dataset, showing that RRTs outperform previous reranking approaches while using much fewer local descriptors. Moreover, we demonstrate that, unlike existing approaches, RRTs can be optimized jointly with the feature extractor, which can lead to feature representations tailored to downstream tasks and further accuracy improvements. The code and trained models are publicly available at https://github.com/uvavision/RerankingTransformer.
    Intelligent Monitoring of Stress Induced by Water Deficiency in Plants using Deep Learning. (arXiv:2104.07911v3 [cs.CV] UPDATED)
    (0 min) In the recent decade, high-throughput plant phenotyping techniques, which combine non-invasive image analysis and machine learning, have been successfully applied to identify and quantify plant health and diseases. However, these techniques usually do not consider the progressive nature of plant stress and often require images showing severe signs of stress to ensure high confidence detection, thereby reducing the feasibility for early detection and recovery of plants under stress. To overcome the problem mentioned above, we propose a deep learning pipeline for the temporal analysis of the visual changes induced in the plant due to stress and apply it to the specific water stress identification case in Chickpea plant shoot images. For this, we have considered an image dataset of two chickpea varieties JG-62 and Pusa-372, under three water stress conditions; control, young seedling, and before flowering, captured over five months. We have employed a variant of Convolutional Neural Network - Long Short Term Memory (CNN-LSTM) network to learn spatio-temporal patterns from the chickpea plant dataset and use them for water stress classification. Our model has achieved ceiling level classification performance of 98.52% on JG-62 and 97.78% on Pusa-372 chickpea plant data and has outperformed the best reported time-invariant technique by at least 14% for both JG-62 and Pusa-372 species, to the best of our knowledge. Furthermore, our CNN-LSTM model has demonstrated robustness to noisy input, with a less than 2.5% dip in average model accuracy and a small standard deviation about the mean for both species. Lastly, we have performed an ablation study to analyze the performance of the CNN-LSTM model by decreasing the number of temporal session data used for training.
    Self-supervised Product Quantization for Deep Unsupervised Image Retrieval. (arXiv:2109.02244v1 [cs.CV])
    (0 min) Supervised deep learning-based hash and vector quantization are enabling fast and large-scale image retrieval systems. By fully exploiting label annotations, they are achieving outstanding retrieval performances compared to the conventional methods. However, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, we propose the first deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner. We design a Cross Quantized Contrastive learning strategy that jointly learns codewords and deep visual descriptors by comparing individually transformed images (views). Our method analyzes the image contents to extract descriptive features, allowing us to understand image representations for accurate retrieval. By conducting extensive experiments on benchmarks, we demonstrate that the proposed method yields state-of-the-art results even without supervised pretraining.
    Fast Image-Anomaly Mitigation for Autonomous Mobile Robots. (arXiv:2109.01889v1 [cs.CV])
    (0 min) Camera anomalies like rain or dust can severelydegrade image quality and its related tasks, such as localizationand segmentation. In this work we address this importantissue by implementing a pre-processing step that can effectivelymitigate such artifacts in a real-time fashion, thus supportingthe deployment of autonomous systems with limited computecapabilities. We propose a shallow generator with aggregation,trained in an adversarial setting to solve the ill-posed problemof reconstructing the occluded regions. We add an enhancer tofurther preserve high-frequency details and image colorization.We also produce one of the largest publicly available datasets1to train our architecture and use realistic synthetic raindrops toobtain an improved initialization of the model. We benchmarkour framework on existing datasets and on our own imagesobtaining state-of-the-art results while enabling real-time per-formance, with up to 40x faster inference time than existingapproaches.
    On the Surprising Efficiency of Committee-based Models. (arXiv:2012.01988v4 [cs.CV] UPDATED)
    (0 min) Committee-based models, i.e., model ensembles or cascades, are underexplored in recent work on developing efficient models. While committee-based models themselves are not new, there lacks a systematic understanding of their efficiency in comparison with single models. To fill this gap, we conduct a comprehensive analysis of the efficiency of committee-based models. We find that committee-based models provide a complementary paradigm to achieve superior efficiency without tuning the architecture: even the most simplistic method for building ensembles or cascades from existing pre-trained networks can attain a significant speedup and higher accuracy over state-of-the-art single models, and also outperforms sophisticated neural architecture search methods (e.g., BigNAS). The superior efficiency of committee-based models holds true for several tasks, including image classification, video classification, and semantic segmentation, and various architecture families, such as EfficientNet, ResNet, MobileNetV2, and X3D.
    Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss. (arXiv:2109.02100v1 [cs.LG])
    (0 min) Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged for their deployments to resource-limited devices. Although recent studies have successfully discretized a full-precision network, they still incur large quantization errors after training, thus giving rise to a significant performance gap between a full-precision network and its quantized counterpart. In this work, we propose a novel quantization method for neural networks, Cluster-Promoting Quantization (CPQ) that finds the optimal quantization grids while naturally encouraging the underlying full-precision weights to gather around those quantization grids cohesively during training. This property of CPQ is thanks to our two main ingredients that enable differentiable quantization: i) the use of the categorical distribution designed by a specific probabilistic parametrization in the forward pass and ii) our proposed multi-class straight-through estimator (STE) in the backward pass. Since our second component, multi-class STE, is intrinsically biased, we additionally propose a new bit-drop technique, DropBits, that revises the standard dropout regularization to randomly drop bits instead of neurons. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer by imposing an additional regularization on DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.
    Utilizing Adversarial Targeted Attacks to Boost Adversarial Robustness. (arXiv:2109.01945v1 [cs.CV])
    (0 min) Adversarial attacks have been shown to be highly effective at degrading the performance of deep neural networks (DNNs). The most prominent defense is adversarial training, a method for learning a robust model. Nevertheless, adversarial training does not make DNNs immune to adversarial perturbations. We propose a novel solution by adopting the recently suggested Predictive Normalized Maximum Likelihood. Specifically, our defense performs adversarial targeted attacks according to different hypotheses, where each hypothesis assumes a specific label for the test sample. Then, by comparing the hypothesis probabilities, we predict the label. Our refinement process corresponds to recent findings of the adversarial subspace properties. We extensively evaluate our approach on 16 adversarial attack benchmarks using ResNet-50, WideResNet-28, and a2-layer ConvNet trained with ImageNet, CIFAR10, and MNIST, showing a significant improvement of up to 5.7%, 3.7%, and 0.6% respectively.
    Robust Data Hiding Using Inverse Gradient Attention. (arXiv:2011.10850v4 [cs.CV] UPDATED)
    (0 min) Data hiding is the procedure of encoding desired information into the cover image to resist potential noises while ensuring the embedded image has few perceptual perturbations from the original one. Recently, with the tremendous successes gained by deep neural networks in various fields, the researches of data hiding with deep learning models have attracted an increasing number of attentions. In the data hiding task, each pixel of cover images should be treated differently since they have divergent tolerabilities. The neglect of considering the sensitivity of each pixel will inevitably affect the model robustness for information hiding. Targeting this problem, we propose a novel deep data hiding scheme with Inverse Gradient Attention (IGA), combing the ideas of adversarial learning and attention mechanism to endow different sensitivities for different pixels. With the proposed component, the model can spotlight pixels with more robustness for data hiding. Empirically, extensive experiments show that the proposed model outperforms the state-of-the-art methods on two prevalent datasets under multiple evaluations. Besides, we further identify and discuss the connections between the proposed inverse gradient attention and high-frequency regions within images.
    Moving Object Detection for Event-based Vision using k-means Clustering. (arXiv:2109.01879v1 [cs.CV])
    (0 min) Moving object detection is a crucial task in computer vision. Event-based cameras are bio-inspired cameras that work by mimicking the working of the human eye. These cameras have multiple advantages over conventional frame-based cameras, like reduced latency, HDR, reduced motion blur during high motion, low power consumption, etc. However, these advantages come at a high cost, as event-based cameras are noise sensitive and have low resolution. Moreover, the task of moving object detection in these cameras is difficult, as event-based sensors capture only the binary changes in brightness of a scene, lacking useful visual features like texture and color. In this paper, we investigate the application of the k-means clustering technique in detecting moving objects in event-based data. Experimental results in publicly available datasets using k-means show significant improvement in performance over the state-of-the-art methods.
    Exploring Long Tail Visual Relationship Recognition with Large Vocabulary. (arXiv:2004.00436v6 [cs.CV] UPDATED)
    (0 min) Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes. Benchmarks, code, and models have been made available at: https://github.com/Vision-CAIR/LTVRR.
    TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations. (arXiv:2006.10187v4 [cs.CV] UPDATED)
    (0 min) Topology matters. Despite the recent success of point cloud processing with geometric deep learning, it remains arduous to capture the complex topologies of point cloud data with a learning model. Given a point cloud dataset containing objects with various genera, or scenes with multiple objects, we propose an autoencoder, TearingNet, which tackles the challenging task of representing the point clouds using a fixed-length descriptor. Unlike existing works directly deforming predefined primitives of genus zero (e.g., a 2D square patch) to an object-level point cloud, our TearingNet is characterized by a proposed Tearing network module and a Folding network module interacting with each other iteratively. Particularly, the Tearing network module learns the point cloud topology explicitly. By breaking the edges of a primitive graph, it tears the graph into patches or with holes to emulate the topology of a target point cloud, leading to faithful reconstructions. Experimentation shows the superiority of our proposal in terms of reconstructing point clouds as well as generating more topology-friendly representations than benchmarks.
    A realistic approach to generate masked faces applied on two novel masked face recognition data sets. (arXiv:2109.01745v1 [cs.CV])
    (0 min) The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical masks to cover their noses and mouths. Traditional data sets (e.g., CelebA, CASIA-WebFace) used for training these systems were released before the pandemic, so they now seem unsuited due to the lack of examples of people wearing masks. We propose a method for enhancing data sets containing faces without masks by creating synthetic masks and overlaying them on faces in the original images. Our method relies on Spark AR Studio, a developer program made by Facebook that is used to create Instagram face filters. In our approach, we use 9 masks of different colors, shapes and fabrics. We employ our method to generate a number of 445,446 (90%) samples of masks for the CASIA-WebFace data set and 196,254 (96.8%) masks for the CelebA data set, releasing the mask images at https://github.com/securifai/masked_faces. We show that our method produces significantly more realistic training examples of masks overlaid on faces by asking volunteers to qualitatively compare it to other methods or data sets designed for the same task. We also demonstrate the usefulness of our method by evaluating state-of-the-art face recognition systems (FaceNet, VGG-face, ArcFace) trained on the enhanced data sets and showing that they outperform equivalent systems trained on the original data sets (containing faces without masks), when the test benchmark contains masked faces.
    Learning to Generate Scene Graph from Natural Language Supervision. (arXiv:2109.02227v1 [cs.CV])
    (0 min) Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated unlocalized scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation. Our code is available at https://github.com/YiwuZhong/SGG_from_NLS.
    Regional Adversarial Training for Better Robust Generalization. (arXiv:2109.00678v2 [cs.CV] UPDATED)
    (0 min) Adversarial training (AT) has been demonstrated as one of the most promising defense methods against various adversarial attacks. To our knowledge, existing AT-based methods usually train with the locally most adversarial perturbed points and treat all the perturbed points equally, which may lead to considerably weaker adversarial robust generalization on test data. In this work, we introduce a new adversarial training framework that considers the diversity as well as characteristics of the perturbed points in the vicinity of benign samples. To realize the framework, we propose a Regional Adversarial Training (RAT) defense method that first utilizes the attack path generated by the typical iterative attack method of projected gradient descent (PGD), and constructs an adversarial region based on the attack path. Then, RAT samples diverse perturbed training points efficiently inside this region, and utilizes a distance-aware label smoothing mechanism to capture our intuition that perturbed points at different locations should have different impact on the model performance. Extensive experiments on several benchmark datasets show that RAT consistently makes significant improvement on standard adversarial training (SAT), and exhibits better robust generalization.
    Spatiotemporal Inconsistency Learning for DeepFake Video Detection. (arXiv:2109.01860v1 [cs.CV])
    (0 min) The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To address this issue, we term this task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation. Moreover, our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the state-of-the-art competitors.
    Graph Signal Processing for Geometric Data and Beyond: Theory and Applications. (arXiv:2008.01918v3 [cs.CV] UPDATED)
    (0 min) Geometric data acquired from real-world scenes, e.g., 2D depth images, 3D point clouds, and 4D dynamic point clouds, have found a wide range of applications including immersive telepresence, autonomous driving, surveillance, etc. Due to irregular sampling patterns of most geometric data, traditional image/video processing methodologies are limited, while Graph Signal Processing (GSP) -- a fast-developing field in the signal processing community -- enables processing signals that reside on irregular domains and plays a critical role in numerous applications of geometric data from low-level processing to high-level analysis. To further advance the research in this field, we provide the first timely and comprehensive overview of GSP methodologies for geometric data in a unified manner by bridging the connections between geometric data and graphs, among the various geometric data modalities, and with spectral/nodal graph filtering techniques. We also discuss the recently developed Graph Neural Networks (GNNs) and interpret the operation of these networks from the perspective of GSP. We conclude with a brief discussion of open problems and challenges.
    Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark. (arXiv:2109.01839v1 [cs.CL])
    (0 min) As a kind of new expression elements, Internet memes are popular and extensively used in online chatting scenarios since they manage to make dialogues vivid, moving, and interesting. However, most current dialogue researches focus on text-only dialogue tasks. In this paper, we propose a new task named as \textbf{M}eme incorporated \textbf{O}pen-domain \textbf{D}ialogue (MOD). Compared to previous dialogue tasks, MOD is much more challenging since it requires the model to understand the multimodal elements as well as the emotions behind them. To facilitate the MOD research, we construct a large-scale open-domain multimodal dialogue dataset incorporating abundant Internet memes into utterances. The dataset consists of $\sim$45K Chinese conversations with $\sim$606K utterances. Each conversation contains about $13$ utterances with about $4$ Internet memes on average and each utterance equipped with an Internet meme is annotated with the corresponding emotion. In addition, we present a simple and effective method, which utilizes a unified generation network to solve the MOD task. Experimental results demonstrate that our method trained on the proposed corpus is able to achieve expressive communication including texts and memes. The corpus and models have been publicly available at https://github.com/lizekang/DSTC10-MOD.
    Cross-Task Generalization via Natural Language Crowdsourcing Instructions. (arXiv:2104.08773v2 [cs.CL] UPDATED)
    (0 min) Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. NLP models built with the conventional paradigm, however, often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that is equipped with the understanding of human-readable instructions that define the tasks, and can generalize to new tasks. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions and 193k task instances. The instructions are obtained from crowdsourcing instructions used to collect existing NLP datasets and mapped to a unified schema. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models can benefit from instructions to generalize across tasks. These models, however, are far behind supervised task-specific models, indicating significant room for more progress in this direction.
    A Survey on Assessing the Generalization Envelope of Deep Neural Networks: Predictive Uncertainty, Out-of-distribution and Adversarial Samples. (arXiv:2008.09381v4 [cs.LG] UPDATED)
    (0 min) Deep Neural Networks (DNNs) achieve state-of-the-art performance on numerous applications. However, it is difficult to tell beforehand if a DNN receiving an input will deliver the correct output since their decision criteria are usually nontransparent. A DNN delivers the correct output if the input is within the area enclosed by its generalization envelope. In this case, the information contained in the input sample is processed reasonably by the network. It is of large practical importance to assess at inference time if a DNN generalizes correctly. Currently, the approaches to achieve this goal are investigated in different problem set-ups rather independently from one another, leading to three main research and literature fields: predictive uncertainty, out-of-distribution detection and adversarial example detection. This survey connects the three fields within the larger framework of investigating the generalization performance of machine learning methods and in particular DNNs. We underline the common ground, point at the most promising approaches and give a structured overview of the methods that provide at inference time means to establish if the current input is within the generalization envelope of a DNN.
    Underwater 3D Reconstruction Using Light Fields. (arXiv:2109.02116v1 [cs.CV])
    (2 min) Underwater 3D reconstruction is challenging due to the refraction of light at the water-air interface (most electronic devices cannot be directly submerged in water). In this paper, we present an underwater 3D reconstruction solution using light field cameras. We first develop a light field camera calibration algorithm that simultaneously estimates the camera parameters and the geometry of the water-air interface. We then design a novel depth estimation algorithm for 3D reconstruction. Specifically, we match correspondences on curved epipolar lines caused by water refraction. We also observe that the view-dependent specular reflection is very weak in the underwater environment, resulting the angularly sampled rays in light field has uniform intensity. We therefore propose an angular uniformity constraint for depth optimization. We also develop a fast algorithm for locating the angular patches in presence of non-linear light paths. Extensive synthetic and real experiments demonstrate that our method can perform underwater 3D reconstruction with high accuracy.
    Deep Learning for Embodied Vision Navigation: A Survey. (arXiv:2108.04097v2 [cs.RO] UPDATED)
    (0 min) Navigation is one of the fundamental features of a autonomous robot. And the ability of long-term navigation with semantic instruction is a `holy grail` goals of intelligent robots. The development of 3D simulation technology provide a large scale of data to simulate the real-world environment. The deep learning proves its ability to robustly learn various embodied navigation tasks. However, deep learning on embodied navigation is still in its infancy due to the unique challenges faced by the navigation exploration and learning from partial observed visual input. Recently, deep learning in embodied navigation has become even thriving, with numerous methods have been proposed to tackle different challenges in this area. To give a promising direction for future research, in this paper, we present a comprehensive review of embodied navigation tasks and the recent progress in deep learning based methods. It includes two major tasks: target-oriented navigation and the instruction-oriented navigation.
    AIBench Scenario: Scenario-distilling AI Benchmarking. (arXiv:2005.03459v4 [cs.PF] UPDATED)
    (2 min) Modern real-world application scenarios like Internet services consist of a diversity of AI and non-AI modules with huge code sizes and long and complicated execution paths, which raises serious benchmarking or evaluating challenges. Using AI components or micro benchmarks alone can lead to error-prone conclusions. This paper presents a methodology to attack the above challenge. We formalize a real-world application scenario as a Directed Acyclic Graph-based model and propose the rules to distill it into a permutation of essential AI and non-AI tasks, which we call a scenario benchmark. Together with seventeen industry partners, we extract nine typical scenario benchmarks. We design and implement an extensible, configurable, and flexible benchmark framework. We implement two Internet service AI scenario benchmarks based on the framework as proxies to two real-world application scenarios. We consider scenario, component, and micro benchmarks as three indispensable parts for evaluating. Our evaluation shows the advantage of our methodology against using component or micro AI benchmarks alone. The specifications, source code, testbed, and results are publicly available from \url{https://www.benchcouncil.org/aibench/scenario/}.
    DRIVE: Deep Reinforced Accident Anticipation with Visual Explanation. (arXiv:2107.10189v2 [cs.CV] UPDATED)
    (0 min) Traffic accident anticipation aims to accurately and promptly predict the occurrence of a future accident from dashcam videos, which is vital for a safety-guaranteed self-driving system. To encourage an early and accurate decision, existing approaches typically focus on capturing the cues of spatial and temporal context before a future accident occurs. However, their decision-making lacks visual explanation and ignores the dynamic interaction with the environment. In this paper, we propose Deep ReInforced accident anticipation with Visual Explanation, named DRIVE. The method simulates both the bottom-up and top-down visual attention mechanism in a dashcam observation environment so that the decision from the proposed stochastic multi-task agent can be visually explained by attentive regions. Moreover, the proposed dense anticipation reward and sparse fixation reward are effective in training the DRIVE model with our improved reinforcement learning algorithm. Experimental results show that the DRIVE model achieves state-of-the-art performance on multiple real-world traffic accident datasets. Code and pre-trained model are available at \url{https://www.rit.edu/actionlab/drive}.
    Information Theory-Guided Heuristic Progressive Multi-View Coding. (arXiv:2109.02344v1 [cs.CV])
    (2 min) Multi-view representation learning captures comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning (CL) to learn representations, regarded as a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; and evenly measuring the similarities between terms might interfere with optimization. Importantly, few works research the theoretical framework of generalized self-supervised multi-view learning, especially for more than two views. To this end, we rethink the existing multi-view learning paradigm from the information theoretical perspective and then propose a novel information theoretical framework for generalized multi-view learning. Guided by it, we build a multi-view coding method with a three-tier progressive architecture, namely Information theory-guided heuristic Progressive Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC builds self-adjusted pools for contrasting, which utilizes a view filter to adaptively modify the pools. Lastly, in the instance-tier, we adopt a designed unified loss to learn discriminative representations and reduce the gradient interference. Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods.
    Audio-Visual Transformer Based Crowd Counting. (arXiv:2109.01926v1 [cs.CV])
    (2 min) Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.
    Spatial Domain Feature Extraction Methods for Unconstrained Handwritten Malayalam Character Recognition. (arXiv:2109.02153v1 [cs.CV])
    (2 min) Handwritten character recognition is an active research challenge,especially for Indian scripts. This paper deals with handwritten Malayalam, with a complete set of basic characters, vowel and consonant signs and compound characters that may be present in the script. Spatial domain features suitable for recognition are chosen in this work. For classification, k-NN, SVM and ELM are employed
    Multi-view Image-based Hand Geometry Refinement using Differentiable Monte Carlo Ray Tracing. (arXiv:2107.05509v2 [cs.CV] UPDATED)
    (0 min) The amount and quality of datasets and tools available in the research field of hand pose and shape estimation act as evidence to the significant progress that has been made.However, even the datasets of the highest quality, reported to date, have shortcomings in annotation. We propose a refinement approach, based on differentiable ray tracing,and demonstrate how a high-quality publicly available, multi-camera dataset of hands(InterHand2.6M) can become an even better dataset, with respect to annotation quality. Differentiable ray tracing has not been employed so far to relevant problems and is hereby shown to be superior to the approximative alternatives that have been employed in the past. To tackle the lack of reliable ground truth, as far as quantitative evaluation is concerned, we resort to realistic synthetic data, to show that the improvement we induce is indeed significant. The same becomes evident in real data through visual evaluation.
    CPFN: Cascaded Primitive Fitting Networks for High-Resolution Point Clouds. (arXiv:2109.00113v2 [cs.CV] UPDATED)
    (0 min) Representing human-made objects as a collection of base primitives has a long history in computer vision and reverse engineering. In the case of high-resolution point cloud scans, the challenge is to be able to detect both large primitives as well as those explaining the detailed parts. While the classical RANSAC approach requires case-specific parameter tuning, state-of-the-art networks are limited by memory consumption of their backbone modules such as PointNet++, and hence fail to detect the fine-scale primitives. We present Cascaded Primitive Fitting Networks (CPFN) that relies on an adaptive patch sampling network to assemble detection results of global and local primitive detection networks. As a key enabler, we present a merging formulation that dynamically aggregates the primitives across global and local scales. Our evaluation demonstrates that CPFN improves the state-of-the-art SPFN performance by 13-14% on high-resolution point cloud datasets and specifically improves the detection of fine-scale primitives by 20-22%.
    Image Compression with Recurrent Neural Network and Generalized Divisive Normalization. (arXiv:2109.01999v1 [eess.IV])
    (2 min) Image compression is a method to remove spatial redundancy between adjacent pixels and reconstruct a high-quality image. In the past few years, deep learning has gained huge attention from the research community and produced promising image reconstruction results. Therefore, recent methods focused on developing deeper and more complex networks, which significantly increased network complexity. In this paper, two effective novel blocks are developed: analysis and synthesis block that employs the convolution layer and Generalized Divisive Normalization (GDN) in the variable-rate encoder and decoder side. Our network utilizes a pixel RNN approach for quantization. Furthermore, to improve the whole network, we encode a residual image using LSTM cells to reduce unnecessary information. Experimental results demonstrated that the proposed variable-rate framework with novel blocks outperforms existing methods and standard image codecs, such as George's ~\cite{002} and JPEG in terms of image similarity. The project page along with code and models are available at https://khawar512.github.io/cvpr/
    Triplet-Watershed for Hyperspectral Image Classification. (arXiv:2103.09384v3 [cs.CV] UPDATED)
    (0 min) Hyperspectral images (HSI) consist of rich spatial and spectral information, which can potentially be used for several applications. However, noise, band correlations and high dimensionality restrict the applicability of such data. This is recently addressed using creative deep learning network architectures such as ResNet, SSRN, and A2S2K. However, the last layer, i.e the classification layer, remains unchanged and is taken to be the softmax classifier. In this article, we propose to use a watershed classifier. Watershed classifier extends the watershed operator from Mathematical Morphology for classification. In its vanilla form, the watershed classifier does not have any trainable parameters. In this article, we propose a novel approach to train deep learning networks to obtain representations suitable for the watershed classifier. The watershed classifier exploits the connectivity patterns, a characteristic of HSI datasets, for better inference. We show that exploiting such characteristics allows the Triplet-Watershed to achieve state-of-art results in supervised and semi-supervised contexts. These results are validated on Indianpines (IP), University of Pavia (UP), Kennedy Space Center (KSC) and University of Houston (UH) datasets, relying on simple convnet architecture using a quarter of parameters compared to previous state-of-the-art networks. The source code for reproducing the experiments and supplementary material (high resolution images) is available at https://github.com/ac20/TripletWatershed Code.
    Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision Farming. (arXiv:2009.05750v2 [cs.CV] UPDATED)
    (0 min) An effective perception system is a fundamental component for farming robots, as it enables them to properly perceive the surrounding environment and to carry out targeted operations. The most recent methods make use of state-of-the-art machine learning techniques to learn a valid model for the target task. However, those techniques need a large amount of labeled data for training. A recent approach to deal with this issue is data augmentation through Generative Adversarial Networks (GANs), where entire synthetic scenes are added to the training data, thus enlarging and diversifying their informative content. In this work, we propose an alternative solution with respect to the common data augmentation methods, applying it to the fundamental problem of crop/weed segmentation in precision farming. Starting from real images, we create semi-artificial samples by replacing the most relevant object classes (i.e., crop and weeds) with their synthesized counterparts. To do that, we employ a conditional GAN (cGAN), where the generative model is trained by conditioning the shape of the generated object. Moreover, in addition to RGB data, we take into account also near-infrared (NIR) information, generating four channel multi-spectral synthetic images. Quantitative experiments, carried out on three publicly available datasets, show that (i) our model is capable of generating realistic multi-spectral images of plants and (ii) the usage of such synthetic images in the training process improves the segmentation performance of state-of-the-art semantic segmentation convolutional networks.
    On the Tightness of Semidefinite Relaxations for Rotation Estimation. (arXiv:2101.02099v2 [cs.CV] UPDATED)
    (0 min) Why is it that semidefinite relaxations have been so successful in numerous applications in computer vision and robotics for solving non-convex optimization problems involving rotations? In studying the empirical performance we note that there are few failure cases reported in the literature, in particular for estimation problems with a single rotation, motivating us to gain further theoretical understanding. A general framework based on tools from algebraic geometry is introduced for analyzing the power of semidefinite relaxations of problems with quadratic objective functions and rotational constraints. Applications include registration, hand-eye calibration and rotation averaging. We characterize the extreme points, and show that there exist failure cases for which the relaxation is not tight, even in the case of a single rotation. We also show that some problem classes are always tight given an appropriate parametrization. Our theoretical findings are accompanied with numerical simulations, providing further evidence and understanding of the results.
    Stochastic Neural Radiance Fields:Quantifying Uncertainty in Implicit 3D Representations. (arXiv:2109.02123v1 [cs.CV])
    (0 min) Neural Radiance Fields (NeRF) has become a popular framework for learning implicit 3D representations and addressing different tasks such as novel-view synthesis or depth-map estimation. However, in downstream applications where decisions need to be made based on automatic predictions, it is critical to leverage the confidence associated with the model estimations. Whereas uncertainty quantification is a long-standing problem in Machine Learning, it has been largely overlooked in the recent NeRF literature. In this context, we propose Stochastic Neural Radiance Fields (S-NeRF), a generalization of standard NeRF that learns a probability distribution over all the possible radiance fields modeling the scene. This distribution allows to quantify the uncertainty associated with the scene information provided by the model. S-NeRF optimization is posed as a Bayesian learning problem which is efficiently addressed using the Variational Inference framework. Exhaustive experiments over benchmark datasets demonstrate that S-NeRF is able to provide more reliable predictions and confidence values than generic approaches previously proposed for uncertainty estimation in other domains.
    GeneAnnotator: A Semi-automatic Annotation Tool for Visual Scene Graph. (arXiv:2109.02226v1 [cs.CV])
    (0 min) In this manuscript, we introduce a semi-automatic scene graph annotation tool for images, the GeneAnnotator. This software allows human annotators to describe the existing relationships between participators in the visual scene in the form of directed graphs, hence enabling the learning and reasoning on visual relationships, e.g., image captioning, VQA and scene graph generation, etc. The annotations for certain image datasets could either be merged in a single VG150 data-format file to support most existing models for scene graph learning or transformed into a separated annotation file for each single image to build customized datasets. Moreover, GeneAnnotator provides a rule-based relationship recommending algorithm to reduce the heavy annotation workload. With GeneAnnotator, we propose Traffic Genome, a comprehensive scene graph dataset with 1000 diverse traffic images, which in return validates the effectiveness of the proposed software for scene graph annotation. The project source code, with usage examples and sample data is available at https://github.com/Milomilo0320/A-Semi-automatic-Annotation-Software-for-Scene-Graph, under the Apache open-source license.
    Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing. (arXiv:2109.02281v1 [cs.CV])
    (0 min) Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction. In this paper, we propose a Spatial-Temporal Semantic Consistency method to capture class-exclusive context information. Specifically, we design a spatial-temporal consistency loss to constrain the semantic consistency in spatial and temporal dimensions. In addition, we adopt an pseudo-labeling strategy to enrich the training dataset. We obtain the scores of 59.84% and 58.85% mIoU on development (test part 1) and testing set of VSPW, respectively. And our method wins the 1st place on VSPW challenge at ICCV2021.
    Online Continual Learning in Image Classification: An Empirical Survey. (arXiv:2101.10423v3 [cs.LG] UPDATED)
    (3 min) Online continual learning for image classification studies the problem of learning to classify images from an online stream of data and tasks, where tasks may include new classes (class incremental) or data nonstationarity (domain incremental). One of the key challenges of continual learning is to avoid catastrophic forgetting (CF), i.e., forgetting old tasks in the presence of more recent tasks. Over the past few years, many methods and tricks have been introduced to address this problem, but many have not been fairly and systematically compared under a variety of realistic and practical settings. To better understand the relative advantages of various approaches and the settings where they work best, this survey aims to (1) compare state-of-the-art methods such as MIR, iCARL, and GDumb and determine which works best at different experimental settings; (2) determine if the best class incremental methods are also competitive in domain incremental setting; (3) evaluate the performance of 7 simple but effective trick such as "review" trick and nearest class mean (NCM) classifier to assess their relative impact. Regarding (1), we observe iCaRL remains competitive when the memory buffer is small; GDumb outperforms many recently proposed methods in medium-size datasets and MIR performs the best in larger-scale datasets. For (2), we note that GDumb performs quite poorly while MIR -- already competitive for (1) -- is also strongly competitive in this very different but important setting. Overall, this allows us to conclude that MIR is overall a strong and versatile method across a wide variety of settings. For (3), we find that all 7 tricks are beneficial, and when augmented with the "review" trick and NCM classifier, MIR produces performance levels that bring online continual learning much closer to its ultimate goal of matching offline training.
    (M)SLAe-Net: Multi-Scale Multi-Level Attention embedded Network for Retinal Vessel Segmentation. (arXiv:2109.02084v1 [eess.IV])
    (2 min) Segmentation plays a crucial role in diagnosis. Studying the retinal vasculatures from fundus images help identify early signs of many crucial illnesses such as diabetic retinopathy. Due to the varying shape, size, and patterns of retinal vessels, along with artefacts and noises in fundus images, no one-stage method can accurately segment retinal vessels. In this work, we propose a multi-scale, multi-level attention embedded CNN architecture ((M)SLAe-Net) to address the issue of multi-stage processing for robust and precise segmentation of retinal vessels. We do this by extracting features at multiple scales and multiple levels of the network, enabling our model to holistically extracts the local and global features. Multi-scale features are extracted using our novel dynamic dilated pyramid pooling (D-DPP) module. We also aggregate the features from all the network levels. These effectively resolved the issues of varying shapes and artefacts and hence the need for multiple stages. To assist in better pixel-level classification, we use the Squeeze and Attention(SA) module, a smartly adapted version of the Squeeze and Excitation(SE) module for segmentation tasks in our network to facilitate pixel-group attention. Our unique network design and novel D-DPP module with efficient task-specific loss function for thin vessels enabled our model for better cross data performance. Exhaustive experimental results on DRIVE, STARE, HRF, and CHASE-DB1 show the superiority of our method.
    Siamese Basis Function Networks for Data-efficient Defect Classification in Technical Domains. (arXiv:2012.01338v7 [cs.CV] UPDATED)
    (2 min) Training deep learning models in technical domains is often accompanied by the challenge that although the task is clear, insufficient data for training is available. In this work, we propose a novel approach based on the combination of Siamese networks and radial basis function networks to perform data-efficient classification without pretraining by measuring the distance between images in semantic space in a data-efficient manner. We develop the models using three technical datasets, the NEU dataset, the BSD dataset, and the TEX dataset. In addition to the technical domain, we show the general applicability to classical datasets (cifar10 and MNIST) as well. The approach is tested against state-of-the-art models (Resnet50 and Resnet101) by stepwise reduction of the number of samples available for training. The authors show that the proposed approach outperforms the state-of-the-art models in the low data regime.
    Identification of Driver Phone Usage Violations via State-of-the-Art Object Detection with Tracking. (arXiv:2109.02119v1 [cs.CV])
    (2 min) The use of mobiles phones when driving have been a major factor when it comes to road traffic incidents and the process of capturing such violations can be a laborious task. Advancements in both modern object detection frameworks and high-performance hardware has paved the way for a more automated approach when it comes to video surveillance. In this work, we propose a custom-trained state-of-the-art object detector to work with roadside cameras to capture driver phone usage without the need for human intervention. The proposed approach also addresses the issues caused by windscreen glare and introduces the steps required to remedy this. Twelve pre-trained models are fine-tuned with our custom dataset using four popular object detection methods: YOLO, SSD, Faster R-CNN, and CenterNet. Out of all the object detectors tested, the YOLO yields the highest accuracy levels of up to 96% (AP10) and frame rates of up to ~30 FPS. DeepSort object tracking algorithm is also integrated into the best-performing model to collect records of only the unique violations, and enable the proposed approach to count the number of vehicles. The proposed automated system will collect the output images of the identified violations, timestamps of each violation, and total vehicle count. Data can be accessed via a purpose-built user interface.
    Reasoning Graph Networks for Kinship Verification: from Star-shaped to Hierarchical. (arXiv:2109.02219v1 [cs.CV])
    (2 min) In this paper, we investigate the problem of facial kinship verification by learning hierarchical reasoning graph networks. Conventional methods usually focus on learning discriminative features for each facial image of a paired sample and neglect how to fuse the obtained two facial image features and reason about the relations between them. To address this, we propose a Star-shaped Reasoning Graph Network (S-RGN). Our S-RGN first constructs a star-shaped graph where each surrounding node encodes the information of comparisons in a feature dimension and the central node is employed as the bridge for the interaction of surrounding nodes. Then we perform relational reasoning on this star graph with iterative message passing. The proposed S-RGN uses only one central node to analyze and process information from all surrounding nodes, which limits its reasoning capacity. We further develop a Hierarchical Reasoning Graph Network (H-RGN) to exploit more powerful and flexible capacity. More specifically, our H-RGN introduces a set of latent reasoning nodes and constructs a hierarchical graph with them. Then bottom-up comparative information abstraction and top-down comprehensive signal propagation are iteratively performed on the hierarchical graph to update the node features. Extensive experimental results on four widely used kinship databases show that the proposed methods achieve very competitive results.
    Fair Federated Learning for Heterogeneous Face Data. (arXiv:2109.02351v1 [cs.LG])
    (2 min) We consider the problem of achieving fair classification in Federated Learning (FL) under data heterogeneity. Most of the approaches proposed for fair classification require diverse data that represent the different demographic groups involved. In contrast, it is common for each client to own data that represents only a single demographic group. Hence the existing approaches cannot be adopted for fair classification models at the client level. To resolve this challenge, we propose several aggregation techniques. We empirically validate these techniques by comparing the resulting fairness metrics and accuracy on CelebA, UTK, and FairFace datasets.
    Hierarchical Object-to-Zone Graph for Object Navigation. (arXiv:2109.02066v1 [cs.CV])
    (2 min) The goal of object navigation is to reach the expected objects according to visual information in the unseen environments. Previous works usually implement deep models to train an agent to predict actions in real-time. However, in the unseen environment, when the target object is not in egocentric view, the agent may not be able to make wise decisions due to the lack of guidance. In this paper, we propose a hierarchical object-to-zone (HOZ) graph to guide the agent in a coarse-to-fine manner, and an online-learning mechanism is also proposed to update HOZ according to the real-time observation in new environments. In particular, the HOZ graph is composed of scene nodes, zone nodes and object nodes. With the pre-learned HOZ graph, the real-time observation and the target goal, the agent can constantly plan an optimal path from zone to zone. In the estimated path, the next potential zone is regarded as sub-goal, which is also fed into the deep reinforcement learning model for action prediction. Our methods are evaluated on the AI2-Thor simulator. In addition to widely used evaluation metrics SR and SPL, we also propose a new evaluation metric of SAE that focuses on the effective action rate. Experimental results demonstrate the effectiveness and efficiency of our proposed method.
    Does Melania Trump have a body double from the perspective of automatic face recognition?. (arXiv:2109.02283v1 [cs.CV])
    (2 min) In this paper, we explore whether automatic face recognition can help in verifying widespread misinformation on social media, particularly conspiracy theories that are based on the existence of body doubles. The conspiracy theory addressed in this paper is the case of the Melania Trump body double. We employed four different state-of-the-art descriptors for face recognition to verify the integrity of the claim of the studied conspiracy theory. In addition, we assessed the impact of different image quality metrics on the variation of face recognition results. Two sets of image quality metrics were considered: acquisition-related metrics and subject-related metrics.
    Automated Cardiac Resting Phase Detection Targeted on the Right Coronary Artery. (arXiv:2109.02342v1 [eess.IV])
    (2 min) Purpose: Static cardiac imaging such as late gadolinium enhancement, mapping, or 3-D coronary angiography require prior information, e.g., the phase during a cardiac cycle with least motion, called resting phase (RP). The purpose of this work is to propose a fully automated framework that allows the detection of the right coronary artery (RCA) RP within CINE series. Methods: The proposed prototype system consists of three main steps. First, the localization of the regions of interest (ROI) is performed. Second, as CINE series are time-resolved, the cropped ROI series over all time points are taken for tracking motions quantitatively. Third, the output motion values are used to classify RPs. In this work, we focused on the detection of the area with the outer edge of the cross-section of the RCA as our target. The proposed framework was evaluated on 102 clinically acquired dataset at 1.5T and 3T. The automatically classified RPs were compared with the ground truth RPs annotated manually by a medical expert for testing the robustness and feasibility of the framework. Results: The predicted RCA RPs showed high agreement with the experts annotated RPs with 92.7% accuracy, 90.5% sensitivity and 95.0% specificity for the unseen study dataset. The mean absolute difference of the start and end RP was 13.6 ${\pm}$ 18.6 ms for the validation study dataset (n=102). Conclusion: In this work, automated RP detection has been introduced by the proposed framework and demonstrated feasibility, robustness, and applicability for diverse static imaging acquisitions.
    GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization. (arXiv:2109.02220v1 [cs.CV])
    (2 min) Model compression techniques are recently gaining explosive attention for obtaining efficient AI models for various real-time applications. Channel pruning is one important compression strategy and is widely used in slimming various DNNs. Previous gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. However, it remains unclear what criteria the channel importance should be measured on, leading to various channel selection heuristics. Some other sampling-based pruning methods deploy sampling strategies to train sub-nets, which often causes the training instability and the compressed model's degraded performance. In view of the research gaps, we present a new module named Gates with Differentiable Polarization (GDP), inspired by principled optimization ideas. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel or whole layer block. During the training process, the polarization effect will drive a subset of gates to smoothly decrease to exact zero, while other gates gradually stay away from zero by a large margin. When training terminates, those zero-gated channels can be painlessly removed, while other non-zero gates can be absorbed into the succeeding convolution kernel, causing completely no interruption to training nor damage to the trained model. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP algorithm achieves the state-of-the-art performance on various benchmark DNNs at a broad range of pruning ratios. We also apply GDP to DeepLabV3Plus-ResNet50 on the challenging Pascal VOC segmentation task, whose test performance sees no drop (even slightly improved) with over 60% FLOPs saving.
    Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis. (arXiv:2109.02081v1 [cs.CV])
    (2 min) Deep person generation has attracted extensive research attention due to its wide applications in virtual agents, video conferencing, online shopping and art/movie production. With the advancement of deep learning, visual appearances (face, pose, cloth) of a person image can be easily generated or manipulated on demand. In this survey, we first summarize the scope of person generation, and then systematically review recent progress and technical trends in deep person generation, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented person generation (cloth). More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. Based on these fundamental tasks, a number of applications are investigated, e.g., virtual fitting, digital human, generative data augmentation. We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human.
    Investigating Customization Strategies and Convergence Behaviors of Task-specific ADMM. (arXiv:1909.10819v2 [cs.CV] UPDATED)
    (2 min) Alternating Direction Method of Multiplier (ADMM) has been a popular algorithmic framework for separable optimization problems with linear constraints. For numerical ADMM fail to exploit the particular structure of the problem at hand nor the input data information, leveraging task-specific modules (e.g., neural networks and other data-driven architectures) to extend ADMM is a significant but challenging task. This work focuses on designing a flexible algorithmic framework to incorporate various task-specific modules (with no additional constraints) to improve the performance of ADMM in real-world applications. Specifically, we propose Guidance from Optimality (GO), a new customization strategy, to embed task-specific modules into ADMM (GO-ADMM). By introducing an optimality-based criterion to guide the propagation, GO-ADMM establishes an updating scheme agnostic to the choice of additional modules. The existing task-specific methods just plug their task-specific modules into the numerical iterations in a straightforward manner. Even with some restrictive constraints on the plug-in modules, they can only obtain some relatively weaker convergence properties for the resulted ADMM iterations. Fortunately, without any restrictions on the embedded modules, we prove the convergence of GO-ADMM regarding objective values and constraint violations, and derive the worst-case convergence rate measured by iteration complexity. Extensive experiments are conducted to verify the theoretical results and demonstrate the efficiency of GO-ADMM.
    Neural Radiance Flow for 4D View Synthesis and Video Processing. (arXiv:2012.09790v2 [cs.CV] UPDATED)
    (2 min) We present a method, Neural Radiance Flow (NeRFlow),to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only one camera. We further demonstrate that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.
    Learning Fine-Grained Motion Embedding for Landscape Animation. (arXiv:2109.02216v1 [cs.CV])
    (2 min) In this paper we focus on landscape animation, which aims to generate time-lapse videos from a single landscape image. Motion is crucial for landscape animation as it determines how objects move in videos. Existing methods are able to generate appealing videos by learning motion from real time-lapse videos. However, current methods suffer from inaccurate motion generation, which leads to unrealistic video results. To tackle this problem, we propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation. Our model consists of two parts: (1) a motion encoder which embeds time-lapse motion in a fine-grained way. (2) a motion generator which generates realistic motion to animate input images. To train and evaluate on diverse time-lapse videos, we build the largest high-resolution Time-lapse video dataset with Diverse scenes, namely Time-lapse-D, which includes 16,874 video clips with over 10 million frames. Quantitative and qualitative experimental results demonstrate the superiority of our method. In particular, our method achieves relative improvements by 19% on LIPIS and 5.6% on FVD compared with state-of-the-art methods on our dataset. A user study carried out with 700 human subjects shows that our approach visually outperforms existing methods by a large margin.
    Dissecting Image Crops. (arXiv:2011.11831v4 [cs.CV] UPDATED)
    (2 min) The elementary operation of cropping underpins nearly every computer vision system, ranging from data augmentation and translation invariance to computational photography and representation learning. This paper investigates the subtle traces introduced by this operation. For example, despite refinements to camera optics, lenses will leave behind certain clues, notably chromatic aberration and vignetting. Photographers also leave behind other clues relating to image aesthetics and scene composition. We study how to detect these traces, and investigate the impact that cropping has on the image distribution. While our aim is to dissect the fundamental impact of spatial crops, there are also a number of practical implications to our work, such as revealing faulty photojournalism and equipping neural network researchers with a better understanding of shortcut learning. Code is available at https://github.com/basilevh/dissecting-image-crops.
    Sparse-MLP: A Fully-MLP Architecture with Conditional Computation. (arXiv:2109.02008v1 [cs.LG])
    (2 min) Mixture of Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture. We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, one with MLP experts mixing information within patches along the channel dimension. Besides, to reduce computational cost in routing and improve experts capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations. By pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models with comparable parameters and less computational cost on several downstream image classification tasks.
    Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond. (arXiv:1710.03113v5 [cs.LG] UPDATED)
    (3 min) The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multi-level diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this paper proposes a novel multidiversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed. Further, an entropy-based criterion is utilized to explore the cluster-wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state-of-the-art. The source code is available at https://github.com/huangdonghere/MDEC.
    Square Root Marginalization for Sliding-Window Bundle Adjustment. (arXiv:2109.02182v1 [cs.CV])
    (2 min) In this paper we propose a novel square root sliding-window bundle adjustment suitable for real-time odometry applications. The square root formulation pervades three major aspects of our optimization-based sliding-window estimator: for bundle adjustment we eliminate landmark variables with nullspace projection; to store the marginalization prior we employ a matrix square root of the Hessian; and when marginalizing old poses we avoid forming normal equations and update the square root prior directly with a specialized QR decomposition. We show that the proposed square root marginalization is algebraically equivalent to the conventional use of Schur complement (SC) on the Hessian. Moreover, it elegantly deals with rank-deficient Jacobians producing a prior equivalent to SC with Moore-Penrose inverse. Our evaluation of visual and visual-inertial odometry on real-world datasets demonstrates that the proposed estimator is 36% faster than the baseline. It furthermore shows that in single precision, conventional Hessian-based marginalization leads to numeric failures and reduced accuracy. We analyse numeric properties of the marginalization prior to explain why our square root form does not suffer from the same effect and therefore entails superior performance.
    Navigational Path-Planning For All-Terrain Autonomous Agricultural Robot. (arXiv:2109.02015v1 [cs.RO])
    (2 min) The shortage of workforce and increasing cost of maintenance has forced many farm industrialists to shift towards automated and mechanized approaches. The key component for autonomous systems is the path planning techniques used. Coverage path planning (CPP) algorithm is used for navigating over farmlands to perform various agricultural operations such as seeding, ploughing, or spraying pesticides and fertilizers. This report paper compares novel algorithms for autonomous navigation of farmlands. For reduction of navigational constraints, a high-resolution grid map representation is taken into consideration specific to Indian environments. The free space is covered by distinguishing the grid cells as covered, unexplored, partially explored and presence of an obstacle. The performance of the compared algorithms is evaluated with metrics such as time efficiency, space efficiency, accuracy, and robustness to changes in the environment. Robotic Operating System (ROS), Dassault Systemes Experience Platform (3DS Experience), MATLAB along Python were used for the simulation of the compared algorithms. The results proved the applicability of the algorithms for autonomous field navigation and feasibility with robotic path planning.
    Stimuli-Aware Visual Emotion Analysis. (arXiv:2109.01812v1 [cs.CV])
    (2 min) Visual emotion analysis (VEA) has attracted great attention recently, due to the increasing tendency of expressing and understanding emotions through images on social networks. Different from traditional vision tasks, VEA is inherently more challenging since it involves a much higher level of complexity and ambiguity in human cognitive process. Most of the existing methods adopt deep learning techniques to extract general features from the whole image, disregarding the specific features evoked by various emotional stimuli. Inspired by the \textit{Stimuli-Organism-Response (S-O-R)} emotion model in psychological theory, we proposed a stimuli-aware VEA method consisting of three stages, namely stimuli selection (S), feature extraction (O) and emotion prediction (R). First, specific emotional stimuli (i.e., color, object, face) are selected from images by employing the off-the-shelf tools. To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network. Then, we design three specific networks, i.e., Global-Net, Semantic-Net and Expression-Net, to extract distinct emotional features from different stimuli simultaneously. Finally, benefiting from the inherent structure of Mikel's wheel, we design a novel hierarchical cross-entropy loss to distinguish hard false examples from easy ones in an emotion-specific manner. Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets. Ablation study and visualizations further prove the validity and interpretability of our method.
    How Reliable Are Out-of-Distribution Generalization Methods for Medical Image Segmentation?. (arXiv:2109.01668v1 [eess.IV])
    (2 min) The recent achievements of Deep Learning rely on the test data being similar in distribution to the training data. In an ideal case, Deep Learning models would achieve Out-of-Distribution (OoD) Generalization, i.e. reliably make predictions on out-of-distribution data. Yet in practice, models usually fail to generalize well when facing a shift in distribution. Several methods were thereby designed to improve the robustness of the features learned by a model through Regularization- or Domain-Prediction-based schemes. Segmenting medical images such as MRIs of the hippocampus is essential for the diagnosis and treatment of neuropsychiatric disorders. But these brain images often suffer from distribution shift due to the patient's age and various pathologies affecting the shape of the organ. In this work, we evaluate OoD Generalization solutions for the problem of hippocampus segmentation in MR data using both fully- and semi-supervised training. We find that no method performs reliably in all experiments. Only the V-REx loss stands out as it remains easy to tune, while it outperforms a standard U-Net in most cases.
    Recognition of COVID-19 Disease Utilizing X-Ray Imaging of the Chest Using CNN. (arXiv:2109.02103v1 [eess.IV])
    (2 min) Since this COVID-19 pandemic thrives, the utilization of X-Ray images of the Chest (CXR) as a complementary screening technique to RT-PCR testing grows to its clinical use for respiratory complaints. Many new deep learning approaches have developed as a consequence. The goal of this research is to assess the convolutional neural networks (CNNs) to diagnosis COVID-19 utisizing X-ray images of chest. The performance of CNN with one, three, and four convolution layers has been evaluated in this research. A dataset of 13,808 CXR photographs are used in this research. When evaluated on X-ray images with three splits of the dataset, our preliminary experimental results show that the CNN model with three convolution layers can reliably detect with 96 percent accuracy (precision being 96 percent). This fact indicates the commitment of our suggested model for reliable screening of COVID-19.
    F3S: Free Flow Fever Screening. (arXiv:2109.01733v1 [cs.CV])
    (0 min) Identification of people with elevated body temperature can reduce or dramatically slow down the spread of infectious diseases like COVID-19. We present a novel fever-screening system, F3S, that uses edge machine learning techniques to accurately measure core body temperatures of multiple individuals in a free-flow setting. F3S performs real-time sensor fusion of visual camera with thermal camera data streams to detect elevated body temperature, and it has several unique features: (a) visual and thermal streams represent very different modalities, and we dynamically associate semantically-equivalent regions across visual and thermal frames by using a new, dynamic alignment technique that analyzes content and context in real-time, (b) we track people through occlusions, identify the eye (inner canthus), forehead, face and head regions where possible, and provide an accurate temperature reading by using a prioritized refinement algorithm, and (c) we robustly detect elevated body temperature even in the presence of personal protective equipment like masks, or sunglasses or hats, all of which can be affected by hot weather and lead to spurious temperature readings. F3S has been deployed at over a dozen large commercial establishments, providing contact-less, free-flow, real-time fever screening for thousands of employees and customers in indoors and outdoor settings.
    Dual Transfer Learning for Event-based End-task Prediction via Pluggable Event to Image Translation. (arXiv:2109.01801v1 [cs.CV])
    (0 min) Event cameras are novel sensors that perceive the per-pixel intensity changes and output asynchronous event streams with high dynamic range and less motion blur. It has been shown that events alone can be used for end-task learning, \eg, semantic segmentation, based on encoder-decoder-like networks. However, as events are sparse and mostly reflect edge information, it is difficult to recover original details merely relying on the decoder. Moreover, most methods resort to pixel-wise loss alone for supervision, which might be insufficient to fully exploit the visual details from sparse events, thus leading to less optimal performance. In this paper, we propose a simple yet flexible two-stream framework named Dual Transfer Learning (DTL) to effectively enhance the performance on the end-tasks without adding extra inference cost. The proposed approach consists of three parts: event to end-task learning (EEL) branch, event to image translation (EIT) branch, and transfer learning (TL) module that simultaneously explores the feature-level affinity information and pixel-level knowledge from the EIT branch to improve the EEL branch. This simple yet novel method leads to strong representation learning from events and is evidenced by the significant performance boost on the end-tasks such as semantic segmentation and depth estimation.
    Multi-View Spatial-Temporal Graph Convolutional Networks with Domain Generalization for Sleep Stage Classification. (arXiv:2109.01824v1 [eess.SP])
    (0 min) Sleep stage classification is essential for sleep assessment and disease diagnosis. Although previous attempts to classify sleep stages have achieved high classification performance, several challenges remain open: 1) How to effectively utilize time-varying spatial and temporal features from multi-channel brain signals remains challenging. Prior works have not been able to fully utilize the spatial topological information among brain regions. 2) Due to the many differences found in individual biological signals, how to overcome the differences of subjects and improve the generalization of deep neural networks is important. 3) Most deep learning methods ignore the interpretability of the model to the brain. To address the above challenges, we propose a multi-view spatial-temporal graph convolutional networks (MSTGCN) with domain generalization for sleep stage classification. Specifically, we construct two brain view graphs for MSTGCN based on the functional connectivity and physical distance proximity of the brain regions. The MSTGCN consists of graph convolutions for extracting spatial features and temporal convolutions for capturing the transition rules among sleep stages. In addition, attention mechanism is employed for capturing the most relevant spatial-temporal information for sleep stage classification. Finally, domain generalization and MSTGCN are integrated into a unified framework to extract subject-invariant sleep features. Experiments on two public datasets demonstrate that the proposed model outperforms the state-of-the-art baselines.
    Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction. (arXiv:2109.01723v1 [cs.CV])
    (0 min) 3D hand-mesh reconstruction from RGB images facilitates many applications, including augmented reality (AR). However, this requires not only real-time speed and accurate hand pose and shape but also plausible mesh-image alignment. While existing works already achieve promising results, meeting all three requirements is very challenging. This paper presents a novel pipeline by decoupling the hand-mesh reconstruction task into three stages: a joint stage to predict hand joints and segmentation; a mesh stage to predict a rough hand mesh; and a refine stage to fine-tune it with an offset mesh for mesh-image alignment. With careful design in the network structure and in the loss functions, we can promote high-quality finger-level mesh-image alignment and drive the models together to deliver real-time predictions. Extensive quantitative and qualitative results on benchmark datasets demonstrate that the quality of our results outperforms the state-of-the-art methods on hand-mesh/pose precision and hand-image alignment. In the end, we also showcase several real-time AR scenarios.
    CodeNeRF: Disentangled Neural Radiance Fields for Object Categories. (arXiv:2109.01750v1 [cs.GR])
    (0 min) CodeNeRF is an implicit 3D neural representation that learns the variation of object shapes and textures across a category and can be trained, from a set of posed images, to synthesize novel views of unseen objects. Unlike the original NeRF, which is scene specific, CodeNeRF learns to disentangle shape and texture by learning separate embeddings. At test time, given a single unposed image of an unseen object, CodeNeRF jointly estimates camera viewpoint, and shape and appearance codes via optimization. Unseen objects can be reconstructed from a single image, and then rendered from new viewpoints or their shape and texture edited by varying the latent codes. We conduct experiments on the SRN benchmark, which show that CodeNeRF generalises well to unseen objects and achieves on-par performance with methods that require known camera pose at test time. Our results on real-world images demonstrate that CodeNeRF can bridge the sim-to-real gap. Project page: \url{https://github.com/wayne1123/code-nerf}
    Spatio-temporal-spectral-angular observation model that integrates observations from UAV and mobile mapping vehicle for better urban mapping. (arXiv:2109.00900v2 [cs.CV] UPDATED)
    (0 min) In a complex urban scene, observation from a single sensor unavoidably leads to voids in observations, failing to describe urban objects in a comprehensive manner. In this paper, we propose a spatio-temporal-spectral-angular observation model to integrate observations from UAV and mobile mapping vehicle platform, realizing a joint, coordinated observation operation from both air and ground. We develop a multi-source remote sensing data acquisition system to effectively acquire multi-angle data of complex urban scenes. Multi-source data fusion solves the missing data problem caused by occlusion and achieves accurate, rapid, and complete collection of holographic spatial and temporal information in complex urban scenes. We carried out an experiment on Baisha Town, Chongqing, China and obtained multi-sensor, multi-angle data from UAV and mobile mapping vehicle. We first extracted the point cloud from UAV and then integrated the UAV and mobile mapping vehicle point cloud. The integrated results combined both the characteristic of UAV and mobile mapping vehicle point cloud, confirming the practicability of the proposed joint data acquisition platform and the effectiveness of spatio-temporal-spectral-angular observation model. Compared with the observation from UAV or mobile mapping vehicle alone, the integrated system provides an effective data acquisition solution towards comprehensive urban monitoring.
    Seam Carving Detection and Localization using Two-Stage Deep Neural Networks. (arXiv:2109.01764v1 [cs.CV])
    (0 min) Seam carving is a method to resize an image in a content aware fashion. However, this method can also be used to carve out objects from images. In this paper, we propose a two-step method to detect and localize seam carved images. First, we build a detector to detect small patches in an image that has been seam carved. Next, we compute a heatmap on an image based on the patch detector's output. Using these heatmaps, we build another detector to detect if a whole image is seam carved or not. Our experimental results show that our approach is effective in detecting and localizing seam carved images.
    On robustness of generative representations against catastrophic forgetting. (arXiv:2109.01844v1 [cs.CV])
    (0 min) Catastrophic forgetting of previously learned knowledge while learning new tasks is a widely observed limitation of contemporary neural networks. Although many continual learning methods are proposed to mitigate this drawback, the main question remains unanswered: what is the root cause of catastrophic forgetting? In this work, we aim at answering this question by posing and validating a set of research hypotheses related to the specificity of representations built internally by neural models. More specifically, we design a set of empirical evaluations that compare the robustness of representations in discriminative and generative models against catastrophic forgetting. We observe that representations learned by discriminative models are more prone to catastrophic forgetting than their generative counterparts, which sheds new light on the advantages of developing generative models for continual learning. Finally, our work opens new research pathways and possibilities to adopt generative models in continual learning beyond mere replay mechanisms.
    PR-Net: Preference Reasoning for Personalized Video Highlight Detection. (arXiv:2109.01799v1 [cs.CV])
    (0 min) Personalized video highlight detection aims to shorten a long video to interesting moments according to a user's preference, which has recently raised the community's attention. Current methods regard the user's history as holistic information to predict the user's preference but negating the inherent diversity of the user's interests, resulting in vague preference representation. In this paper, we propose a simple yet efficient preference reasoning framework (PR-Net) to explicitly take the diverse interests into account for frame-level highlight prediction. Specifically, distinct user-specific preferences for each input query frame are produced, presented as the similarity weighted sum of history highlights to the corresponding query frame. Next, distinct comprehensive preferences are formed by the user-specific preferences and a learnable generic preference for more overall highlight measurement. Lastly, the degree of highlight and non-highlight for each query frame is calculated as semantic similarity to its comprehensive and non-highlight preferences, respectively. Besides, to alleviate the ambiguity due to the incomplete annotation, a new bi-directional contrastive loss is proposed to ensure a compact and differentiable metric space. In this way, our method significantly outperforms state-of-the-art methods with a relative improvement of 12% in mean accuracy precision.
    Towards disease-aware image editing of chest X-rays. (arXiv:2109.01071v2 [eess.IV] UPDATED)
    (0 min) Disease-aware image editing by means of generative adversarial networks (GANs) constitutes a promising avenue for advancing the use of AI in the healthcare sector. Here, we present a proof of concept of this idea. While GAN-based techniques have been successful in generating and manipulating natural images, their application to the medical domain, however, is still in its infancy. Working with the CheXpert data set, we show that StyleGAN can be trained to generate realistic chest X-rays. Inspired by the Cyclic Reverse Generator (CRG) framework, we train an encoder that allows for faithfully inverting the generator on synthetic X-rays and provides organ-level reconstructions of real ones. Employing a guided manipulation of latent codes, we confer the medical condition of cardiomegaly (increased heart size) onto real X-rays from healthy patients. This work was presented in the Medical Imaging meets Neurips Workshop 2020, which was held as part of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020) in Vancouver, Canada
    Personalized Image Semantic Segmentation. (arXiv:2107.13978v3 [cs.CV] UPDATED)
    (0 min) Semantic segmentation models trained on public datasets have achieved great success in recent years. However, these models didn't consider the personalization issue of segmentation though it is important in practice. In this paper, we address the problem of personalized image segmentation. The objective is to generate more accurate segmentation results on unlabeled personalized images by investigating the data's personalized traits. To open up future research in this area, we collect a large dataset containing various users' personalized images called PIS (Personalized Image Semantic Segmentation). We also survey some recent researches related to this problem and report their performance on our dataset. Furthermore, by observing the correlation among a user's personalized images, we propose a baseline method that incorporates the inter-image context when segmenting certain images. Extensive experiments show that our method outperforms the existing methods on the proposed dataset. The code and the PIS dataset will be made publicly available.
    Tackling the Background Bias in Sparse Object Detection via Cropped Windows. (arXiv:2106.02288v2 [cs.CV] UPDATED)
    (0 min) Object detection on Unmanned Aerial Vehicles (UAVs) is still a challenging task. The recordings are mostly sparse and contain only small objects. In this work, we propose a simple tiling method that improves the detection capability in the remote sensing case without modifying the model itself. By reducing the background bias and enabling the usage of higher image resolutions during training, our method can improve the performance of models substantially. The procedure was validated on three different data sets and outperformed similar approaches in performance and speed.
    Free Lunch for Co-Saliency Detection: Context Adjustment. (arXiv:2108.02093v2 [cs.CV] UPDATED)
    (0 min) We unveil a long-standing problem in the prevailing co-saliency detection systems: there is indeed inconsistency between training and testing. Constructing a high-quality co-saliency detection dataset involves time-consuming and labor-intensive pixel-level labeling, which has forced most recent works to rely instead on semantic segmentation or saliency detection datasets for training. However, the lack of proper co-saliency and the absence of multiple foreground objects in these datasets can lead to spurious variations and inherent biases learned by models. To tackle this, we introduce the idea of counterfactual training through context adjustment, and propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples. Following GCP, we collect a novel dataset called Context Adjustment Training (CAT). CAT consists of 33,500 images, making it four times larger than the current co-saliency detection datasets. All images are automatically annotated with high-quality mask annotations, object categories, and edge maps. Extensive experiments with state-of-the-art models are conducted to demonstrate the superiority of our dataset. We hope that the scale, diversity, and quality of our dataset can benefit researchers in this area and beyond. The dataset and benchmark toolkit will be publicly accessible through our project page.
    Generative Models Improve Radiomics Performance in Different Tasks and Different Datasets: An Experimental Study. (arXiv:2109.02252v1 [q-bio.QM])
    (0 min) Radiomics is an active area of research focusing on high throughput feature extraction from medical images with a wide array of applications in clinical practice, such as clinical decision support in oncology. However, noise in low dose computed tomography (CT) scans can impair the accurate extraction of radiomic features. In this article, we investigate the possibility of using deep learning generative models to improve the performance of radiomics from low dose CTs. We used two datasets of low dose CT scans -NSCLC Radiogenomics and LIDC-IDRI - as test datasets for two tasks - pre-treatment survival prediction and lung cancer diagnosis. We used encoder-decoder networks and conditional generative adversarial networks (CGANs) trained in a previous study as generative models to transform low dose CT images into full dose CT images. Radiomic features extracted from the original and improved CT scans were used to build two classifiers - a support vector machine (SVM) and a deep attention based multiple instance learning model - for survival prediction and lung cancer diagnosis respectively. Finally, we compared the performance of the models derived from the original and improved CT scans. Encoder-decoder networks and CGANs improved the area under the curve (AUC) of survival prediction from 0.52 to 0.57 (p-value<0.01). On the other hand, Encoder-decoder network and CGAN can improve the AUC of lung cancer diagnosis from 0.84 to 0.88 and 0.89 respectively (p-value<0.01). Moreover, there are no statistically significant differences in improving AUC by using encoder-decoder network and CGAN (p-value=0.34) when networks trained at 75 and 100 epochs. Generative models can improve the performance of low dose CT-based radiomics in different tasks. Hence, denoising using generative models seems to be a necessary pre-processing step for calculating radiomic features from low dose CTs.
    Deep learning facilitates fully automated brain image registration of optoacoustic tomography and magnetic resonance imaging. (arXiv:2109.01880v1 [eess.IV])
    (0 min) Multi-spectral optoacoustic tomography (MSOT) is an emerging optical imaging method providing multiplex molecular and functional information from the rodent brain. It can be greatly augmented by magnetic resonance imaging (MRI) that offers excellent soft-tissue contrast and high-resolution brain anatomy. Nevertheless, registration of multi-modal images remains challenging, chiefly due to the entirely different image contrast rendered by these modalities. Previously reported registration algorithms mostly relied on manual user-dependent brain segmentation, which compromised data interpretation and accurate quantification. Here we propose a fully automated registration method for MSOT-MRI multimodal imaging empowered by deep learning. The automated workflow includes neural network-based image segmentation to generate suitable masks, which are subsequently registered using an additional neural network. Performance of the algorithm is showcased with datasets acquired by cross-sectional MSOT and high-field MRI preclinical scanners. The automated registration method is further validated with manual and half-automated registration, demonstrating its robustness and accuracy.
    Fine-tuning deep learning model parameters for improved super-resolution of dynamic MRI with prior-knowledge. (arXiv:2102.02711v3 [eess.IV] UPDATED)
    (0 min) Dynamic imaging is a beneficial tool for interventions to assess physiological changes. Nonetheless during dynamic MRI, while achieving a high temporal resolution, the spatial resolution is compromised. To overcome this spatio-temporal trade-off, this research presents a super-resolution (SR) MRI reconstruction with prior knowledge based fine-tuning to maximise spatial information while reducing the required scan-time for dynamic MRIs. An U-Net based network with perceptual loss is trained on a benchmark dataset and fine-tuned using one subject-specific static high resolution MRI as prior knowledge to obtain high resolution dynamic images during the inference stage. 3D dynamic data for three subjects were acquired with different parameters to test the generalisation capabilities of the network. The method was tested for different levels of in-plane undersampling for dynamic MRI. The reconstructed dynamic SR results after fine-tuning showed higher similarity with the high resolution ground-truth, while quantitatively achieving statistically significant improvement. The average SSIM of the lowest resolution experimented during this research (6.25~\% of the k-space) before and after fine-tuning were 0.939 $\pm$ 0.008 and 0.957 $\pm$ 0.006 respectively. This could theoretically result in an acceleration factor of 16, which can potentially be acquired in less than half a second. The proposed approach shows that the super-resolution MRI reconstruction with prior-information can alleviate the spatio-temporal trade-off in dynamic MRI, even for high acceleration factors.
    Weakly Supervised Few-Shot Segmentation Via Meta-Learning. (arXiv:2109.01693v1 [cs.CV])
    (0 min) Semantic segmentation is a classic computer vision task with multiple applications, which includes medical and remote sensing image analysis. Despite recent advances with deep-based approaches, labeling samples (pixels) for training models is laborious and, in some cases, unfeasible. In this paper, we present two novel meta learning methods, named WeaSeL and ProtoSeg, for the few-shot semantic segmentation task with sparse annotations. We conducted extensive evaluation of the proposed methods in different applications (12 datasets) in medical imaging and agricultural remote sensing, which are very distinct fields of knowledge and usually subject to data scarcity. The results demonstrated the potential of our method, achieving suitable results for segmenting both coffee/orange crops and anatomical parts of the human body in comparison with full dense annotation.
    Ranking Models in Unlabeled New Environments. (arXiv:2108.10310v2 [cs.CV] UPDATED)
    (0 min) Consider a scenario where we are supplied with a number of ready-to-use models trained on a certain source domain and hope to directly apply the most appropriate ones to different target domains based on the models' relative performance. Ideally we should annotate a validation set for model performance assessment on each new target environment, but such annotations are often very expensive. Under this circumstance, we introduce the problem of ranking models in unlabeled new environments. For this problem, we propose to adopt a proxy dataset that 1) is fully labeled and 2) well reflects the true model rankings in a given target environment, and use the performance rankings on the proxy sets as surrogates. We first select labeled datasets as the proxy. Specifically, datasets that are more similar to the unlabeled target domain are found to better preserve the relative performance rankings. Motivated by this, we further propose to search the proxy set by sampling images from various datasets that have similar distributions as the target. We analyze the problem and its solutions on the person re-identification (re-ID) task, for which sufficient datasets are publicly available, and show that a carefully constructed proxy set effectively captures relative performance ranking in new environments. Code is available at \url{https://github.com/sxzrt/Proxy-Set}.
    ISyNet: Convolutional Neural Networks design for AI accelerator. (arXiv:2109.01932v1 [cs.CV])
    (0 min) In recent years Deep Learning reached significant results in many practical problems, such as computer vision, natural language processing, speech recognition and many others. For many years the main goal of the research was to improve the quality of models, even if the complexity was impractically high. However, for the production solutions, which often require real-time work, the latency of the model plays a very important role. Current state-of-the-art architectures are found with neural architecture search (NAS) taking model complexity into account. However, designing of the search space suitable for specific hardware is still a challenging task. To address this problem we propose a measure of hardware efficiency of neural architecture search space - matrix efficiency measure (MEM); a search space comprising of hardware-efficient operations; a latency-aware scaling method; and ISyNet - a set of architectures designed to be fast on the specialized neural processing unit (NPU) hardware and accurate at the same time. We show the advantage of the designed architectures for the NPU devices on ImageNet and the generalization ability for the downstream classification and detection tasks.
    Sensor Data Augmentation with Resampling for Contrastive Learning in Human Activity Recognition. (arXiv:2109.02054v1 [cs.HC])
    (0 min) Human activity recognition plays an increasingly important role not only in our daily lives, but also in the medical and rehabilitation fields. The development of deep learning has also contributed to the advancement of human activity recognition, but the large amount of data annotation work required to train deep learning models is a major obstacle to the development of human activity recognition. Contrastive learning has started to be used in the field of sensor-based human activity recognition due to its ability to avoid the cost of labeling large datasets and its ability to better distinguish between sample representations of different instances. Among them, data augmentation, an important part of contrast learning, has a significant impact on model effectiveness, but current data augmentation methods do not perform too successfully in contrast learning frameworks for wearable sensor-based activity recognition. To optimize the effect of contrast learning models, in this paper, we investigate the sampling frequency of sensors and propose a resampling data augmentation method. In addition, we also propose a contrast learning framework based on human activity recognition and apply the resampling augmentation method to the data augmentation phase of contrast learning. The experimental results show that the resampling augmentation method outperforms supervised learning by 9.88% on UCI HAR and 7.69% on Motion Sensor in the fine-tuning evaluation of contrast learning with a small amount of labeled data, and also reveal that not all data augmentation methods will have positive effects in the contrast learning framework. Finally, we explored the influence of the combination of different augmentation methods on contrastive learning, and the experimental results showed that the effect of most combination augmentation methods was better than that of single augmentation.
    CORSAIR: Convolutional Object Retrieval and Symmetry-AIded Registration. (arXiv:2103.06911v3 [cs.CV] UPDATED)
    (0 min) This paper considers online object-level mapping using partial point-cloud observations obtained online in an unknown environment. We develop and approach for fully Convolutional Object Retrieval and Symmetry-AIded Registration (CORSAIR). Our model extends the Fully Convolutional Geometric Features model to learn a global object-shape embedding in addition to local point-wise features from the point-cloud observations. The global feature is used to retrieve a similar object from a category database, and the local features are used for robust pose registration between the observed and the retrieved object. Our formulation also leverages symmetries, present in the object shapes, to obtain promising local-feature pairs from different symmetry classes for matching. We present results from synthetic and real-world datasets with different object categories to verify the robustness of our method.
    Robust Attentive Deep Neural Network for Exposing GAN-generated Faces. (arXiv:2109.02167v1 [cs.CV])
    (0 min) GAN-based techniques that generate and synthesize realistic faces have caused severe social concerns and security problems. Existing methods for detecting GAN-generated faces can perform well on limited public datasets. However, images from existing public datasets do not represent real-world scenarios well enough in terms of view variations and data distributions (where real faces largely outnumber synthetic faces). The state-of-the-art methods do not generalize well in real-world problems and lack the interpretability of detection results. Performance of existing GAN-face detection models degrades significantly when facing imbalanced data distributions. To address these shortcomings, we propose a robust, attentive, end-to-end network that can spot GAN-generated faces by analyzing their eye inconsistencies. Specifically, our model learns to identify inconsistent eye components by localizing and comparing the iris artifacts between the two eyes automatically. Our deep network addresses the imbalance learning issues by considering the AUC loss and the traditional cross-entropy loss jointly. Comprehensive evaluations of the FFHQ dataset in terms of both balanced and imbalanced scenarios demonstrate the superiority of the proposed method.
    SHAD3S: A model to Sketch, Shade and Shadow. (arXiv:2011.06822v3 [cs.GR] UPDATED)
    (0 min) Hatching is a common method used by artists to accentuate the third dimension of a sketch, and to illuminate the scene. Our system SHAD3S attempts to compete with a human at hatching generic three-dimensional (3D) shapes, and also tries to assist her in a form exploration exercise. The novelty of our approach lies in the fact that we make no assumptions about the input other than that it represents a 3D shape, and yet, given a contextual information of illumination and texture, we synthesise an accurate hatch pattern over the sketch, without access to 3D or pseudo 3D. In the process, we contribute towards a) a cheap yet effective method to synthesise a sufficiently large high fidelity dataset, pertinent to task; b) creating a pipeline with conditional generative adversarial network (CGAN); and c) creating an interactive utility with GIMP, that is a tool for artists to engage with automated hatching or a form-exploration exercise. User evaluation of the tool suggests that the model performance does generalise satisfactorily over diverse input, both in terms of style as well as shape. A simple comparison of inception scores suggest that the generated distribution is as diverse as the ground truth.
    Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos. (arXiv:2104.12671v3 [cs.CV] UPDATED)
    (0 min) Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.
    A Privacy-Preserving Image Retrieval Scheme Using A Codebook Generated From Independent Plain-Image Dataset. (arXiv:2109.01841v1 [eess.IV])
    (0 min) In this paper, we propose a privacy-preserving image-retrieval scheme using a codebook generated by using a plain-image dataset. Encryption-then-compression (EtC) images, which were proposed for EtC systems, have been used in conventional privacy-preserving image-retrieval schemes, in which a codebook is generated from EtC images uploaded by image owners, and extended SIMPLE descriptors are then calculated as image descriptors by using the codebook. In contrast, in the proposed scheme, a codebook is generated from a dataset independent of uploaded images. The use of an independent dataset enables us not only to use a codebook that does not require recalculation but also to constantly provide a high retrieval accuracy. In an experiment, the proposed scheme is demonstrated to maintain a high retrieval performance, even if codebooks are generated from a plain image dataset independent of image owners' encrypted images.
    Image recognition via Vietoris-Rips complex. (arXiv:2109.02231v1 [cs.CV])
    (0 min) Extracting informative features from images has been of capital importance in computer vision. In this paper, we propose a way to extract such features from images by a method based on algebraic topology. To that end, we construct a weighted graph from an image, which extracts local information of an image. By considering this weighted graph as a pseudo-metric space, we construct a Vietoris-Rips complex with a parameter $\varepsilon$ by a well-known process of algebraic topology. We can extract information of complexity of the image and can detect a sub-image with a relatively high concentration of information from this Vietoris-Rips complex. The parameter $\varepsilon$ of the Vietoris-Rips complex produces robustness to noise. We empirically show that the extracted feature captures well images' characteristics.
    Toward Realistic Single-View 3D Object Reconstructionwith Unsupervised Learning from Multiple Images. (arXiv:2109.02288v1 [cs.CV])
    (0 min) Recovering the 3D structure of an object from a single image is a challenging task due to its ill-posed nature. One approach is to utilize the plentiful photos of the same object category to learn a strong 3D shape prior for the object. This approach has successfully been demonstrated by a recent work of Wu et al. (2020), which obtained impressive 3D reconstruction networks with unsupervised learning. However, their algorithm is only applicable to symmetric objects. In this paper, we eliminate the symmetry requirement with a novel unsupervised algorithm that can learn a 3D reconstruction network from a multi-image dataset. Our algorithm is more general and covers the symmetry-required scenario as a special case. Besides, we employ a novel albedo loss that improves the reconstructed details and realisticity. Our method surpasses the previous work in both quality and robustness, as shown in experiments on datasets of various structures, including single-view, multi-view, image-collection, and video sets.
    Sparse Spatial Attention Network for Semantic Segmentation. (arXiv:2109.01915v1 [cs.CV])
    (0 min) The spatial attention mechanism captures long-range dependencies by aggregating global contextual information to each query location, which is beneficial for semantic segmentation. In this paper, we present a sparse spatial attention network (SSANet) to improve the efficiency of the spatial attention mechanism without sacrificing the performance. Specifically, a sparse non-local (SNL) block is proposed to sample a subset of key and value elements for each query element to capture long-range relations adaptively and generate a sparse affinity matrix to aggregate contextual information efficiently. Experimental results show that the proposed approach outperforms other context aggregation methods and achieves state-of-the-art performance on the Cityscapes, PASCAL Context and ADE20K datasets.
    Parsing Table Structures in the Wild. (arXiv:2109.02199v1 [cs.CV])
    (0 min) This paper tackles the problem of table structure parsing (TSP) from images in the wild. In contrast to existing studies that mainly focus on parsing well-aligned tabular images with simple layouts from scanned PDF documents, we aim to establish a practical table structure parsing system for real-world scenarios where tabular input images are taken or scanned with severe deformation, bending or occlusions. For designing such a system, we propose an approach named Cycle-CenterNet on the top of CenterNet with a novel cycle-pairing module to simultaneously detect and group tabular cells into structured tables. In the cycle-pairing module, a new pairing loss function is proposed for the network training. Alongside with our Cycle-CenterNet, we also present a large-scale dataset, named Wired Table in the Wild (WTW), which includes well-annotated structure parsing of multiple style tables in several scenes like the photo, scanning files, web pages, \emph{etc.}. In experiments, we demonstrate that our Cycle-CenterNet consistently achieves the best accuracy of table structure parsing on the new WTW dataset by 24.6\% absolute improvement evaluated by the TEDS metric. A more comprehensive experimental analysis also validates the advantages of our proposed methods for the TSP task.
    Visual Recognition with Deep Learning from Biased Image Datasets. (arXiv:2109.02357v1 [cs.CV])
    (0 min) In practice, and more especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven predictive performances on different population segments highlights the representativeness issues possibly induced by a naive aggregation of image datasets. Indeed, sampling bias does not vanish simply by considering larger datasets, and ignoring its impact may completely jeopardize the generalization capacity of the learned prediction rules. In this paper, we show how biasing models, originally introduced for nonparametric estimation in (Gill et al., 1988), and recently revisited from the perspective of statistical learning theory in (Laforgue and Cl\'emen\c{c}on, 2019), can be applied to remedy these problems in the context of visual recognition. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition for our method to be theoretically valid is that the supports of the distributions generating the biased datasets at disposal must overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach whenever the biasing functions are appropriately chosen.
    Inpainting Transformer for Anomaly Detection. (arXiv:2104.13897v2 [cs.CV] UPDATED)
    (0 min) Anomaly detection in computer vision is the task of identifying images which deviate from a set of normal images. A common approach is to train deep convolutional autoencoders to inpaint covered parts of an image and compare the output with the original image. By training on anomaly-free samples only, the model is assumed to not being able to reconstruct anomalous regions properly. For anomaly detection by inpainting we suggest it to be beneficial to incorporate information from potentially distant regions. In particular we pose anomaly detection as a patch-inpainting problem and propose to solve it with a purely self-attention based approach discarding convolutions. The proposed Inpainting Transformer (InTra) is trained to inpaint covered patches in a large sequence of image patches, thereby integrating information across large regions of the input image. When training from scratch, in comparison to other methods not using extra training data, InTra achieves results on par with the current state-of-the-art on the MVTec AD dataset for detection and surpassing them on segmentation.
    Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks. (arXiv:2109.02096v1 [cs.SD])
    (0 min) This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality. The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio and is applied to the Flickr 8k Audio dataset for transferring the vocal timbre between speakers and the URMP dataset for transferring the musical timbre between instruments. Furthermore, variations of the adopted approach are trained, and generalised performance is compared using the metrics SSIM (Structural Similarity Index) and FAD (Frech\'et Audio Distance). It was found that a many-to-many approach supersedes a one-to-one approach in terms of reconstructive capabilities, and that the adoption of a basic over a bottleneck residual block design is more suitable for enriching content information about a latent space. It was also found that the decision on whether cyclic loss takes on a variational autoencoder or vanilla autoencoder approach does not have a significant impact on reconstructive and adversarial translation aspects of the model.
    Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution. (arXiv:2109.02079v1 [cs.CV])
    (0 min) Hyperspectral image has become increasingly crucial due to its abundant spectral information. However, It has poor spatial resolution with the limitation of the current imaging mechanism. Nowadays, many convolutional neural networks have been proposed for the hyperspectral image super-resolution problem. However, convolutional neural network (CNN) based methods only consider the local information instead of the global one with the limited kernel size of receptive field in the convolution operation. In this paper, we design a network based on the transformer for fusing the low-resolution hyperspectral images and high-resolution multispectral images to obtain the high-resolution hyperspectral images. Thanks to the representing ability of the transformer, our approach is able to explore the intrinsic relationships of features globally. Furthermore, considering the LR-HSIs hold the main spectral structure, the network focuses on the spatial detail estimation releasing from the burden of reconstructing the whole data. It reduces the mapping space of the proposed network, which enhances the final performance. Various experiments and quality indexes show our approach's superiority compared with other state-of-the-art methods.
    Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation. (arXiv:2109.02303v1 [cs.CV])
    (0 min) 3D human shape and pose estimation is the essential task for human motion analysis, which is widely used in many 3D applications. However, existing methods cannot simultaneously capture the relations at multiple levels, including spatial-temporal level and human joint level. Therefore they fail to make accurate predictions in some hard scenarios when there is cluttered background, occlusion, or extreme pose. To this end, we propose Multi-level Attention Encoder-Decoder Network (MAED), including a Spatial-Temporal Encoder (STE) and a Kinematic Topology Decoder (KTD) to model multi-level attentions in a unified framework. STE consists of a series of cascaded blocks based on Multi-Head Self-Attention, and each block uses two parallel branches to learn spatial and temporal attention respectively. Meanwhile, KTD aims at modeling the joint level attention. It regards pose estimation as a top-down hierarchical process similar to SMPL kinematic tree. With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE on the three widely used benchmarks 3DPW, MPI-INF-3DHP, and Human3.6M respectively. Our code is available at https://github.com/ziniuwan/maed.
    Training Multi-Object Detector by Estimating Bounding Box Distribution for Input Image. (arXiv:1911.12721v4 [cs.CV] UPDATED)
    (0 min) In multi-object detection using neural networks, the fundamental problem is, "How should the network learn a variable number of bounding boxes in different input images?". Previous methods train a multi-object detection network through a procedure that directly assigns the ground truth bounding boxes to the specific locations of the network's output. However, this procedure makes the training of a multi-object detection network too heuristic and complicated. In this paper, we reformulate the multi-object detection task as a problem of density estimation of bounding boxes. Instead of assigning each ground truth to specific locations of network's output, we train a network by estimating the probability density of bounding boxes in an input image using a mixture model. For this purpose, we propose a novel network for object detection called Mixture Density Object Detector (MDOD), and the corresponding objective function for the density-estimation-based training. We applied MDOD to MS COCO dataset. Our proposed method not only deals with multi-object detection problems in a new approach, but also improves detection performances through MDOD. The code is available: https://github.com/yoojy31/MDOD.
    Robust fine-tuning of zero-shot models. (arXiv:2109.01903v1 [cs.CV])
    (0 min) Large pre-trained models such as CLIP offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning approaches substantially improve accuracy in-distribution, they also reduce out-of-distribution robustness. We address this tension by introducing a simple and effective method for improving robustness: ensembling the weights of the zero-shot and fine-tuned models. Compared to standard fine-tuning, the resulting weight-space ensembles provide large accuracy improvements out-of-distribution, while matching or improving in-distribution accuracy. On ImageNet and five derived distribution shifts, weight-space ensembles improve out-of-distribution accuracy by 2 to 10 percentage points while increasing in-distribution accuracy by nearly 1 percentage point relative to standard fine-tuning. These improvements come at no additional computational cost during fine-tuning or inference.
    A Comprehensive Approach for UAV Small Object Detection with Simulation-based Transfer Learning and Adaptive Fusion. (arXiv:2109.01800v1 [cs.CV])
    (0 min) Precisely detection of Unmanned Aerial Vehicles(UAVs) plays a critical role in UAV defense systems. Deep learning is widely adopted for UAV object detection whereas researches on this topic are limited by the amount of dataset and small scale of UAV. To tackle these problems, a novel comprehensive approach that combines transfer learning based on simulation data and adaptive fusion is proposed. Firstly, the open-source plugin AirSim proposed by Microsoft is used to generate mass realistic simulation data. Secondly, transfer learning is applied to obtain a pre-trained YOLOv5 model on the simulated dataset and fine-tuned model on the real-world dataset. Finally, an adaptive fusion mechanism is proposed to further improve small object detection performance. Experiment results demonstrate the effectiveness of simulation-based transfer learning which leads to a 2.7% performance increase on UAV object detection. Furthermore, with transfer learning and adaptive fusion mechanism, 7.1% improvement is achieved compared to the original YOLO v5 model.
    Deep Saliency Prior for Reducing Visual Distraction. (arXiv:2109.01980v1 [cs.CV])
    (0 min) Using only a model that was trained to predict where people look at images, and no additional training data, we can produce a range of powerful editing effects for reducing distraction in images. Given an image and a mask specifying the region to edit, we backpropagate through a state-of-the-art saliency model to parameterize a differentiable editing operator, such that the saliency within the masked region is reduced. We demonstrate several operators, including: a recoloring operator, which learns to apply a color transform that camouflages and blends distractors into their surroundings; a warping operator, which warps less salient image regions to cover distractors, gradually collapsing objects into themselves and effectively removing them (an effect akin to inpainting); a GAN operator, which uses a semantic prior to fully replace image regions with plausible, less salient alternatives. The resulting effects are consistent with cognitive research on the human visual system (e.g., since color mismatch is salient, the recoloring operator learns to harmonize objects' colors with their surrounding to reduce their saliency), and, importantly, are all achieved solely through the guidance of the pretrained saliency model, with no additional supervision. We present results on a variety of natural images and conduct a perceptual study to evaluate and validate the changes in viewers' eye-gaze between the original images and our edited results.
    Robust Mitosis Detection Using a Cascade Mask-RCNN Approach With Domain-Specific Residual Cycle-GAN Data Augmentation. (arXiv:2109.01878v1 [cs.CV])
    (0 min) For the MIDOG mitosis detection challenge, we created a cascade algorithm consisting of a Mask-RCNN detector, followed by a classification ensemble consisting of ResNet50 and DenseNet201 to refine detected mitotic candidates. The MIDOG training data consists of 200 frames originating from four scanners, three of which are annotated for mitotic instances with centroid annotations. Our main algorithmic choices are as follows: first, to enhance the generalizability of our detector and classification networks, we use a state-of-the-art residual Cycle-GAN to transform each scanner domain to every other scanner domain. During training, we then randomly load, for each image, one of the four domains. In this way, our networks can learn from the fourth non-annotated scanner domain even if we don't have annotations for it. Second, for training the detector network, rather than using centroid-based fixed-size bounding boxes, we create mitosis-specific bounding boxes. We do this by manually annotating a small selection of mitoses, training a Mask-RCNN on this small dataset, and applying it to the rest of the data to obtain full annotations. We trained the follow-up classification ensemble using only the challenge-provided positive and hard-negative examples. On the preliminary test set, the algorithm scores an F1 score of 0.7578, putting us as the second-place team on the leaderboard.
    RAMA: A Rapid Multicut Algorithm on GPU. (arXiv:2109.01838v1 [cs.DC])
    (0 min) We propose a highly parallel primal-dual algorithm for the multicut (a.k.a. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision. Our algorithm consists of three steps executed recursively: (1) Finding conflicted cycles that correspond to violated inequalities of the underlying multicut relaxation, (2) Performing message passing between the edges and cycles to optimize the Lagrange relaxation coming from the found violated cycles producing reduced costs and (3) Contracting edges with high reduced costs through matrix-matrix multiplications. Our algorithm produces primal solutions and dual lower bounds that estimate the distance to optimum. We implement our algorithm on GPUs and show resulting one to two order-of-magnitudes improvements in execution speed without sacrificing solution quality compared to traditional serial algorithms that run on CPUs. We can solve very large scale benchmark problems with up to $\mathcal{O}(10^8)$ variables in a few seconds with small primal-dual gaps. We make our code available at https://github.com/pawelswoboda/RAMA.
    Navigating the Mise-en-Page: Interpretive Machine Learning Approaches to the Visual Layouts of Multi-Ethnic Periodicals. (arXiv:2109.01732v1 [cs.CV])
    (0 min) This paper presents a computational method of analysis that draws from machine learning, library science, and literary studies to map the visual layouts of multi-ethnic newspapers from the late 19th and early 20th century United States. This work departs from prior approaches to newspapers that focus on individual pieces of textual and visual content. Our method combines Chronicling America's MARC data and the Newspaper Navigator machine learning dataset to identify the visual patterns of newspaper page layouts. By analyzing high-dimensional visual similarity, we aim to better understand how editors spoke and protested through the layout of their papers.
    Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution. (arXiv:2109.01664v1 [eess.IV])
    (0 min) Super-resolving the Magnetic Resonance (MR) image of a target contrast under the guidance of the corresponding auxiliary contrast, which provides additional anatomical information, is a new and effective solution for fast MR imaging. However, current multi-contrast super-resolution (SR) methods tend to concatenate different contrasts directly, ignoring their relationships in different clues, \eg, in the foreground and background. In this paper, we propose a separable attention network (comprising a foreground priority attention and background separation attention), named SANet. Our method can explore the foreground and background areas in the forward and reverse directions with the help of the auxiliary contrast, enabling it to learn clearer anatomical structures and edge information for the SR of a target-contrast MR image. SANet provides three appealing benefits: (1) It is the first model to explore a separable attention mechanism that uses the auxiliary contrast to predict the foreground and background regions, diverting more attention to refining any uncertain details between these regions and correcting the fine areas in the reconstructed results. (2) A multi-stage integration module is proposed to learn the response of multi-contrast fusion at different stages, obtain the dependency between the fused features, and improve their representation ability. (3) Extensive experiments with various state-of-the-art multi-contrast SR methods on fastMRI and clinical \textit{in vivo} datasets demonstrate the superiority of our model.
    Revisiting 3D ResNets for Video Recognition. (arXiv:2109.01696v1 [cs.CV])
    (0 min) A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition. This short note studies effective training and scaling strategies for video recognition models. We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes. The resulting models, termed 3D ResNet-RS, attain competitive performance of 81.0 on Kinetics-400 and 83.8 on Kinetics-600 without pre-training. When pre-trained on a large Web Video Text dataset, our best model achieves 83.5 and 84.3 on Kinetics-400 and Kinetics-600. The proposed scaling rule is further evaluated in a self-supervised setup using contrastive learning, demonstrating improved performance. Code is available at: https://github.com/tensorflow/models/tree/master/official.
    Robotic Waste Sorter with Agile Manipulation and Quickly Trainable Detector. (arXiv:2104.01260v2 [cs.RO] UPDATED)
    (0 min) Owing to human labor shortages, the automation of labor-intensive manual waste-sorting is needed. The goal of automating waste-sorting is to replace the human role of robust detection and agile manipulation of waste items with robots. To achieve this, we propose three methods. First, we provide a combined manipulation method using graspless push-and-drop and pick-and-release manipulation. Second, we provide a robotic system that can automatically collect object images to quickly train a deep neural-network model. Third, we provide a method to mitigate the differences in the appearance of target objects from two scenes: one for dataset collection and the other for waste sorting in a recycling factory. If differences exist, the performance of a trained waste detector may decrease. We address differences in illumination and background by applying object scaling, histogram matching with histogram equalization, and background synthesis to the source target-object images. Via experiments in an indoor experimental workplace for waste-sorting, we confirm that the proposed methods enable quick collection of the training image sets for three classes of waste items (i.e., aluminum can, glass bottle, and plastic bottle) and detection with higher performance than the methods that do not consider the differences. We also confirm that the proposed method enables the robot quickly manipulate the objects.
    Virtual Temporal Samples for Recurrent Neural Networks: applied to semantic segmentation in agriculture. (arXiv:2106.10118v2 [cs.CV] UPDATED)
    (0 min) This paper explores the potential for performing temporal semantic segmentation in the context of agricultural robotics without temporally labelled data. We achieve this by proposing to generate virtual temporal samples from labelled still images. By exploiting the relatively static scene and assuming that the robot (camera) moves we are able to generate virtually labelled temporal sequences with no extra annotation effort. Normally, to train a recurrent neural network (RNN), labelled samples from a video (temporal) sequence are required which is laborious and has stymied work in this direction. By generating virtual temporal samples, we demonstrate that it is possible to train a lightweight RNN to perform semantic segmentation on two challenging agricultural datasets. Our results show that by training a temporal semantic segmenter using virtual samples we can increase the performance by an absolute amount of $4.6$ and $4.9$ on sweet pepper and sugar beet datasets, respectively. This indicates that our virtual data augmentation technique is able to accurately classify agricultural images temporally without the use of complicated synthetic data generation techniques nor with the overhead of labelling large amounts of temporal sequences.
    Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. (arXiv:2109.01934v1 [cs.CV])
    (0 min) Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e.\ implicitly learning the geometry of the scene. In this work, we evaluate the faithfulness of V\&L models to such geometric understanding, by formulating the prediction of pair-wise relative locations of objects as a classification as well as a regression task. Our findings suggest that state-of-the-art transformer-based V\&L models lack sufficient abilities to excel at this task. Motivated by this, we design two objectives as proxies for 3D spatial reasoning (SR) -- object centroid estimation, and relative position estimation, and train V\&L with weak supervision from off-the-shelf depth estimators. This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge (in fully supervised, few-shot, and O.O.D settings) as well as improvements in relative spatial reasoning. Code and data will be released \href{https://github.com/pratyay-banerjee/weak_sup_vqa}{here}.
    Parallel Capsule Networks for Classification of White Blood Cells. (arXiv:2108.02644v2 [cs.CV] UPDATED)
    (0 min) Capsule Networks (CapsNets) is a machine learning architecture proposed to overcome some of the shortcomings of convolutional neural networks (CNNs). However, CapsNets have mainly outperformed CNNs in datasets where images are small and/or the objects to identify have minimal background noise. In this work, we present a new architecture, parallel CapsNets, which exploits the concept of branching the network to isolate certain capsules, allowing each branch to identify different entities. We applied our concept to the two current types of CapsNet architectures, studying the performance for networks with different layers of capsules. We tested our design in a public, highly unbalanced dataset of acute myeloid leukaemia images (15 classes). Our experiments showed that conventional CapsNets show similar performance than our baseline CNN (ResNeXt-50) but depict instability problems. In contrast, parallel CapsNets can outperform ResNeXt-50, is more stable, and shows better rotational invariance than both, conventional CapsNets and ResNeXt-50.
    Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering. (arXiv:2109.01847v1 [cs.CV])
    (0 min) Implicit neural rendering techniques have shown promising results for novel view synthesis. However, existing methods usually encode the entire scene as a whole, which is generally not aware of the object identity and limits the ability to the high-level editing tasks such as moving or adding furniture. In this paper, we present a novel neural scene rendering system, which learns an object-compositional neural radiance field and produces realistic rendering with editing capability for a clustered and real-world scene. Specifically, we design a novel two-pathway architecture, in which the scene branch encodes the scene geometry and appearance, and the object branch encodes each standalone object conditioned on learnable object activation codes. To survive the training in heavily cluttered scenes, we propose a scene-guided training strategy to solve the 3D space ambiguity in the occluded regions and learn sharp boundaries for each object. Extensive experiments demonstrate that our system not only achieves competitive performance for static scene novel-view synthesis, but also produces realistic rendering for object-level editing.
    Counterfactual Explanation Based on Gradual Construction for Deep Networks. (arXiv:2008.01897v2 [cs.LG] UPDATED)
    (0 min) To understand the black-box characteristics of deep networks, counterfactual explanation that deduces not only the important features of an input space but also how those features should be modified to classify input as a target class has gained an increasing interest. The patterns that deep networks have learned from a training dataset can be grasped by observing the feature variation among various classes. However, current approaches perform the feature modification to increase the classification probability for the target class irrespective of the internal characteristics of deep networks. This often leads to unclear explanations that deviate from real-world data distributions. To address this problem, we propose a counterfactual explanation method that exploits the statistics learned from a training dataset. Especially, we gradually construct an explanation by iterating over masking and composition steps. The masking step aims to select an important feature from the input data to be classified as a target class. Meanwhile, the composition step aims to optimize the previously selected feature by ensuring that its output score is close to the logit space of the training data that are classified as the target class. Experimental results show that our method produces human-friendly interpretations on various classification datasets and verify that such interpretations can be achieved with fewer feature modification.
    Scene Graphs: A Survey of Generations and Applications. (arXiv:2104.01111v2 [cs.CV] UPDATED)
    (0 min) Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene. As computer vision technology continues to develop, people are no longer satisfied with simply detecting and recognizing objects in images; instead, people look forward to a higher level of understanding and reasoning about visual scenes. For example, given an image, we want to not only detect and recognize objects in the image, but also know the relationship between objects (visual relationship detection), and generate a text description (image captioning) based on the image content. Alternatively, we might want the machine to tell us what the little girl in the image is doing (Visual Question Answering (VQA)), or even remove the dog from the image and find similar images (image editing and retrieval), etc. These tasks require a higher level of understanding and reasoning for image vision tasks. The scene graph is just such a powerful tool for scene understanding. Therefore, scene graphs have attracted the attention of a large number of researchers, and related research is often cross-modal, complex, and rapidly developing. However, no relatively systematic survey of scene graphs exists at present. To this end, this survey conducts a comprehensive investigation of the current scene graph research. More specifically, we first summarized the general definition of the scene graph, then conducted a comprehensive and systematic discussion on the generation method of the scene graph (SGG) and the SGG with the aid of prior knowledge. We then investigated the main applications of scene graphs and summarized the most commonly used datasets. Finally, we provide some insights into the future development of scene graphs. We believe this will be a very helpful foundation for future research on scene graphs.
    Tensor Normalization and Full Distribution Training. (arXiv:2109.02345v1 [cs.CV])
    (0 min) In this work, we introduce pixel wise tensor normalization, which is inserted after rectifier linear units and, together with batch normalization, provides a significant improvement in the accuracy of modern deep neural networks. In addition, this work deals with the robustness of networks. We show that the factorized superposition of images from the training set and the reformulation of the multi class problem into a multi-label problem yields significantly more robust networks. The reformulation and the adjustment of the multi class log loss also improves the results compared to the overlay with only one class as label. https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/?p=%2FTNandFDT&mode=list
    GOHOME: Graph-Oriented Heatmap Output forfuture Motion Estimation. (arXiv:2109.01827v1 [cs.CV])
    (0 min) In this paper, we propose GOHOME, a method leveraging graph representations of the High Definition Map and sparse projections to generate a heatmap output representing the future position probability distribution for a given agent in a traffic scene. This heatmap output yields an unconstrained 2D grid representation of agent future possible locations, allowing inherent multimodality and a measure of the uncertainty of the prediction. Our graph-oriented model avoids the high computation burden of representing the surrounding context as squared images and processing it with classical CNNs, but focuses instead only on the most probable lanes where the agent could end up in the immediate future. GOHOME reaches 3$rd$ on Argoverse Motion Forecasting Benchmark on the MissRate$_6$ metric while achieving significant speed-up and memory burden diminution compared to 1$^{st}$ place method HOME. We also highlight that heatmap output enables multimodal ensembling and improve 1$^{st}$ place MissRate$_6$ by more than 15$\%$ with our best ensemble.
    Comparing the Machine Readability of Traffic Sign Pictograms in Austria and Germany. (arXiv:2109.02362v1 [cs.CV])
    (0 min) We compare the machine readability of pictograms found on Austrian and German traffic signs. To that end, we train classification models on synthetic data sets and evaluate their classification accuracy in a controlled setting. In particular, we focus on differences between currently deployed pictograms in the two countries, and a set of new pictograms designed to increase human readability. Besides other results, we find that machine-learning models generalize poorly to data sets with pictogram designs they have not been trained on. We conclude that manufacturers of advanced driver-assistance systems (ADAS) must take special care to properly address small visual differences between current and newly designed traffic sign pictograms, as well as between pictograms from different countries.
  • cs.IR updates on arXiv.org

    Relation Extraction from Tables using Artificially Generated Metadata. (arXiv:2108.10750v3 [cs.CL] UPDATED)
    (2 min) Relation Extraction (RE) from tables is the task of identifying relations between pairs of columns of a table. Generally, RE models for this task require labelled tables for training. These labelled tables can also be generated artificially from a Knowledge Graph (KG), which makes the cost to acquire them much lower in comparison to manual annotations. However, unlike real tables, these synthetic tables lack associated metadata, such as, column-headers, captions, etc; this is because synthetic tables are created out of KGs that do not store such metadata. Meanwhile, previous works have shown that metadata is important for accurate RE from tables. To address this issue, we propose methods to artificially create some of this metadata for synthetic tables. Afterward, we experiment with a BERT-based model, in line with recently published works, that takes as input a combination of proposed artificial metadata and table content. Our empirical results show that this leads to an improvement of 9\%-45\% in F1 score, in absolute terms, over 2 tabular datasets.
    Towards Retrieval-based Conversational Recommendation. (arXiv:2109.02311v1 [cs.IR])
    (2 min) Conversational recommender systems have attracted immense attention recently. The most recent approaches rely on neural models trained on recorded dialogs between humans, implementing an end-to-end learning process. These systems are commonly designed to generate responses given the user's utterances in natural language. One main challenge is that these generated responses both have to be appropriate for the given dialog context and must be grammatically and semantically correct. An alternative to such generation-based approaches is to retrieve responses from pre-recorded dialog data and to adapt them if needed. Such retrieval-based approaches were successfully explored in the context of general conversational systems, but have received limited attention in recent years for CRS. In this work, we re-assess the potential of such approaches and design and evaluate a novel technique for response retrieval and ranking. A user study (N=90) revealed that the responses by our system were on average of higher quality than those of two recent generation-based systems. We furthermore found that the quality ranking of the two generation-based approaches is not aligned with the results from the literature, which points to open methodological questions. Overall, our research underlines that retrieval-based approaches should be considered an alternative or complement to language generation approaches.
    Artificial Intelligence (AI) in Action: Addressing the COVID-19 Pandemic with Natural Language Processing (NLP). (arXiv:2010.16413v3 [cs.CL] UPDATED)
    (2 min) The COVID-19 pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP), the branch of artificial intelligence that interprets human language, can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
    Patent-KG: A Patent Knowledge Graph for Engineering Design. (arXiv:2108.11899v2 [cs.IR] UPDATED)
    (2 min) To facilitate the knowledge reuse in engineering design, several dataset approaches have been proposed and applied by designers. This paper builds a patent-based knowledge graph, patent-KG, to represent the knowledge facts in patents for engineering design. The arising patent-KG approach proposes a new unsupervised mechanism to extract the knowledge facts in patent, by searching the attention graph in language models. This method avoids using expensive labelled data in supervised learning or listing complex syntactic rules in rule-based extraction. The extracted entities are compared with other benchmarks and shows a higher coverage of engineering words. The extracted relationships are also compared with other benchmarks, and the result shows meaningful advantages.
    Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context. (arXiv:2009.08978v3 [cs.IR] UPDATED)
    (2 min) Recommender systems research tends to evaluate model performance offline and on randomly sampled targets, yet the same systems are later used to predict user behavior sequentially from a fixed point in time. Simulating online recommender system performance is notoriously difficult and the discrepancy between online and offline behaviors is typically not accounted for in offline evaluations. This disparity permits weaknesses to go unnoticed until the model is deployed in a production setting. In this paper, we first demonstrate how omitting temporal context when evaluating recommender system performance leads to false confidence. To overcome this, we postulate that offline evaluation protocols can only model real-life use-cases if they account for temporal context. Next, we propose a training procedure to further embed the temporal context in existing models. We use a multi-objective approach to introduce temporal context into traditionally time-unaware recommender systems and confirm its advantage via the proposed evaluation protocol. Finally, we validate that the Pareto Fronts obtained with the added objective dominate those produced by state-of-the-art models that are only optimized for accuracy on three real-world publicly available datasets. The results show that including our temporal objective can improve recall@20 by up to 20%.
    Global-Local Item Embedding for Temporal Set Prediction. (arXiv:2109.02074v1 [cs.LG])
    (2 min) Temporal set prediction is becoming increasingly important as many companies employ recommender systems in their online businesses, e.g., personalized purchase prediction of shopping baskets. While most previous techniques have focused on leveraging a user's history, the study of combining it with others' histories remains untapped potential. This paper proposes Global-Local Item Embedding (GLOIE) that learns to utilize the temporal properties of sets across whole users as well as within a user by coining the names as global and local information to distinguish the two temporal patterns. GLOIE uses Variational Autoencoder (VAE) and dynamic graph-based model to capture global and local information and then applies attention to integrate resulting item embeddings. Additionally, we propose to use Tweedie output for the decoder of VAE as it can easily model zero-inflated and long-tailed distribution, which is more suitable for several real-world data distributions than Gaussian or multinomial counterparts. When evaluated on three public benchmarks, our algorithm consistently outperforms previous state-of-the-art methods in most ranking metrics.
    When Retriever-Reader Meets Scenario-Based Multiple-Choice Questions. (arXiv:2108.13875v2 [cs.CL] UPDATED)
    (2 min) Scenario-based question answering (SQA) requires retrieving and reading paragraphs from a large corpus to answer a question which is contextualized by a long scenario description. Since a scenario contains both keyphrases for retrieval and much noise, retrieval for SQA is extremely difficult. Moreover, it can hardly be supervised due to the lack of relevance labels of paragraphs for SQA. To meet the challenge, in this paper we propose a joint retriever-reader model called JEEVES where the retriever is implicitly supervised only using QA labels via a novel word weighting mechanism. JEEVES significantly outperforms a variety of strong baselines on multiple-choice questions in three SQA datasets.
    QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction. (arXiv:2108.08468v3 [cs.CL] UPDATED)
    (3 min) We study the problem of query attribute value extraction, which aims to identify named entities from user queries as diverse surface form attribute values and afterward transform them into formally canonical forms. Such a problem consists of two phases: {named entity recognition (NER)} and {attribute value normalization (AVN)}. However, existing works only focus on the NER phase but neglect equally important AVN. To bridge this gap, this paper proposes a unified query attribute value extraction system in e-commerce search named QUEACO, which involves both two phases. Moreover, by leveraging large-scale weakly-labeled behavior data, we further improve the extraction performance with less supervision cost. Specifically, for the NER phase, QUEACO adopts a novel teacher-student network, where a teacher network that is trained on the strongly-labeled data generates pseudo-labels to refine the weakly-labeled data for training a student network. Meanwhile, the teacher network can be dynamically adapted by the feedback of the student's performance on strongly-labeled data to maximally denoise the noisy supervisions from the weak labels. For the AVN phase, we also leverage the weakly-labeled query-to-attribute behavior data to normalize surface form attribute values from queries into canonical forms from products. Extensive experiments on a real-world large-scale E-commerce dataset demonstrate the effectiveness of QUEACO.
    Adherence and Constancy in LIME-RS Explanations for Recommendation. (arXiv:2109.00818v2 [cs.IR] UPDATED)
    (2 min) Explainable Recommendation has attracted a lot of attention due to a renewed interest in explainable artificial intelligence. In particular, post-hoc approaches have proved to be the most easily applicable ones to increasingly complex recommendation models, which are then treated as black-boxes. The most recent literature has shown that for post-hoc explanations based on local surrogate models, there are problems related to the robustness of the approach itself. This consideration becomes even more relevant in human-related tasks like recommendation. The explanation also has the arduous task of enhancing increasingly relevant aspects of user experience such as transparency or trustworthiness. This paper aims to show how the characteristics of a classical post-hoc model based on surrogates is strongly model-dependent and does not prove to be accountable for the explanations generated.
    Matching-oriented Product Quantization For Ad-hoc Retrieval. (arXiv:2104.07858v2 [cs.CL] UPDATED)
    (2 min) Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and quantization models can be jointly trained with supervised learning. However, there is a lack of appropriate formulation of the joint training objective; thus, the improvements over previous non-supervised baselines are limited in reality. In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated. With the minimization of MCL, we are able to maximize the matching probability of query and ground-truth key, which contributes to the optimal retrieval accuracy. Given that the exact computation of MCL is intractable due to the demand of vast contrastive samples, we further propose the Differentiable Cross-device Sampling (DCS), which significantly augments the contrastive samples for precise approximation of MCL. We conduct extensive experimental studies on four real-world datasets, whose results verify the effectiveness of MoPQ.
    Fairness via AI: Bias Reduction in Medical Information. (arXiv:2109.02202v1 [cs.AI])
    (2 min) Most Fairness in AI research focuses on exposing biases in AI systems. A broader lens on fairness reveals that AI can serve a greater aspiration: rooting out societal inequities from their source. Specifically, we focus on inequities in health information, and aim to reduce bias in that domain using AI. The AI algorithms under the hood of search engines and social media, many of which are based on recommender systems, have an outsized impact on the quality of medical and health information online. Therefore, embedding bias detection and reduction into these recommender systems serving up medical and health content online could have an outsized positive impact on patient outcomes and wellbeing. In this position paper, we offer the following contributions: (1) we propose a novel framework of Fairness via AI, inspired by insights from medical education, sociology and antiracism; (2) we define a new term, bisinformation, which is related to, but distinct from, misinformation, and encourage researchers to study it; (3) we propose using AI to study, detect and mitigate biased, harmful, and/or false health information that disproportionately hurts minority groups in society; and (4) we suggest several pillars and pose several open problems in order to seed inquiry in this new space. While part (3) of this work specifically focuses on the health domain, the fundamental computer science advances and contributions stemming from research efforts in bias reduction and Fairness via AI have broad implications in all areas of society.
    FedMatch: Federated Learning Over Heterogeneous Question Answering Data. (arXiv:2108.05069v2 [cs.IR] UPDATED)
    (3 min) Question Answering (QA), a popular and promising technique for intelligent information access, faces a dilemma about data as most other AI techniques. On one hand, modern QA methods rely on deep learning models which are typically data-hungry. Therefore, it is expected to collect and fuse all the available QA datasets together in a common site for developing a powerful QA model. On the other hand, real-world QA datasets are typically distributed in the form of isolated islands belonging to different parties. Due to the increasing awareness of privacy security, it is almost impossible to integrate the data scattered around, or the cost is prohibited. A possible solution to this dilemma is a new approach known as federated learning, which is a privacy-preserving machine learning technique over distributed datasets. In this work, we propose to adopt federated learning for QA with the special concern on the statistical heterogeneity of the QA data. Here the heterogeneity refers to the fact that annotated QA data are typically with non-identical and independent distribution (non-IID) and unbalanced sizes in practice. Traditional federated learning methods may sacrifice the accuracy of individual models under the heterogeneous situation. To tackle this problem, we propose a novel Federated Matching framework for QA, named FedMatch, with a backbone-patch architecture. The shared backbone is to distill the common knowledge of all the participants while the private patch is a compact and efficient module to retain the domain information for each participant. To facilitate the evaluation, we build a benchmark collection based on several QA datasets from different domains to simulate the heterogeneous situation in practice. Empirical studies demonstrate that our model can achieve significant improvements against the baselines over all the datasets.
    Representation Learning for Efficient and Effective Similarity Search and Recommendation. (arXiv:2109.01815v1 [cs.IR])
    (2 min) How data is represented and operationalized is critical for building computational solutions that are both effective and efficient. A common approach is to represent data objects as binary vectors, denoted \textit{hash codes}, which require little storage and enable efficient similarity search through direct indexing into a hash table or through similarity computations in an appropriate space. Due to the limited expressibility of hash codes, compared to real-valued representations, a core open challenge is how to generate hash codes that well capture semantic content or latent properties using a small number of bits, while ensuring that the hash codes are distributed in a way that does not reduce their search efficiency. State of the art methods use representation learning for generating such hash codes, focusing on neural autoencoder architectures where semantics are encoded into the hash codes by learning to reconstruct the original inputs of the hash codes. This thesis addresses the above challenge and makes a number of contributions to representation learning that (i) improve effectiveness of hash codes through more expressive representations and a more effective similarity measure than the current state of the art, namely the Hamming distance, and (ii) improve efficiency of hash codes by learning representations that are especially suited to the choice of search method. The contributions are empirically validated on several tasks related to similarity search and recommendation.
    Attentive Knowledge-aware Graph Convolutional Networks with Collaborative Guidance for Recommendation. (arXiv:2109.02046v1 [cs.IR])
    (2 min) To alleviate data sparsity and cold-start problems of traditional recommender systems (RSs), incorporating knowledge graphs (KGs) to supplement auxiliary information has attracted considerable attention recently. However, simply integrating KGs in current KG-based RS models is not necessarily a guarantee to improve the recommendation performance, which may even weaken the holistic model capability. This is because the construction of these KGs is independent of the collection of historical user-item interactions; hence, information in these KGs may not always be helpful for recommendation to all users. In this paper, we propose attentive Knowledge-aware Graph convolutional networks with Collaborative Guidance for personalized Recommendation (CG-KGR). CG-KGR is a novel knowledge-aware recommendation model that enables ample and coherent learning of KGs and user-item interactions, via our proposed Collaborative Guidance Mechanism. Specifically, CG-KGR first encapsulates historical interactions to interactive information summarization. Then CG-KGR utilizes it as guidance to extract information out of KGs, which eventually provides more precise personalized recommendation. We conduct extensive experiments on four real-world datasets over two recommendation tasks, i.e., Top-K recommendation and Click-Through rate (CTR) prediction. The experimental results show that the CG-KGR model significantly outperforms recent state-of-the-art models by 4.0-53.2% and 0.4-3.2%, in terms of Recall metric on Top-K recommendation and AUC on CTR prediction, respectively.
    Recommending Researchers in Machine Learning based on Author-Topic Model. (arXiv:2109.02022v1 [cs.IR])
    (2 min) The aim of this paper is to uncover the researchers in machine learning using the author-topic model (ATM). We collect 16,855 scientific papers from six top journals in the field of machine learning published from 1997 to 2016 and analyze them using ATM. The dataset is broken down into 4 intervals to identify the top researchers and find similar researchers using their similarity score. The similarity score is calculated using Hellinger distance. The researchers are plotted using t-SNE, which reduces the dimensionality of the data while keeping the same distance between the points. The analysis of our study helps the upcoming researchers to find the top researchers in their area of interest.
    Navigating the Mise-en-Page: Interpretive Machine Learning Approaches to the Visual Layouts of Multi-Ethnic Periodicals. (arXiv:2109.01732v1 [cs.CV])
    (2 min) This paper presents a computational method of analysis that draws from machine learning, library science, and literary studies to map the visual layouts of multi-ethnic newspapers from the late 19th and early 20th century United States. This work departs from prior approaches to newspapers that focus on individual pieces of textual and visual content. Our method combines Chronicling America's MARC data and the Newspaper Navigator machine learning dataset to identify the visual patterns of newspaper page layouts. By analyzing high-dimensional visual similarity, we aim to better understand how editors spoke and protested through the layout of their papers.
    Urban Fire Station Location Planning: A Systematic Approach using Predicted Demand and Service Quality Index. (arXiv:2109.02160v1 [cs.LG])
    (2 min) In this article, we propose a systematic approach for fire station location planning. We develop a machine learning model, based on Random Forest, for demand prediction and utilize the model further to define a generalized index to measure quality of fire service in urban settings. Our model is built upon spatial data collected from multiple different sources. Efficacy of proper facility planning depends on choice of candidates where fire stations can be located along with existing stations, if any. Also, the travel time from these candidates to demand locations need to be taken care of to maintain fire safety standard. Here, we propose a travel time based clustering technique to identify suitable candidates. Finally, we develop an optimization problem to select best locations to install new fire stations. Our optimization problem is built upon maximum coverage problem, based on integer programming. We present a detailed experimental study of our proposed approach in collaboration with city of Victoria Fire Department, MN, USA. Our demand prediction model achieves true positive rate of 70% and false positive rate of 22% approximately. We aid Victoria Fire Department to select a location for a new fire station using our approach. We present detailed results on improvement statistics by locating a new facility, as suggested by our methodology, in the city of Victoria.
  • cs.LG updates on arXiv.org

    Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning. (arXiv:2107.08888v2 [cs.LG] UPDATED)
    (2 min) Maintaining the long-term exploration capability of the agent remains one of the critical challenges in deep reinforcement learning. A representative solution is to leverage reward shaping to provide intrinsic rewards for the agent to encourage exploration. However, most existing methods suffer from vanishing intrinsic rewards, which cannot provide sustainable exploration incentives. Moreover, they rely heavily on complex models and additional memory to record learning procedures, resulting in high computational complexity and low robustness. To tackle this problem, entropy-based methods are proposed to evaluate the global exploration performance, encouraging the agent to visit the state space more equitably. However, the sample complexity of estimating the state visitation entropy is prohibitive when handling environments with high-dimensional observations. In this paper, we introduce a novel metric entitled Jain's fairness index (JFI) to replace the entropy regularizer, which solves the exploration problem from a brand new perspective. In sharp contrast to the entropy regularizer, JFI is more computable and robust and can be easily applied generalized into arbitrary tasks. Furthermore, we leverage a variational auto-encoder (VAE) model to capture the life-long novelty of states, which is combined with the global JFI score to form multimodal intrinsic rewards. Finally, extensive simulation results demonstrate that our multimodal reward shaping (MMRS) method can achieve higher performance than other benchmark schemes.
    Neural Marching Cubes. (arXiv:2106.11272v3 [cs.GR] UPDATED)
    (3 min) We introduce Neural Marching Cubes (NMC), a data-driven approach for extracting a triangle mesh from a discretized implicit field. Classical MC is defined by coarse tessellation templates isolated to individual cubes. While more refined tessellations have been proposed, they all make heuristic assumptions, such as trilinearity, when determining the vertex positions and local mesh topologies in each cube. In principle, none of these approaches can reconstruct geometric features that reveal coherence or dependencies between nearby cubes (e.g., a sharp edge), as such information is unaccounted for, resulting in poor estimates of the true underlying implicit field. To tackle these challenges, we re-cast MC from a deep learning perspective, by designing tessellation templates more apt at preserving geometric features, and learning the vertex positions and mesh topologies from training meshes, to account for contextual information from nearby cubes. We develop a compact per-cube parameterization to represent the output triangle mesh, while being compatible with neural processing, so that a simple 3D convolutional network can be employed for the training. We show that all topological cases in each cube that are applicable to our design can be easily derived using our representation, and the resulting tessellations can also be obtained naturally and efficiently by following a few design guidelines. In addition, our network learns local features with limited receptive fields, hence it generalizes well to new shapes and new datasets. We evaluate our neural MC approach by quantitative and qualitative comparisons to all well-known MC variants. In particular, we demonstrate the ability of our network to recover sharp features such as edges and corners, a long-standing issue of MC and its variants. Our network also reconstructs local mesh topologies more accurately than previous approaches.
    Multi-Reference Alignment for sparse signals, Uniform Uncertainty Principles and the Beltway Problem. (arXiv:2106.12996v2 [math.ST] UPDATED)
    (2 min) Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude $\sigma$. Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime $\sigma \gg 1$ that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order $\sigma^6$ for certain signals whose Fourier transforms have full support, in stark contrast to the traditional $\sigma^2$ that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that \emph{sparse} signals exhibit an intermediate $\sigma^4$ sample complexity even in the classical MRA model. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the \textit{beltway problem} from combinatorial optimization, and \textit{uniform uncertainty principles} from harmonic analysis.
    A method for estimating the entropy of time series using artificial neural network. (arXiv:2107.08399v2 [cs.LG] UPDATED)
    (3 min) Measuring the predictability and complexity of time series is an essential tool in designing and controlling the nonlinear system. Different entropy measures exist in the literature to analyze the predictability and complexity of time series. However, the existed methods have some drawbacks related to a strong dependence of entropy on the parameters of the methods, as well as on the length and amplitude of the time series. To overcome these difficulties, this study proposes a new method for estimating the entropy of a time series using the LogNNet neural network model. The LogNNet reservoir matrix is filled with the time series elements according to our algorithm. The network is trained on MNIST-10 dataset and the classification accuracy is calculated. The accuracy is considered as the entropy measure and denoted by NNetEn. The novelty of entropy calculation is that the time series is involved in mixing the input information in the reservoir. The greater complexity of the time series leads to the better ability of the neural network to learn, and to the higher classification accuracy and NNetEn values. The epochs number in the training process of LogNNet is considered as the control parameter. We introduce a new time series characteristic, called time series learning inertia, that determines the learning rate of the neural network. The robustness and efficiency of the method is verified on chaotic, periodic, random, binary and constant time series. The comparison of NNetEn with other methods of entropy estimation demonstrates that our method is more robust and accurate and can be widely used in practice.
    Explainable AI (XAI) for PHM of Industrial Asset: A State-of-The-Art, PRISMA-Compliant Systematic Review. (arXiv:2107.03869v2 [cs.AI] UPDATED)
    (2 min) A state-of-the-art systematic review on XAI applied to Prognostic and Health Management (PHM) of industrial asset is presented. This work provides an overview of the general trend of XAI in PHM, answers the question of accuracy versus explainability, the extent of human involvement, the explanation assessment and uncertainty quantification in PHM-XAI domain. Research articles associated with the subject, from 2015 to 2021 were selected from five known databases following PRISMA guidelines. Data was then extracted from the selected articles and examined. Several findings were synthesized. Firstly, while the discipline is still young, the analysis indicated the growing acceptance of XAI in PHM domain. Secondly, XAI functions as a double edge sword, where it is assimilated as a tool to execute PHM tasks as well as a mean of explanation, particularly in diagnostic and anomaly detection activities, implying a real need for XAI in PHM. Thirdly, the review showed that PHM-XAI papers produce either good or excellent result in general, suggesting that PHM performance is unaffected by XAI. Fourthly, human role, evaluation metrics and uncertainty management are areas requiring further attention by the PHM community. Adequate assessment metrics to cater for PHM need are urgently needed.Finally, most case study featured on the accepted articles are based on real, industrial data, indicating that the available PHM-XAI blends are fit to solve complex,real-world challenges, increasing the confidence in AI adoption in the industry.
    Constrained Restless Bandits for Dynamic Scheduling in Cyber-Physical Systems. (arXiv:1904.08962v5 [cs.SY] UPDATED)
    (3 min) This paper studies a class of constrained restless multi-armed bandits (CRMAB). The constraints are in the form of time varying set of actions (set of available arms). This variation can be either stochastic or semi-deterministic. Given a set of arms, a fixed number of them can be chosen to be played in each decision interval. The play of each arm yields a state dependent reward. The current states of arms are partially observable through binary feedback signals from arms that are played. The current availability of arms is fully observable. The objective is to maximize long term cumulative reward. The uncertainty about future availability of arms along with partial state information makes this objective challenging. Applications for CRMAB can be found in resource allocation in cyber-physical systems involving components with time varying availability. First, this optimization problem is analyzed using Whittle's index policy. To this end, a constrained restless single-armed bandit is studied. It is shown to admit a threshold-type optimal policy and is also indexable. An algorithm to compute Whittle's index is presented. An alternate solution method with lower complexity is also presented in the form of an online rollout policy. A detailed discussion on the complexity of both these schemes is also presented, which suggests that online rollout policy with short look ahead is simpler to implement than Whittle's index computation. Further, upper bounds on the value function are derived in order to estimate the degree of sub-optimality of various solutions. The simulation study compares the performance of Whittle's index, online rollout, myopic and modified Whittle's index policies.
    Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach. (arXiv:2102.05406v3 [cs.LG] UPDATED)
    (2 min) We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity. By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for (generalized) linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.
    Weather-based forecasting of energy generation, consumption and price for electrical microgrids management. (arXiv:2107.01034v4 [eess.SY] UPDATED)
    (2 min) The Intergovernmental Panel on Climate Change proposes different mitigation strategies to achieve the net emissions reductions that would be required to follow a pathway that limits global warming to 1.5{\deg}C with no or limited overshoot. The transition towards a carbon-free society goes through an inevitable increase in the share of renewable generation in the energy mix and a drastic decrease in the total consumption of fossil fuels. Therefore, this thesis studies the integration of renewables in power systems by investigating forecasting and decision-making tools. Indeed, in contrast to conventional power plants, renewable energy is subject to uncertainty. Most of the generation technologies based on renewable sources are non-dispatchable, and their production is stochastic and complex to predict in advance. A high share of renewables is challenging for power systems that have been designed and sized for dispatchable units. In this context, probabilistic forecasts, which aim at modeling the distribution of all possible future realizations, have become a vital tool to equip decision-makers, hopefully leading to better decisions in energy applications. This thesis focuses on two main research questions: (1) How to produce reliable probabilistic renewable generation forecasts, consumption, and electricity prices? (2) How to make decisions with uncertainty using probabilistic forecasts? The thesis perimeter is the energy management of "small" systems such as microgrids at a residential scale on a day-ahead basis. It is divided into two main parts to propose directions to address both research questions (1) a forecasting part; (2) a planning and control part.
    A Critical Connectivity Radius for Segmenting Randomly-Generated, High Dimensional Data Points. (arXiv:1602.03822v8 [cs.LG] UPDATED)
    (2 min) Motivated by a $2$-dimensional (unsupervised) image segmentation task whereby local regions of pixels are clustered via edge detection methods, a more general probabilistic mathematical framework is devised. Critical thresholds are calculated that indicate strong correlation between randomly-generated, high dimensional data points that have been projected into structures in a partition of a bounded, $2$-dimensional area, of which, an image is a special case. A neighbor concept for structures in the partition is defined and a critical radius is uncovered. Measured from a central structure in localized regions of the partition, the radius indicates strong, long and short range correlation in the count of occupied structures. The size of a short interval of radii is estimated upon which the transition from short-to-long range correlation is virtually assured, which defines a demarcation of when an image ceases to be "interesting".
    Accurate and fast matrix factorization for low-rank learning. (arXiv:2104.10785v4 [stat.ML] UPDATED)
    (2 min) In this paper, we tackle two important problems in low-rank learning, which are partial singular value decomposition and numerical rank estimation of huge matrices. By using the concepts of Krylov subspaces such as Golub-Kahan bidiagonalization (GK-bidiagonalization) as well as Ritz vectors, we propose two methods for solving these problems in a fast and accurate way. Our experiments show the advantages of the proposed methods compared to the traditional and randomized singular value decomposition methods. The proposed methods are appropriate for applications involving huge matrices where the accuracy of the desired singular values and also all of their corresponding singular vectors are essential. As a real application, we evaluate the performance of our methods on the problem of Riemannian similarity learning between two various image datasets of MNIST and USPS.
    Using BART to Perform Pareto Optimization and Quantify its Uncertainties. (arXiv:2101.02558v2 [cs.LG] UPDATED)
    (2 min) Techniques to reduce the energy burden of an industrial ecosystem often require solving a multiobjective optimization problem. However, collecting experimental data can often be either expensive or time-consuming. In such cases, statistical methods can be helpful. This article proposes Pareto Front (PF) and Pareto Set (PS) estimation methods using Bayesian Additive Regression Trees (BART), which is a non-parametric model whose assumptions are typically less restrictive than popular alternatives, such as Gaussian Processes (GPs). These less restrictive assumptions allow BART to handle scenarios (e.g. high-dimensional input spaces, nonsmooth responses, large datasets) that GPs find difficult. The performance of our BART-based method is compared to a GP-based method using analytic test functions, demonstrating convincing advantages. Finally, our BART-based methodology is applied to a motivating engineering problem. Supplementary materials, which include a theorem proof, algorithms, and R code, for this article are available online.
    Learning with Holographic Reduced Representations. (arXiv:2109.02157v1 [cs.AI])
    (2 min) Holographic Reduced Representations (HRR) are a method for performing symbolic AI on top of real-valued vectors \cite{Plate1995} by associating each vector with an abstract concept, and providing mathematical operations to manipulate vectors as if they were classic symbolic objects. This method has seen little use outside of older symbolic AI work and cognitive science. Our goal is to revisit this approach to understand if it is viable for enabling a hybrid neural-symbolic approach to learning as a differentiable component of a deep learning architecture. HRRs today are not effective in a differentiable solution due to numerical instability, a problem we solve by introducing a projection step that forces the vectors to exist in a well behaved point in space. In doing so we improve the concept retrieval efficacy of HRRs by over $100\times$. Using multi-label classification we demonstrate how to leverage the symbolic HRR properties to develop an output layer and loss function that is able to learn effectively, and allows us to investigate some of the pros and cons of an HRR neuro-symbolic learning approach.
    RDFFrames: Knowledge Graph Access for Machine Learning Tools. (arXiv:2002.03614v4 [cs.DB] UPDATED)
    (3 min) Knowledge graphs represented as RDF datasets are integral to many machine learning applications. RDF is supported by a rich ecosystem of data management systems and tools, most notably RDF database systems that provide a SPARQL query interface. Surprisingly, machine learning tools for knowledge graphs do not use SPARQL, despite the obvious advantages of using a database system. This is due to the mismatch between SPARQL and machine learning tools in terms of data model and programming style. Machine learning tools work on data in tabular format and process it using an imperative programming style, while SPARQL is declarative and has as its basic operation matching graph patterns to RDF triples. We posit that a good interface to knowledge graphs from a machine learning software stack should use an imperative, navigational programming paradigm based on graph traversal rather than the SPARQL query paradigm based on graph patterns. In this paper, we present RDFFrames, a framework that provides such an interface. RDFFrames provides an imperative Python API that gets internally translated to SPARQL, and it is integrated with the PyData machine learning software stack. RDFFrames enables the user to make a sequence of Python calls to define the data to be extracted from a knowledge graph stored in an RDF database system, and it translates these calls into a compact SPQARL query, executes it on the database system, and returns the results in a standard tabular format. Thus, RDFFrames is a useful tool for data preparation that combines the usability of PyData with the flexibility and performance of RDF database systems.
    Artificial Intelligence in Dry Eye Disease. (arXiv:2109.01658v1 [cs.LG])
    (2 min) Dry eye disease (DED) has a prevalence of between 5 and 50\%, depending on the diagnostic criteria used and population under study. However, it remains one of the most underdiagnosed and undertreated conditions in ophthalmology. Many tests used in the diagnosis of DED rely on an experienced observer for image interpretation, which may be considered subjective and result in variation in diagnosis. Since artificial intelligence (AI) systems are capable of advanced problem solving, use of such techniques could lead to more objective diagnosis. Although the term `AI' is commonly used, recent success in its applications to medicine is mainly due to advancements in the sub-field of machine learning, which has been used to automatically classify images and predict medical outcomes. Powerful machine learning techniques have been harnessed to understand nuances in patient data and medical images, aiming for consistent diagnosis and stratification of disease severity. This is the first literature review on the use of AI in DED. We provide a brief introduction to AI, report its current use in DED research and its potential for application in the clinic. Our review found that AI has been employed in a wide range of DED clinical tests and research applications, primarily for interpretation of interferometry, slit-lamp and meibography images. While initial results are promising, much work is still needed on model development, clinical testing and standardisation.
    Frustratingly Simple Pretraining Alternatives to Masked Language Modeling. (arXiv:2109.01819v1 [cs.CL])
    (2 min) Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced by a [MASK] placeholder in a multi-class setting over the entire vocabulary. When pretraining, it is common to use alongside MLM other auxiliary objectives on the token or sequence level to improve downstream performance (e.g. next sentence prediction). However, no previous work so far has attempted in examining whether other simpler linguistically intuitive or not objectives can be used standalone as main pretraining objectives. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of MLM. Empirical results on GLUE and SQuAD show that our proposed methods achieve comparable or better performance to MLM using a BERT-BASE architecture. We further validate our methods using smaller models, showing that pretraining a model with 41% of the BERT-BASE's parameters, BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective.
    Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees. (arXiv:2002.10940v4 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) has emerged as a prominent distributed learning paradigm. FL entails some pressing needs for developing novel parameter estimation approaches with theoretical guarantees of convergence, which are also communication efficient, differentially private and Byzantine resilient in the heterogeneous data distribution settings. Quantization-based SGD solvers have been widely adopted in FL and the recently proposed SIGNSGD with majority vote shows a promising direction. However, no existing methods enjoy all the aforementioned properties. In this paper, we propose an intuitively-simple yet theoretically-sound method based on SIGNSGD to bridge the gap. We present Stochastic-Sign SGD which utilizes novel stochastic-sign based gradient compressors enabling the aforementioned properties in a unified framework. We also present an error-feedback variant of the proposed Stochastic-Sign SGD which further improves the learning performance in FL. We test the proposed method with extensive experiments using deep neural networks on the MNIST dataset and the CIFAR-10 dataset. The experimental results corroborate the effectiveness of the proposed method.
    Multi-label Classification via Adaptive Resonance Theory-based Clustering. (arXiv:2103.01511v3 [cs.LG] UPDATED)
    (2 min) This paper proposes a multi-label classification algorithm capable of continual learning by applying an Adaptive Resonance Theory (ART)-based clustering algorithm and the Bayesian approach for label probability computation. The ART-based clustering algorithm adaptively and continually generates prototype nodes corresponding to given data, and the generated nodes are used as classifiers. The label probability computation independently counts the number of label appearances for each class and calculates the Bayesian probabilities. Thus, the label probability computation can cope with an increase in the number of labels. Experimental results with synthetic and real-world multi-label datasets show that the proposed algorithm has competitive classification performance to other well-known algorithms while realizing continual learning.
    QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction. (arXiv:2108.08468v3 [cs.CL] UPDATED)
    (3 min) We study the problem of query attribute value extraction, which aims to identify named entities from user queries as diverse surface form attribute values and afterward transform them into formally canonical forms. Such a problem consists of two phases: {named entity recognition (NER)} and {attribute value normalization (AVN)}. However, existing works only focus on the NER phase but neglect equally important AVN. To bridge this gap, this paper proposes a unified query attribute value extraction system in e-commerce search named QUEACO, which involves both two phases. Moreover, by leveraging large-scale weakly-labeled behavior data, we further improve the extraction performance with less supervision cost. Specifically, for the NER phase, QUEACO adopts a novel teacher-student network, where a teacher network that is trained on the strongly-labeled data generates pseudo-labels to refine the weakly-labeled data for training a student network. Meanwhile, the teacher network can be dynamically adapted by the feedback of the student's performance on strongly-labeled data to maximally denoise the noisy supervisions from the weak labels. For the AVN phase, we also leverage the weakly-labeled query-to-attribute behavior data to normalize surface form attribute values from queries into canonical forms from products. Extensive experiments on a real-world large-scale E-commerce dataset demonstrate the effectiveness of QUEACO.
    An empirical evaluation of attention-based multi-head models for improved turbofan engine remaining useful life prediction. (arXiv:2109.01761v1 [cs.LG])
    (2 min) A single unit (head) is the conventional input feature extractor in deep learning architectures trained on multivariate time series signals. The importance of the fixed-dimensional vector representation generated by the single-head network has been demonstrated for industrial machinery condition monitoring and predictive maintenance. However, processing heterogeneous sensor signals with a single head may result in a model that cannot explicitly account for the diversity in time-varying multivariate inputs. This work extends the conventional single-head deep learning models to a more robust form by developing context-specific heads to independently capture the inherent pattern of each sensor reading in multivariate time series signals. Using the turbofan aircraft engine benchmark dataset (CMAPSS), an extensive experiment is performed to verify the effectiveness and benefits of multi-head fully connected neurons, recurrent networks, convolution network, the transformer-style stand-alone attention network, and their variants for remaining useful life estimation. Moreover, the effect of different attention mechanisms on the multi-head models is also evaluated. In addition, each architecture's relative advantage and computational overhead are analyzed. Results show that utilizing the attention layer is task-sensitive and model-dependent, as it does not provide consistent improvement across the models investigated. The result is further compared with five state-of-the-art models, and the comparison shows that a relatively simple multi-head architecture performs better than the state-of-the-art models. The results presented in this study demonstrate the importance of multi-head models and attention mechanisms to improved understanding of the remaining useful life of industrial assets.
    Multi-View Spatial-Temporal Graph Convolutional Networks with Domain Generalization for Sleep Stage Classification. (arXiv:2109.01824v1 [eess.SP])
    (2 min) Sleep stage classification is essential for sleep assessment and disease diagnosis. Although previous attempts to classify sleep stages have achieved high classification performance, several challenges remain open: 1) How to effectively utilize time-varying spatial and temporal features from multi-channel brain signals remains challenging. Prior works have not been able to fully utilize the spatial topological information among brain regions. 2) Due to the many differences found in individual biological signals, how to overcome the differences of subjects and improve the generalization of deep neural networks is important. 3) Most deep learning methods ignore the interpretability of the model to the brain. To address the above challenges, we propose a multi-view spatial-temporal graph convolutional networks (MSTGCN) with domain generalization for sleep stage classification. Specifically, we construct two brain view graphs for MSTGCN based on the functional connectivity and physical distance proximity of the brain regions. The MSTGCN consists of graph convolutions for extracting spatial features and temporal convolutions for capturing the transition rules among sleep stages. In addition, attention mechanism is employed for capturing the most relevant spatial-temporal information for sleep stage classification. Finally, domain generalization and MSTGCN are integrated into a unified framework to extract subject-invariant sleep features. Experiments on two public datasets demonstrate that the proposed model outperforms the state-of-the-art baselines.
    Learning-based decentralized offloading decision making in an adversarial environment. (arXiv:2104.12827v3 [cs.LG] UPDATED)
    (2 min) Vehicular fog computing (VFC) pushes the cloud computing capability to the distributed fog nodes at the edge of the Internet, enabling compute-intensive and latency-sensitive computing services for vehicles through task offloading. However, a heterogeneous mobility environment introduces uncertainties in terms of resource supply and demand, which are inevitable bottlenecks for the optimal offloading decision. Also, these uncertainties bring extra challenges to task offloading under the oblivious adversary attack and data privacy risks. In this article, we develop a new adversarial online learning algorithm with bandit feedback based on the adversarial multi-armed bandit theory, to enable scalable and low-complexity offloading decision making. Specifically, we focus on optimizing fog node selection with the aim of minimizing the offloading service costs in terms of delay and energy. The key is to implicitly tune the exploration bonus in the selection process and the assessment rules of the designed algorithm, taking into account volatile resource supply and demand. We theoretically prove that the input-size dependent selection rule allows to choose a suitable fog node without exploring the sub-optimal actions, and also an appropriate score patching rule allows to quickly adapt to evolving circumstances, which reduce variance and bias simultaneously, thereby achieving a better exploitation-exploration balance. Simulation results verify the effectiveness and robustness of the proposed algorithm.
    NAS-OoD: Neural Architecture Search for Out-of-Distribution Generalization. (arXiv:2109.02038v1 [cs.LG])
    (2 min) Recent advances on Out-of-Distribution (OoD) generalization reveal the robustness of deep learning models against distribution shifts. However, existing works focus on OoD algorithms, such as invariant risk minimization, domain generalization, or stable learning, without considering the influence of deep model architectures on OoD generalization, which may lead to sub-optimal performance. Neural Architecture Search (NAS) methods search for architecture based on its performance on the training data, which may result in poor generalization for OoD tasks. In this work, we propose robust Neural Architecture Search for OoD generalization (NAS-OoD), which optimizes the architecture with respect to its performance on generated OoD data by gradient descent. Specifically, a data generator is learned to synthesize OoD data by maximizing losses computed by different neural architectures, while the goal for architecture search is to find the optimal architecture parameters that minimize the synthetic OoD data losses. The data generator and the neural architecture are jointly optimized in an end-to-end manner, and the minimax training process effectively discovers robust architectures that generalize well for different distribution shifts. Extensive experimental results show that NAS-OoD achieves superior performance on various OoD generalization benchmarks with deep models having a much fewer number of parameters. In addition, on a real industry dataset, the proposed NAS-OoD method reduces the error rate by more than 70% compared with the state-of-the-art method, demonstrating the proposed method's practicality for real applications.
    Efficient Privacy Preserving Edge Computing Framework for Image Classification. (arXiv:2005.04563v2 [cs.LG] UPDATED)
    (2 min) In order to extract knowledge from the large data collected by edge devices, traditional cloud based approach that requires data upload may not be feasible due to communication bandwidth limitation as well as privacy and security concerns of end users. To address these challenges, a novel privacy preserving edge computing framework is proposed in this paper for image classification. Specifically, autoencoder will be trained unsupervised at each edge device individually, then the obtained latent vectors will be transmitted to the edge server for the training of a classifier. This framework would reduce the communications overhead and protect the data of the end users. Comparing to federated learning, the training of the classifier in the proposed framework does not subject to the constraints of the edge devices, and the autoencoder can be trained independently at each edge device without any server involvement. Furthermore, the privacy of the end users' data is protected by transmitting latent vectors without additional cost of encryption. Experimental results provide insights on the image classification performance vs. various design parameters such as the data compression ratio of the autoencoder and the model complexity.
    Federated Learning using Smart Contracts on Blockchains, based on Reward Driven Approach. (arXiv:2107.10243v2 [cs.CR] UPDATED)
    (2 min) Over the recent years, Federated machine learning continues to gain interest and momentum where there is a need to draw insights from data while preserving the data provider's privacy. However, one among other existing challenges in the adoption of federated learning has been the lack of fair, transparent and universally agreed incentivization schemes for rewarding the federated learning contributors. Smart contracts on a blockchain network provide transparent, immutable and independently verifiable proofs by all participants of the network. We leverage this open and transparent nature of smart contracts on a blockchain to define incentivization rules for the contributors, which is based on a novel scalar quantity - federated contribution. Such a smart contract based reward-driven model has the potential to revolutionize the federated learning adoption in enterprises. Our contribution is two-fold: first is to show how smart contract based blockchain can be a very natural communication channel for federated learning. Second, leveraging this infrastructure, we can show how an intuitive measure of each agents' contribution can be built and integrated with the life cycle of the training and reward process.
    Information Theory-Guided Heuristic Progressive Multi-View Coding. (arXiv:2109.02344v1 [cs.CV])
    (2 min) Multi-view representation learning captures comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning (CL) to learn representations, regarded as a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; and evenly measuring the similarities between terms might interfere with optimization. Importantly, few works research the theoretical framework of generalized self-supervised multi-view learning, especially for more than two views. To this end, we rethink the existing multi-view learning paradigm from the information theoretical perspective and then propose a novel information theoretical framework for generalized multi-view learning. Guided by it, we build a multi-view coding method with a three-tier progressive architecture, namely Information theory-guided heuristic Progressive Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC builds self-adjusted pools for contrasting, which utilizes a view filter to adaptively modify the pools. Lastly, in the instance-tier, we adopt a designed unified loss to learn discriminative representations and reduce the gradient interference. Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods.
    Deep learning facilitates fully automated brain image registration of optoacoustic tomography and magnetic resonance imaging. (arXiv:2109.01880v1 [eess.IV])
    (2 min) Multi-spectral optoacoustic tomography (MSOT) is an emerging optical imaging method providing multiplex molecular and functional information from the rodent brain. It can be greatly augmented by magnetic resonance imaging (MRI) that offers excellent soft-tissue contrast and high-resolution brain anatomy. Nevertheless, registration of multi-modal images remains challenging, chiefly due to the entirely different image contrast rendered by these modalities. Previously reported registration algorithms mostly relied on manual user-dependent brain segmentation, which compromised data interpretation and accurate quantification. Here we propose a fully automated registration method for MSOT-MRI multimodal imaging empowered by deep learning. The automated workflow includes neural network-based image segmentation to generate suitable masks, which are subsequently registered using an additional neural network. Performance of the algorithm is showcased with datasets acquired by cross-sectional MSOT and high-field MRI preclinical scanners. The automated registration method is further validated with manual and half-automated registration, demonstrating its robustness and accuracy.
    Estimating the probabilities of causation via deep monotonic twin networks. (arXiv:2109.01904v1 [cs.LG])
    (2 min) There has been much recent work using machine learning to answer causal queries. Most focus on interventional queries, such as the conditional average treatment effect. However, as noted by Pearl, interventional queries only form part of a larger hierarchy of causal queries, with counterfactuals sitting at the top. Despite this, our community has not fully succeeded in adapting machine learning tools to answer counterfactual queries. This work addresses this challenge by showing how to implement twin network counterfactual inference -- an alternative to abduction, action, & prediction counterfactual inference -- with deep learning to estimate counterfactual queries. We show how the graphical nature of twin networks makes them particularly amenable to deep learning, yielding simple neural network architectures that, when trained, are capable of counterfactual inference. Importantly, we show how to enforce known identifiability constraints during training, ensuring the answer to each counterfactual query is uniquely determined. We demonstrate our approach by using it to accurately estimate the probabilities of causation -- important counterfactual queries that quantify the degree to which one event was a necessary or sufficient cause of another -- on both synthetic and real data.
    Using Differentiable Programming for Flexible Statistical Modeling. (arXiv:2012.05722v2 [cs.LG] UPDATED)
    (2 min) Differentiable programming has recently received much interest as a paradigm that facilitates taking gradients of computer programs. While the corresponding flexible gradient-based optimization approaches so far have been used predominantly for deep learning or enriching the latter with modeling components, we want to demonstrate that they can also be useful for statistical modeling per se, e.g., for quick prototyping when classical maximum likelihood approaches are challenging or not feasible. In an application from a COVID-19 setting, we utilize differentiable programming to quickly build and optimize a flexible prediction model adapted to the data quality challenges at hand. Specifically, we develop a regression model, inspired by delay differential equations, that can bridge temporal gaps of observations in the central German registry of COVID-19 intensive care cases for predicting future demand. With this exemplary modeling challenge, we illustrate how differentiable programming can enable simple gradient-based optimization of the model by automatic differentiation. This allowed us to quickly prototype a model under time pressure that outperforms simpler benchmark models. We thus exemplify the potential of differentiable programming also outside deep learning applications, to provide more options for flexible applied statistical modeling.
    Bacteriophage classification for assembled contigs using Graph Convolutional Network. (arXiv:2102.03746v2 [q-bio.GN] UPDATED)
    (2 min) Motivation: Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biology of microbes. As the most abundant biological entities on the planet, the number of discovered phages is only the tip of the iceberg. Recently, many new phages have been revealed using high throughput sequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-like sequences, there is a serious lag in taxonomic classification of phages. High diversity, abundance, and limited known phages pose great challenges for taxonomic analysis. In particular, alignment-based tools have difficulty in classifying fast accumulating contigs assembled from metagenomic data. Results: In this work, we present a novel semi-supervised learning model, named PhaGCN, to conduct taxonomic classification for phage contigs. In this learning model, we construct a knowledge graph by combining the DNA sequence features learned by convolutional neural network (CNN) and protein sequence similarity gained from gene-sharing network. Then we apply graph convolutional network (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learning ability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show that our method competes favorably against available phage classification tools.
    Trainable Discrete Feature Embeddings for Variational Quantum Classifier. (arXiv:2106.09415v2 [quant-ph] UPDATED)
    (2 min) Quantum classifiers provide sophisticated embeddings of input data in Hilbert space promising quantum advantage. The advantage stems from quantum feature maps encoding the inputs into quantum states with variational quantum circuits. A recent work shows how to map discrete features with fewer quantum bits using Quantum Random Access Coding (QRAC), an important primitive to encode binary strings into quantum states. We propose a new method to embed discrete features with trainable quantum circuits by combining QRAC and a recently proposed strategy for training quantum feature map called quantum metric learning. We show that the proposed trainable embedding requires not only as few qubits as QRAC but also overcomes the limitations of QRAC to classify inputs whose classes are based on hard Boolean functions. We numerically demonstrate its use in variational quantum classifiers to achieve better performances in classifying real-world datasets, and thus its possibility to leverage near-term quantum computers for quantum machine learning.
    Model retraining and information sharing in a supply chain with long-term fluctuating demands. (arXiv:2109.01784v1 [physics.soc-ph])
    (2 min) Demand forecasting based on empirical data is a viable approach for optimizing a supply chain. However, in this approach, a model constructed from past data occasionally becomes outdated due to long-term changes in the environment, in which case the model should be updated (i.e., retrained) using the latest data. In this study, we examine the effects of updating models in a supply chain using a minimal setting. We demonstrate that when each party in the supply chain has its own forecasting model, uncoordinated model retraining causes the bullwhip effect even if a very simple replenishment policy is applied. Our results also indicate that sharing the forecasting model among the parties involved significantly reduces the bullwhip effect.
    Customer 360-degree Insights in Predicting Chronic Diabetes. (arXiv:2109.01863v1 [cs.LG])
    (2 min) Chronic diseases such as diabetes are quite prevalent in the world and are responsible for a significant number of deaths per year. In addition, treatments for such chronic diseases account for a high healthcare cost. However, research has shown that diabetes can be proactively managed and prevented while lowering these healthcare costs. We have mined a sample of ten million customers' 360-degree data representing the state of Texas, USA, with attributes current as of late 2018. The sample received from a market research data vendor has over 1000 customer attributes consisting of demography, lifestyle, and in some cases self-reported chronic conditions. In this study, we have developed a classification model to predict chronic diabetes with an accuracy of 80%. We demonstrate a use case where a large volume of 360-degree customer data can be useful to predict and hence proactively prevent chronic diseases such as diabetes.
    Robust fine-tuning of zero-shot models. (arXiv:2109.01903v1 [cs.CV])
    (2 min) Large pre-trained models such as CLIP offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning approaches substantially improve accuracy in-distribution, they also reduce out-of-distribution robustness. We address this tension by introducing a simple and effective method for improving robustness: ensembling the weights of the zero-shot and fine-tuned models. Compared to standard fine-tuning, the resulting weight-space ensembles provide large accuracy improvements out-of-distribution, while matching or improving in-distribution accuracy. On ImageNet and five derived distribution shifts, weight-space ensembles improve out-of-distribution accuracy by 2 to 10 percentage points while increasing in-distribution accuracy by nearly 1 percentage point relative to standard fine-tuning. These improvements come at no additional computational cost during fine-tuning or inference.
    Efficient On-Chip Learning for Optical Neural Networks Through Power-Aware Sparse Zeroth-Order Optimization. (arXiv:2012.11148v3 [cs.ET] UPDATED)
    (0 min) Optical neural networks (ONNs) have demonstrated record-breaking potential in high-performance neuromorphic computing due to their ultra-high execution speed and low energy consumption. However, current learning protocols fail to provide scalable and efficient solutions to photonic circuit optimization in practical applications. In this work, we propose a novel on-chip learning framework to release the full potential of ONNs for power-efficient in situ training. Instead of deploying implementation-costly back-propagation, we directly optimize the device configurations with computation budgets and power constraints. We are the first to model the ONN on-chip learning as a resource-constrained stochastic noisy zeroth-order optimization problem, and propose a novel mixed-training strategy with two-level sparsity and power-aware dynamic pruning to offer a scalable on-chip training solution in practical ONN deployment. Compared with previous methods, we are the first to optimize over 2,500 optical components on chip. We can achieve much better optimization stability, 3.7x-7.6x higher efficiency, and save >90% power under practical device variations and thermal crosstalk.
    A Critical Review of the state-of-the-art on Deep Neural Networks for Blood Glucose Prediction in Patients with Diabetes. (arXiv:2109.02178v1 [q-bio.QM])
    (0 min) This article compares ten recently proposed neural networks and proposes two ensemble neural network-based models for blood glucose prediction. All of them are tested under the same dataset, preprocessing workflow, and tools using the OhioT1DM Dataset at three different prediction horizons: 30, 60, and 120 minutes. We compare their performance using the most common metrics in blood glucose prediction and rank the best-performing ones using three methods devised for the statistical comparison of the performance of multiple algorithms: scmamp, model confidence set, and superior predictive ability. Our analysis highlights those models with the highest probability of being the best predictors, estimates the increase in error of the models that perform more poorly with respect to the best ones, and provides a guide for their use in clinical practice.
    Learning Bayesian Networks Under Sparsity Constraints: A Parameterized Complexity Analysis. (arXiv:2004.14724v3 [cs.DS] UPDATED)
    (0 min) We study the problem of learning the structure of an optimal Bayesian network when additional constraints are posed on the network or on its moralized graph. More precisely, we consider the constraint that the network or its moralized graph are close, in terms of vertex or edge deletions, to a sparse graph class $\Pi$. For example, we show that learning an optimal network whose moralized graph has vertex deletion distance at most $k$ from a graph with maximum degree 1 can be computed in polynomial time when $k$ is constant. This extends previous work that gave an algorithm with such a running time for the vertex deletion distance to edgeless graphs [Korhonen & Parviainen, NIPS 2015]. We then show that further extensions or improvements are presumably impossible. For example, we show that learning optimal networks where the network or its moralized graph have maximum degree $2$ or connected components of size at most $c$, $c\ge 3$, is NP-hard. Finally, we show that learning an optimal network with at most $k$ edges in the moralized graph presumably has no $f(k)\cdot |I|^{O(1)}$-time algorithm and that, in contrast, an optimal network with at most $k$ arcs can be computed in $2^{O(k)}\cdot |I|^{O(1)}$ time where $|I|$ is the total input size.
    Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond. (arXiv:1710.03113v5 [cs.LG] UPDATED)
    (0 min) The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multi-level diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this paper proposes a novel multidiversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed. Further, an entropy-based criterion is utilized to explore the cluster-wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state-of-the-art. The source code is available at https://github.com/huangdonghere/MDEC.
    An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments. (arXiv:2002.11561v7 [cs.SD] UPDATED)
    (0 min) The problem of training with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications include those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at poviding the audio recognition community with a carefully annotated dataset (https://zenodo.org/record/3689288) for FSL in an OSR context comprised of 1360 clips from 34 classes divided into pattern sounds} and unwanted sounds. To facilitate and promote research on this area, results with state-of-the-art baseline systems based on transfer learning are also presented.
    A Transformer-based Model to Detect Phishing URLs. (arXiv:2109.02138v1 [cs.CR])
    (0 min) Phishing attacks are among emerging security issues that recently draws significant attention in the cyber security community. There are numerous existing approaches for phishing URL detection. However, malicious URL detection is still a research hotspot because attackers can bypass newly introduced detection mechanisms by changing their tactics. This paper will introduce a transformer-based malicious URL detection model, which has significant accuracy and outperforms current detection methods. We conduct experiments and compare them with six existing classical detection models. Experiments demonstrate that our transformer-based model is the best performing model from all perspectives among the seven models and achieves 97.3 % of detection accuracy.
    Towards high-accuracy deep learning inference of compressible turbulent flows over aerofoils. (arXiv:2109.02183v1 [physics.flu-dyn])
    (0 min) The present study investigates the accurate inference of Reynolds-averaged Navier-Stokes solutions for the compressible flow over aerofoils in two dimensions with a deep neural network. Our approach yields networks that learn to generate precise flow fields for varying body-fitted, structured grids by providing them with an encoding of the corresponding mapping to a canonical space for the solutions. We apply the deep neural network model to a benchmark case of incompressible flow at randomly given angles of attack and Reynolds numbers and achieve an improvement of more than an order of magnitude compared to previous work. Further, for transonic flow cases, the deep neural network model accurately predicts complex flow behaviour at high Reynolds numbers, such as shock wave/boundary layer interaction, and quantitative distributions like pressure coefficient, skin friction coefficient as well as wake total pressure profiles downstream of aerofoils. The proposed deep learning method significantly speeds up the predictions of flow fields and shows promise for enabling fast aerodynamic designs.
    Event-Based Communication in Multi-Agent Distributed Q-Learning. (arXiv:2109.01417v2 [cs.AI] UPDATED)
    (0 min) We present in this work an approach to reduce the communication of information needed on a multi-agent learning system inspired by Event Triggered Control (ETC) techniques. We consider a baseline scenario of a distributed Q-learning problem on a Markov Decision Process (MDP). Following an event-based approach, N agents explore the MDP and communicate experiences to a central learner only when necessary, which performs updates of the actor Q functions. We analyse the convergence guarantees retained with respect to a regular Q-learning algorithm, and present experimental results showing that event-based communication results in a substantial reduction of data transmission rates in such distributed systems. Additionally, we discuss what effects (desired and undesired) these event-based approaches have on the learning processes studied, and how they can be applied to more complex multi-agent learning systems.
    Optimal transport weights for causal inference. (arXiv:2109.01991v1 [stat.ME])
    (0 min) Weighting methods are a common tool to de-bias estimates of causal effects. And though there are an increasing number of seemingly disparate methods, many of them can be folded into one unifying regime: causal optimal transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between a source and target population. Our approach is model-free but can also incorporate moments or any other important functions of covariates that the researcher desires to balance. We find that the causal optimal transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control study examining the effect of misoprostol versus oxytocin for treatment of post-partum hemorrhage.
    Variational Physics Informed Neural Networks: the role of quadratures and test functions. (arXiv:2109.02035v1 [math.NA])
    (0 min) In this work we analyze how Gaussian or Newton-Cotes quadrature rules of different precisions and piecewise polynomial test functions of different degrees affect the convergence rate of Variational Physics Informed Neural Networks (VPINN) with respect to mesh refinement, while solving elliptic boundary-value problems. Using a Petrov-Galerkin framework relying on an inf-sup condition, we derive an a priori error estimate in the energy norm between the exact solution and a suitable high-order piecewise interpolant of a computed neural network. Numerical experiments confirm the theoretical predictions, and also indicate that the error decay follows the same behavior when the neural network is not interpolated. Our results suggest, somehow counterintuitively, that for smooth solutions the best strategy to achieve a high decay rate of the error consists in choosing test functions of the lowest polynomial degree, while using quadrature formulas of suitably high precision.
    Fair Federated Learning for Heterogeneous Face Data. (arXiv:2109.02351v1 [cs.LG])
    (0 min) We consider the problem of achieving fair classification in Federated Learning (FL) under data heterogeneity. Most of the approaches proposed for fair classification require diverse data that represent the different demographic groups involved. In contrast, it is common for each client to own data that represents only a single demographic group. Hence the existing approaches cannot be adopted for fair classification models at the client level. To resolve this challenge, we propose several aggregation techniques. We empirically validate these techniques by comparing the resulting fairness metrics and accuracy on CelebA, UTK, and FairFace datasets.
    Anomaly Detection on IT Operation Series via Online Matrix Profile. (arXiv:2108.12093v2 [cs.LG] UPDATED)
    (2 min) Anomaly detection on time series is a fundamental task in monitoring the Key Performance Indicators (KPIs) of IT systems. Many of the existing approaches in the literature show good performance while requiring a lot of training resources. In this paper, the online matrix profile, which requires no training, is proposed to address this issue. The anomalies are detected by referring to the past subsequence that is the closest to the current one. The distance significance is introduced based on the online matrix profile, which demonstrates a prominent pattern when an anomaly occurs. Another training-free approach spectral residual is integrated into our approach to further enhance the detection accuracy. Moreover, the proposed approach is sped up by at least four times for long time series by the introduced cache strategy. In comparison to the existing approaches, the online matrix profile makes a good trade-off between accuracy and efficiency. More importantly, it is generic to various types of time series in the sense that it works without the constraint from any trained model.
    Transformer Networks for Data Augmentation of Human Physical Activity Recognition. (arXiv:2109.01081v2 [cs.LG] UPDATED)
    (0 min) Data augmentation is a widely used technique in classification to increase data used in training. It improves generalization and reduces amount of annotated human activity data needed for training which reduces labour and time needed with the dataset. Sensor time-series data, unlike images, cannot be augmented by computationally simple transformation algorithms. State of the art models like Recurrent Generative Adversarial Networks (RGAN) are used to generate realistic synthetic data. In this paper, transformer based generative adversarial networks which have global attention on data, are compared on PAMAP2 and Real World Human Activity Recognition data sets with RGAN. The newer approach provides improvements in time and savings in computational resources needed for data augmentation than previous approach.
    Hindsight Reward Tweaking via Conditional Deep Reinforcement Learning. (arXiv:2109.02332v1 [cs.LG])
    (0 min) Designing optimal reward functions has been desired but extremely difficult in reinforcement learning (RL). When it comes to modern complex tasks, sophisticated reward functions are widely used to simplify policy learning yet even a tiny adjustment on them is expensive to evaluate due to the drastically increasing cost of training. To this end, we propose a hindsight reward tweaking approach by designing a novel paradigm for deep reinforcement learning to model the influences of reward functions within a near-optimal space. We simply extend the input observation with a condition vector linearly correlated with the effective environment reward parameters and train the model in a conventional manner except for randomizing reward configurations, obtaining a hyper-policy whose characteristics are sensitively regulated over the condition space. We demonstrate the feasibility of this approach and study one of its potential application in policy performance boosting with multiple MuJoCo tasks.
    Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context. (arXiv:2009.08978v3 [cs.IR] UPDATED)
    (0 min) Recommender systems research tends to evaluate model performance offline and on randomly sampled targets, yet the same systems are later used to predict user behavior sequentially from a fixed point in time. Simulating online recommender system performance is notoriously difficult and the discrepancy between online and offline behaviors is typically not accounted for in offline evaluations. This disparity permits weaknesses to go unnoticed until the model is deployed in a production setting. In this paper, we first demonstrate how omitting temporal context when evaluating recommender system performance leads to false confidence. To overcome this, we postulate that offline evaluation protocols can only model real-life use-cases if they account for temporal context. Next, we propose a training procedure to further embed the temporal context in existing models. We use a multi-objective approach to introduce temporal context into traditionally time-unaware recommender systems and confirm its advantage via the proposed evaluation protocol. Finally, we validate that the Pareto Fronts obtained with the added objective dominate those produced by state-of-the-art models that are only optimized for accuracy on three real-world publicly available datasets. The results show that including our temporal objective can improve recall@20 by up to 20%.
    Inferring feature importance with uncertainties in high-dimensional data. (arXiv:2109.00855v2 [cs.LG] UPDATED)
    (0 min) Estimating feature importance is a significant aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley value based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published feature importance measure of SAGE (Shapley additive global importance) and introduce sub-SAGE which can be estimated without resampling for tree-based models. We argue that the uncertainties can be estimated from bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as high-dimensional genomics data.
    (M)SLAe-Net: Multi-Scale Multi-Level Attention embedded Network for Retinal Vessel Segmentation. (arXiv:2109.02084v1 [eess.IV])
    (0 min) Segmentation plays a crucial role in diagnosis. Studying the retinal vasculatures from fundus images help identify early signs of many crucial illnesses such as diabetic retinopathy. Due to the varying shape, size, and patterns of retinal vessels, along with artefacts and noises in fundus images, no one-stage method can accurately segment retinal vessels. In this work, we propose a multi-scale, multi-level attention embedded CNN architecture ((M)SLAe-Net) to address the issue of multi-stage processing for robust and precise segmentation of retinal vessels. We do this by extracting features at multiple scales and multiple levels of the network, enabling our model to holistically extracts the local and global features. Multi-scale features are extracted using our novel dynamic dilated pyramid pooling (D-DPP) module. We also aggregate the features from all the network levels. These effectively resolved the issues of varying shapes and artefacts and hence the need for multiple stages. To assist in better pixel-level classification, we use the Squeeze and Attention(SA) module, a smartly adapted version of the Squeeze and Excitation(SE) module for segmentation tasks in our network to facilitate pixel-group attention. Our unique network design and novel D-DPP module with efficient task-specific loss function for thin vessels enabled our model for better cross data performance. Exhaustive experimental results on DRIVE, STARE, HRF, and CHASE-DB1 show the superiority of our method.
    Fairness via AI: Bias Reduction in Medical Information. (arXiv:2109.02202v1 [cs.AI])
    (0 min) Most Fairness in AI research focuses on exposing biases in AI systems. A broader lens on fairness reveals that AI can serve a greater aspiration: rooting out societal inequities from their source. Specifically, we focus on inequities in health information, and aim to reduce bias in that domain using AI. The AI algorithms under the hood of search engines and social media, many of which are based on recommender systems, have an outsized impact on the quality of medical and health information online. Therefore, embedding bias detection and reduction into these recommender systems serving up medical and health content online could have an outsized positive impact on patient outcomes and wellbeing. In this position paper, we offer the following contributions: (1) we propose a novel framework of Fairness via AI, inspired by insights from medical education, sociology and antiracism; (2) we define a new term, bisinformation, which is related to, but distinct from, misinformation, and encourage researchers to study it; (3) we propose using AI to study, detect and mitigate biased, harmful, and/or false health information that disproportionately hurts minority groups in society; and (4) we suggest several pillars and pose several open problems in order to seed inquiry in this new space. While part (3) of this work specifically focuses on the health domain, the fundamental computer science advances and contributions stemming from research efforts in bias reduction and Fairness via AI have broad implications in all areas of society.
    Impact and dynamics of hate and counter speech online. (arXiv:2009.08392v3 [cs.SI] UPDATED)
    (0 min) Citizen-generated counter speech is a promising way to fight hate speech and promote peaceful, non-polarized discourse. However, there is a lack of large-scale longitudinal studies of its effectiveness for reducing hate speech. To this end, we perform an exploratory analysis of the effectiveness of counter speech using several different macro- and micro-level measures to analyze 180,000 political conversations that took place on German Twitter over four years. We report on the dynamic interactions of hate and counter speech over time and provide insights into whether, as in `classic' bullying situations, organized efforts are more effective than independent individuals in steering online discourse. Taken together, our results build a multifaceted picture of the dynamics of hate and counter speech online. While we make no causal claims due to the complexity of discourse dynamics, our findings suggest that organized hate speech is associated with changes in public discourse and that counter speech -- especially when organized -- may help curb hateful rhetoric in online discourse.
    High-Dimensional Sparse Linear Bandits. (arXiv:2011.04020v2 [stat.ML] UPDATED)
    (0 min) Stochastic linear bandits with high-dimensional sparse features are a practical model for a variety of domains, including personalized medicine and online advertising. We derive a novel $\Omega(n^{2/3})$ dimension-free minimax regret lower bound for sparse linear bandits in the data-poor regime where the horizon is smaller than the ambient dimension and where the feature vectors admit a well-conditioned exploration distribution. This is complemented by a nearly matching upper bound for an explore-then-commit algorithm showing that that $\Theta(n^{2/3})$ is the optimal rate in the data-poor regime. The results complement existing bounds for the data-rich regime and provide another example where carefully balancing the trade-off between information and regret is necessary. Finally, we prove a dimension-free $O(\sqrt{n})$ regret upper bound under an additional assumption on the magnitude of the signal for relevant features.
    BEAUTY Powered BEAST. (arXiv:2103.00674v3 [stat.ME] UPDATED)
    (0 min) We study nonparametric dependence detection with the proposed binary expansion approximation of uniformity (BEAUTY) approach, which generalizes the celebrated Euler's formula, and approximates the characteristic function of any copula with a linear combination of expectations of binary interactions from marginal binary expansions. This novel theory enables a unification of many important tests through approximations from some quadratic forms of symmetry statistics, where the deterministic weight matrix characterizes the power properties of each test. To achieve a robust power, we study test statistics with data-adaptive weights, referred to as the binary expansion adaptive symmetry test (BEAST). By utilizing the properties of the binary expansion filtration, we show that the Neyman-Pearson test of uniformity can be approximated by an oracle weighted sum of symmetry statistics. The BEAST with this oracle provides a benchmark of feasible power against any alternative by leading all existing tests with a substantial margin. To approach this oracle power, we develop the BEAST through a regularized resampling approximation of the oracle test. The BEAST improves the empirical power of many existing tests against a wide spectrum of common alternatives while providing clear interpretation of the form of dependency upon rejection.
    Online mirror descent and dual averaging: keeping pace in the dynamic case. (arXiv:2006.02585v3 [cs.LG] UPDATED)
    (0 min) Online mirror descent (OMD) and dual averaging (DA) -- two fundamental algorithms for online convex optimization -- are known to have very similar (and sometimes identical) performance guarantees when used with a fixed learning rate. Under dynamic learning rates, however, OMD is provably inferior to DA and suffers a linear regret, even in common settings such as prediction with expert advice. We modify the OMD algorithm through a simple technique that we call stabilization. We give essentially the same abstract regret bound for OMD with stabilization and for DA by modifying the classical OMD convergence analysis in a careful and modular way that allows for straightforward and flexible proofs. Simple corollaries of these bounds show that OMD with stabilization and DA enjoy the same performance guarantees in many applications -- even under dynamic learning rates. We also shed light on the similarities between OMD and DA and show simple conditions under which stabilized-OMD and DA generate the same iterates.
    From Static to Dynamic Prediction: Wildfire Risk Assessment Based on Multiple Environmental Factors. (arXiv:2103.10901v2 [cs.LG] UPDATED)
    (0 min) Wildfire is one of the biggest disasters that frequently occurs on the west coast of the United States. Many efforts have been made to understand the causes of the increases in wildfire intensity and frequency in recent years. In this work, we propose static and dynamic prediction models to analyze and assess the areas with high wildfire risks in California by utilizing a multitude of environmental data including population density, Normalized Difference Vegetation Index (NDVI), Palmer Drought Severity Index (PDSI), tree mortality area, tree mortality number, and altitude. Moreover, we focus on a better understanding of the impacts of different factors so as to inform preventive actions. To validate our models and findings, we divide the land of California into 4,242 grids of 0.1 degrees $\times$ 0.1 degrees in latitude and longitude, and compute the risk of each grid based on spatial and temporal conditions. To verify the generalizability of our models, we further expand the scope of wildfire risk assessment from California to Washington without any fine tuning. By performing counterfactual analysis, we uncover the effects of several possible methods on reducing the number of high risk wildfires. Taken together, our study has the potential to estimate, monitor, and reduce the risks of wildfires across diverse areas provided that such environment data is available.
    Visual Recognition with Deep Learning from Biased Image Datasets. (arXiv:2109.02357v1 [cs.CV])
    (0 min) In practice, and more especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven predictive performances on different population segments highlights the representativeness issues possibly induced by a naive aggregation of image datasets. Indeed, sampling bias does not vanish simply by considering larger datasets, and ignoring its impact may completely jeopardize the generalization capacity of the learned prediction rules. In this paper, we show how biasing models, originally introduced for nonparametric estimation in (Gill et al., 1988), and recently revisited from the perspective of statistical learning theory in (Laforgue and Cl\'emen\c{c}on, 2019), can be applied to remedy these problems in the context of visual recognition. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition for our method to be theoretically valid is that the supports of the distributions generating the biased datasets at disposal must overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach whenever the biasing functions are appropriately chosen.
    Model Compression. (arXiv:2105.10059v2 [cs.LG] UPDATED)
    (0 min) With time, machine learning models have increased in their scope, functionality and size. Consequently, the increased functionality and size of such models requires high-end hardware to both train and provide inference after the fact. This paper aims to explore the possibilities within the domain of model compression, discuss the efficiency of combining various levels of pruning and quantization, while proposing a quality measurement metric to objectively decide which combination is best in terms of minimizing the accuracy delta and maximizing the size reduction factor.
    Robust Importance Sampling for Error Estimation in the Context of Optimal Bayesian Transfer Learning. (arXiv:2109.02150v1 [stat.ML])
    (0 min) Classification has been a major task for building intelligent systems as it enables decision-making under uncertainty. Classifier design aims at building models from training data for representing feature-label distributions--either explicitly or implicitly. In many scientific or clinical settings, training data are typically limited, which makes designing accurate classifiers and evaluating their classification error extremely challenging. While transfer learning (TL) can alleviate this issue by incorporating data from relevant source domains to improve learning in a different target domain, it has received little attention for performance assessment, notably in error estimation. In this paper, we fill this gap by investigating knowledge transferability in the context of classification error estimation within a Bayesian paradigm. We introduce a novel class of Bayesian minimum mean-square error (MMSE) estimators for optimal Bayesian transfer learning (OBTL), which enables rigorous evaluation of classification error under uncertainty in a small-sample setting. Using Monte Carlo importance sampling, we employ the proposed estimator to evaluate the classification accuracy of a broad family of classifiers that span diverse learning capabilities. Experimental results based on both synthetic data as well as real-world RNA sequencing (RNA-seq) data show that our proposed OBTL error estimation scheme clearly outperforms standard error estimators, especially in a small-sample setting, by tapping into the data from other relevant domains.
    Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation. (arXiv:2108.11430v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks (DNN) have shown superior performance in a variety of tasks. As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices. Though extensive efficient accelerator designs, from traditional electronics to emerging photonics, have been successfully demonstrated, they are still bottlenecked by expensive memory accesses due to tremendous gaps between the bandwidth/power/latency of electrical memory and computing cores. Previous solutions fail to fully-leverage the ultra-fast computational speed of emerging DNN accelerators to break through the critical memory bound. In this work, we propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations, directly translating to performance improvement. We are the first to jointly explore the intrinsic correlations and bit-level redundancy within DNN kernels and propose a multi-level in situ generation mechanism with mixed-precision bases to achieve on-the-fly recovery of high-resolution parameters with minimum hardware overhead. Extensive experiments demonstrate that our proposed joint method can boost the memory efficiency by 10-20x with comparable accuracy over four state-of-the-art designs, when benchmarked on ResNet-18/DenseNet-121/MobileNetV2/V3 with various tasks.
    A Survey on Assessing the Generalization Envelope of Deep Neural Networks: Predictive Uncertainty, Out-of-distribution and Adversarial Samples. (arXiv:2008.09381v4 [cs.LG] UPDATED)
    (0 min) Deep Neural Networks (DNNs) achieve state-of-the-art performance on numerous applications. However, it is difficult to tell beforehand if a DNN receiving an input will deliver the correct output since their decision criteria are usually nontransparent. A DNN delivers the correct output if the input is within the area enclosed by its generalization envelope. In this case, the information contained in the input sample is processed reasonably by the network. It is of large practical importance to assess at inference time if a DNN generalizes correctly. Currently, the approaches to achieve this goal are investigated in different problem set-ups rather independently from one another, leading to three main research and literature fields: predictive uncertainty, out-of-distribution detection and adversarial example detection. This survey connects the three fields within the larger framework of investigating the generalization performance of machine learning methods and in particular DNNs. We underline the common ground, point at the most promising approaches and give a structured overview of the methods that provide at inference time means to establish if the current input is within the generalization envelope of a DNN.
    Data science and Machine learning in the Clouds: A Perspective for the Future. (arXiv:2109.01661v1 [cs.DC])
    (2 min) As we are fast approaching the beginning of a paradigm shift in the field of science, Data driven science (the so called fourth science paradigm) is going to be the driving force in research and innovation. From medicine to biodiversity and astronomy to geology, all these terms are somehow going to be affected by this paradigm shift. The huge amount of data to be processed under this new paradigm will be a major concern in the future and one will strongly require cloud based services in all the aspects of these computations (from storage to compute and other services). Another aspect will be energy consumption and performance of prediction jobs and tasks within such a scientific paradigm which will change the way one sees computation. Data science has heavily impacted or rather triggered the emergence of Machine Learning, Signal/Image/Video processing related algorithms, Artificial intelligence, Robotics, health informatics, geoinformatics, and many more such areas of interest. Hence, we envisage an era where Data science can deliver its promises with the help of the existing cloud based platforms and services with the addition of new services. In this article, we discuss about data driven science and Machine learning and how they are going to be linked through cloud based services in the future. It also discusses the rise of paradigms like approximate computing, quantum computing and many more in recent times and their applicability in big data processing, data science, analytics, prediction and machine learning in the cloud environments.
    Training Meta-Surrogate Model for Transferable Adversarial Attack. (arXiv:2109.01983v1 [cs.LG])
    (2 min) We consider adversarial attacks to a black-box model when no queries are allowed. In this setting, many methods directly attack surrogate models and transfer the obtained adversarial examples to fool the target model. Plenty of previous works investigated what kind of attacks to the surrogate model can generate more transferable adversarial examples, but their performances are still limited due to the mismatches between surrogate models and the target model. In this paper, we tackle this problem from a novel angle -- instead of using the original surrogate models, can we obtain a Meta-Surrogate Model (MSM) such that attacks to this model can be easier transferred to other models? We show that this goal can be mathematically formulated as a well-posed (bi-level-like) optimization problem and design a differentiable attacker to make training feasible. Given one or a set of surrogate models, our method can thus obtain an MSM such that adversarial examples generated on MSM enjoy eximious transferability. Comprehensive experiments on Cifar-10 and ImageNet demonstrate that by attacking the MSM, we can obtain stronger transferable adversarial examples to fool black-box models including adversarially trained ones, with much higher success rates than existing methods. The proposed method reveals significant security challenges of deep models and is promising to be served as a state-of-the-art benchmark for evaluating the robustness of deep models in the black-box setting.
    How Reliable Are Out-of-Distribution Generalization Methods for Medical Image Segmentation?. (arXiv:2109.01668v1 [eess.IV])
    (2 min) The recent achievements of Deep Learning rely on the test data being similar in distribution to the training data. In an ideal case, Deep Learning models would achieve Out-of-Distribution (OoD) Generalization, i.e. reliably make predictions on out-of-distribution data. Yet in practice, models usually fail to generalize well when facing a shift in distribution. Several methods were thereby designed to improve the robustness of the features learned by a model through Regularization- or Domain-Prediction-based schemes. Segmenting medical images such as MRIs of the hippocampus is essential for the diagnosis and treatment of neuropsychiatric disorders. But these brain images often suffer from distribution shift due to the patient's age and various pathologies affecting the shape of the organ. In this work, we evaluate OoD Generalization solutions for the problem of hippocampus segmentation in MR data using both fully- and semi-supervised training. We find that no method performs reliably in all experiments. Only the V-REx loss stands out as it remains easy to tune, while it outperforms a standard U-Net in most cases.
    Pointspectrum: Equivariance Meets Laplacian Filtering for Graph Representation Learning. (arXiv:2109.02358v1 [cs.LG])
    (0 min) Graph Representation Learning (GRL) has become essential for modern graph data mining and learning tasks. GRL aims to capture the graph's structural information and exploit it in combination with node and edge attributes to compute low-dimensional representations. While Graph Neural Networks (GNNs) have been used in state-of-the-art GRL architectures, they have been shown to suffer from over smoothing when many GNN layers need to be stacked. In a different GRL approach, spectral methods based on graph filtering have emerged addressing over smoothing; however, up to now, they employ traditional neural networks that cannot efficiently exploit the structure of graph data. Motivated by this, we propose PointSpectrum, a spectral method that incorporates a set equivariant network to account for a graph's structure. PointSpectrum enhances the efficiency and expressiveness of spectral methods, while it outperforms or competes with state-of-the-art GRL methods. Overall, PointSpectrum addresses over smoothing by employing a graph filter and captures a graph's structure through set equivariance, lying on the intersection of GNNs and spectral methods. Our findings are promising for the benefits and applicability of this architectural shift for spectral methods and GRL.
    RAMA: A Rapid Multicut Algorithm on GPU. (arXiv:2109.01838v1 [cs.DC])
    (0 min) We propose a highly parallel primal-dual algorithm for the multicut (a.k.a. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision. Our algorithm consists of three steps executed recursively: (1) Finding conflicted cycles that correspond to violated inequalities of the underlying multicut relaxation, (2) Performing message passing between the edges and cycles to optimize the Lagrange relaxation coming from the found violated cycles producing reduced costs and (3) Contracting edges with high reduced costs through matrix-matrix multiplications. Our algorithm produces primal solutions and dual lower bounds that estimate the distance to optimum. We implement our algorithm on GPUs and show resulting one to two order-of-magnitudes improvements in execution speed without sacrificing solution quality compared to traditional serial algorithms that run on CPUs. We can solve very large scale benchmark problems with up to $\mathcal{O}(10^8)$ variables in a few seconds with small primal-dual gaps. We make our code available at https://github.com/pawelswoboda/RAMA.
    Tolerating Adversarial Attacks and Byzantine Faults in Distributed Machine Learning. (arXiv:2109.02018v1 [cs.CR])
    (0 min) Adversarial attacks attempt to disrupt the training, retraining and utilizing of artificial intelligence and machine learning models in large-scale distributed machine learning systems. This causes security risks on its prediction outcome. For example, attackers attempt to poison the model by either presenting inaccurate misrepresentative data or altering the models' parameters. In addition, Byzantine faults including software, hardware, network issues occur in distributed systems which also lead to a negative impact on the prediction outcome. In this paper, we propose a novel distributed training algorithm, partial synchronous stochastic gradient descent (ParSGD), which defends adversarial attacks and/or tolerates Byzantine faults. We demonstrate the effectiveness of our algorithm under three common adversarial attacks again the ML models and a Byzantine fault during the training phase. Our results show that using ParSGD, ML models can still produce accurate predictions as if it is not being attacked nor having failures at all when almost half of the nodes are being compromised or failed. We will report the experimental evaluations of ParSGD in comparison with other algorithms.
    A realistic approach to generate masked faces applied on two novel masked face recognition data sets. (arXiv:2109.01745v1 [cs.CV])
    (0 min) The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical masks to cover their noses and mouths. Traditional data sets (e.g., CelebA, CASIA-WebFace) used for training these systems were released before the pandemic, so they now seem unsuited due to the lack of examples of people wearing masks. We propose a method for enhancing data sets containing faces without masks by creating synthetic masks and overlaying them on faces in the original images. Our method relies on Spark AR Studio, a developer program made by Facebook that is used to create Instagram face filters. In our approach, we use 9 masks of different colors, shapes and fabrics. We employ our method to generate a number of 445,446 (90%) samples of masks for the CASIA-WebFace data set and 196,254 (96.8%) masks for the CelebA data set, releasing the mask images at https://github.com/securifai/masked_faces. We show that our method produces significantly more realistic training examples of masks overlaid on faces by asking volunteers to qualitatively compare it to other methods or data sets designed for the same task. We also demonstrate the usefulness of our method by evaluating state-of-the-art face recognition systems (FaceNet, VGG-face, ArcFace) trained on the enhanced data sets and showing that they outperform equivalent systems trained on the original data sets (containing faces without masks), when the test benchmark contains masked faces.
    Automatic Online Multi-Source Domain Adaptation. (arXiv:2109.01996v1 [cs.LG])
    (0 min) Knowledge transfer across several streaming processes remain challenging problem not only because of different distributions of each stream but also because of rapidly changing and never-ending environments of data streams. Albeit growing research achievements in this area, most of existing works are developed for a single source domain which limits its resilience to exploit multi-source domains being beneficial to recover from concept drifts quickly and to avoid the negative transfer problem. An online domain adaptation technique under multisource streaming processes, namely automatic online multi-source domain adaptation (AOMSDA), is proposed in this paper. The online domain adaptation strategy of AOMSDA is formulated under a coupled generative and discriminative approach of denoising autoencoder (DAE) where the central moment discrepancy (CMD)-based regularizer is integrated to handle the existence of multi-source domains thereby taking advantage of complementary information sources. The asynchronous concept drifts taking place at different time periods are addressed by a self-organizing structure and a node re-weighting strategy. Our numerical study demonstrates that AOMSDA is capable of outperforming its counterparts in 5 of 8 study cases while the ablation study depicts the advantage of each learning component. In addition, AOMSDA is general for any number of source streams. The source code of AOMSDA is shared publicly in https://github.com/Renchunzi-Xie/AOMSDA.git.
    Multimodal Detection of COVID-19 Symptoms using Deep Learning & Probability-based Weighting of Modes. (arXiv:2109.01669v1 [cs.LG])
    (0 min) The COVID-19 pandemic is one of the most challenging healthcare crises during the 21st century. As the virus continues to spread on a global scale, the majority of efforts have been on the development of vaccines and the mass immunization of the public. While the daily case numbers were following a decreasing trend, the emergent of new virus mutations and variants still pose a significant threat. As economies start recovering and societies start opening up with people going back into office buildings, schools, and malls, we still need to have the ability to detect and minimize the spread of COVID-19. Individuals with COVID-19 may show multiple symptoms such as cough, fever, and shortness of breath. Many of the existing detection techniques focus on symptoms having the same equal importance. However, it has been shown that some symptoms are more prevalent than others. In this paper, we present a multimodal method to predict COVID-19 by incorporating existing deep learning classifiers using convolutional neural networks and our novel probability-based weighting function that considers the prevalence of each symptom. The experiments were performed on an existing dataset with respect to the three considered modes of coughs, fever, and shortness of breath. The results show considerable improvements in the detection of COVID-19 using our weighting function when compared to an equal weighting function.
    VARGAN: Variance Enforcing Network Enhanced GAN. (arXiv:2109.02117v1 [cs.LG])
    (0 min) Generative adversarial networks (GANs) are one of the most widely used generative models. GANs can learn complex multi-modal distributions, and generate real-like samples. Despite the major success of GANs in generating synthetic data, they might suffer from unstable training process, and mode collapse. In this paper, we introduce a new GAN architecture called variance enforcing GAN (VARGAN), which incorporates a third network to introduce diversity in the generated samples. The third network measures the diversity of the generated samples, which is used to penalize the generator's loss for low diversity samples. The network is trained on the available training data and undesired distributions with limited modality. On a set of synthetic and real-world image data, VARGAN generates a more diverse set of samples compared to the recent state-of-the-art models. High diversity and low computational complexity, as well as fast convergence, make VARGAN a promising model to alleviate mode collapse.
    What Do Compressed Deep Neural Networks Forget?. (arXiv:1911.05248v3 [cs.LG] UPDATED)
    (0 min) Deep neural network pruning and quantization techniques have demonstrated it is possible to achieve high levels of compression with surprisingly little degradation to test set accuracy. However, this measure of performance conceals significant differences in how different classes and images are impacted by model compression techniques. We find that models with radically different numbers of weights have comparable top-line performance metrics but diverge considerably in behavior on a narrow subset of the dataset. This small subset of data points, which we term Pruning Identified Exemplars (PIEs) are systematically more impacted by the introduction of sparsity. Compression disproportionately impacts model performance on the underrepresented long-tail of the data distribution. PIEs over-index on atypical or noisy images that are far more challenging for both humans and algorithms to classify. Our work provides intuition into the role of capacity in deep neural networks and the trade-offs incurred by compression. An understanding of this disparate impact is critical given the widespread deployment of compressed models in the wild.
    Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality on H\"older Class. (arXiv:2103.00542v5 [cs.LG] UPDATED)
    (0 min) In this paper, we construct neural networks with ReLU, sine and $2^x$ as activation functions. For general continuous $f$ defined on $[0,1]^d$ with continuity modulus $\omega_f(\cdot)$, we construct ReLU-sine-$2^x$ networks that enjoy an approximation rate $\mathcal{O}(\omega_f(\sqrt{d})\cdot2^{-M}+\omega_{f}\left(\frac{\sqrt{d}}{N}\right))$, where $M,N\in \mathbb{N}^{+}$ denote the hyperparameters related to widths of the networks. As a consequence, we can construct ReLU-sine-$2^x$ network with the depth $5$ and width $\max\left\{\left\lceil2d^{3/2}\left(\frac{3\mu}{\epsilon}\right)^{1/{\alpha}}\right\rceil,2\left\lceil\log_2\frac{3\mu d^{\alpha/2}}{2\epsilon}\right\rceil+2\right\}$ that approximates $f\in \mathcal{H}_{\mu}^{\alpha}([0,1]^d)$ within a given tolerance $\epsilon >0$ measured in $L^p$ norm $p\in[1,\infty)$, where $\mathcal{H}_{\mu}^{\alpha}([0,1]^d)$ denotes the H\"older continuous function class defined on $[0,1]^d$ with order $\alpha \in (0,1]$ and constant $\mu > 0$. Therefore, the ReLU-sine-$2^x$ networks overcome the curse of dimensionality on $\mathcal{H}_{\mu}^{\alpha}([0,1]^d)$. In addition to its supper expressive power, functions implemented by ReLU-sine-$2^x$ networks are (generalized) differentiable, enabling us to apply SGD to train.
    Cohort Characteristics and Factors Associated with Cannabis Use among Adolescents in Canada Using Pattern Discovery and Disentanglement Method. (arXiv:2109.01739v1 [cs.LG])
    (0 min) COMPASS is a longitudinal, prospective cohort study collecting data annually from students attending high school in jurisdictions across Canada. We aimed to discover significant frequent/rare associations of behavioral factors among Canadian adolescents related to cannabis use. We use a subset of COMPASS dataset which contains 18,761 records of students in grades 9 to 12 with 31 selected features (attributes) involving various characteristics, from living habits to academic performance. We then used the Pattern Discovery and Disentanglement (PDD) algorithm that we have developed to detect strong and rare (yet statistically significant) associations from the dataset. PDD used the criteria derived from disentangled statistical spaces (known as Re-projected Adjusted-Standardized Residual Vector Spaces, notated as RARV). It outperformed methods using other criteria (i.e. support and confidence) popular as reported in the literature. Association results showed that PDD can discover: i) a smaller set of succinct significant associations in clusters; ii) frequent and rare, yet significant, patterns supported by population health relevant study; iii) patterns from a dataset with extremely imbalanced groups (majority class: minority class = 88.3%: 11.7%).
    Soft Hierarchical Graph Recurrent Networks for Many-Agent Partially Observable Environments. (arXiv:2109.02032v1 [cs.LG])
    (0 min) The recent progress in multi-agent deep reinforcement learning(MADRL) makes it more practical in real-world tasks, but its relatively poor scalability and the partially observable constraints raise challenges to its performance and deployment. Based on our intuitive observation that the human society could be regarded as a large-scale partially observable environment, where each individual has the function of communicating with neighbors and remembering its own experience, we propose a novel network structure called hierarchical graph recurrent network(HGRN) for multi-agent cooperation under partial observability. Specifically, we construct the multi-agent system as a graph, use the hierarchical graph attention network(HGAT) to achieve communication between neighboring agents, and exploit GRU to enable agents to record historical information. To encourage exploration and improve robustness, we design a maximum-entropy learning method to learn stochastic policies of a configurable target action entropy. Based on the above technologies, we proposed a value-based MADRL algorithm called Soft-HGRN and its actor-critic variant named SAC-HRGN. Experimental results based on three homogeneous tasks and one heterogeneous environment not only show that our approach achieves clear improvements compared with four baselines, but also demonstrates the interpretability, scalability, and transferability of the proposed model. Ablation studies prove the function and necessity of each component.
    Artificial Intelligence (AI) in Action: Addressing the COVID-19 Pandemic with Natural Language Processing (NLP). (arXiv:2010.16413v3 [cs.CL] UPDATED)
    (0 min) The COVID-19 pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP), the branch of artificial intelligence that interprets human language, can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
    ALLWAS: Active Learning on Language models in WASserstein space. (arXiv:2109.01691v1 [cs.CL])
    (0 min) Active learning has emerged as a standard paradigm in areas with scarcity of labeled training data, such as in the medical domain. Language models have emerged as the prevalent choice of several natural language tasks due to the performance boost offered by these models. However, in several domains, such as medicine, the scarcity of labeled training data is a common issue. Also, these models may not work well in cases where class imbalance is prevalent. Active learning may prove helpful in these cases to boost the performance with a limited label budget. To this end, we propose a novel method using sampling techniques based on submodular optimization and optimal transport for active learning in language models, dubbed ALLWAS. We construct a sampling strategy based on submodular optimization of the designed objective in the gradient domain. Furthermore, to enable learning from few samples, we propose a novel strategy for sampling from the Wasserstein barycenters. Our empirical evaluations on standard benchmark datasets for text classification show that our methods perform significantly better (>20% relative increase in some cases) than existing approaches for active learning on language models.
    Communication Efficient Tensor Factorization for Decentralized Healthcare Networks. (arXiv:2109.01718v1 [cs.LG])
    (0 min) Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medical concepts. Federated tensor factorization distributes the tensor computation to multiple workers under the coordination of a central server, which enables jointly learning the phenotypes across multiple hospitals while preserving the privacy of the patient information. However, existing federated tensor factorization algorithms encounter the single-point-failure issue with the involvement of the central server, which is not only easily exposed to external attacks, but also limits the number of clients sharing information with the server under restricted uplink bandwidth. In this paper, we propose CiderTF, a communication-efficient decentralized generalized tensor factorization, which reduces the uplink communication cost by leveraging a four-level communication reduction strategy designed for a generalized tensor factorization, which has the flexibility of modeling different tensor distribution with multiple kinds of loss functions. Experiments on two real-world EHR datasets demonstrate that CiderTF achieves comparable convergence with the communication reduction up to 99.99%.
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v1 [cs.LG])
    (0 min) The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.
    Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment. (arXiv:2109.01949v1 [cs.LG])
    (0 min) Self-supervised learning provides an opportunity to explore unlabeled chest X-rays and their associated free-text reports accumulated in clinical routine without manual supervision. This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports. The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching. Both are bidirectionally constrained on Cross-Entropy based and ranking-based Triplet Matching Losses. The region-word matching is calculated using the attention mechanism without direct supervision about their mapping. The pre-trained multi-modal representation learning paves the way for downstream tasks concerning image and/or text encoding. We demonstrate the representation learning quality by cross-modality retrievals and multi-label classifications on two datasets: OpenI-IU and MIMIC-CXR
    Acceleration Method for Learning Fine-Layered Optical Neural Networks. (arXiv:2109.01731v1 [cs.LG])
    (0 min) An optical neural network (ONN) is a promising system due to its high-speed and low-power operation. Its linear unit performs a multiplication of an input vector and a weight matrix in optical analog circuits. Among them, a circuit with a multiple-layered structure of programmable Mach-Zehnder interferometers (MZIs) can realize a specific class of unitary matrices with a limited number of MZIs as its weight matrix. The circuit is effective for balancing the number of programmable MZIs and ONN performance. However, it takes a lot of time to learn MZI parameters of the circuit with a conventional automatic differentiation (AD), which machine learning platforms are equipped with. To solve the time-consuming problem, we propose an acceleration method for learning MZI parameters. We create customized complex-valued derivatives for an MZI, exploiting Wirtinger derivatives and a chain rule. They are incorporated into our newly developed function module implemented in C++ to collectively calculate their values in a multi-layered structure. Our method is simple, fast, and versatile as well as compatible with the conventional AD. We demonstrate that our method works 20 times faster than the conventional AD when a pixel-by-pixel MNIST task is performed in a complex-valued recurrent neural network with an MZI-based hidden unit.
    Reconfigurable Intelligent Surface Empowered Over-the-Air Federated Edge Learning. (arXiv:2109.02353v1 [cs.IT])
    (0 min) Federated edge learning (FEEL) has emerged as a revolutionary paradigm to develop AI services at the edge of 6G wireless networks as it supports collaborative model training at a massive number of mobile devices. However, model communication over wireless channels, especially in uplink model uploading of FEEL, has been widely recognized as a bottleneck that critically limits the efficiency of FEEL. Although over-the-air computation can alleviate the excessive cost of radio resources in FEEL model uploading, practical implementations of over-the-air FEEL still suffer from several challenges, including strong straggler issues, large communication overheads, and potential privacy leakage. In this article, we study these challenges in over-the-air FEEL and leverage reconfigurable intelligent surface (RIS), a key enabler of future wireless systems, to address these challenges. We study the state-of-the-art solutions on RIS-empowered FEEL and explore the promising research opportunities for adopting RIS to enhance FEEL performance.
    Estimating permeability of 3D micro-CT images by physics-informed CNNs based on DNS. (arXiv:2109.01818v1 [cs.LG])
    (0 min) In recent years, convolutional neural networks (CNNs) have experienced an increasing interest for their ability to perform fast approximation of effective hydrodynamic parameters in porous media research and applications. This paper presents a novel methodology for permeability prediction from micro-CT scans of geological rock samples. The training data set for CNNs dedicated to permeability prediction consists of permeability labels that are typically generated by classical lattice Boltzmann methods (LBM) that simulate the flow through the pore space of the segmented image data. We instead perform direct numerical simulation (DNS) by solving the stationary Stokes equation in an efficient and distributed-parallel manner. As such, we circumvent the convergence issues of LBM that frequently are observed on complex pore geometries, and therefore, improve on the generality and accuracy of our training data set. Using the DNS-computed permeabilities, a physics-informed CNN PhyCNN) is trained by additionally providing a tailored characteristic quantity of the pore space. More precisely, by exploiting the connection to flow problems on a graph representation of the pore space, additional information about confined structures is provided to the network in terms of the maximum flow value, which is the key innovative component of our workflow. As a result, unprecedented prediction accuracy and robustness are observed for a variety of sandstone samples from archetypal rock formations.
    Multi-Kernel Correntropy for Robust Learning. (arXiv:1905.10115v2 [cs.LG] UPDATED)
    (0 min) As a novel similarity measure that is defined as the expectation of a kernel function between two random variables, correntropy has been successfully applied in robust machine learning and signal processing to combat large outliers. The kernel function in correntropy is usually a zero-mean Gaussian kernel. In a recent work, the concept of mixture correntropy (MC) was proposed to improve the learning performance, where the kernel function is a mixture Gaussian kernel, namely a linear combination of several zero-mean Gaussian kernels with different widths. In both correntropy and mixture correntropy, the center of the kernel function is, however, always located at zero. In the present work, to further improve the learning performance, we propose the concept of multi-kernel correntropy (MKC), in which each component of the mixture Gaussian kernel can be centered at a different location. The properties of the MKC are investigated and an efficient approach is proposed to determine the free parameters in MKC. Experimental results show that the learning algorithms under the maximum multi-kernel correntropy criterion (MMKCC) can outperform those under the original maximum correntropy criterion (MCC) and the maximum mixture correntropy criterion (MMCC).
    Detect the Interactions that Matter in Matter: Geometric Attention for Many-Body Systems. (arXiv:2106.02549v4 [cs.LG] UPDATED)
    (0 min) Attention mechanisms are developing into a viable alternative to convolutional layers as elementary building block of NNs. Their main advantage is that they are not restricted to capture local dependencies in the input, but can draw arbitrary connections. This unprecedented capability coincides with the long-standing problem of modeling global atomic interactions in molecular force fields and other many-body problems. In its original formulation, however, attention is not applicable to the continuous domains in which the atoms live. For this purpose we propose a variant to describe geometric relations for arbitrary atomic configurations in Euclidean space that also respects all relevant physical symmetries. We furthermore demonstrate, how the successive application of our learned attention matrices effectively translates the molecular geometry into a set of individual atomic contributions on-the-fly.
    Hierarchical 3D Feature Learning for Pancreas Segmentation. (arXiv:2109.01667v1 [eess.IV])
    (0 min) We propose a novel 3D fully convolutional deep network for automated pancreas segmentation from both MRI and CT scans. More specifically, the proposed model consists of a 3D encoder that learns to extract volume features at different scales; features taken at different points of the encoder hierarchy are then sent to multiple 3D decoders that individually predict intermediate segmentation maps. Finally, all segmentation maps are combined to obtain a unique detailed segmentation mask. We test our model on both CT and MRI imaging data: the publicly available NIH Pancreas-CT dataset (consisting of 82 contrast-enhanced CTs) and a private MRI dataset (consisting of 40 MRI scans). Experimental results show that our model outperforms existing methods on CT pancreas segmentation, obtaining an average Dice score of about 88%, and yields promising segmentation performance on a very challenging MRI data set (average Dice score is about 77%). Additional control experiments demonstrate that the achieved performance is due to the combination of our 3D fully-convolutional deep network and the hierarchical representation decoding, thus substantiating our architectural design.
    Nonasymptotic one-and two-sample tests in high dimension with unknown covariance structure. (arXiv:2109.01730v1 [cs.LG])
    (0 min) Let $\mathbf{X} = (X_i)_{1\leq i \leq n}$ be an i.i.d. sample of square-integrable variables in $\mathbb{R}^d$, with common expectation $\mu$ and covariance matrix $\Sigma$, both unknown. We consider the problem of testing if $\mu$ is $\eta$-close to zero, i.e. $\|\mu\| \leq \eta $ against $\|\mu\| \geq (\eta + \delta)$; we also tackle the more general two-sample mean closeness testing problem. The aim of this paper is to obtain nonasymptotic upper and lower bounds on the minimal separation distance $\delta$ such that we can control both the Type I and Type II errors at a given level. The main technical tools are concentration inequalities, first for a suitable estimator of $\|\mu\|^2$ used a test statistic, and secondly for estimating the operator and Frobenius norms of $\Sigma$ coming into the quantiles of said test statistic. These properties are obtained for Gaussian and bounded distributions. A particular attention is given to the dependence in the pseudo-dimension $d_*$ of the distribution, defined as $d_* := \|\Sigma\|_2^2/\|\Sigma\|_\infty^2$. In particular, for $\eta=0$, the minimum separation distance is ${\Theta}(d_*^{\frac{1}{4}}\sqrt{\|\Sigma\|_\infty/n})$, in contrast with the minimax estimation distance for $\mu$, which is ${\Theta}(d_e^{\frac{1}{2}}\sqrt{\|\Sigma\|_\infty/n})$ (where $d_e:=\|\Sigma\|_1/\|\Sigma\|_\infty$). This generalizes a phenomenon spelled out in particular by Baraud (2002).
    Learning Temporal Quantum Tomography. (arXiv:2103.13973v3 [quant-ph] UPDATED)
    (0 min) Quantifying and verifying the control level in preparing a quantum state are central challenges in building quantum devices. The quantum state is characterized from experimental measurements, using a procedure known as tomography, which requires a vast number of resources. Furthermore, the tomography for a quantum device with temporal processing, which is fundamentally different from the standard tomography, has not been formulated. We develop a practical and approximate tomography method using a recurrent machine learning framework for this intriguing situation. The method is based on repeated quantum interactions between a system called quantum reservoir with a stream of quantum states. Measurement data from the reservoir are connected to a linear readout to train a recurrent relation between quantum channels applied to the input stream. We demonstrate our algorithms for quantum learning tasks followed by the proposal of a quantum short-term memory capacity to evaluate the temporal processing ability of near-term quantum devices.
    Attentive Neural Controlled Differential Equations for Time-series Classification and Forecasting. (arXiv:2109.01876v1 [cs.LG])
    (0 min) Neural networks inspired by differential equations have proliferated for the past several years. Neural ordinary differential equations (NODEs) and neural controlled differential equations (NCDEs) are two representative examples of them. In theory, NCDEs provide better representation learning capability for time-series data than NODEs. In particular, it is known that NCDEs are suitable for processing irregular time-series data. Whereas NODEs have been successfully extended after adopting attention, however, it had not been studied yet how to integrate attention into NCDEs. To this end, we present the method of Attentive Neural Controlled Differential Equations (ANCDEs) for time-series classification and forecasting, where dual NCDEs are used: one for generating attention values, and the other for evolving hidden vectors for a downstream machine learning task. We conduct experiments with three real-world time-series datasets and 10 baselines. After dropping some values, we also conduct irregular time-series experiments. Our method consistently shows the best accuracy in all cases by non-trivial margins. Our visualizations also show that the presented attention mechanism works as intended by focusing on crucial information.
    Parallel Capsule Networks for Classification of White Blood Cells. (arXiv:2108.02644v2 [cs.CV] UPDATED)
    (0 min) Capsule Networks (CapsNets) is a machine learning architecture proposed to overcome some of the shortcomings of convolutional neural networks (CNNs). However, CapsNets have mainly outperformed CNNs in datasets where images are small and/or the objects to identify have minimal background noise. In this work, we present a new architecture, parallel CapsNets, which exploits the concept of branching the network to isolate certain capsules, allowing each branch to identify different entities. We applied our concept to the two current types of CapsNet architectures, studying the performance for networks with different layers of capsules. We tested our design in a public, highly unbalanced dataset of acute myeloid leukaemia images (15 classes). Our experiments showed that conventional CapsNets show similar performance than our baseline CNN (ResNeXt-50) but depict instability problems. In contrast, parallel CapsNets can outperform ResNeXt-50, is more stable, and shows better rotational invariance than both, conventional CapsNets and ResNeXt-50.
    Cross-Task Generalization via Natural Language Crowdsourcing Instructions. (arXiv:2104.08773v2 [cs.CL] UPDATED)
    (0 min) Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. NLP models built with the conventional paradigm, however, often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that is equipped with the understanding of human-readable instructions that define the tasks, and can generalize to new tasks. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions and 193k task instances. The instructions are obtained from crowdsourcing instructions used to collect existing NLP datasets and mapped to a unified schema. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models can benefit from instructions to generalize across tasks. These models, however, are far behind supervised task-specific models, indicating significant room for more progress in this direction.
    Sparse-MLP: A Fully-MLP Architecture with Conditional Computation. (arXiv:2109.02008v1 [cs.LG])
    (0 min) Mixture of Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture. We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, one with MLP experts mixing information within patches along the channel dimension. Besides, to reduce computational cost in routing and improve experts capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations. By pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models with comparable parameters and less computational cost on several downstream image classification tasks.
    Mobility Functional Areas and COVID-19 Spread. (arXiv:2103.16894v2 [stat.AP] UPDATED)
    (2 min) This work introduces a new concept of functional areas called Mobility Functional Areas (MFAs), i.e., the geographic zones highly interconnected according to the analysis of mobile positioning data. The MFAs do not coincide necessarily with administrative borders as they are built observing natural human mobility and, therefore, they can be used to inform, in a bottom-up approach, local transportation, spatial planning, health and economic policies. After presenting the methodology behind the MFAs, this study focuses on the link between the COVID-19 pandemic and the MFAs in Austria. It emerges that the MFAs registered an average number of infections statistically larger than the areas in the rest of the country, suggesting the usefulness of the MFAs in the context of targeted re-escalation policy responses to this health crisis. The MFAs dataset is openly available to other scholars for further analyses.
    Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. (arXiv:2109.01934v1 [cs.CV])
    (2 min) Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e.\ implicitly learning the geometry of the scene. In this work, we evaluate the faithfulness of V\&L models to such geometric understanding, by formulating the prediction of pair-wise relative locations of objects as a classification as well as a regression task. Our findings suggest that state-of-the-art transformer-based V\&L models lack sufficient abilities to excel at this task. Motivated by this, we design two objectives as proxies for 3D spatial reasoning (SR) -- object centroid estimation, and relative position estimation, and train V\&L with weak supervision from off-the-shelf depth estimators. This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge (in fully supervised, few-shot, and O.O.D settings) as well as improvements in relative spatial reasoning. Code and data will be released \href{https://github.com/pratyay-banerjee/weak_sup_vqa}{here}.
    A Perceptually-Validated Metric for Crowd Trajectory Quality Evaluation. (arXiv:2108.12346v2 [cs.LG] UPDATED)
    (2 min) Simulating crowds requires controlling a very large number of trajectories and is usually performed using crowd motion algorithms for which appropriate parameter values need to be found. The study of the relation between parametric values for simulation techniques and the quality of the resulting trajectories has been studied either through perceptual experiments or by comparison with real crowd trajectories. In this paper, we integrate both strategies. A quality metric, QF, is proposed to abstract from reference data while capturing the most salient features that affect the perception of trajectory realism. QF weights and combines cost functions that are based on several individual, local and global properties of trajectories. These trajectory features are selected from the literature and from interviews with experts. To validate the capacity of QF to capture perceived trajectory quality, we conduct an online experiment that demonstrates the high agreement between the automatic quality score and non-expert users. To further demonstrate the usefulness of QF, we use it in a data-free parameter tuning application able to tune any parametric microscopic crowd simulation model that outputs independent trajectories for characters. The learnt parameters for the tuned crowd motion model maintain the influence of the reference data which was used to weight the terms of QF.
    The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers. (arXiv:2108.12284v2 [cs.LG] UPDATED)
    (2 min) Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly, performance differences between these models are typically invisible on the IID data split. This calls for proper generalization validation sets for developing neural networks that generalize systematically. We publicly release the code to reproduce our results.
    Physics-Informed Deep Learning: A Promising Technique for System Reliability Assessment. (arXiv:2108.10828v2 [stat.ML] UPDATED)
    (2 min) Considerable research has been devoted to deep learning-based predictive models for system prognostics and health management in the reliability and safety community. However, there is limited study on the utilization of deep learning for system reliability assessment. This paper aims to bridge this gap and explore this new interface between deep learning and system reliability assessment by exploiting the recent advances of physics-informed deep learning. Particularly, we present an approach to frame system reliability assessment in the context of physics-informed deep learning and discuss the potential value of physics-informed generative adversarial networks for the uncertainty quantification and measurement data incorporation in system reliability assessment. The proposed approach is demonstrated by three numerical examples involving a dual-processor computing system. The results indicate the potential value of physics-informed deep learning to alleviate computational challenges and combine measurement data and mathematical models for system reliability assessment.
    Data-Driven Learning of 3-Point Correlation Functions as Microstructure Representations. (arXiv:2109.02255v1 [cond-mat.mtrl-sci])
    (2 min) This paper considers the open challenge of identifying complete, concise, and explainable quantitative microstructure representations for disordered heterogeneous material systems. Completeness and conciseness have been achieved through existing data-driven methods, e.g., deep generative models, which, however, do not provide mathematically explainable latent representations. This study investigates representations composed of three-point correlation functions, which are a special type of spatial convolutions. We show that a variety of microstructures can be characterized by a concise subset of three-point correlations, and the identification of such subsets can be achieved by Bayesian optimization. Lastly, we show that the proposed representation can directly be used to compute material properties based on the effective medium theory.
    A Variational Approach to Privacy and Fairness. (arXiv:2006.06332v3 [stat.ML] UPDATED)
    (2 min) In this article, we propose a new variational approach to learn private and/or fair representations. This approach is based on the Lagrangians of a new formulation of the privacy and fairness optimization problems that we propose. In this formulation, we aim to generate representations of the data that keep a prescribed level of the relevant information that is not shared by the private or sensitive data, while minimizing the remaining information they keep. The proposed approach (i) exhibits the similarities of the privacy and fairness problems, (ii) allows us to control the trade-off between utility and privacy or fairness through the Lagrange multiplier parameter, and (iii) can be comfortably incorporated to common representation learning algorithms such as the VAE, the $\beta$-VAE, the VIB, or the nonlinear IB.
    The Phonexia VoxCeleb Speaker Recognition Challenge 2021 System Description. (arXiv:2109.02052v1 [cs.SD])
    (0 min) We describe the Phonexia submission for the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21) in the unsupervised speaker verification track. Our solution was very similar to IDLab's winning submission for VoxSRC-20. An embedding extractor was bootstrapped using momentum contrastive learning, with input augmentations as the only source of supervision. This was followed by several iterations of clustering to assign pseudo-speaker labels that were then used for supervised embedding extractor training. Finally, a score fusion was done, by averaging the zt-normalized cosine scores of five different embedding extractors. We briefly also describe unsuccessful solutions involving i-vectors instead of DNN embeddings and PLDA instead of cosine scoring.
    Approximate information state for approximate planning and reinforcement learning in partially observed systems. (arXiv:2010.08843v2 [cs.LG] UPDATED)
    (0 min) We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two equivalent definitions of information state -- i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.
    On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games. (arXiv:2109.01795v1 [cs.GT])
    (0 min) Similar to the role of Markov decision processes in reinforcement learning, Stochastic Games (SGs) lay the foundation for the study of multi-agent reinforcement learning (MARL) and sequential agent interactions. In this paper, we derive that computing an approximate Markov Perfect Equilibrium (MPE) in a finite-state discounted Stochastic Game within the exponential precision is \textbf{PPAD}-complete. We adopt a function with a polynomially bounded description in the strategy space to convert the MPE computation to a fixed-point problem, even though the stochastic game may demand an exponential number of pure strategies, in the number of states, for each agent. The completeness result follows the reduction of the fixed-point problem to {\sc End of the Line}. Our results indicate that finding an MPE in SGs is highly unlikely to be \textbf{NP}-hard unless \textbf{NP}=\textbf{co-NP}. Our work offers confidence for MARL research to study MPE computation on general-sum SGs and to develop fruitful algorithms as currently on zero-sum SGs.
    Structural Optimization Makes Graph Classification Simpler and Better. (arXiv:2109.02027v1 [cs.LG])
    (0 min) In deep neural networks, better results can often be obtained by increasing the complexity of previously developed basic models. However, it is unclear whether there is a way to boost performance by decreasing the complexity of such models. Here, based on an optimization method, we investigate the feasibility of improving graph classification performance while simplifying the model learning process. Inspired by progress in structural information assessment, we optimize the given data sample from graphs to encoding trees. In particular, we minimize the structural entropy of the transformed encoding tree to decode the key structure underlying a graph. This transformation is denoted as structural optimization. Furthermore, we propose a novel feature combination scheme, termed hierarchical reporting, for encoding trees. In this scheme, features are transferred from leaf nodes to root nodes by following the hierarchical structures of encoding trees. We then present an implementation of the scheme in a tree kernel and a convolutional network to perform graph classification. The tree kernel follows label propagation in the Weisfeiler-Lehman (WL) subtree kernel, but it has a lower runtime complexity $O(n)$. The convolutional network is a special implementation of our tree kernel in the deep learning field and is called Encoding Tree Learning (ETL). We empirically validate our tree kernel and convolutional network with several graph classification benchmarks and demonstrate that our methods achieve better performance and lower computational consumption than competing approaches.
    Weakly supervised semantic segmentation of tomographic images in the diagnosis of stroke. (arXiv:2109.01887v1 [eess.IV])
    (0 min) This paper presents an automatic algorithm for the segmentation of areas affected by an acute stroke on the non-contrast computed tomography brain images. The proposed algorithm is designed for learning in a weakly supervised scenario when some images are labeled accurately, and some images are labeled inaccurately. Wrong labels appear as a result of inaccuracy made by a radiologist in the process of manual annotation of computed tomography images. We propose methods for solving the segmentation problem in the case of inaccurately labeled training data. We use the U-Net neural network architecture with several modifications. Experiments on real computed tomography scans show that the proposed methods increase the segmentation accuracy.
    Weakly Supervised Few-Shot Segmentation Via Meta-Learning. (arXiv:2109.01693v1 [cs.CV])
    (0 min) Semantic segmentation is a classic computer vision task with multiple applications, which includes medical and remote sensing image analysis. Despite recent advances with deep-based approaches, labeling samples (pixels) for training models is laborious and, in some cases, unfeasible. In this paper, we present two novel meta learning methods, named WeaSeL and ProtoSeg, for the few-shot semantic segmentation task with sparse annotations. We conducted extensive evaluation of the proposed methods in different applications (12 datasets) in medical imaging and agricultural remote sensing, which are very distinct fields of knowledge and usually subject to data scarcity. The results demonstrated the potential of our method, achieving suitable results for segmenting both coffee/orange crops and anatomical parts of the human body in comparison with full dense annotation.
    Deep Saliency Prior for Reducing Visual Distraction. (arXiv:2109.01980v1 [cs.CV])
    (0 min) Using only a model that was trained to predict where people look at images, and no additional training data, we can produce a range of powerful editing effects for reducing distraction in images. Given an image and a mask specifying the region to edit, we backpropagate through a state-of-the-art saliency model to parameterize a differentiable editing operator, such that the saliency within the masked region is reduced. We demonstrate several operators, including: a recoloring operator, which learns to apply a color transform that camouflages and blends distractors into their surroundings; a warping operator, which warps less salient image regions to cover distractors, gradually collapsing objects into themselves and effectively removing them (an effect akin to inpainting); a GAN operator, which uses a semantic prior to fully replace image regions with plausible, less salient alternatives. The resulting effects are consistent with cognitive research on the human visual system (e.g., since color mismatch is salient, the recoloring operator learns to harmonize objects' colors with their surrounding to reduce their saliency), and, importantly, are all achieved solely through the guidance of the pretrained saliency model, with no additional supervision. We present results on a variety of natural images and conduct a perceptual study to evaluate and validate the changes in viewers' eye-gaze between the original images and our edited results.
    Assessing the Knowledge State of Online Students -- New Data, New Approaches, Improved Accuracy. (arXiv:2109.01753v1 [cs.LG])
    (0 min) We consider the problem of assessing the changing knowledge state of individual students as they go through online courses. This student performance (SP) modeling problem, also known as knowledge tracing, is a critical step for building adaptive online teaching systems. Specifically, we conduct a study of how to utilize various types and large amounts of students log data to train accurate machine learning models that predict the knowledge state of future students. This study is the first to use four very large datasets made available recently from four distinct intelligent tutoring systems. Our results include a new machine learning approach that defines a new state of the art for SP modeling, improving over earlier methods in several ways: First, we achieve improved accuracy by introducing new features that can be easily computed from conventional question-response logs (e.g., the pattern in the student's most recent answers). Second, we take advantage of features of the student history that go beyond question-response pairs (e.g., which video segments the student watched, or skipped) as well as information about prerequisite structure in the curriculum. Third, we train multiple specialized modeling models for different aspects of the curriculum (e.g., specializing in early versus later segments of the student history), then combine these specialized models to create a group prediction of student knowledge. Taken together, these innovations yield an average AUC score across these four datasets of 0.807 compared to the previous best logistic regression approach score of 0.766, and also outperforming state-of-the-art deep neural net approaches. Importantly, we observe consistent improvements from each of our three methodological innovations, in each dataset, suggesting that our methods are of general utility and likely to produce improvements for other online tutoring systems as well.
    Providing an Approach to Predicting Customer Quality in E-Commerce Social Networks Based on Big Data and Unsupervised Learning Method. (arXiv:2109.02080v1 [cs.SI])
    (0 min) One of the goals of every business enterprise is to increase customer loyalty. The degree of customer loyalty is called customer quality which its forecasting will affect strategic marketing practices. The purpose of this study is to predict the quality of customers of large e-commerce social networks by big data algorithms and unsupervised learning. For this purpose, a graph-based social network analysis framework was used for community detection in the Stanford Network Analysis Platform (SNAP). Then in the found communities, the quality of customers was predicted. The results showed that various visits with an impact of 37.13% can have the greatest impact on customer quality and the order of impact of other parameters were from highest to lowest: number of frequent customer visits (28.56%), role in social networks (28.37%), Indirect transactions (26.74%), activity days (25.62%) and customer social network size (25.06%).
    Exploring Long Tail Visual Relationship Recognition with Large Vocabulary. (arXiv:2004.00436v6 [cs.CV] UPDATED)
    (0 min) Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes. Benchmarks, code, and models have been made available at: https://github.com/Vision-CAIR/LTVRR.
    User-Oriented Smart General AI System under Causal Inference. (arXiv:2103.14561v2 [cs.LG] UPDATED)
    (0 min) General AI system solves a wide range of tasks with high performance in an automated fashion. The best general AI algorithm designed by one individual is different from that devised by another. The best performance records achieved by different users are also different. An inevitable component of general AI is tacit knowledge that depends upon user-specific comprehension of task information and individual model design preferences that are related to user technical experiences. Tacit knowledge affects model performance but cannot be automatically optimized in general AI algorithms. In this paper, we propose User-Oriented Smart General AI System under Causal Inference, abbreviated as UOGASuCI, where UOGAS represents User-Oriented General AI System and uCI means under the framework of causal inference. User characteristics that have a significant influence upon tacit knowledge can be extracted from observed model training experiences of many users in external memory modules. Under the framework of causal inference, we manage to identify the optimal value of user characteristics that are connected with the best model performance designed by users. We make suggestions to users about how different user characteristics can improve the best model performance achieved by users. By recommending updating user characteristics associated with individualized tacit knowledge comprehension and technical preferences, UOGAS helps users design models with better performance.
    A Closed Loop Gradient Descent Algorithm applied to Rosenbrock's function. (arXiv:2108.12883v3 [math.OC] UPDATED)
    (0 min) We introduce a novel adaptive damping technique for an inertial gradient system which finds application as a gradient descent algorithm for unconstrained optimisation. In an example using the non-convex Rosenbrock's function, we show an improvement on existing momentum-based gradient optimisation methods. Also using Lyapunov stability analysis, we demonstrate the performance of the continuous-time version of the algorithm. Using numerical simulations, we consider the performance of its discrete-time counterpart obtained by using the symplectic Euler method of discretisation.
    Training Agents using Upside-Down Reinforcement Learning. (arXiv:1912.02877v2 [cs.LG] UPDATED)
    (0 min) We develop Upside-Down Reinforcement Learning (UDRL), a method for learning to act using only supervised learning techniques. Unlike traditional algorithms, UDRL does not use reward prediction or search for an optimal policy. Instead, it trains agents to follow commands such as "obtain so much total reward in so much time." Many of its general principles are outlined in a companion report; the goal of this paper is to develop a practical learning algorithm and show that this conceptually simple perspective on agent training can produce a range of rewarding behaviors for multiple episodic environments. Experiments show that on some tasks UDRL's performance can be surprisingly competitive with, and even exceed that of some traditional baseline algorithms developed over decades of research. Based on these results, we suggest that alternative approaches to expected reward maximization have an important role to play in training useful autonomous agents.
    Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time. (arXiv:2009.10623v5 [cs.LG] UPDATED)
    (0 min) From CNNs to attention mechanisms, encoding inductive biases into neural networks has been a fruitful source of improvement in machine learning. Adding auxiliary losses to the main objective function is a general way of encoding biases that can help networks learn better representations. However, since auxiliary losses are minimized only on training data, they suffer from the same generalization gap as regular task losses. Moreover, by adding a term to the loss function, the model optimizes a different objective than the one we care about. In this work we address both problems: first, we take inspiration from \textit{transductive learning} and note that after receiving an input but before making a prediction, we can fine-tune our networks on any unsupervised loss. We call this process {\em tailoring}, because we customize the model to each input to ensure our prediction satisfies the inductive bias. Second, we formulate {\em meta-tailoring}, a nested optimization similar to that in meta-learning, and train our models to perform well on the task objective after adapting them using an unsupervised loss. The advantages of tailoring and meta-tailoring are discussed theoretically and demonstrated empirically on a diverse set of examples.
    Learning to drive from a world on rails. (arXiv:2105.00636v2 [cs.RO] UPDATED)
    (0 min) We learn an interactive vision-based driving policy from pre-recorded driving logs via a model-based approach. A forward model of the world supervises a driving policy that predicts the outcome of any potential driving trajectory. To support learning from pre-recorded logs, we assume that the world is on rails, meaning neither the agent nor its actions influence the environment. This assumption greatly simplifies the learning problem, factorizing the dynamics into a nonreactive world model and a low-dimensional and compact forward model of the ego-vehicle. Our approach computes action-values for each training trajectory using a tabular dynamic-programming evaluation of the Bellman equations; these action-values in turn supervise the final vision-based driving policy. Despite the world-on-rails assumption, the final driving policy acts well in a dynamic and reactive world. At the time of writing, our method ranks first on the CARLA leaderboard, attaining a 25% higher driving score while using 40 times less data. Our method is also an order of magnitude more sample-efficient than state-of-the-art model-free reinforcement learning techniques on navigational tasks in the ProcGen benchmark.
    A Novel Multi-Centroid Template Matching Algorithm and Its Application to Cough Detection. (arXiv:2109.00630v2 [cs.LG] UPDATED)
    (0 min) Cough is a major symptom of respiratory-related diseases. There exists a tremendous amount of work in detecting coughs from audio but there has been no effort to identify coughs from solely inertial measurement unit (IMU). Coughing causes motion across the whole body and especially on the neck and head. Therefore, head motion data during coughing captured by a head-worn IMU sensor could be leveraged to detect coughs using a template matching algorithm. In time series template matching problems, K-Nearest Neighbors (KNN) combined with elastic distance measurement (esp. Dynamic Time Warping (DTW)) achieves outstanding performance. However, it is often regarded as prohibitively time-consuming. Nearest Centroid Classifier is thereafter proposed. But the accuracy is comprised of only one centroid obtained for each class. Centroid-based Classifier performs clustering and averaging for each cluster, but requires manually setting the number of clusters. We propose a novel self-tuning multi-centroid template-matching algorithm, which can automatically adjust the number of clusters to balance accuracy and inference time. Through experiments conducted on synthetic datasets and a real-world earbud-based cough dataset, we demonstrate the superiority of our proposed algorithm and present the result of cough detection with a single accelerometer sensor on the earbuds platform.
    Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss. (arXiv:2109.02100v1 [cs.LG])
    (0 min) Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged for their deployments to resource-limited devices. Although recent studies have successfully discretized a full-precision network, they still incur large quantization errors after training, thus giving rise to a significant performance gap between a full-precision network and its quantized counterpart. In this work, we propose a novel quantization method for neural networks, Cluster-Promoting Quantization (CPQ) that finds the optimal quantization grids while naturally encouraging the underlying full-precision weights to gather around those quantization grids cohesively during training. This property of CPQ is thanks to our two main ingredients that enable differentiable quantization: i) the use of the categorical distribution designed by a specific probabilistic parametrization in the forward pass and ii) our proposed multi-class straight-through estimator (STE) in the backward pass. Since our second component, multi-class STE, is intrinsically biased, we additionally propose a new bit-drop technique, DropBits, that revises the standard dropout regularization to randomly drop bits instead of neurons. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer by imposing an additional regularization on DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.
    Siamese Basis Function Networks for Data-efficient Defect Classification in Technical Domains. (arXiv:2012.01338v7 [cs.CV] UPDATED)
    (0 min) Training deep learning models in technical domains is often accompanied by the challenge that although the task is clear, insufficient data for training is available. In this work, we propose a novel approach based on the combination of Siamese networks and radial basis function networks to perform data-efficient classification without pretraining by measuring the distance between images in semantic space in a data-efficient manner. We develop the models using three technical datasets, the NEU dataset, the BSD dataset, and the TEX dataset. In addition to the technical domain, we show the general applicability to classical datasets (cifar10 and MNIST) as well. The approach is tested against state-of-the-art models (Resnet50 and Resnet101) by stepwise reduction of the number of samples available for training. The authors show that the proposed approach outperforms the state-of-the-art models in the low data regime.
    Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks. (arXiv:2006.13866v2 [cs.LG] UPDATED)
    (0 min) Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
    Counterfactual Explanation Based on Gradual Construction for Deep Networks. (arXiv:2008.01897v2 [cs.LG] UPDATED)
    (0 min) To understand the black-box characteristics of deep networks, counterfactual explanation that deduces not only the important features of an input space but also how those features should be modified to classify input as a target class has gained an increasing interest. The patterns that deep networks have learned from a training dataset can be grasped by observing the feature variation among various classes. However, current approaches perform the feature modification to increase the classification probability for the target class irrespective of the internal characteristics of deep networks. This often leads to unclear explanations that deviate from real-world data distributions. To address this problem, we propose a counterfactual explanation method that exploits the statistics learned from a training dataset. Especially, we gradually construct an explanation by iterating over masking and composition steps. The masking step aims to select an important feature from the input data to be classified as a target class. Meanwhile, the composition step aims to optimize the previously selected feature by ensuring that its output score is close to the logit space of the training data that are classified as the target class. Experimental results show that our method produces human-friendly interpretations on various classification datasets and verify that such interpretations can be achieved with fewer feature modification.
    Scalable Feature Selection for (Multitask) Gradient Boosted Trees. (arXiv:2109.01965v1 [stat.ML])
    (0 min) Gradient Boosted Decision Trees (GBDTs) are widely used for building ranking and relevance models in search and recommendation. Considerations such as latency and interpretability dictate the use of as few features as possible to train these models. Feature selection in GBDT models typically involves heuristically ranking the features by importance and selecting the top few, or by performing a full backward feature elimination routine. On-the-fly feature selection methods proposed previously scale suboptimally with the number of features, which can be daunting in high dimensional settings. We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. We show via extensive experiments on both public and proprietary datasets that the proposed method offers significant speedups in training time, while being as competitive as existing GBDT methods in terms of model performance metrics. We also extend the method to the multitask setting, allowing the practitioner to select common features across tasks, as well as selecting task-specific features.
    Mitigating harm in language models with conditional-likelihood filtration. (arXiv:2108.07790v2 [cs.CL] UPDATED)
    (0 min) Language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. We present a methodology for programmatically identifying and removing harmful text from web-scale datasets. A pretrained language model is used to calculate the log-likelihood of researcher-written trigger phrases conditioned on a specific document, which is used to identify and filter documents from the dataset. We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text, with a marginal decrease in performance on standard language modeling benchmarks compared to unfiltered baselines. We provide a partial explanation for this performance gap by surfacing examples of hate speech and other undesirable content from standard language modeling benchmarks. Finally, we discuss the generalization of this method and how trigger phrases which reflect specific values can be used by researchers to build language models which are more closely aligned with their values.
    Gradient Normalization for Generative Adversarial Networks. (arXiv:2109.02235v1 [cs.LG])
    (0 min) In this paper, we propose a novel normalization method called gradient normalization (GN) to tackle the training instability of Generative Adversarial Networks (GANs) caused by the sharp gradient space. Unlike existing work such as gradient penalty and spectral normalization, the proposed GN only imposes a hard 1-Lipschitz constraint on the discriminator function, which increases the capacity of the discriminator. Moreover, the proposed gradient normalization can be applied to different GAN architectures with little modification. Extensive experiments on four datasets show that GANs trained with gradient normalization outperform existing methods in terms of both Frechet Inception Distance and Inception Score.
    Pandemic Drugs at Pandemic Speed: Infrastructure for Accelerating COVID-19 Drug Discovery with Hybrid Machine Learning- and Physics-based Simulations on High Performance Computers. (arXiv:2103.02843v2 [cs.DC] UPDATED)
    (0 min) The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods, in this case developed for linear accelerators, and physics-based methods. The two in silico methods, each have their own advantages and limitations which, interestingly, complement each other. Here, we present an innovative infrastructural development that combines both approaches to accelerate drug discovery. The scale of the potential resulting workflow is such that it is dependent on supercomputing to achieve extremely high throughput. We have demonstrated the viability of this workflow for the study of inhibitors for four COVID-19 target proteins and our ability to perform the required large-scale calculations to identify lead antiviral compounds through repurposing on a variety of supercomputers.
    Regional Adversarial Training for Better Robust Generalization. (arXiv:2109.00678v2 [cs.CV] UPDATED)
    (0 min) Adversarial training (AT) has been demonstrated as one of the most promising defense methods against various adversarial attacks. To our knowledge, existing AT-based methods usually train with the locally most adversarial perturbed points and treat all the perturbed points equally, which may lead to considerably weaker adversarial robust generalization on test data. In this work, we introduce a new adversarial training framework that considers the diversity as well as characteristics of the perturbed points in the vicinity of benign samples. To realize the framework, we propose a Regional Adversarial Training (RAT) defense method that first utilizes the attack path generated by the typical iterative attack method of projected gradient descent (PGD), and constructs an adversarial region based on the attack path. Then, RAT samples diverse perturbed training points efficiently inside this region, and utilizes a distance-aware label smoothing mechanism to capture our intuition that perturbed points at different locations should have different impact on the model performance. Extensive experiments on several benchmark datasets show that RAT consistently makes significant improvement on standard adversarial training (SAT), and exhibits better robust generalization.
    Effective and scalable clustering of SARS-CoV-2 sequences. (arXiv:2108.08143v3 [q-bio.PE] UPDATED)
    (0 min) SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a virus's evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected. Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants. Using a $k$-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences. Such a clustering method allows us to show the relative proportion of each variant over time, giving the rate of spread of each variant in different locations -- something which is important for vaccine development and distribution. We also compute the importance of each amino acid position of the spike protein in identifying a given variant in terms of information gain. Positions of high variant-specific importance tend to agree with those reported by the USA's Centers for Disease Control and Prevention (CDC), further demonstrating our approach.
    Deep State-Space Gaussian Processes. (arXiv:2008.04733v3 [stat.ML] UPDATED)
    (0 min) This paper is concerned with a state-space approach to deep Gaussian process (DGP) regression. We construct the DGP by hierarchically putting transformed Gaussian process (GP) priors on the length scales and magnitudes of the next level of Gaussian processes in the hierarchy. The idea of the state-space approach is to represent the DGP as a non-linear hierarchical system of linear stochastic differential equations (SDEs), where each SDE corresponds to a conditional GP. The DGP regression problem then becomes a state estimation problem, and we can estimate the state efficiently with sequential methods by using the Markov property of the state-space DGP. The computational complexity scales linearly with respect to the number of measurements. Based on this, we formulate state-space MAP as well as Bayesian filtering and smoothing solutions to the DGP regression problem. We demonstrate the performance of the proposed models and methods on synthetic non-stationary signals and apply the state-space DGP to detection of the gravitational waves from LIGO measurements.
    Urban Fire Station Location Planning: A Systematic Approach using Predicted Demand and Service Quality Index. (arXiv:2109.02160v1 [cs.LG])
    (0 min) In this article, we propose a systematic approach for fire station location planning. We develop a machine learning model, based on Random Forest, for demand prediction and utilize the model further to define a generalized index to measure quality of fire service in urban settings. Our model is built upon spatial data collected from multiple different sources. Efficacy of proper facility planning depends on choice of candidates where fire stations can be located along with existing stations, if any. Also, the travel time from these candidates to demand locations need to be taken care of to maintain fire safety standard. Here, we propose a travel time based clustering technique to identify suitable candidates. Finally, we develop an optimization problem to select best locations to install new fire stations. Our optimization problem is built upon maximum coverage problem, based on integer programming. We present a detailed experimental study of our proposed approach in collaboration with city of Victoria Fire Department, MN, USA. Our demand prediction model achieves true positive rate of 70% and false positive rate of 22% approximately. We aid Victoria Fire Department to select a location for a new fire station using our approach. We present detailed results on improvement statistics by locating a new facility, as suggested by our methodology, in the city of Victoria.
    Quantifying the Reproducibility of Graph Neural Networks using Multigraph Brain Data. (arXiv:2109.02248v1 [cs.LG])
    (0 min) Graph neural networks (GNNs) have witnessed an unprecedented proliferation in tackling several problems in computer vision, computer-aided diagnosis, and related fields. While prior studies have focused on boosting the model accuracy, quantifying the reproducibility of the most discriminative features identified by GNNs is still an intact problem that yields concerns about their reliability in clinical applications in particular. Specifically, the reproducibility of biological markers across clinical datasets and distribution shifts across classes (e.g., healthy and disordered brains) is of paramount importance in revealing the underpinning mechanisms of diseases as well as propelling the development of personalized treatment. Motivated by these issues, we propose, for the first time, reproducibility-based GNN selection (RG-Select), a framework for GNN reproducibility assessment via the quantification of the most discriminative features (i.e., biomarkers) shared between different models. To ascertain the soundness of our framework, the reproducibility assessment embraces variations of different factors such as training strategies and data perturbations. Despite these challenges, our framework successfully yielded replicable conclusions across different training strategies and various clinical datasets. Our findings could thus pave the way for the development of biomarker trustworthiness and reliability assessment methods for computer-aided diagnosis and prognosis tasks. RG-Select code is available on GitHub at https://github.com/basiralab/RG-Select.
    Representation Learning for Efficient and Effective Similarity Search and Recommendation. (arXiv:2109.01815v1 [cs.IR])
    (0 min) How data is represented and operationalized is critical for building computational solutions that are both effective and efficient. A common approach is to represent data objects as binary vectors, denoted \textit{hash codes}, which require little storage and enable efficient similarity search through direct indexing into a hash table or through similarity computations in an appropriate space. Due to the limited expressibility of hash codes, compared to real-valued representations, a core open challenge is how to generate hash codes that well capture semantic content or latent properties using a small number of bits, while ensuring that the hash codes are distributed in a way that does not reduce their search efficiency. State of the art methods use representation learning for generating such hash codes, focusing on neural autoencoder architectures where semantics are encoded into the hash codes by learning to reconstruct the original inputs of the hash codes. This thesis addresses the above challenge and makes a number of contributions to representation learning that (i) improve effectiveness of hash codes through more expressive representations and a more effective similarity measure than the current state of the art, namely the Hamming distance, and (ii) improve efficiency of hash codes by learning representations that are especially suited to the choice of search method. The contributions are empirically validated on several tasks related to similarity search and recommendation.
    Data Efficient Masked Language Modeling for Vision and Language. (arXiv:2109.02040v1 [cs.CL])
    (0 min) Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource settings. Further, our pre-training approach substantially outperforms the baseline model on a prompt-based probing task designed to elicit image objects. These results and our analysis indicate that our method allows for better utilization of the training data.
    Neural Radiance Flow for 4D View Synthesis and Video Processing. (arXiv:2012.09790v2 [cs.CV] UPDATED)
    (0 min) We present a method, Neural Radiance Flow (NeRFlow),to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only one camera. We further demonstrate that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.
    Estimating Leaf Water Content using Remotely Sensed Hyperspectral Data. (arXiv:2109.02250v1 [eess.SP])
    (0 min) Plant water stress may occur due to the limited availability of water to the roots/soil or due to increased transpiration. These factors adversely affect plant physiology and photosynthetic ability to the extent that it has been shown to have inhibitory effects in both growth and yield [18]. Early identification of plant water stress status enables suitable corrective measures to be applied to obtain the expected crop yield. Further, improving crop yield through precision agriculture methods is a key component of climate policy and the UN sustainable development goals [1]. Leaf water content (LWC) is a measure that can be used to estimate water content and identify stressed plants. LWC during the early crop growth stages is an important indicator of plant productivity and yield. The effect of water stress can be instantaneous [15], affecting gaseous exchange or long-term, significantly reducing [9, 18, 22]. It is thus necessary to identify potential plant water stress during the early stages of growth [15] to introduce corrective irrigation and alleviate stress. LWC is also useful for identifying plant genotypes that are tolerant to water stress and salinity by measuring the stability of LWC even under artificially induced water stress [18, 25]. Such experiments generally employ destructive procedures to obtain the LWC, which is time-consuming and labor intensive. Accordingly, this research has developed a non-destructive method to estimate LWC from UAV-based hyperspectral data.
    Communication Efficient Federated Learning with Energy Awareness over Wireless Networks. (arXiv:2004.07351v3 [cs.LG] UPDATED)
    (0 min) In federated learning (FL), reducing the communication overhead is one of the most critical challenges since the parameter server and the mobile devices share the training parameters over wireless links. With such consideration, we adopt the idea of SignSGD in which only the signs of the gradients are exchanged. Moreover, most of the existing works assume Channel State Information (CSI) available at both the mobile devices and the parameter server, and thus the mobile devices can adopt fixed transmission rates dictated by the channel capacity. In this work, only the parameter server side CSI is assumed, and channel capacity with outage is considered. In this case, an essential problem for the mobile devices is to select appropriate local processing and communication parameters (including the transmission rates) to achieve a desired balance between the overall learning performance and their energy consumption. Two optimization problems are formulated and solved, which optimize the learning performance given the energy consumption requirement, and vice versa. Furthermore, considering that the data may be distributed across the mobile devices in a highly uneven fashion in FL, a stochastic sign-based algorithm is proposed. Extensive simulations are performed to demonstrate the effectiveness of the proposed methods.
    Questioning the AI: Informing Design Practices for Explainable AI User Experiences. (arXiv:2001.02478v3 [cs.HC] UPDATED)
    (0 min) A surge of interest in explainable AI (XAI) has led to a vast collection of algorithmic work on the topic. While many recognize the necessity to incorporate explainability features in AI systems, how to address real-world user needs for understanding AI remains an open question. By interviewing 20 UX and design practitioners working on various AI products, we seek to identify gaps between the current XAI algorithmic work and practices to create explainable AI products. To do so, we develop an algorithm-informed XAI question bank in which user needs for explainability are represented as prototypical questions users might ask about the AI, and use it as a study probe. Our work contributes insights into the design space of XAI, informs efforts to support design practices in this space, and identifies opportunities for future XAI work. We also provide an extended XAI question bank and discuss how it can be used for creating user-centered XAI.
    Learning to Select Cuts for Efficient Mixed-Integer Programming. (arXiv:2105.13645v2 [cs.LG] UPDATED)
    (0 min) Cutting plane methods play a significant role in modern solvers for tackling mixed-integer programming (MIP) problems. Proper selection of cuts would remove infeasible solutions in the early stage, thus largely reducing the computational burden without hurting the solution accuracy. However, the major cut selection approaches heavily rely on heuristics, which strongly depend on the specific problem at hand and thus limit their generalization capability. In this paper, we propose a data-driven and generalizable cut selection approach, named Cut Ranking, in the settings of multiple instance learning. To measure the quality of the candidate cuts, a scoring function, which takes the instance-specific cut features as inputs, is trained and applied in cut ranking and selection. In order to evaluate our method, we conduct extensive experiments on both synthetic datasets and real-world datasets. Compared with commonly used heuristics for cut selection, the learning-based policy has shown to be more effective, and is capable of generalizing over multiple problems with different properties. Cut Ranking has been deployed in an industrial solver for large-scale MIPs. In the online A/B testing of the product planning problems with more than $10^7$ variables and constraints daily, Cut Ranking has achieved the average speedup ratio of 12.42% over the production solver without any accuracy loss of solution.
    Multi-agent online learning in time-varying games. (arXiv:1809.03066v3 [cs.GT] UPDATED)
    (0 min) We examine the long-run behavior of multi-agent online learning in games that evolve over time. Specifically, we focus on a wide class of policies based on mirror descent, and we show that the induced sequence of play (a) converges to Nash equilibrium in time-varying games that stabilize in the long run to a strictly monotone limit; and (b) it stays asymptotically close to the evolving equilibrium of the sequence of stage games (assuming they are strongly monotone). Our results apply to both gradient-based and payoff-based feedback - i.e., the "bandit feedback" case where players only get to observe the payoffs of their chosen actions.
    Non-Euclidean Analysis of Joint Variations in Multi-Object Shapes. (arXiv:2109.02230v1 [stat.ML])
    (0 min) This paper considers joint analysis of multiple functionally related structures in classification tasks. In particular, our method developed is driven by how functionally correlated brain structures vary together between autism and control groups. To do so, we devised a method based on a novel combination of (1) non-Euclidean statistics that can faithfully represent non-Euclidean data in Euclidean spaces and (2) a non-parametric integrative analysis method that can decompose multi-block Euclidean data into joint, individual, and residual structures. We find that the resulting joint structure is effective, robust, and interpretable in recognizing the underlying patterns of the joint variation of multi-block non-Euclidean data. We verified the method in classifying the structural shape data collected from cases that developed and did not develop into Autistic Spectrum Disorder (ASD).
    Identification of Driver Phone Usage Violations via State-of-the-Art Object Detection with Tracking. (arXiv:2109.02119v1 [cs.CV])
    (0 min) The use of mobiles phones when driving have been a major factor when it comes to road traffic incidents and the process of capturing such violations can be a laborious task. Advancements in both modern object detection frameworks and high-performance hardware has paved the way for a more automated approach when it comes to video surveillance. In this work, we propose a custom-trained state-of-the-art object detector to work with roadside cameras to capture driver phone usage without the need for human intervention. The proposed approach also addresses the issues caused by windscreen glare and introduces the steps required to remedy this. Twelve pre-trained models are fine-tuned with our custom dataset using four popular object detection methods: YOLO, SSD, Faster R-CNN, and CenterNet. Out of all the object detectors tested, the YOLO yields the highest accuracy levels of up to 96% (AP10) and frame rates of up to ~30 FPS. DeepSort object tracking algorithm is also integrated into the best-performing model to collect records of only the unique violations, and enable the proposed approach to count the number of vehicles. The proposed automated system will collect the output images of the identified violations, timestamps of each violation, and total vehicle count. Data can be accessed via a purpose-built user interface.
    Revisiting 3D ResNets for Video Recognition. (arXiv:2109.01696v1 [cs.CV])
    (0 min) A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition. This short note studies effective training and scaling strategies for video recognition models. We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes. The resulting models, termed 3D ResNet-RS, attain competitive performance of 81.0 on Kinetics-400 and 83.8 on Kinetics-600 without pre-training. When pre-trained on a large Web Video Text dataset, our best model achieves 83.5 and 84.3 on Kinetics-400 and Kinetics-600. The proposed scaling rule is further evaluated in a self-supervised setup using contrastive learning, demonstrating improved performance. Code is available at: https://github.com/tensorflow/models/tree/master/official.
    Error Detection in Large-Scale Natural Language Understanding Systems Using Transformer Models. (arXiv:2109.01754v1 [cs.CL])
    (0 min) Large-scale conversational assistants like Alexa, Siri, Cortana and Google Assistant process every utterance using multiple models for domain, intent and named entity recognition. Given the decoupled nature of model development and large traffic volumes, it is extremely difficult to identify utterances processed erroneously by such systems. We address this challenge to detect domain classification errors using offline Transformer models. We combine utterance encodings from a RoBERTa model with the Nbest hypothesis produced by the production system. We then fine-tune end-to-end in a multitask setting using a small dataset of humanannotated utterances with domain classification errors. We tested our approach for detecting misclassifications from one domain that accounts for <0.5% of the traffic in a large-scale conversational AI system. Our approach achieves an F1 score of 30% outperforming a bi- LSTM baseline by 16.9% and a standalone RoBERTa model by 4.8%. We improve this further by 2.2% to 32.2% by ensembling multiple models.
    Training Graph Neural Networks by Graphon Estimation. (arXiv:2109.01918v1 [cs.LG])
    (0 min) In this work, we propose to train a graph neural network via resampling from a graphon estimate obtained from the underlying network data. More specifically, the graphon or the link probability matrix of the underlying network is first obtained from which a new network will be resampled and used during the training process at each layer. Due to the uncertainty induced from the resampling, it helps mitigate the well-known issue of over-smoothing in a graph neural network (GNN) model. Our framework is general, computationally efficient, and conceptually simple. Another appealing feature of our method is that it requires minimal additional tuning during the training process. Extensive numerical results show that our approach is competitive with and in many cases outperform the other over-smoothing reducing GNN training methods.
    Multi-Agent Variational Occlusion Inference Using People as Sensors. (arXiv:2109.02173v1 [cs.RO])
    (0 min) Autonomous vehicles must reason about spatial occlusions in urban environments to ensure safety without being overly cautious. Prior work explored occlusion inference from observed social behaviors of road agents. Inferring occupancy from agent behaviors is an inherently multimodal problem; a driver may behave in the same manner for different occupancy patterns ahead of them (e.g., a driver may move at constant speed in traffic or on an open road). Past work, however, does not account for this multimodality, thus neglecting to model this source of aleatoric uncertainty in the relationship between driver behaviors and their environment. We propose an occlusion inference method that characterizes observed behaviors of human agents as sensor measurements, and fuses them with those from a standard sensor suite. To capture the aleatoric uncertainty, we train a conditional variational autoencoder with a discrete latent space to learn a multimodal mapping from observed driver trajectories to an occupancy grid representation of the view ahead of the driver. Our method handles multi-agent scenarios, combining measurements from multiple observed drivers using evidential theory to solve the sensor fusion problem. Our approach is validated on a real-world dataset, outperforming baselines and demonstrating real-time capable performance. Our code is available at https://github.com/sisl/MultiAgentVariationalOcclusionInference .
    CX-ToM: Counterfactual Explanations with Theory-of-Mind for Enhancing Human Trust in Image Recognition Models. (arXiv:2109.01401v2 [cs.AI] UPDATED)
    (0 min) We propose CX-ToM, short for counterfactual explanations with theory-of mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, our CX-ToM framework generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling human's intention, machine's mind as inferred by the human as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c_pred, a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra, pointed ears of dog), referred to as explainable concepts, that need to be added to or deleted from I in order to alter the classification category of I by M to another specified class c_alt. We argue that, due to the iterative, conceptual and counterfactual nature of CX-ToM explanations, our framework is practical and more natural for both expert and non-expert users to understand the internal workings of complex deep learning models. Extensive quantitative and qualitative experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art explainable AI models.
    Temporal Aware Deep Reinforcement Learning. (arXiv:2109.02145v1 [cs.LG])
    (0 min) The function approximators employed by traditional image based Deep Reinforcement Learning (DRL) algorithms usually lack a temporal learning component and instead focus on learning the spatial component. We propose a technique wherein both temporal as well as spatial components are jointly learned. Our tested was tested with a generic DQN and it outperformed it in terms of maximum rewards as well as sample complexity. This algorithm has implications in the robotics as well as sequential decision making domains.
    Growing Cosine Unit: A Novel Oscillatory Activation Function That Can Speedup Training and Reduce Parameters in Convolutional Neural Networks. (arXiv:2108.12943v2 [cs.LG] UPDATED)
    (0 min) Convolution neural networks have been successful in solving many socially important and economically significant problems. Their ability to learn complex high-dimensional functions hierarchically can be attributed to the use of nonlinear activation functions. A key discovery that made training deep networks feasible was the adoption of the Rectified Linear Unit (ReLU) activation function to alleviate the vanishing gradient problem caused by using saturating activation functions. Since then many improved variants of the ReLU activation have been proposed. However a majority of activation functions used today are non-oscillatory and monotonically increasing due to their biological plausibility. This paper demonstrates that oscillatory activation functions can improve gradient flow and reduce network size. It is shown that oscillatory activation functions allow neurons to switch classification (sign of output) within the interior of neuronal hyperplane positive and negative half-spaces allowing complex decisions with fewer neurons. A new oscillatory activation function C(z) = z cos z that outperforms Sigmoids, Swish, Mish and ReLU on a variety of architectures and benchmarks is presented. This new activation function allows even single neurons to exhibit nonlinear decision boundaries. This paper presents a single neuron solution to the famous XOR problem. Experimental results indicate that replacing the activation function in the convolutional layers with C(z) significantly improves performance on CIFAR-10, CIFAR-100 and Imagenette.
    Reinforcement Learning for Battery Energy Storage Dispatch augmented with Model-based Optimizer. (arXiv:2109.01659v1 [cs.LG])
    (0 min) Reinforcement learning has been found useful in solving optimal power flow (OPF) problems in electric power distribution systems. However, the use of largely model-free reinforcement learning algorithms that completely ignore the physics-based modeling of the power grid compromises the optimizer performance and poses scalability challenges. This paper proposes a novel approach to synergistically combine the physics-based models with learning-based algorithms using imitation learning to solve distribution-level OPF problems. Specifically, we propose imitation learning based improvements in deep reinforcement learning (DRL) methods to solve the OPF problem for a specific case of battery storage dispatch in the power distribution systems. The proposed imitation learning algorithm uses the approximate optimal solutions obtained from a linearized model-based OPF solver to provide a good initial policy for the DRL algorithms while improving the training efficiency. The effectiveness of the proposed approach is demonstrated using IEEE 34-bus and 123-bus distribution feeders with numerous distribution-level battery storage systems.
    Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. (arXiv:2108.05935v2 [cs.LG] UPDATED)
    (0 min) The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Various tools and techniques are available that assess data quality with respect to general cleaning and profiling checks. However these techniques are not applicable to detect data issues in the context of machine learning tasks, like noisy labels, existence of overlapping classes etc. We attempt to re-look at the data quality issues in the context of building a machine learning pipeline and build a tool that can detect, explain and remediate issues in the data, and systematically and automatically capture all the changes applied to the data. We introduce the Data Quality Toolkit for machine learning as a library of some key quality metrics and relevant remediation techniques to analyze and enhance the readiness of structured training datasets for machine learning projects. The toolkit can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process. Our toolkit is publicly available via IBM API Hub [1] platform, any developer can assess the data quality using the IBM's Data Quality for AI apis [2]. Detailed tutorials are also available on IBM Learning Path [3].
    Distributed Learning with Dependent Samples. (arXiv:2002.03757v2 [cs.LG] UPDATED)
    (0 min) This paper focuses on learning rate analysis of distributed kernel ridge regression for strong mixing sequences. Using a recently developed integral operator approach and a classical covariance inequality for Banach-valued strong mixing sequences, we succeed in deriving optimal learning rate for distributed kernel ridge regression. As a byproduct, we also deduce a sufficient condition for the mixing property to guarantee the optimal learning rates for kernel ridge regression. Our results extend the applicable range of distributed learning from i.i.d. samples to non-i.i.d. sequences.
    A Multi-view Multi-task Learning Framework for Multi-variate Time Series Forecasting. (arXiv:2109.01657v1 [cs.LG])
    (0 min) Multi-variate time series (MTS) data is a ubiquitous class of data abstraction in the real world. Any instance of MTS is generated from a hybrid dynamical system and their specific dynamics are usually unknown. The hybrid nature of such a dynamical system is a result of complex external attributes, such as geographic location and time of day, each of which can be categorized into either spatial attributes or temporal attributes. Therefore, there are two fundamental views which can be used to analyze MTS data, namely the spatial view and the temporal view. Moreover, from each of these two views, we can partition the set of data samples of MTS into disjoint forecasting tasks in accordance with their associated attribute values. Then, samples of the same task will manifest similar forthcoming pattern, which is less sophisticated to be predicted in comparison with the original single-view setting. Considering this insight, we propose a novel multi-view multi-task (MVMT) learning framework for MTS forecasting. Instead of being explicitly presented in most scenarios, MVMT information is deeply concealed in the MTS data, which severely hinders the model from capturing it naturally. To this end, we develop two kinds of basic operations, namely task-wise affine transformation and task-wise normalization, respectively. Applying these two operations with prior knowledge on the spatial and temporal view allows the model to adaptively extract MVMT information while predicting. Extensive experiments on three datasets are conducted to illustrate that canonical architectures can be greatly enhanced by the MVMT learning framework in terms of both effectiveness and efficiency. In addition, we design rich case studies to reveal the properties of representations produced at different phases in the entire prediction procedure.
    A Neural Network-Based Linguistic Similarity Measure for Entrainment in Conversations. (arXiv:2109.01924v1 [cs.CL])
    (0 min) Linguistic entrainment is a phenomenon where people tend to mimic each other in conversation. The core instrument to quantify entrainment is a linguistic similarity measure between conversational partners. Most of the current similarity measures are based on bag-of-words approaches that rely on linguistic markers, ignoring the overall language structure and dialogue context. To address this issue, we propose to use a neural network model to perform the similarity measure for entrainment. Our model is context-aware, and it further leverages a novel component to learn the shared high-level linguistic features across dialogues. We first investigate the effectiveness of our novel component. Then we use the model to perform similarity measure in a corpus-based entrainment analysis. We observe promising results for both evaluation tasks.
    Confidence Adaptive Regularization for Deep Learning with Noisy Labels. (arXiv:2108.08212v2 [cs.LG] UPDATED)
    (0 min) Recent studies on the memorization effects of deep neural networks on noisy labels show that the networks first fit the correctly-labeled training samples before memorizing the mislabeled samples. Motivated by this early-learning phenomenon, we propose a novel method to prevent memorization of the mislabeled samples. Unlike the existing approaches which use the model output to identify or ignore the mislabeled samples, we introduce an indicator branch to the original model and enable the model to produce a confidence value for each sample. The confidence values are incorporated in our loss function which is learned to assign large confidence values to correctly-labeled samples and small confidence values to mislabeled samples. We also propose an auxiliary regularization term to further improve the robustness of the model. To improve the performance, we gradually correct the noisy labels with a well-designed target estimation strategy. We provide the theoretical analysis and conduct the experiments on synthetic and real-world datasets, demonstrating that our approach achieves comparable results to the state-of-the-art methods.
    Empirical observation of negligible fairness-accuracy trade-offs in machine learning for public policy. (arXiv:2012.02972v3 [cs.LG] UPDATED)
    (0 min) Growing use of machine learning in policy and social impact settings have raised concerns for fairness implications, especially for racial minorities. These concerns have generated considerable interest among machine learning and artificial intelligence researchers, who have developed new methods and established theoretical bounds for improving fairness, focusing on the source data, regularization and model training, or post-hoc adjustments to model scores. However, little work has studied the practical trade-offs between fairness and accuracy in real-world settings to understand how these bounds and methods translate into policy choices and impact on society. Our empirical study fills this gap by investigating the impact of mitigating disparities on accuracy, focusing on the common context of using machine learning to inform benefit allocation in resource-constrained programs across education, mental health, criminal justice, and housing safety. Here we describe applied work in which we find fairness-accuracy trade-offs to be negligible in practice. In each setting studied, explicitly focusing on achieving equity and using our proposed post-hoc disparity mitigation methods, fairness was substantially improved without sacrificing accuracy. This observation was robust across policy contexts studied, scale of resources available for intervention, time, and relative size of the protected groups. These empirical results challenge a commonly held assumption that reducing disparities either requires accepting an appreciable drop in accuracy or the development of novel, complex methods, making reducing disparities in these applications more practical.
    Analysis of Discriminator in RKHS Function Space for Kullback-Leibler Divergence Estimation. (arXiv:2002.11187v4 [cs.LG] UPDATED)
    (0 min) Several scalable sample-based methods to compute the Kullback Leibler (KL) divergence between two distributions have been proposed and applied in large-scale machine learning models. While they have been found to be unstable, the theoretical root cause of the problem is not clear. In this paper, we study a generative adversarial network based approach that uses a neural network discriminator to estimate KL divergence. We argue that, in such case, high fluctuations in the estimates are a consequence of not controlling the complexity of the discriminator function space. We provide a theoretical underpinning and remedy for this problem by first constructing a discriminator in the Reproducing Kernel Hilbert Space (RKHS). This enables us to leverage sample complexity and mean embedding to theoretically relate the error probability bound of the KL estimates to the complexity of the discriminator in RKHS. Based on this theory, we then present a scalable way to control the complexity of the discriminator for a reliable estimation of KL divergence. We support both our proposed theory and method to control the complexity of the RKHS discriminator through controlled experiments.
    Online Continual Learning in Image Classification: An Empirical Survey. (arXiv:2101.10423v3 [cs.LG] UPDATED)
    (0 min) Online continual learning for image classification studies the problem of learning to classify images from an online stream of data and tasks, where tasks may include new classes (class incremental) or data nonstationarity (domain incremental). One of the key challenges of continual learning is to avoid catastrophic forgetting (CF), i.e., forgetting old tasks in the presence of more recent tasks. Over the past few years, many methods and tricks have been introduced to address this problem, but many have not been fairly and systematically compared under a variety of realistic and practical settings. To better understand the relative advantages of various approaches and the settings where they work best, this survey aims to (1) compare state-of-the-art methods such as MIR, iCARL, and GDumb and determine which works best at different experimental settings; (2) determine if the best class incremental methods are also competitive in domain incremental setting; (3) evaluate the performance of 7 simple but effective trick such as "review" trick and nearest class mean (NCM) classifier to assess their relative impact. Regarding (1), we observe iCaRL remains competitive when the memory buffer is small; GDumb outperforms many recently proposed methods in medium-size datasets and MIR performs the best in larger-scale datasets. For (2), we note that GDumb performs quite poorly while MIR -- already competitive for (1) -- is also strongly competitive in this very different but important setting. Overall, this allows us to conclude that MIR is overall a strong and versatile method across a wide variety of settings. For (3), we find that all 7 tricks are beneficial, and when augmented with the "review" trick and NCM classifier, MIR produces performance levels that bring online continual learning much closer to its ultimate goal of matching offline training.
    Postulating Exoplanetary Habitability via a Novel Anomaly Detection Method. (arXiv:2109.02273v1 [astro-ph.EP])
    (0 min) A profound shift in the study of cosmology came with the discovery of thousands of exoplanets and the possibility of the existence of billions of them in our Galaxy. The biggest goal in these searches is whether there are other life-harbouring planets. However, the question which of these detected planets are habitable, potentially-habitable, or maybe even inhabited, is still not answered. Some potentially habitable exoplanets have been hypothesized, but since Earth is the only known habitable planet, measures of habitability are necessarily determined with Earth as the reference. Several recent works introduced new habitability metrics based on optimization methods. Classification of potentially habitable exoplanets using supervised learning is another emerging area of study. However, both modeling and supervised learning approaches suffer from drawbacks. We propose an anomaly detection method, the Multi-Stage Memetic Algorithm (MSMA), to detect anomalies and extend it to an unsupervised clustering algorithm MSMVMCA to use it to detect potentially habitable exoplanets as anomalies. The algorithm is based on the postulate that Earth is an anomaly, with the possibility of existence of few other anomalies among thousands of data points. We describe an MSMA-based clustering approach with a novel distance function to detect habitable candidates as anomalies (including Earth). The results are cross-matched with the habitable exoplanet catalog (PHL-HEC) of the Planetary Habitability Laboratory (PHL) with both optimistic and conservative lists of potentially habitable exoplanets.
    FBCNN: A Deep Neural Network Architecture for Portable and Fast Brain-Computer Interfaces. (arXiv:2109.02165v1 [eess.SP])
    (0 min) Objective: To propose a novel deep neural network (DNN) architecture -- the filter bank convolutional neural network (FBCNN) -- to improve SSVEP classification in single-channel BCIs with small data lengths. Methods: We propose two models: the FBCNN-2D and the FBCNN-3D. The FBCNN-2D utilizes a filter bank to create sub-band components of the electroencephalography (EEG) signal, which it transforms using the fast Fourier transform (FFT) and analyzes with a 2D CNN. The FBCNN-3D utilizes the same filter bank, but it transforms the sub-band components into spectrograms via short-time Fourier transform (STFT), and analyzes them with a 3D CNN. We made use of transfer learning. To train the FBCNN-3D, we proposed a new technique, called inter-dimensional transfer learning, to transfer knowledge from a 2D DNN to a 3D DNN. Our BCI was conceived so as not to require calibration from the final user: therefore, the test subject data was separated from training and validation. Results: The mean test accuracy was 85.7% for the FBCCA-2D and 85% for the FBCCA-3D. Mean F1-Scores were 0.858 and 0.853. Alternative classification methods, SVM, FBCCA and a CNN, had mean accuracy of 79.2%, 80.1% and 81.4%, respectively. Conclusion: The FBCNNs surpassed traditional SSVEP classification methods in our simulated BCI, by a considerable margin (about 5% higher accuracy). Transfer learning and inter-dimensional transfer learning made training much faster and more predictable. Significance: We proposed a new and flexible type of DNN, which had a better performance than standard methods in SSVEP classification for portable and fast BCIs.
    Does Melania Trump have a body double from the perspective of automatic face recognition?. (arXiv:2109.02283v1 [cs.CV])
    (0 min) In this paper, we explore whether automatic face recognition can help in verifying widespread misinformation on social media, particularly conspiracy theories that are based on the existence of body doubles. The conspiracy theory addressed in this paper is the case of the Melania Trump body double. We employed four different state-of-the-art descriptors for face recognition to verify the integrity of the claim of the studied conspiracy theory. In addition, we assessed the impact of different image quality metrics on the variation of face recognition results. Two sets of image quality metrics were considered: acquisition-related metrics and subject-related metrics.
    SEC4SR: A Security Analysis Platform for Speaker Recognition. (arXiv:2109.01766v1 [cs.CR])
    (0 min) Adversarial attacks have been expanded to speaker recognition (SR). However, existing attacks are often assessed using different SR models, recognition tasks and datasets, and only few adversarial defenses borrowed from computer vision are considered. Yet,these defenses have not been thoroughly evaluated against adaptive attacks. Thus, there is still a lack of quantitative understanding about the strengths and limitations of adversarial attacks and defenses. More effective defenses are also required for securing SR systems. To bridge this gap, we present SEC4SR, the first platform enabling researchers to systematically and comprehensively evaluate adversarial attacks and defenses in SR. SEC4SR incorporates 4 white-box and 2 black-box attacks, 24 defenses including our novel feature-level transformations. It also contains techniques for mounting adaptive attacks. Using SEC4SR, we conduct thus far the largest-scale empirical study on adversarial attacks and defenses in SR, involving 23 defenses, 15 attacks and 4 attack settings. Our study provides lots of useful findings that may advance future research: such as (1) all the transformations slightly degrade accuracy on benign examples and their effectiveness vary with attacks; (2) most transformations become less effective under adaptive attacks, but some transformations become more effective; (3) few transformations combined with adversarial training yield stronger defenses over some but not all attacks, while our feature-level transformation combined with adversarial training yields the strongest defense over all the attacks. Extensive experiments demonstrate capabilities and advantages of SEC4SR which can benefit future research in SR.
    Uncovering the Limits of Text-based Emotion Detection. (arXiv:2109.01900v1 [cs.CL])
    (0 min) Identifying emotions from text is crucial for a variety of real world tasks. We consider the two largest now-available corpora for emotion classification: GoEmotions, with 58k messages labelled by readers, and Vent, with 33M writer-labelled messages. We design a benchmark and evaluate several feature spaces and learning algorithms, including two simple yet novel models on top of BERT that outperform previous strong baselines on GoEmotions. Through an experiment with human participants, we also analyze the differences between how writers express emotions and how readers perceive them. Our results suggest that emotions expressed by writers are harder to identify than emotions that readers perceive. We share a public web interface for researchers to explore our models.
    Comparing the Machine Readability of Traffic Sign Pictograms in Austria and Germany. (arXiv:2109.02362v1 [cs.CV])
    (0 min) We compare the machine readability of pictograms found on Austrian and German traffic signs. To that end, we train classification models on synthetic data sets and evaluate their classification accuracy in a controlled setting. In particular, we focus on differences between currently deployed pictograms in the two countries, and a set of new pictograms designed to increase human readability. Besides other results, we find that machine-learning models generalize poorly to data sets with pictogram designs they have not been trained on. We conclude that manufacturers of advanced driver-assistance systems (ADAS) must take special care to properly address small visual differences between current and newly designed traffic sign pictograms, as well as between pictograms from different countries.
    Supervised DKRC with Images for Offline System Identification. (arXiv:2109.02241v1 [cs.LG])
    (2 min) Koopman spectral theory has provided a new perspective in the field of dynamical systems in recent years. Modern dynamical systems are becoming increasingly non-linear and complex, and there is a need for a framework to model these systems in a compact and comprehensive representation for prediction and control. The central problem in applying Koopman theory to a system of interest is that the choice of finite-dimensional basis functions is typically done apriori, using expert knowledge of the systems dynamics. Our approach learns these basis functions using a supervised learning approach where a combination of autoencoders and deep neural networks learn the basis functions for any given system. We demonstrate this approach on a simple pendulum example in which we obtain a linear representation of the non-linear system and then predict the future state trajectories given some initial conditions. We also explore how changing the input representation of the dynamic systems time series data can impact the quality of learned basis functions. This alternative representation is compared to the traditional raw time series data approach to determine which method results in lower reconstruction and prediction error of the true non-linear dynamics of the system.
    FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models. (arXiv:2109.01951v1 [cs.CL])
    (2 min) The task of learning from only a few examples (called a few-shot setting) is of key importance and relevance to a real-world setting. For question answering (QA), the current state-of-the-art pre-trained models typically need fine-tuning on tens of thousands of examples to obtain good results. Their performance degrades significantly in a few-shot setting (< 100 examples). To address this, we propose a simple fine-tuning framework that leverages pre-trained text-to-text models and is directly aligned with their pre-training framework. Specifically, we construct the input as a concatenation of the question, a mask token representing the answer span and a context. Given this input, the model is fine-tuned using the same objective as that of its pre-training objective. Through experimental studies on various few-shot configurations, we show that this formulation leads to significant gains on multiple QA benchmarks (an absolute gain of 34.2 F1 points on average when there are only 16 training examples). The gains extend further when used with larger models (Eg:- 72.3 F1 on SQuAD using BART-large with only 32 examples) and translate well to a multilingual setting . On the multilingual TydiQA benchmark, our model outperforms the XLM-Roberta-large by an absolute margin of upto 40 F1 points and an average of 33 F1 points in a few-shot setting (<= 64 training examples). We conduct detailed ablation studies to analyze factors contributing to these gains.
    MLCTR: A Fast Scalable Coupled Tensor Completion Based on Multi-Layer Non-Linear Matrix Factorization. (arXiv:2109.01773v1 [cs.LG])
    (2 min) Firms earning prediction plays a vital role in investment decisions, dividends expectation, and share price. It often involves multiple tensor-compatible datasets with non-linear multi-way relationships, spatiotemporal structures, and different levels of sparsity. Current non-linear tensor completion algorithms tend to learn noisy embedding and incur overfitting. This paper focuses on the embedding learning aspect of the tensor completion problem and proposes a new multi-layer neural network architecture for tensor factorization and completion (MLCTR). The network architecture entails multiple advantages: a series of low-rank matrix factorizations (MF) building blocks to minimize overfitting, interleaved transfer functions in each layer for non-linearity, and by-pass connections to reduce the gradient diminishing problem and increase the depths of neural networks. Furthermore, the model employs Stochastic Gradient Descent(SGD) based optimization for fast convergence in training. Our algorithm is highly efficient for imputing missing values in the EPS data. Experiments confirm that our strategy of incorporating non-linearity in factor matrices demonstrates impressive performance in embedding learning and end-to-end tensor models, and outperforms approaches with non-linearity in the phase of reconstructing tensors from factor matrices.
    Barycenteric distribution alignment and manifold-restricted invertibility for domain generalization. (arXiv:2109.01902v1 [cs.LG])
    (2 min) For the Domain Generalization (DG) problem where the hypotheses are composed of a common representation function followed by a labeling function, we point out a shortcoming in existing approaches that fail to explicitly optimize for a term, appearing in a well-known and widely adopted upper bound to the risk on the unseen domain, that is dependent on the representation to be learned. To this end, we first derive a novel upper bound to the prediction risk. We show that imposing a mild assumption on the representation to be learned, namely manifold restricted invertibility, is sufficient to deal with this issue. Further, unlike existing approaches, our novel upper bound doesn't require the assumption of Lipschitzness of the loss function. In addition, the distributional discrepancy in the representation space is handled via the Wasserstein-2 barycenter cost. In this context, we creatively leverage old and recent transport inequalities, which link various optimal transport metrics, in particular the $L^1$ distance (also known as the total variation distance) and the Wasserstein-2 distances, with the Kullback-Liebler divergence. These analyses and insights motivate a new representation learning cost for DG that additively balances three competing objectives: 1) minimizing classification error across seen domains via cross-entropy, 2) enforcing domain-invariance in the representation space via the Wasserstein-2 barycenter cost, and 3) promoting non-degenerate, nearly-invertible representation via one of two mechanisms, viz., an autoencoder-based reconstruction loss or a mutual information loss. It is to be noted that the proposed algorithms completely bypass the use of any adversarial training mechanism that is typical of many current domain generalization approaches. Simulation results on several standard datasets demonstrate superior performance compared to several well-known DG algorithms.
    Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters. (arXiv:2109.01313v2 [cs.DC] UPDATED)
    (0 min) Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
    Global-Local Item Embedding for Temporal Set Prediction. (arXiv:2109.02074v1 [cs.LG])
    (0 min) Temporal set prediction is becoming increasingly important as many companies employ recommender systems in their online businesses, e.g., personalized purchase prediction of shopping baskets. While most previous techniques have focused on leveraging a user's history, the study of combining it with others' histories remains untapped potential. This paper proposes Global-Local Item Embedding (GLOIE) that learns to utilize the temporal properties of sets across whole users as well as within a user by coining the names as global and local information to distinguish the two temporal patterns. GLOIE uses Variational Autoencoder (VAE) and dynamic graph-based model to capture global and local information and then applies attention to integrate resulting item embeddings. Additionally, we propose to use Tweedie output for the decoder of VAE as it can easily model zero-inflated and long-tailed distribution, which is more suitable for several real-world data distributions than Gaussian or multinomial counterparts. When evaluated on three public benchmarks, our algorithm consistently outperforms previous state-of-the-art methods in most ranking metrics.
    Neural Operator: Learning Maps Between Function Spaces. (arXiv:2108.08481v2 [cs.LG] UPDATED)
    (0 min) The classical development of neural networks has primarily focused on learning mappings between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural networks tailored to learn operators mapping between infinite dimensional function spaces. We formulate the approximation of operators by composition of a class of linear integral operators and nonlinear activation functions, so that the composed operator can approximate complex nonlinear operators. We prove a universal approximation theorem for our construction. Furthermore, we introduce four classes of operator parameterizations: graph-based operators, low-rank operators, multipole graph-based operators, and Fourier operators and describe efficient algorithms for computing with each one. The proposed neural operators are resolution-invariant: they share the same network parameters between different discretizations of the underlying function spaces and can be used for zero-shot super-resolutions. Numerically, the proposed models show superior performance compared to existing machine learning based methodologies on Burgers' equation, Darcy flow, and the Navier-Stokes equation, while being several order of magnitude faster compared to conventional PDE solvers.
    A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning. (arXiv:2109.02355v1 [stat.ML])
    (0 min) The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -- from simple linear models to deep neural networks -- have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance. Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.
    Using Logical Specifications of Objectives in Multi-Objective Reinforcement Learning. (arXiv:1910.01723v3 [cs.LG] UPDATED)
    (0 min) It is notoriously difficult to control the behavior of reinforcement learning agents. Agents often learn to exploit the environment or reward signal and need to be retrained multiple times. The multi-objective reinforcement learning (MORL) framework separates a reward function into several objectives. An ideal MORL agent learns to generalize to novel combinations of objectives allowing for better control of an agent's behavior without requiring retraining. Many MORL approaches use a weight vector to parameterize the importance of each objective. However, this approach suffers from lack of expressiveness and interpretability. We propose using propositional logic to specify the importance of multiple objectives. By using a logic where predicates correspond directly to objectives, specifications are inherently more interpretable. Additionally the set of specifications that can be expressed with formal languages is a superset of what can be expressed by weight vectors. In this paper, we define a formal language based on propositional logic with quantitative semantics. We encode logical specifications using a recurrent neural network and show that MORL agents parameterized by these encodings are able to generalize to novel specifications over objectives and achieve performance comparable to single objective baselines.
    Thompson Sampling for Bandits with Clustered Arms. (arXiv:2109.01656v1 [cs.LG])
    (0 min) We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.
    Sense and Learn: Self-Supervision for Omnipresent Sensors. (arXiv:2009.13233v2 [cs.LG] UPDATED)
    (0 min) Learning general-purpose representations from multisensor data produced by the omnipresent sensing systems (or IoT in general) has numerous applications in diverse use cases. Existing purely supervised end-to-end deep learning techniques depend on the availability of a massive amount of well-curated data, acquiring which is notoriously difficult but required to achieve a sufficient level of generalization on a task of interest. In this work, we leverage the self-supervised learning paradigm towards realizing the vision of continual learning from unlabeled inputs. We present a generalized framework named Sense and Learn for representation or feature learning from raw sensory data. It consists of several auxiliary tasks that can learn high-level and broadly useful features entirely from unannotated data without any human involvement in the tedious labeling process. We demonstrate the efficacy of our approach on several publicly available datasets from different domains and in various settings, including linear separability, semi-supervised or few shot learning, and transfer learning. Our methodology achieves results that are competitive with the supervised approaches and close the gap through fine-tuning a network while learning the downstream tasks in most cases. In particular, we show that the self-supervised network can be utilized as initialization to significantly boost the performance in a low-data regime with as few as 5 labeled instances per class, which is of high practical importance to real-world problems. Likewise, the learned representations with self-supervision are found to be highly transferable between related datasets, even when few labeled instances are available from the target domains. The self-learning nature of our methodology opens up exciting possibilities for on-device continual learning.
    Connecting GANs, MFGs, and OT. (arXiv:2002.04112v4 [cs.GT] UPDATED)
    (0 min) Generative adversarial networks (GANs) have enjoyed tremendous success in image generation and processing, and have recently attracted growing interests in financial modelings. This paper analyzes GANs from the perspectives of mean-field games (MFGs) and optimal transport. More specifically, from the game theoretical perspective, GANs are interpreted as MFGs under Pareto Optimality criterion or mean-field controls; from the optimal transport perspective, GANs are to minimize the optimal transport cost indexed by the generator from the known latent distribution to the unknown true distribution of data. The MFGs perspective of GANs leads to a GAN-based computational method (MFGANs) to solve MFGs: one neural network for the backward Hamilton-Jacobi-Bellman equation and one neural network for the forward Fokker-Planck equation, with the two neural networks trained in an adversarial way. Numerical experiments demonstrate superior performance of this proposed algorithm, especially in the higher dimensional case, when compared with existing neural network approaches.
    Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series. (arXiv:2010.05073v3 [cs.LG] UPDATED)
    (0 min) Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon's dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.
    High-quality Thermal Gibbs Sampling with Quantum Annealing Hardware. (arXiv:2109.01690v1 [quant-ph])
    (0 min) Quantum Annealing (QA) was originally intended for accelerating the solution of combinatorial optimization tasks that have natural encodings as Ising models. However, recent experiments on QA hardware platforms have demonstrated that, in the operating regime corresponding to weak interactions, the QA hardware behaves like a noisy Gibbs sampler at a hardware-specific effective temperature. This work builds on those insights and identifies a class of small hardware-native Ising models that are robust to noise effects and proposes a novel procedure for executing these models on QA hardware to maximize Gibbs sampling performance. Experimental results indicate that the proposed protocol results in high-quality Gibbs samples from a hardware-specific effective temperature and that the QA annealing time can be used to adjust the effective temperature of the output distribution. The procedure proposed in this work provides a new approach to using QA hardware for Ising model sampling presenting potential new opportunities for applications in machine learning and physics simulation.
    CodeNeRF: Disentangled Neural Radiance Fields for Object Categories. (arXiv:2109.01750v1 [cs.GR])
    (0 min) CodeNeRF is an implicit 3D neural representation that learns the variation of object shapes and textures across a category and can be trained, from a set of posed images, to synthesize novel views of unseen objects. Unlike the original NeRF, which is scene specific, CodeNeRF learns to disentangle shape and texture by learning separate embeddings. At test time, given a single unposed image of an unseen object, CodeNeRF jointly estimates camera viewpoint, and shape and appearance codes via optimization. Unseen objects can be reconstructed from a single image, and then rendered from new viewpoints or their shape and texture edited by varying the latent codes. We conduct experiments on the SRN benchmark, which show that CodeNeRF generalises well to unseen objects and achieves on-par performance with methods that require known camera pose at test time. Our results on real-world images demonstrate that CodeNeRF can bridge the sim-to-real gap. Project page: \url{https://github.com/wayne1123/code-nerf}
    Towards a Common Testing Terminology for Software Engineering and Artificial Intelligence Experts. (arXiv:2108.13837v2 [cs.SE] UPDATED)
    (0 min) Analytical quality assurance, especially testing, is an integral part of software-intensive system development. With the increased usage of Artificial Intelligence (AI) and Machine Learning (ML) as part of such systems, this becomes more difficult as well-understood software testing approaches cannot be applied directly to the AI-enabled parts of the system. The required adaptation of classical testing approaches and development of new concepts for AI would benefit from a deeper understanding and exchange between AI and software engineering experts. A major obstacle on this way, we see in the different terminologies used in the two communities. As we consider a mutual understanding of the testing terminology as a key, this paper contributes a mapping between the most important concepts from classical software testing and AI testing. In the mapping, we highlight differences in relevance and naming of the mapped concepts.
    Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks. (arXiv:2109.02096v1 [cs.SD])
    (0 min) This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality. The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio and is applied to the Flickr 8k Audio dataset for transferring the vocal timbre between speakers and the URMP dataset for transferring the musical timbre between instruments. Furthermore, variations of the adopted approach are trained, and generalised performance is compared using the metrics SSIM (Structural Similarity Index) and FAD (Frech\'et Audio Distance). It was found that a many-to-many approach supersedes a one-to-one approach in terms of reconstructive capabilities, and that the adoption of a basic over a bottleneck residual block design is more suitable for enriching content information about a latent space. It was also found that the decision on whether cyclic loss takes on a variational autoencoder or vanilla autoencoder approach does not have a significant impact on reconstructive and adversarial translation aspects of the model.
    Fast Hypergraph Regularized Nonnegative Tensor Ring Factorization Based on Low-Rank Approximation. (arXiv:2109.02314v1 [cs.LG])
    (0 min) For the high dimensional data representation, nonnegative tensor ring (NTR) decomposition equipped with manifold learning has become a promising model to exploit the multi-dimensional structure and extract the feature from tensor data. However, the existing methods such as graph regularized tensor ring decomposition (GNTR) only models the pair-wise similarities of objects. For tensor data with complex manifold structure, the graph can not exactly construct similarity relationships. In this paper, in order to effectively utilize the higher-dimensional and complicated similarities among objects, we introduce hypergraph to the framework of NTR to further enhance the feature extraction, upon which a hypergraph regularized nonnegative tensor ring decomposition (HGNTR) method is developed. To reduce the computational complexity and suppress the noise, we apply the low-rank approximation trick to accelerate HGNTR (called LraHGNTR). Our experimental results show that compared with other state-of-the-art algorithms, the proposed HGNTR and LraHGNTR can achieve higher performance in clustering tasks, in addition, LraHGNTR can greatly reduce running time without decreasing accuracy.
    On Faster Convergence of Scaled Sign Gradient Descent. (arXiv:2109.01806v1 [math.OC])
    (0 min) Communication has been seen as a significant bottleneck in industrial applications over large-scale networks. To alleviate the communication burden, sign-based optimization algorithms have gained popularity recently in both industrial and academic communities, which is shown to be closely related to adaptive gradient methods, such as Adam. Along this line, this paper investigates faster convergence for a variant of sign-based gradient descent, called scaled signGD, in three cases: 1) the objective function is strongly convex; 2) the objective function is nonconvex but satisfies the Polyak-Lojasiewicz (PL) inequality; 3) the gradient is stochastic, called scaled signGD in this case. For the first two cases, it can be shown that the scaled signGD converges at a linear rate. For case 3), the algorithm is shown to converge linearly to a neighborhood of the optimal value when a constant learning rate is employed, and the algorithm converges at a rate of $O(1/k)$ when using a diminishing learning rate, where $k$ is the iteration number. The results are also extended to the distributed setting by majority vote in a parameter-server framework. Finally, numerical experiments on logistic regression are performed to corroborate the theoretical findings.
    Eden: A Unified Environment Framework for Booming Reinforcement Learning Algorithms. (arXiv:2109.01768v1 [cs.LG])
    (0 min) With AlphaGo defeats top human players, reinforcement learning(RL) algorithms have gradually become the code-base of building stronger artificial intelligence(AI). The RL algorithm design firstly needs to adapt to the specific environment, so the designed environment guides the rapid and profound development of RL algorithms. However, the existing environments, which can be divided into real world games and customized toy environments, have obvious shortcomings. For real world games, it is designed for human entertainment, and too much difficult for most of RL researchers. For customized toy environments, there is no widely accepted unified evaluation standard for all RL algorithms. Therefore, we introduce the first virtual user-friendly environment framework for RL. In this framework, the environment can be easily configured to realize all kinds of RL tasks in the mainstream research. Then all the mainstream state-of-the-art(SOTA) RL algorithms can be conveniently evaluated and compared. Therefore, our contributions mainly includes the following aspects: 1.single configured environment for all classification of SOTA RL algorithms; 2.combined environment of more than one classification RL algorithms; 3.the evaluation standard for all kinds of RL algorithms. With all these efforts, a possibility for breeding an AI with capability of general competency in a variety of tasks is provided, and maybe it will open up a new chapter for AI.
    On robustness of generative representations against catastrophic forgetting. (arXiv:2109.01844v1 [cs.CV])
    (0 min) Catastrophic forgetting of previously learned knowledge while learning new tasks is a widely observed limitation of contemporary neural networks. Although many continual learning methods are proposed to mitigate this drawback, the main question remains unanswered: what is the root cause of catastrophic forgetting? In this work, we aim at answering this question by posing and validating a set of research hypotheses related to the specificity of representations built internally by neural models. More specifically, we design a set of empirical evaluations that compare the robustness of representations in discriminative and generative models against catastrophic forgetting. We observe that representations learned by discriminative models are more prone to catastrophic forgetting than their generative counterparts, which sheds new light on the advantages of developing generative models for continual learning. Finally, our work opens new research pathways and possibilities to adopt generative models in continual learning beyond mere replay mechanisms.
    Mean-Variance Efficient Reinforcement Learning by Expected Quadratic Utility Maximization. (arXiv:2010.01404v3 [cs.LG] UPDATED)
    (0 min) Risk management is critical in decision making, and mean-variance (MV) trade-off is one of the most common criteria. However, in reinforcement learning (RL) for sequential decision making under uncertainty, most of the existing methods for MV control suffer from computational difficulties caused by the double sampling problem. In this paper, in contrast to strict MV control, we consider learning MV efficient policies that achieve Pareto efficiency regarding MV trade-off. To achieve this purpose, we train an agent to maximize the expected quadratic utility function, a common objective of risk management in finance and economics. We call our approach direct expected quadratic utility maximization (EQUM). The EQUM does not suffer from the double sampling issue because it does not include gradient estimation of variance. We confirm that the maximizer of the objective in the EQUM directly corresponds to an MV efficient policy under a certain condition. We conduct experiments with benchmark settings to demonstrate the effectiveness of the EQUM.
    Tensor Normalization and Full Distribution Training. (arXiv:2109.02345v1 [cs.CV])
    (0 min) In this work, we introduce pixel wise tensor normalization, which is inserted after rectifier linear units and, together with batch normalization, provides a significant improvement in the accuracy of modern deep neural networks. In addition, this work deals with the robustness of networks. We show that the factorized superposition of images from the training set and the reformulation of the multi class problem into a multi-label problem yields significantly more robust networks. The reformulation and the adjustment of the multi class log loss also improves the results compared to the overlay with only one class as label. https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/?p=%2FTNandFDT&mode=list
    Spike and slab variational Bayes for high dimensional logistic regression. (arXiv:2010.11665v2 [stat.ML] UPDATED)
    (0 min) Variational Bayes (VB) is a popular scalable alternative to Markov chain Monte Carlo for Bayesian inference. We study a mean-field spike and slab VB approximation of widely used Bayesian model selection priors in sparse high-dimensional logistic regression. We provide non-asymptotic theoretical guarantees for the VB posterior in both $\ell_2$ and prediction loss for a sparse truth, giving optimal (minimax) convergence rates. Since the VB algorithm does not depend on the unknown truth to achieve optimality, our results shed light on effective prior choices. We confirm the improved performance of our VB algorithm over common sparse VB approaches in a numerical study.
    Provably Correct Training of Neural Network Controllers Using Reachability Analysis. (arXiv:2102.10806v2 [eess.SY] UPDATED)
    (0 min) In this paper, we consider the problem of training neural network (NN) controllers for nonlinear dynamical systems that are guaranteed to satisfy safety and liveness (e.g., reach-avoid) properties. Our approach is to combine model-based design methodologies for dynamical systems with data-driven approaches to achieve this target. We confine our attention to NNs with Rectifier Linear Unit (ReLU) nonlinearity which are known to represent Continuous Piece-Wise Affine (CPWA) functions. Given a mathematical model of the dynamical system, we compute a finite-state abstract model that captures the closed-loop behavior under all possible CPWA controllers. Using this finite-state abstract model, our framework identifies a family of CPWA functions guaranteed to satisfy the safety requirements. We augment the learning algorithm with a NN weight projection operator during training that enforces the resulting NN to represent a CPWA function from the provably safe family of CPWA functions. Moreover, the proposed framework uses the finite-state abstract model to identify candidate CPWA functions that may satisfy the liveness properties. Using such candidate CPWA functions, the proposed framework biases the NN training to achieve the liveness specification. We show the efficacy of the proposed framework both in simulation and on an actual robotic vehicle.
    On the stability properties of Gated Recurrent Units neural networks. (arXiv:2011.06806v5 [eess.SY] UPDATED)
    (0 min) The goal of this paper is to provide sufficient conditions for guaranteeing the Input-to-State Stability (ISS) and the Incremental Input-to-State Stability ({\delta}ISS) of Gated Recurrent Units (GRUs) neural networks. These conditions, devised for both single-layer and multi-layer architectures, consist of nonlinear inequalities on network's weights. They can be employed to check the stability of trained networks, or can be enforced as constraints during the training procedure of a GRU. The resulting training procedure is tested on a Quadruple Tank nonlinear benchmark system, showing satisfactory modeling performances.
    Nonparametric Extrema Analysis in Time Series for Envelope Extraction, Peak Detection and Clustering. (arXiv:2109.02082v1 [cs.LG])
    (0 min) In this paper, we propose a nonparametric approach that can be used in envelope extraction, peak-burst detection and clustering in time series. Our problem formalization results in a naturally defined splitting/forking of the time series. With a possibly hierarchical implementation, it can be used for various applications in machine learning, signal processing and mathematical finance. From an incoming input signal, our iterative procedure sequentially creates two signals (one upper bounding and one lower bounding signal) by minimizing the cumulative $L_1$ drift. We show that a solution can be efficiently calculated by use of a Viterbi-like path tracking algorithm together with an optimal elimination rule. We consider many interesting settings, where our algorithm has near-linear time complexities.
2021-10-04T07:20:28.613Z osmosfeed 1.11.2